Re: Contribute to ctakes: it is in your best interests! RE: unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS]

Alexandru Zbarcea Tue, 21 Nov 2017 15:11:57 -0800

Tim, this is extremely informative. Thank you Sean for updating the .xml's.
I will play a little bit more with them and understand the process.


Until then, I can only speculate (please excuse my lack of understanding)
that the *model.jar is produced out of these .txt(raw data)+.xml
(annotations) files. For example, I presume, the anafora_annotated .xmls
were produced using Anafora tool [1]. This is great step forward.

Another step would be to understand the metadata relevant to these models.
Just to give some examples from software delivery: groupId (owner),
artifactId (product), version, classifier, packaging (e.g. .jar). I see
other metadata associated: language, ontology, NLP techniques etc, that
would allow comparison and measurement of these models, similar to how
docker images are shared/distributed.

I can only emphasize what Sean said at the beginning of this thread:

"With a project like ctakes there are a lot of things that can be
done, there are great opportunities (...)" [2]

Alex

[1] -
https://www.semanticscholar.org/paper/Anafora-A-Web-based-General-Purpose-Annotation-Too-Chen-Styler/66ccd53060a018cadb804bcff266cfc202a4c5dd
[2] -
http://mail-archives.apache.org/mod_mbox/ctakes-dev/201711.mbox/%3Cc9144a6bfcd74c5fbd352791080ffdf1%40CHEXMAIL1A.CHBOSTON.ORG%3E


On Tue, Nov 21, 2017 at 11:32 AM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> I just checked the files into trunk an hour ago, so you'll need to update.
>
> ctakes-examples-res   /src/main/resources/  org/apache/ctakes/examples/
> annotation/anafora_annotated
>
> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Tuesday, November 21, 2017 11:20 AM
> To: dev@ctakes.apache.org
> Subject: Re: Contribute to ctakes: it is in your best interests! RE:
> unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS]
> [SUSPICIOUS] [SUSPICIOUS]
>
> Yeah, it's definitely hard to do it the most efficient way because the
> sensitive nature of our source data. You can see roughly what the source
> data looks like in our ctakes-example-res project
> (/home/tmill/Projects/ctakes-git/ctakes-examples-
> res/src/main/resources/org/apache/ctakes/examples/annotation/anafora_an
> notated)
> Each document has a directory with the plaintext document and an xml file
> indicating spans of entities and relations between entities. The xml files
> contain no identified information, but the plaintext is required for
> feature extraction, and so we cannot rebuild models without them.
>
> However, another possibility, as Alex mentioned, is to have models be not
> in the git repo but be resources. We already intended something like that
> by having them in *-res modules, but if there are other ideas for
> structures that would keep models completely out of the repo (or in another
> repo that wouldn't be required), I would be happy to hear about them.
>
> One final thing we (myself and others) need to be better at is that large
> models shouldn't be checked in until they are used for default modules, and
> shouldn't be used for default models unless they offer large performance
> benefits (in terms of accuracy). Might be worth dev discussion if there is
> some indecisio (for example, a 1Gb model that offers 2% improvement on
> relation extraction, is that worth it?) Sometimes I've checked things in
> that run in experimental projects where they may or may not make it into
> default models.
>
> Tim
>
>
>
>
> On Tue, 2017-11-21 at 14:21 +0000, Finan, Sean wrote:
> > Hi Alex,
> >
> > >
> > > I know about the importance of these models.
> > My apologies if I offended.
> >
> > >
> > > I would like to know if there is a way also to generate them.
> >  There is a little bit of documentation on models expertly written by
> > Tim.  Right now it is in a pamphlet that we distributed at a hackathon
> > a couple of years ago and the contents should definitely be copied
> > into the wiki.  I think that there is a jira for it, but I'm not
> > certain.
> > On the main ctakes wiki page for 4.0
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org
> > _confluence_display_CTAKES_cTAKES-
> > 2B4.0&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=Heup-
> > IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=6V-
> > pSvmqqANZgc5S56uDn3iKdm_e9XeiPBzEl4jTr5Q&s=PajX2LAbUuShItvLgZPSFtEdy8
> > I1--L-ok4nTjXNphk&e=
> > it is on the second line in the "Documentation" list.
> > Again, it needs to be moved into the wiki - and updated if necessary.
> >
> > >
> > > The same principle (I presume) it applies.
> > You need a bit of machine learning awareness and annotated data.
> >
> > >
> > > If we are able to generate them, then we can version the source and
> > > the process to generate them and not the binaries themselves.
> > Some of the models are created using 'proprietary' data that cannot be
> > distributed.
> > Some of the models are created with data that is actually larger in
> > footprint than the models.
> >
> > >
> > > What is the lifecycle of a model?
> > It depends what you mean by lifecycle.  In terms of sdlc it is a very
> > long waterfall.  First, the aims are set.  This often (around us,
> > anyway) involves brainstorming between a number of people on aims for
> > the model, like what types and attributes can and should be produced.
> > An appropriate source for data needs to be found, the data acquired
> > ... and getting a grant to cover the cost of doing it.  Then the data
> > needs to be annotated, then experts fiddle with the various features
> > and methods for a while running a gazillion times to fine- tune.  For
> > example, I think that the temporal models have been under development
> > for over five years by several developers, and the training data was
> > annotated by another half dozen or so experts.  If new data is
> > acquired from another project the model is improved and updated.
> > If you are asking about the lifetime of a model, that is highly
> > variable.  New data, new researchers, available time, interest and of
> > course the accuracy of an existing model all play a part.  A model may
> > go years without any changes, or it might be updated monthly or weekly
> > or even daily depending upon how a person is working and using vcs.
> >
> > >
> > > Can it be integrated with other Deep Learning frameworks from ASF?
> > Are you asking about other frameworks using ctakes models or ctakes
> > using other models?  I think that some of the models used by ctakes do
> > originally come from other sources.  Besides that, if those other
> > frameworks are willing to use libraries like cleartk then there
> > shouldn't be much of a problem.  There are currently some initiatives
> > trying to incorporate some deep learning frameworks.  If anybody out
> > there working on one is reading this then they can give you some
> > information.
> >
> > >
> > > I also come from a background of Continuous Delivery,
> > I appreciate that in every sense of the word!
> >
> > I hope that this information helps.  The pamphlet section on models
> > that Tim wrote is the best starting point.  ML experts (which I am
> > not) out there can contribute a lot more information, probably even a
> > correction or two.
> >
> > Sean
> >
> > -----Original Message-----
> > From: Alexandru Zbarcea [mailto:zbarce...@gmail.com]
> > Sent: Tuesday, November 21, 2017 8:35 AM
> > To: Apache cTAKES Dev
> > Subject: Re: Contribute to ctakes: it is in your best interests! RE:
> > unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS]
> >
> > Hi Sean,
> >
> > I know about the importance of these models. Tim was also kind enough
> > to explain to me in a previous email on the mailing list about the
> > importance of them and about the fact that these models were created
> > by experts.
> >
> > However, I'm not proposing to remove them, but to document better
> > their importance. Also, I would like to know if there is a way also to
> > generate them. I appreciate the way Pipeline aggregation was solved in
> > cTAKES, by creating a new DSL [1] (Piper) that was easy to read and
> > also build a lot of automation and flexibility. The same principle (I
> > presume) it applies.
> > If we are able to generate them, then we can version the source and
> > the process to generate them and not the binaries themselves.
> >
> > If we can use the cTAKES CLIs to generate some of these models, and
> > simulate what the expert would do using the UI, we would have a
> > reproducible process that can also be perfected over time by other
> > experts.
> > Is like the Lucene viewer vs Lucene Java API. I don't know how
> > feasible this is, though. Just my $0.0.2.
> >
> > I'm looking to not only understand the cTAKES Java code, but how the
> > entire process works. One of the pieces missing for me, is what
> > expertise you actually need and how dependent of a context it is to
> > build these models. I also come from a background of Continuous
> > Delivery, so few questions popped
> > out: What is the lifecycle of a model? Can it be integrated with other
> > Deep Learning frameworks from ASF?
> >
> > What do you think?
> >
> > Alex
> >
> > [1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikiped
> > ia.org_wiki_Domain-2Dspecific-
> > 5Flanguage&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=f
> > s67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=Z1PqE3gYYReZ9DTKn8orPn03
> > 5tOYJSebS_S_Yq39mHY&s=k5C2cLaa5HI6YU7YX0nXqzUWbrV_KHNqDzSWGyN_jqc&e=
> >
> > On Tue, Nov 21, 2017 at 7:30 AM, Finan, Sean < Sean.Finan@childrens.h
> > arvard.edu> wrote:
> >
> > >
> > > Hi Alex,
> > >
> > > The model.jar files are needed and cannot be removed.  You may have
> > > noticed that a lot of those hard-coded paths point to these
> > > model.jar files.
> > >
> > > Sean
> > >
> > >
> > > -----Original Message-----
> > > From: Alexandru Zbarcea [mailto:al...@apache.org]
> > > Sent: Monday, November 20, 2017 7:33 PM
> > > To: Apache cTAKES Dev
> > > Subject: Re: Contribute to ctakes: it is in your best interests!
> > > RE:
> > > unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
> > > [SUSPICIOUS]
> > >
> > > Thank Tim,
> > >
> > > I am in favor of moving to git too. If there is a desire from the
> > > community to move entirely over git,
> > >
> > > I can work with Apache Infra to make the migration.
> > >
> > > I wonder if we can reduce the repository size on this transition.
> > > Based on Apache rules, history is not allowed to be rewritten.
> > > Migrations like these are used though, to cleanup some of the big
> > > (space consuming) resource.
> > > (e.g. models "*.jar"):
> > > $ find . -name "*.jar" | xargs du -hsc 2.3M
> > > ./ctakes-temporal-res/src/main/resources/org/apache/
> > > ctakes/temporal/ae/eventevent/model.jar
> > > 348K    ./ctakes-temporal-res/src/main/resources/org/apache/
> > > ctakes/temporal/ae/contextualmodality/model.jar
> > > 4.0K    ./ctakes-temporal-res/src/main/resources/org/apache/
> > > ctakes/temporal/ae/salience/model.jar
> > > 1.0M    ./ctakes-temporal-res/src/main/resources/org/apache/
> > > ctakes/temporal/ae/eventannotator/model.jar
> > > 568K    ./ctakes-temporal-res/src/main/resources/org/apache/
> > > ctakes/temporal/ae/doctimerel/model.jar
> > > 2.2M    ./ctakes-temporal-res/src/main/resources/org/apache/
> > > ctakes/temporal/ae/eventtime/model.jar
> > > 1.3M    ./ctakes-temporal-res/src/main/resources/org/apache/
> > > ctakes/temporal/ae/timeannotator/model.jar
> > > 7.8M    ./ctakes-pos-tagger-res/src/main/resources/org/apache/
> > > ctakes/postagger/models/clearnlp/mayo-en-pos-1.3.0.jar
> > > 4.0K    ./ctakes-coreference-res/src/main/resources/org/apache/
> > > ctakes/coreference/models/mention-cluster/model.jar
> > > 1.5M    ./ctakes-core-res/src/main/resources/org/apache/ctakes/
> > > core/sentdetect/model.jar
> > >
> > > 504K    ./ctakes-assertion-res/src/main/resources/org/apache/
> > > ctakes/assertion/models/subject/model.jar
> > > 588K    ./ctakes-assertion-res/src/main/resources/org/apache/
> > > ctakes/assertion/models/historyOf/model.jar
> > > 332K    ./ctakes-assertion-res/src/main/resources/org/apache/
> > > ctakes/assertion/models/uncertainty/model.jar
> > > 740K    ./ctakes-assertion-res/src/main/resources/org/apache/
> > > ctakes/assertion/models/conditional/model.jar
> > > 592K    ./ctakes-assertion-res/src/main/resources/org/apache/
> > > ctakes/assertion/models/polarity/sharpi2b2mipacqnegex/model.jar
> > > 572K    ./ctakes-assertion-res/src/main/resources/org/apache/
> > > ctakes/assertion/models/generic/model.jar
> > > 1.5M    ./ctakes-assertion-res/resources/model/
> > > sharpi2b2mipacqnegex/polarity/model.jar
> > > 312K    ./ctakes-dependency-parser-res/src/main/resources/org/
> > > apache/ctakes/dependency/parser/models/lemmatizer/dictionary-
> > > 1.3.1.jar
> > > 228M    ./ctakes-dependency-parser-res/src/main/resources/org/
> > > apache/ctakes/dependency/parser/models/clearparser_models.jar
> > > 5.8M    ./ctakes-dependency-parser-res/src/main/resources/org/
> > > apache/ctakes/dependency/parser/models/srl/mayo-en-srl-1.3.0.jar
> > > 452K    ./ctakes-dependency-parser-res/src/main/resources/org/
> > > apache/ctakes/dependency/parser/models/pred/mayo-en-pred-1.3.0.jar
> > > 1.2M    ./ctakes-dependency-parser-res/src/main/resources/org/
> > > apache/ctakes/dependency/parser/models/role/mayo-en-role-1.3.0.jar
> > > 25M     ./ctakes-dependency-parser-res/src/main/resources/
> > > org/apache/ctakes/dependency/parser/models/dependency/mayo-
> > > en-dep-1.3.0.jar
> > > 688K    ./ctakes-relation-extractor-res/src/main/
> > > resources/org/apache/ctakes/relationextractor/models/location_of/mo
> > > del.jar
> > > 488K    ./ctakes-relation-extractor-res/src/main/
> > > resources/org/apache/ctakes/relationextractor/models/degree_of/mode
> > > l.jar
> > > 300K    ./ctakes-relation-extractor-res/src/main/
> > > resources/org/apache/ctakes/relationextractor/models/
> > > modifier_extractor/model.jar
> > >
> > > 282M    total
> > >
> > > or
> > >
> > > $ find ./ -type f -size +5M | grep -v "\.jar" | grep -v "\.svn" |
> > > grep -v "\.git" | xargs du -hsc 9.2M
> > >    ./ctakes-coreference-res/src/main/resources/org/apache/
> > > ctakes/coreference/models/index_med_5k/_3.prx
> > >
> > > 20M
> > >     ./ctakes-coreference-res/src/main/resources/org/apache/
> > > ctakes/coreference/models/index_med_5k/_3.tvf
> > >
> > > 6.9M
> > >    ./ctakes-coreference-res/src/main/resources/org/apache/
> > > ctakes/coreference/pref_probs.txt
> > >
> > > 13M
> > >     ./ctakes-chunker-res/src/main/resources/org/apache/ctakes/
> > > chunker/models/chunker-model.zip
> > >
> > > 6.4M
> > >    ./ctakes-constituency-parser-res/src/main/resources/org/
> > > apache/ctakes/constituency/parser/models/thyme.bin
> > >
> > > 15M
> > >     ./ctakes-constituency-parser-res/src/main/resources/org/
> > > apache/ctakes/constituency/parser/models/sharpacq-3.1.bin
> > >
> > > 12M
> > >     ./ctakes-constituency-parser-res/src/main/resources/org/
> > > apache/ctakes/constituency/parser/models/sharpacq-1.5.bin
> > >
> > > 84M
> > >     ./resources/org/apache/ctakes/dictionary/lookup/fast/sno_rx_
> > > 16ab/sno_rx_16ab.script
> > >
> > > 11M
> > >     ./ctakes-assertion-res/src/main/resources/org/apache/
> > > ctakes/assertion/models/pos.model
> > >
> > > 38M
> > >
> > > ./ctakes-assertion-
> > > res/resources/model/sharpi2b2mipacqnegex/polarity/
> > > training-data.liblinear
> > >
> > > 9.6M
> > >    ./ctakes-temporal/src/main/resources/org/apache/ctakes/
> > > temporal/thyme_word2vec_mapped_50.vec
> > >
> > > 91M
> > >     ./ctakes-temporal/src/main/resources/org/apache/ctakes/
> > > temporal/gloveresult_3
> > >
> > > 67M
> > >     ./ctakes-temporal/src/main/resources/org/apache/ctakes/
> > > temporal/mimic_vectors.txt
> > >
> > > 378M    total
> > >
> > > Are all these resources still relevant? Is there a way to generate
> > > them?
> > >
> > > I do not wish to open the Pandora box though, Alex
> > >
> > >
> > > On Mon, Nov 20, 2017 at 9:29 AM, Finan, Sean <Sean.Finan@childrens.
> > > harvard.
> > > edu> wrote:
> > >
> > > >
> > > > Thanks Tim!
> > > >
> > > > -----Original Message-----
> > > > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.ed
> > > > u]
> > > > Sent: Monday, November 20, 2017 6:33 AM
> > > > To: dev@ctakes.apache.org
> > > > Subject: Re: Contribute to ctakes: it is in your best interests!
> > > > RE:
> > > > unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
> > > > [SUSPICIOUS]
> > > >
> > > > Git is available to apache projects, and many projects have moved
> > > > over (see here:
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__git-2Dw
> > > > ip-2Dus.apache.org_repos_asf&d=DwIFAw&c=qS4goWBT7poplM69zy_
> > > > 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKG
> > > > d4f7d4gTao&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=X
> > > > doxI3lfNrIjSbIVrftDXbkKSJCPH4UkwRroutX-Xp8&e=):
> > > > Here is the general info on what that looks like:
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apa
> > > > che.org_dev_writable-2Dgit&d=DwIFAw&c=qS4goWBT7poplM69zy_3x
> > > > hKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4
> > > > f7d4gTao&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=n-
> > > > m8yd0ayquMf_zuubKtRyr7LydiMTj-tluvryaf0oA&e=
> > > >
> > > > A few points from that link:
> > > > >
> > > > > Projects can request moving to Git as their main code
> > > > > repository, by
> > > > creating an INFRA issue. See also the infra-contact page. >
> > > > Projects can request new, blank repositories by using
> > > > reporeq.apache.org.
> > > > >
> > > > > The current system has basic git support only. We are working on
> > > > extending this service in the near future.
> > > > >
> > > > > Custom commit or other hooks will not be supported, all projects
> > > > > get the
> > > > same hooks. Setting up gitpubsub should provide sufficient
> > > > flexiblity without impacting the core Git setup, volunteers are
> > > > welcome to make that happen.
> > > >
> > > > (Not sure what basic support only means.)
> > > >
> > > > There are also read-only git repos available by default for every
> > > > project and updated in near-real-time:
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apa
> > > > che.org_dev_git.html&d=DwIFAw&c=qS4goWBT7poplM69zy_3xhKwEW14
> > > > JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTa
> > > > o&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=C8RL68JNrL
> > > > pGNVGdwP4YjKi3MZyMFevtQHOJxn7yWsc&e=
> > > >
> > > > with those I guess the suggested workflow is to work off of that
> > > > repo and then just submit patches to someone who commits with svn
> > > > rather than committing directly.
> > > >
> > > > I've been using the git-svn connector myself recently since I just
> > > > vastly prefer the git lightweight branching for focused
> > > > development, as it helps me keep a cleaner working directory. But
> > > > that adds some additional annoying steps.
> > > >
> > > > Tim
> > > >
> > > > ________________________________________
> > > > From: Finan, Sean <sean.fi...@childrens.harvard.edu>
> > > > Sent: Saturday, November 18, 2017 1:23 PM
> > > > To: dev@ctakes.apache.org
> > > > Subject: RE: Contribute to ctakes: it is in your best interests!
> > > > RE:
> > > > unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
> > > >
> > > > Hi Dave,
> > > >
> > > > Those are some great thoughts.  Being an apache project I am not
> > > > sure how far we can move from svn, but there may be a way.  You
> > > > are not the first to voice this desire for an active github repo
> > > > and I'm sure that you won't be the last.
> > > >
> > > > I completely agree with your discussion board preference.  Do you
> > > > have any recommendations?
> > > >
> > > > You make a great point regarding documentation.  In reference to
> > > > things that anybody can quickly contribute ... that would be a big
> > > > one.
> > > > Volunteers?!?
> > > >
> > > > I am really happy to hear that you want to contribute - more than
> > > > you already have, which is actually quite a bit!
> > > >
> > > > Cheers,
> > > > Sean
> > > >
> > > > -----Original Message-----
> > > > From: David Kincaid [mailto:kincaid.d...@gmail.com]
> > > > Sent: Saturday, November 18, 2017 1:10 PM
> > > > To: dev@ctakes.apache.org
> > > > Subject: Re: Contribute to ctakes: it is in your best interests!
> > > > RE:
> > > > unknown dependencies [EXTERNAL] [SUSPICIOUS]
> > > >
> > > > Sean, I can share a couple things that have been an obstacle for
> > > > me.
> > > > It may seem a minor point to some, but I left Subversion behind
> > > > years ago and really have no desire to go back. If the project
> > > > were moved over to Git/Github it would really smooth the way for
> > > > me at least. I would be happy to help out with this. One of the
> > > > other things I would really like to see is the mailing list moved
> > > > onto a discussion board platform. It seems to me that a discussion
> > > > board style of tool tends to create a more active community than a
> > > > mailing list does.
> > > >
> > > > The other thing that might help get new people involved is making
> > > > it easier to find information about the development environment.
> > > > Things
> > > > like branching strategies, coding conventions, etc are really hard
> > > > to find from the main cTAKES web site. I saw some references to
> > > > Jenkins builds recently on the list. I had no idea there was a
> > > > Jenkins CI server for the project somewhere. It also takes some
> > > > digging to find a link to Jira. Maybe we could create a Wiki page
> > > > that describes where all these tools are and how they are used.
> > > >
> > > > You guys have really done some great work over the last couple of
> > > > years cleaning up the code base and improving the documentation by
> > > > a ton. Things like the fast dictionary annotator, dictionary
> > > > creator GUI are a great addition and make it a lot easier for
> > > > other people to get up and running more quickly. As I'm ramping up
> > > > my research as well as some proof of concept stuff at work I'll be
> > > > working more and more with cTAKES and would love to contribute
> > > > more to the project.
> > > >
> > > > Just my thoughts.
> > > >
> > > > - Dave
> > > >
> > > >
> > > > On Sat, Nov 18, 2017 at 11:10 AM, Finan, Sean <
> > > > sean.fi...@childrens.harvard.edu> wrote:
> > > >
> > > > >
> > > > > Hi Tim, Alex,
> > > > >
> > > > > Great ideas.  I like your (Tim) idea to 1. start with commented
> > > > > code removal.
> > > > > Then maybe move on to
> > > > > 2. sanity-test type unit tests - Little two or three-line "does
> > > > > this method crack" tests.
> > > > > And another that is simply
> > > > > 3. "populate a test cas with type(s) X" and a factory with
> > > > > "getSectionTestCas" "getSetenceTestCas" "getPosTestCas"
> > > "getChunkTestCas"
> > > >
> > > > >
> > > > > ...  just really simple reusables for tests.
> > > > > Then
> > > > > 4. refactor to extract and consolidate duplicate code - it is
> > > > > all over the place ...
> > > > >
> > > > > These are just my initial thoughts and suggestions, but I think
> > > > > that
> > > > those
> > > > >
> > > > > 4 tasks can be performed by anybody of any experience level.
> > > > > They
> > > build
> > > >
> > > > >
> > > > > upon each other and should help the implementers better
> > > > > understand
> > > > ctakes.
> > > > >
> > > > > After that the sky is the limit.
> > > > >
> > > > > A couple of years ago I sat on a panel at a workshop for open
> > > > > source scientific software.  For the half dozen or so
> > > > > highlighted projects (ctakes was one!) the common thread was
> > > > > that getting people to contribute is extremely difficult.
> > > > > I have a tendency to assume that people always act in their best
> > > > > interests.  Any student thinking of going towards industry
> > > > > should be jumping at the opportunity to contribution to a large,
> > > > > production-quality project.  They should also realize that
> > > > > contribution means potential recommendation (and possibly hiring
> > > > > interest) by established developers, physicians and researchers
> > > > > that use ctakes.  Even just answering questions on a user or dev
> > > > > list creates
> > > > credibility and can build a network.
> > > > >
> > > > > Active researchers could discover common thoughts and directions
> > > > > that could lead to collaboration outside ctakes.  Researchers
> > > > > and companies trying to build upon open source should realize
> > > > > that direct contribution is easier than custom substitution.
> > > > > Plus, it is in their best interests that code does what they
> > > > > need it to do in the fastest, lightest, most stable way
> > > > > possible.
> > > > > With a project like ctakes there are a lot of things that can be
> > > > > done, there are great opportunities to really shine.  "I wrote
> > > > > this tool for my thesis that performs some nlp task" sounds
> > > > > good.
> > > > > Appending "in an Apache product and it has been taken up by
> > > > > thousands
> > > across the globe"
> > > >
> > > > >
> > > > > makes it sound a lot better.
> > > > > At my previous job in industry the company actively contributed
> > > > > to several open source projects.  We had a few people for whom
> > > > > that was 50% of their job.  Why?  Because we made a commitment
> > > > > to use that open
> > > > source software.
> > > > >
> > > > > It was a better use of our resources to contribute to it,
> > > > > improve it and keep its momentum going and prevent it from
> > > > > becoming stale (or
> > > > > abandoned) while our software continued to move forward.
> > > > >
> > > > > Hmm, that was a touch more than I had planned to write.  A whole
> > > > > cup of coffee in that one.
> > > > >
> > > > > Sean
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Miller, Timothy
> > > > > [mailto:timothy.mil...@childrens.harvard.edu]
> > > > > Sent: Saturday, November 18, 2017 8:13 AM
> > > > > To: dev@ctakes.apache.org
> > > > > Subject: Re: unknown dependencies [EXTERNAL] [SUSPICIOUS]
> > > > >
> > > > > Thanks Alex, looks like that was probably a fat-fingered
> > > > > auto-import on my part.
> > > > >
> > > > > I like your idea, and I don't know the best way to to start
> > > > > either, but maybe one suggestion is to start with one or two
> > > > > focused things to clean up, and then ask for volunteers to take
> > > > > on specific modules?
> > > > > Then people can contribute an hour here and there to do cleanup
> > > > > on their task/module and try to fix that thing in a 1-2-month
> > > > > long sprint. I am happy to contribute to cleanup, I am
> > > > > responsible for my fair share of unclean code, but since I don't
> > > > > have strong software engineering chops it would be good to have
> > > > > people with that background propose the tasks and describe
> > > > > exactly what needs to be done. My idea of cleaning is just to
> > > > > delete commented out sections of
> > > evaluation code.
> > > >
> > > > >
> > > > >
> > > > > Tim
> > > > >
> > > > > ________________________________________
> > > > > From: Alexandru Zbarcea <al...@apache.org>
> > > > > Sent: Friday, November 17, 2017 4:46 PM
> > > > > To: Apache cTAKES Dev
> > > > > Subject: unknown dependencies [EXTERNAL]
> > > > >
> > > > > Hi,
> > > > >
> > > > > I notice that a miss-dependency has slipped in the code:
> > > > > jdk.internal.org.objectweb.asm.commons.AnalyzerAdapter;
> > > > >
> > > > > Now, that the Jenkins builds is successful, I think it is easier
> > > > > to clean-up the code. I would like to be a common effort. I
> > > > > don't know the best way to approach this.
> > > > >
> > > > > Looking forward to your advice,
> > > > > Alex
> > > > >
>

Re: Contribute to ctakes: it is in your best interests! RE: unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS]

Reply via email to