Very informative and a lot of information to digest. I will start with the
pamphlet, per your advice.

So what what I understand is that with a little bit of machine learning
awarness and annotated data, such a process can be built. More than that,
there is a collaboration over this data until the expert community agrees
upon the last state of the model. All of which is great news. Hard, but
possible.

Related to the lifecycle, I wonder if an analogous system exists related to
the model author (groupId), name of the model (artifactId), version and
context (classifier). Is there any other registry for experts to share
these models (like central maven)? Is there an opportunity to create it,
maybe as a subproject? (just a thought).

Thank you Sean for taking the time to reply. Deeply appreciated.

Alex


On Nov 21, 2017 9:25 AM, "Finan, Sean" <sean.fi...@childrens.harvard.edu>
wrote:

Hi Alex,

> I know about the importance of these models.
My apologies if I offended.

> I would like to know if there is a way also to generate them.
 There is a little bit of documentation on models expertly written by Tim.
Right now it is in a pamphlet that we distributed at a hackathon a couple
of years ago and the contents should definitely be copied into the wiki.  I
think that there is a jira for it, but I'm not certain.
On the main ctakes wiki page for 4.0
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0
it is on the second line in the "Documentation" list.
Again, it needs to be moved into the wiki - and updated if necessary.

> The same principle (I presume) it applies.
You need a bit of machine learning awareness and annotated data.

> If we are able to generate them, then we can version the source and the
process to generate them and not the binaries themselves.
Some of the models are created using 'proprietary' data that cannot be
distributed.
Some of the models are created with data that is actually larger in
footprint than the models.

> What is the lifecycle of a model?
It depends what you mean by lifecycle.  In terms of sdlc it is a very long
waterfall.  First, the aims are set.  This often (around us, anyway)
involves brainstorming between a number of people on aims for the model,
like what types and attributes can and should be produced.  An appropriate
source for data needs to be found, the data acquired ... and getting a
grant to cover the cost of doing it.  Then the data needs to be annotated,
then experts fiddle with the various features and methods for a while
running a gazillion times to fine-tune.  For example, I think that the
temporal models have been under development for over five years by several
developers, and the training data was annotated by another half dozen or so
experts.  If new data is acquired from another project the model is
improved and updated.
If you are asking about the lifetime of a model, that is highly variable.
New data, new researchers, available time, interest and of course the
accuracy of an existing model all play a part.  A model may go years
without any changes, or it might be updated monthly or weekly or even daily
depending upon how a person is working and using vcs.

> Can it be integrated with other Deep Learning frameworks from ASF?
Are you asking about other frameworks using ctakes models or ctakes using
other models?  I think that some of the models used by ctakes do originally
come from other sources.  Besides that, if those other frameworks are
willing to use libraries like cleartk then there shouldn't be much of a
problem.  There are currently some initiatives trying to incorporate some
deep learning frameworks.  If anybody out there working on one is reading
this then they can give you some information.

> I also come from a background of Continuous Delivery,
I appreciate that in every sense of the word!

I hope that this information helps.  The pamphlet section on models that
Tim wrote is the best starting point.  ML experts (which I am not) out
there can contribute a lot more information, probably even a correction or
two.

Sean

-----Original Message-----
From: Alexandru Zbarcea [mailto:zbarce...@gmail.com]
Sent: Tuesday, November 21, 2017 8:35 AM
To: Apache cTAKES Dev
Subject: Re: Contribute to ctakes: it is in your best interests! RE:
unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS]

Hi Sean,

I know about the importance of these models. Tim was also kind enough to
explain to me in a previous email on the mailing list about the importance
of them and about the fact that these models were created by experts.

However, I'm not proposing to remove them, but to document better their
importance. Also, I would like to know if there is a way also to generate
them. I appreciate the way Pipeline aggregation was solved in cTAKES, by
creating a new DSL [1] (Piper) that was easy to read and also build a lot
of automation and flexibility. The same principle (I presume) it applies.
If we are able to generate them, then we can version the source and the
process to generate them and not the binaries themselves.

If we can use the cTAKES CLIs to generate some of these models, and
simulate what the expert would do using the UI, we would have a
reproducible process that can also be perfected over time by other experts.
Is like the Lucene viewer vs Lucene Java API. I don't know how feasible
this is, though. Just my $0.0.2.

I'm looking to not only understand the cTAKES Java code, but how the entire
process works. One of the pieces missing for me, is what expertise you
actually need and how dependent of a context it is to build these models. I
also come from a background of Continuous Delivery, so few questions popped
out: What is the lifecycle of a model? Can it be integrated with other Deep
Learning frameworks from ASF?

What do you think?

Alex

[1] - https://urldefense.proofpoint.com/v2/url?u=https-3A__en.
wikipedia.org_wiki_Domain-2Dspecific-5Flanguage&d=
DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=Z1PqE3gYYReZ9DTKn8orPn035tOYJS
ebS_S_Yq39mHY&s=k5C2cLaa5HI6YU7YX0nXqzUWbrV_KHNqDzSWGyN_jqc&e=

On Tue, Nov 21, 2017 at 7:30 AM, Finan, Sean < Sean.Finan@childrens.harvard.
edu> wrote:

> Hi Alex,
>
> The model.jar files are needed and cannot be removed.  You may have
> noticed that a lot of those hard-coded paths point to these model.jar
files.
>
> Sean
>
>
> -----Original Message-----
> From: Alexandru Zbarcea [mailto:al...@apache.org]
> Sent: Monday, November 20, 2017 7:33 PM
> To: Apache cTAKES Dev
> Subject: Re: Contribute to ctakes: it is in your best interests! RE:
> unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS]
>
> Thank Tim,
>
> I am in favor of moving to git too. If there is a desire from the
> community to move entirely over git,
>
> I can work with Apache Infra to make the migration.
>
> I wonder if we can reduce the repository size on this transition.
> Based on Apache rules, history is not allowed to be rewritten.
> Migrations like these are used though, to cleanup some of the big (space
consuming) resource.
> (e.g. models "*.jar"):
> $ find . -name "*.jar" | xargs du -hsc
> 2.3M    ./ctakes-temporal-res/src/main/resources/org/apache/
> ctakes/temporal/ae/eventevent/model.jar
> 348K    ./ctakes-temporal-res/src/main/resources/org/apache/
> ctakes/temporal/ae/contextualmodality/model.jar
> 4.0K    ./ctakes-temporal-res/src/main/resources/org/apache/
> ctakes/temporal/ae/salience/model.jar
> 1.0M    ./ctakes-temporal-res/src/main/resources/org/apache/
> ctakes/temporal/ae/eventannotator/model.jar
> 568K    ./ctakes-temporal-res/src/main/resources/org/apache/
> ctakes/temporal/ae/doctimerel/model.jar
> 2.2M    ./ctakes-temporal-res/src/main/resources/org/apache/
> ctakes/temporal/ae/eventtime/model.jar
> 1.3M    ./ctakes-temporal-res/src/main/resources/org/apache/
> ctakes/temporal/ae/timeannotator/model.jar
> 7.8M    ./ctakes-pos-tagger-res/src/main/resources/org/apache/
> ctakes/postagger/models/clearnlp/mayo-en-pos-1.3.0.jar
> 4.0K    ./ctakes-coreference-res/src/main/resources/org/apache/
> ctakes/coreference/models/mention-cluster/model.jar
> 1.5M    ./ctakes-core-res/src/main/resources/org/apache/ctakes/
> core/sentdetect/model.jar
>
> 504K    ./ctakes-assertion-res/src/main/resources/org/apache/
> ctakes/assertion/models/subject/model.jar
> 588K    ./ctakes-assertion-res/src/main/resources/org/apache/
> ctakes/assertion/models/historyOf/model.jar
> 332K    ./ctakes-assertion-res/src/main/resources/org/apache/
> ctakes/assertion/models/uncertainty/model.jar
> 740K    ./ctakes-assertion-res/src/main/resources/org/apache/
> ctakes/assertion/models/conditional/model.jar
> 592K    ./ctakes-assertion-res/src/main/resources/org/apache/
> ctakes/assertion/models/polarity/sharpi2b2mipacqnegex/model.jar
> 572K    ./ctakes-assertion-res/src/main/resources/org/apache/
> ctakes/assertion/models/generic/model.jar
> 1.5M    ./ctakes-assertion-res/resources/model/
> sharpi2b2mipacqnegex/polarity/model.jar
> 312K    ./ctakes-dependency-parser-res/src/main/resources/org/
> apache/ctakes/dependency/parser/models/lemmatizer/dictionary-1.3.1.jar
> 228M    ./ctakes-dependency-parser-res/src/main/resources/org/
> apache/ctakes/dependency/parser/models/clearparser_models.jar
> 5.8M    ./ctakes-dependency-parser-res/src/main/resources/org/
> apache/ctakes/dependency/parser/models/srl/mayo-en-srl-1.3.0.jar
> 452K    ./ctakes-dependency-parser-res/src/main/resources/org/
> apache/ctakes/dependency/parser/models/pred/mayo-en-pred-1.3.0.jar
> 1.2M    ./ctakes-dependency-parser-res/src/main/resources/org/
> apache/ctakes/dependency/parser/models/role/mayo-en-role-1.3.0.jar
> 25M     ./ctakes-dependency-parser-res/src/main/resources/
> org/apache/ctakes/dependency/parser/models/dependency/mayo-
> en-dep-1.3.0.jar
> 688K    ./ctakes-relation-extractor-res/src/main/
> resources/org/apache/ctakes/relationextractor/models/location_of/model.jar
> 488K    ./ctakes-relation-extractor-res/src/main/
> resources/org/apache/ctakes/relationextractor/models/degree_of/model.jar
> 300K    ./ctakes-relation-extractor-res/src/main/
> resources/org/apache/ctakes/relationextractor/models/
> modifier_extractor/model.jar
>
> 282M    total
>
> or
>
> $ find ./ -type f -size +5M | grep -v "\.jar" | grep -v "\.svn" | grep
> -v "\.git" | xargs du -hsc 9.2M
>    ./ctakes-coreference-res/src/main/resources/org/apache/
> ctakes/coreference/models/index_med_5k/_3.prx
>
> 20M
>     ./ctakes-coreference-res/src/main/resources/org/apache/
> ctakes/coreference/models/index_med_5k/_3.tvf
>
> 6.9M
>    ./ctakes-coreference-res/src/main/resources/org/apache/
> ctakes/coreference/pref_probs.txt
>
> 13M
>     ./ctakes-chunker-res/src/main/resources/org/apache/ctakes/
> chunker/models/chunker-model.zip
>
> 6.4M
>    ./ctakes-constituency-parser-res/src/main/resources/org/
> apache/ctakes/constituency/parser/models/thyme.bin
>
> 15M
>     ./ctakes-constituency-parser-res/src/main/resources/org/
> apache/ctakes/constituency/parser/models/sharpacq-3.1.bin
>
> 12M
>     ./ctakes-constituency-parser-res/src/main/resources/org/
> apache/ctakes/constituency/parser/models/sharpacq-1.5.bin
>
> 84M
>     ./resources/org/apache/ctakes/dictionary/lookup/fast/sno_rx_
> 16ab/sno_rx_16ab.script
>
> 11M
>     ./ctakes-assertion-res/src/main/resources/org/apache/
> ctakes/assertion/models/pos.model
>
> 38M
>
> ./ctakes-assertion-res/resources/model/sharpi2b2mipacqnegex/polarity/
> training-data.liblinear
>
> 9.6M
>    ./ctakes-temporal/src/main/resources/org/apache/ctakes/
> temporal/thyme_word2vec_mapped_50.vec
>
> 91M
>     ./ctakes-temporal/src/main/resources/org/apache/ctakes/
> temporal/gloveresult_3
>
> 67M
>     ./ctakes-temporal/src/main/resources/org/apache/ctakes/
> temporal/mimic_vectors.txt
>
> 378M    total
>
> Are all these resources still relevant? Is there a way to generate them?
>
> I do not wish to open the Pandora box though, Alex
>
>
> On Mon, Nov 20, 2017 at 9:29 AM, Finan, Sean
<Sean.Finan@childrens.harvard.
> edu> wrote:
>
> > Thanks Tim!
> >
> > -----Original Message-----
> > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> > Sent: Monday, November 20, 2017 6:33 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: Contribute to ctakes: it is in your best interests! RE:
> > unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
> > [SUSPICIOUS]
> >
> > Git is available to apache projects, and many projects have moved
> > over (see here:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__git-2Dw
> > ip-2Dus.apache.org_repos_asf&d=DwIFAw&c=qS4goWBT7poplM69zy_
> > 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKG
> > d4f7d4gTao&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=X
> > doxI3lfNrIjSbIVrftDXbkKSJCPH4UkwRroutX-Xp8&e=):
> > Here is the general info on what that looks like:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apa
> > che.org_dev_writable-2Dgit&d=DwIFAw&c=qS4goWBT7poplM69zy_3x
> > hKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4
> > f7d4gTao&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=n-
> > m8yd0ayquMf_zuubKtRyr7LydiMTj-tluvryaf0oA&e=
> >
> > A few points from that link:
> > > Projects can request moving to Git as their main code repository,
> > > by
> > creating an INFRA issue. See also the infra-contact page. > Projects
> > can request new, blank repositories by using reporeq.apache.org.
> > > The current system has basic git support only. We are working on
> > extending this service in the near future.
> > > Custom commit or other hooks will not be supported, all projects
> > > get the
> > same hooks. Setting up gitpubsub should provide sufficient
> > flexiblity without impacting the core Git setup, volunteers are
> > welcome to make that happen.
> >
> > (Not sure what basic support only means.)
> >
> > There are also read-only git repos available by default for every
> > project and updated in near-real-time:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apa
> > che.org_dev_git.html&d=DwIFAw&c=qS4goWBT7poplM69zy_3xhKwEW14
> > JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTa
> > o&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=C8RL68JNrL
> > pGNVGdwP4YjKi3MZyMFevtQHOJxn7yWsc&e=
> >
> > with those I guess the suggested workflow is to work off of that
> > repo and then just submit patches to someone who commits with svn
> > rather than committing directly.
> >
> > I've been using the git-svn connector myself recently since I just
> > vastly prefer the git lightweight branching for focused development,
> > as it helps me keep a cleaner working directory. But that adds some
> > additional annoying steps.
> >
> > Tim
> >
> > ________________________________________
> > From: Finan, Sean <sean.fi...@childrens.harvard.edu>
> > Sent: Saturday, November 18, 2017 1:23 PM
> > To: dev@ctakes.apache.org
> > Subject: RE: Contribute to ctakes: it is in your best interests! RE:
> > unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
> >
> > Hi Dave,
> >
> > Those are some great thoughts.  Being an apache project I am not
> > sure how far we can move from svn, but there may be a way.  You are
> > not the first to voice this desire for an active github repo and I'm
> > sure that you won't be the last.
> >
> > I completely agree with your discussion board preference.  Do you
> > have any recommendations?
> >
> > You make a great point regarding documentation.  In reference to
> > things that anybody can quickly contribute ... that would be a big one.
> > Volunteers?!?
> >
> > I am really happy to hear that you want to contribute - more than
> > you already have, which is actually quite a bit!
> >
> > Cheers,
> > Sean
> >
> > -----Original Message-----
> > From: David Kincaid [mailto:kincaid.d...@gmail.com]
> > Sent: Saturday, November 18, 2017 1:10 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Contribute to ctakes: it is in your best interests! RE:
> > unknown dependencies [EXTERNAL] [SUSPICIOUS]
> >
> > Sean, I can share a couple things that have been an obstacle for me.
> > It may seem a minor point to some, but I left Subversion behind
> > years ago and really have no desire to go back. If the project were
> > moved over to Git/Github it would really smooth the way for me at
> > least. I would be happy to help out with this. One of the other
> > things I would really like to see is the mailing list moved onto a
> > discussion board platform. It seems to me that a discussion board
> > style of tool tends to create a more active community than a mailing
list does.
> >
> > The other thing that might help get new people involved is making it
> > easier to find information about the development environment. Things
> > like branching strategies, coding conventions, etc are really hard
> > to find from the main cTAKES web site. I saw some references to
> > Jenkins builds recently on the list. I had no idea there was a
> > Jenkins CI server for the project somewhere. It also takes some
> > digging to find a link to Jira. Maybe we could create a Wiki page
> > that describes where all these tools are and how they are used.
> >
> > You guys have really done some great work over the last couple of
> > years cleaning up the code base and improving the documentation by a
> > ton. Things like the fast dictionary annotator, dictionary creator
> > GUI are a great addition and make it a lot easier for other people
> > to get up and running more quickly. As I'm ramping up my research as
> > well as some proof of concept stuff at work I'll be working more and
> > more with cTAKES and would love to contribute more to the project.
> >
> > Just my thoughts.
> >
> > - Dave
> >
> >
> > On Sat, Nov 18, 2017 at 11:10 AM, Finan, Sean <
> > sean.fi...@childrens.harvard.edu> wrote:
> >
> > > Hi Tim, Alex,
> > >
> > > Great ideas.  I like your (Tim) idea to 1. start with commented
> > > code removal.
> > > Then maybe move on to
> > > 2. sanity-test type unit tests - Little two or three-line "does
> > > this method crack" tests.
> > > And another that is simply
> > > 3. "populate a test cas with type(s) X" and a factory with
> > > "getSectionTestCas" "getSetenceTestCas" "getPosTestCas"
> "getChunkTestCas"
> > > ...  just really simple reusables for tests.
> > > Then
> > > 4. refactor to extract and consolidate duplicate code - it is all
> > > over the place ...
> > >
> > > These are just my initial thoughts and suggestions, but I think
> > > that
> > those
> > > 4 tasks can be performed by anybody of any experience level.   They
> build
> > > upon each other and should help the implementers better understand
> > ctakes.
> > > After that the sky is the limit.
> > >
> > > A couple of years ago I sat on a panel at a workshop for open
> > > source scientific software.  For the half dozen or so highlighted
> > > projects (ctakes was one!) the common thread was that getting
> > > people to contribute is extremely difficult.
> > > I have a tendency to assume that people always act in their best
> > > interests.  Any student thinking of going towards industry should
> > > be jumping at the opportunity to contribution to a large,
> > > production-quality project.  They should also realize that
> > > contribution means potential recommendation (and possibly hiring
> > > interest) by established developers, physicians and researchers
> > > that use ctakes.  Even just answering questions on a user or dev
> > > list creates
> > credibility and can build a network.
> > > Active researchers could discover common thoughts and directions
> > > that could lead to collaboration outside ctakes.  Researchers and
> > > companies trying to build upon open source should realize that
> > > direct contribution is easier than custom substitution.  Plus, it
> > > is in their best interests that code does what they need it to do
> > > in the fastest, lightest, most stable way possible.
> > > With a project like ctakes there are a lot of things that can be
> > > done, there are great opportunities to really shine.  "I wrote
> > > this tool for my thesis that performs some nlp task" sounds good.
> > > Appending "in an Apache product and it has been taken up by
> > > thousands
> across the globe"
> > > makes it sound a lot better.
> > > At my previous job in industry the company actively contributed to
> > > several open source projects.  We had a few people for whom that
> > > was 50% of their job.  Why?  Because we made a commitment to use
> > > that open
> > source software.
> > > It was a better use of our resources to contribute to it, improve
> > > it and keep its momentum going and prevent it from becoming stale
> > > (or
> > > abandoned) while our software continued to move forward.
> > >
> > > Hmm, that was a touch more than I had planned to write.  A whole
> > > cup of coffee in that one.
> > >
> > > Sean
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Miller, Timothy
> > > [mailto:timothy.mil...@childrens.harvard.edu]
> > > Sent: Saturday, November 18, 2017 8:13 AM
> > > To: dev@ctakes.apache.org
> > > Subject: Re: unknown dependencies [EXTERNAL] [SUSPICIOUS]
> > >
> > > Thanks Alex, looks like that was probably a fat-fingered
> > > auto-import on my part.
> > >
> > > I like your idea, and I don't know the best way to to start
> > > either, but maybe one suggestion is to start with one or two
> > > focused things to clean up, and then ask for volunteers to take on
specific modules?
> > > Then people can contribute an hour here and there to do cleanup on
> > > their task/module and try to fix that thing in a 1-2-month long
> > > sprint. I am happy to contribute to cleanup, I am responsible for
> > > my fair share of unclean code, but since I don't have strong
> > > software engineering chops it would be good to have people with
> > > that background propose the tasks and describe exactly what needs
> > > to be done. My idea of cleaning is just to delete commented out
> > > sections of
> evaluation code.
> > >
> > > Tim
> > >
> > > ________________________________________
> > > From: Alexandru Zbarcea <al...@apache.org>
> > > Sent: Friday, November 17, 2017 4:46 PM
> > > To: Apache cTAKES Dev
> > > Subject: unknown dependencies [EXTERNAL]
> > >
> > > Hi,
> > >
> > > I notice that a miss-dependency has slipped in the code:
> > > jdk.internal.org.objectweb.asm.commons.AnalyzerAdapter;
> > >
> > > Now, that the Jenkins builds is successful, I think it is easier
> > > to clean-up the code. I would like to be a common effort. I don't
> > > know the best way to approach this.
> > >
> > > Looking forward to your advice,
> > > Alex
> > >
> >
>

Reply via email to