Thank Tim,

I am in favor of moving to git too. If there is a desire from the community
to move entirely over git,

I can work with Apache Infra to make the migration.

I wonder if we can reduce the repository size on this transition. Based on
Apache rules, history is not allowed to be rewritten. Migrations like these
are used though, to cleanup some of the big (space consuming) resource.
(e.g. models "*.jar"):
$ find . -name "*.jar" | xargs du -hsc
2.3M    ./ctakes-temporal-res/src/main/resources/org/apache/
ctakes/temporal/ae/eventevent/model.jar
348K    ./ctakes-temporal-res/src/main/resources/org/apache/
ctakes/temporal/ae/contextualmodality/model.jar
4.0K    ./ctakes-temporal-res/src/main/resources/org/apache/
ctakes/temporal/ae/salience/model.jar
1.0M    ./ctakes-temporal-res/src/main/resources/org/apache/
ctakes/temporal/ae/eventannotator/model.jar
568K    ./ctakes-temporal-res/src/main/resources/org/apache/
ctakes/temporal/ae/doctimerel/model.jar
2.2M    ./ctakes-temporal-res/src/main/resources/org/apache/
ctakes/temporal/ae/eventtime/model.jar
1.3M    ./ctakes-temporal-res/src/main/resources/org/apache/
ctakes/temporal/ae/timeannotator/model.jar
7.8M    ./ctakes-pos-tagger-res/src/main/resources/org/apache/
ctakes/postagger/models/clearnlp/mayo-en-pos-1.3.0.jar
4.0K    ./ctakes-coreference-res/src/main/resources/org/apache/
ctakes/coreference/models/mention-cluster/model.jar
1.5M    
./ctakes-core-res/src/main/resources/org/apache/ctakes/core/sentdetect/model.jar

504K    ./ctakes-assertion-res/src/main/resources/org/apache/
ctakes/assertion/models/subject/model.jar
588K    ./ctakes-assertion-res/src/main/resources/org/apache/
ctakes/assertion/models/historyOf/model.jar
332K    ./ctakes-assertion-res/src/main/resources/org/apache/
ctakes/assertion/models/uncertainty/model.jar
740K    ./ctakes-assertion-res/src/main/resources/org/apache/
ctakes/assertion/models/conditional/model.jar
592K    ./ctakes-assertion-res/src/main/resources/org/apache/
ctakes/assertion/models/polarity/sharpi2b2mipacqnegex/model.jar
572K    ./ctakes-assertion-res/src/main/resources/org/apache/
ctakes/assertion/models/generic/model.jar
1.5M    ./ctakes-assertion-res/resources/model/
sharpi2b2mipacqnegex/polarity/model.jar
312K    ./ctakes-dependency-parser-res/src/main/resources/org/
apache/ctakes/dependency/parser/models/lemmatizer/dictionary-1.3.1.jar
228M    ./ctakes-dependency-parser-res/src/main/resources/org/
apache/ctakes/dependency/parser/models/clearparser_models.jar
5.8M    ./ctakes-dependency-parser-res/src/main/resources/org/
apache/ctakes/dependency/parser/models/srl/mayo-en-srl-1.3.0.jar
452K    ./ctakes-dependency-parser-res/src/main/resources/org/
apache/ctakes/dependency/parser/models/pred/mayo-en-pred-1.3.0.jar
1.2M    ./ctakes-dependency-parser-res/src/main/resources/org/
apache/ctakes/dependency/parser/models/role/mayo-en-role-1.3.0.jar
25M     ./ctakes-dependency-parser-res/src/main/resources/
org/apache/ctakes/dependency/parser/models/dependency/mayo-en-dep-1.3.0.jar
688K    ./ctakes-relation-extractor-res/src/main/
resources/org/apache/ctakes/relationextractor/models/location_of/model.jar
488K    ./ctakes-relation-extractor-res/src/main/
resources/org/apache/ctakes/relationextractor/models/degree_of/model.jar
300K    ./ctakes-relation-extractor-res/src/main/
resources/org/apache/ctakes/relationextractor/models/modifier_extractor/model.jar

282M    total

or

$ find ./ -type f -size +5M | grep -v "\.jar" | grep -v "\.svn" | grep -v
"\.git" | xargs du -hsc
9.2M
   
./ctakes-coreference-res/src/main/resources/org/apache/ctakes/coreference/models/index_med_5k/_3.prx

20M
    
./ctakes-coreference-res/src/main/resources/org/apache/ctakes/coreference/models/index_med_5k/_3.tvf

6.9M
   
./ctakes-coreference-res/src/main/resources/org/apache/ctakes/coreference/pref_probs.txt

13M
    
./ctakes-chunker-res/src/main/resources/org/apache/ctakes/chunker/models/chunker-model.zip

6.4M
   
./ctakes-constituency-parser-res/src/main/resources/org/apache/ctakes/constituency/parser/models/thyme.bin

15M
    
./ctakes-constituency-parser-res/src/main/resources/org/apache/ctakes/constituency/parser/models/sharpacq-3.1.bin

12M
    
./ctakes-constituency-parser-res/src/main/resources/org/apache/ctakes/constituency/parser/models/sharpacq-1.5.bin

84M
    
./resources/org/apache/ctakes/dictionary/lookup/fast/sno_rx_16ab/sno_rx_16ab.script

11M
    
./ctakes-assertion-res/src/main/resources/org/apache/ctakes/assertion/models/pos.model

38M
    
./ctakes-assertion-res/resources/model/sharpi2b2mipacqnegex/polarity/training-data.liblinear

9.6M
   
./ctakes-temporal/src/main/resources/org/apache/ctakes/temporal/thyme_word2vec_mapped_50.vec

91M
    
./ctakes-temporal/src/main/resources/org/apache/ctakes/temporal/gloveresult_3

67M
    
./ctakes-temporal/src/main/resources/org/apache/ctakes/temporal/mimic_vectors.txt

378M    total

Are all these resources still relevant? Is there a way to generate them?

I do not wish to open the Pandora box though,
Alex


On Mon, Nov 20, 2017 at 9:29 AM, Finan, Sean <Sean.Finan@childrens.harvard.
edu> wrote:

> Thanks Tim!
>
> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> Sent: Monday, November 20, 2017 6:33 AM
> To: dev@ctakes.apache.org
> Subject: Re: Contribute to ctakes: it is in your best interests! RE:
> unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] [SUSPICIOUS]
>
> Git is available to apache projects, and many projects have moved over
> (see here: https://urldefense.proofpoint.com/v2/url?u=https-3A__git-2Dw
> ip-2Dus.apache.org_repos_asf&d=DwIFAw&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKG
> d4f7d4gTao&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=X
> doxI3lfNrIjSbIVrftDXbkKSJCPH4UkwRroutX-Xp8&e=):
> Here is the general info on what that looks like:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apa
> che.org_dev_writable-2Dgit&d=DwIFAw&c=qS4goWBT7poplM69zy_3x
> hKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4
> f7d4gTao&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=n-
> m8yd0ayquMf_zuubKtRyr7LydiMTj-tluvryaf0oA&e=
>
> A few points from that link:
> > Projects can request moving to Git as their main code repository, by
> creating an INFRA issue. See also the infra-contact page. > Projects can
> request new, blank repositories by using reporeq.apache.org.
> > The current system has basic git support only. We are working on
> extending this service in the near future.
> > Custom commit or other hooks will not be supported, all projects get the
> same hooks. Setting up gitpubsub should provide sufficient flexiblity
> without impacting the core Git setup, volunteers are welcome to make that
> happen.
>
> (Not sure what basic support only means.)
>
> There are also read-only git repos available by default for every project
> and updated in near-real-time:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apa
> che.org_dev_git.html&d=DwIFAw&c=qS4goWBT7poplM69zy_3xhKwEW14
> JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTa
> o&m=4MlIq9wS4oGckpd3UeTqtmRuisKsRIYt9x2E8_IDYuU&s=C8RL68JNrL
> pGNVGdwP4YjKi3MZyMFevtQHOJxn7yWsc&e=
>
> with those I guess the suggested workflow is to work off of that repo and
> then just submit patches to someone who commits with svn rather than
> committing directly.
>
> I've been using the git-svn connector myself recently since I just vastly
> prefer the git lightweight branching for focused development, as it helps
> me keep a cleaner working directory. But that adds some additional annoying
> steps.
>
> Tim
>
> ________________________________________
> From: Finan, Sean <sean.fi...@childrens.harvard.edu>
> Sent: Saturday, November 18, 2017 1:23 PM
> To: dev@ctakes.apache.org
> Subject: RE: Contribute to ctakes: it is in your best interests! RE:
> unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
>
> Hi Dave,
>
> Those are some great thoughts.  Being an apache project I am not sure how
> far we can move from svn, but there may be a way.  You are not the first to
> voice this desire for an active github repo and I'm sure that you won't be
> the last.
>
> I completely agree with your discussion board preference.  Do you have any
> recommendations?
>
> You make a great point regarding documentation.  In reference to things
> that anybody can quickly contribute ... that would be a big one.
> Volunteers?!?
>
> I am really happy to hear that you want to contribute - more than you
> already have, which is actually quite a bit!
>
> Cheers,
> Sean
>
> -----Original Message-----
> From: David Kincaid [mailto:kincaid.d...@gmail.com]
> Sent: Saturday, November 18, 2017 1:10 PM
> To: dev@ctakes.apache.org
> Subject: Re: Contribute to ctakes: it is in your best interests! RE:
> unknown dependencies [EXTERNAL] [SUSPICIOUS]
>
> Sean, I can share a couple things that have been an obstacle for me. It
> may seem a minor point to some, but I left Subversion behind years ago and
> really have no desire to go back. If the project were moved over to
> Git/Github it would really smooth the way for me at least. I would be happy
> to help out with this. One of the other things I would really like to see
> is the mailing list moved onto a discussion board platform. It seems to me
> that a discussion board style of tool tends to create a more active
> community than a mailing list does.
>
> The other thing that might help get new people involved is making it
> easier to find information about the development environment. Things like
> branching strategies, coding conventions, etc are really hard to find from
> the main cTAKES web site. I saw some references to Jenkins builds recently
> on the list. I had no idea there was a Jenkins CI server for the project
> somewhere. It also takes some digging to find a link to Jira. Maybe we
> could create a Wiki page that describes where all these tools are and how
> they are used.
>
> You guys have really done some great work over the last couple of years
> cleaning up the code base and improving the documentation by a ton. Things
> like the fast dictionary annotator, dictionary creator GUI are a great
> addition and make it a lot easier for other people to get up and running
> more quickly. As I'm ramping up my research as well as some proof of
> concept stuff at work I'll be working more and more with cTAKES and would
> love to contribute more to the project.
>
> Just my thoughts.
>
> - Dave
>
>
> On Sat, Nov 18, 2017 at 11:10 AM, Finan, Sean <
> sean.fi...@childrens.harvard.edu> wrote:
>
> > Hi Tim, Alex,
> >
> > Great ideas.  I like your (Tim) idea to 1. start with commented code
> > removal.
> > Then maybe move on to
> > 2. sanity-test type unit tests - Little two or three-line "does this
> > method crack" tests.
> > And another that is simply
> > 3. "populate a test cas with type(s) X" and a factory with
> > "getSectionTestCas" "getSetenceTestCas" "getPosTestCas" "getChunkTestCas"
> > ...  just really simple reusables for tests.
> > Then
> > 4. refactor to extract and consolidate duplicate code - it is all over
> > the place ...
> >
> > These are just my initial thoughts and suggestions, but I think that
> those
> > 4 tasks can be performed by anybody of any experience level.   They build
> > upon each other and should help the implementers better understand
> ctakes.
> > After that the sky is the limit.
> >
> > A couple of years ago I sat on a panel at a workshop for open source
> > scientific software.  For the half dozen or so highlighted projects
> > (ctakes was one!) the common thread was that getting people to
> > contribute is extremely difficult.
> > I have a tendency to assume that people always act in their best
> > interests.  Any student thinking of going towards industry should be
> > jumping at the opportunity to contribution to a large,
> > production-quality project.  They should also realize that
> > contribution means potential recommendation (and possibly hiring
> > interest) by established developers, physicians and researchers that
> > use ctakes.  Even just answering questions on a user or dev list creates
> credibility and can build a network.
> > Active researchers could discover common thoughts and directions that
> > could lead to collaboration outside ctakes.  Researchers and companies
> > trying to build upon open source should realize that direct
> > contribution is easier than custom substitution.  Plus, it is in their
> > best interests that code does what they need it to do in the fastest,
> > lightest, most stable way possible.
> > With a project like ctakes there are a lot of things that can be done,
> > there are great opportunities to really shine.  "I wrote this tool for
> > my thesis that performs some nlp task" sounds good.  Appending "in an
> > Apache product and it has been taken up by thousands across the globe"
> > makes it sound a lot better.
> > At my previous job in industry the company actively contributed to
> > several open source projects.  We had a few people for whom that was
> > 50% of their job.  Why?  Because we made a commitment to use that open
> source software.
> > It was a better use of our resources to contribute to it, improve it
> > and keep its momentum going and prevent it from becoming stale (or
> > abandoned) while our software continued to move forward.
> >
> > Hmm, that was a touch more than I had planned to write.  A whole cup
> > of coffee in that one.
> >
> > Sean
> >
> >
> >
> >
> > -----Original Message-----
> > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
> > Sent: Saturday, November 18, 2017 8:13 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: unknown dependencies [EXTERNAL] [SUSPICIOUS]
> >
> > Thanks Alex, looks like that was probably a fat-fingered auto-import
> > on my part.
> >
> > I like your idea, and I don't know the best way to to start either,
> > but maybe one suggestion is to start with one or two focused things to
> > clean up, and then ask for volunteers to take on specific modules?
> > Then people can contribute an hour here and there to do cleanup on
> > their task/module and try to fix that thing in a 1-2-month long
> > sprint. I am happy to contribute to cleanup, I am responsible for my
> > fair share of unclean code, but since I don't have strong software
> > engineering chops it would be good to have people with that background
> > propose the tasks and describe exactly what needs to be done. My idea
> > of cleaning is just to delete commented out sections of evaluation code.
> >
> > Tim
> >
> > ________________________________________
> > From: Alexandru Zbarcea <al...@apache.org>
> > Sent: Friday, November 17, 2017 4:46 PM
> > To: Apache cTAKES Dev
> > Subject: unknown dependencies [EXTERNAL]
> >
> > Hi,
> >
> > I notice that a miss-dependency has slipped in the code:
> > jdk.internal.org.objectweb.asm.commons.AnalyzerAdapter;
> >
> > Now, that the Jenkins builds is successful, I think it is easier to
> > clean-up the code. I would like to be a common effort. I don't know
> > the best way to approach this.
> >
> > Looking forward to your advice,
> > Alex
> >
>

Reply via email to