Hi Hadrian,

Very true on all points.  Any connections that you can make in healthcare are 
very welcome.

Sean

-----Original Message-----
From: Hadrian Zbarcea [mailto:hadr...@apache.org] 
Sent: Monday, November 20, 2017 8:11 PM
To: dev@ctakes.apache.org
Subject: Re: Contribute to ctakes: it is in your best interests! RE: unknown 
dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]

Hi all,

Sorry for the late reply, quite a bit to digest. I will add some thoughts now 
and more later. (I may not be super responsive, I have to deal with a lot of 
email).

Looks like the topic of the thread is growing the community. It is indeed hard, 
I spent a ton of time on building strong communities for a few projects in the 
past. Something that is sometimes not well understood is that every open source 
project (ASF makes no exception) has actually two communities: users and 
developers. ASF recognizes that by providing separate mailing lists for the 
two. It is hard to grow dev@ without growing users@ and I would focus on that. 
Most of the ideas proposed only address dev@. Growing users@ means a few hard 
and unrewarding tasks:
* clearly articulate the value proposition of the project, target audience and 
benefits (not features)
* easy to 'get started' (ideally under 5 min)
* easy to post feedback (comments/bug reporst)
* responsiveness in fixing issues
* and yes, documentation

Things like migration to git, in my experience, address the middle bullets. 
(However, I would think very hard about removing the binaries from the code 
base before moving to git).

Personally, I do have some good connections in healthcare and I may be able to 
help. I had a presentation at the OSEHRA Summit in June and I was surprised 
that almost nobody was aware of cTakes, but NLP was mentioned quite a bit. Last 
week I was at the HSPC meeting in Indianapolis where, again, very few knew 
about cTakes.

If the project is interested I could find more resources for the project and 
help with growing adoption.

Cheers,
Hadrian



On 11/20/2017 10:41 AM, Finan, Sean wrote:
> Hi Alex,
> Some great ideas, all of which are deserving of comment.
>
>>    - There is code commented out, but much of this code seems to 
>> still be
>     valuable, like it was commented from some migrations and was left over for
>     somebody to follow-up (e.g. unit tests).
> True.  Some intelligence is required.  When in doubt, leave it - but there 
> are a lot of things that are obviously moved or old rewritten code.  This is 
> all volunteer and just getting people involved with "baby steps" would be 
> great.  I would also hope that some inactive authors come back and clean up 
> comments in their own code.  Or write those unit tests if that was the 
> intention.  There are  TODO comments in the code that could be tackled.
>
>>    - There are issues reported by SonnarQube [1] like:
> This should be handled with kid gloves.  A lot of those reports cover items 
> that are not yet complete, ordered for easier following / understanding of 
> code, etc.  However a lot can be handled easily and quickly, like adding 
> @Override ...  People can use local plugins that check code like findBugs.  I 
> used to be religious about it but have become lax.  This is a good reminder 
> for me to start again.
>
>>    - Removal of hardcoded paths like: "/tmp",
> I am in complete agreement.  Things like /tmp should probably even be 
> refactored to use temp files.  Things like default paths used in static 
> createAnnotatorDescription() should instead probably be used in 
> @ConfigurationParameter default= ...
> --- Building upon that statement, it would be nice to migrate older 
> annotation engines, readers, and cas consumers to the uimafit paradigm.  This 
> would help a newbie understand the difference and how to use AEs, etc.
>
>>    - Migrate scripts from Ant (files like build-*.xml) to maven.
> Does ctakes have these?  I guess that I've missed them.  Yeah, full maven 
> would be nice.
>
>>    - Deprecated code
> We certainly have a lot of it.  It is a good excuse to make unit tests before 
> updating.
>
>>    - I think it is time to define some conventions for:
>        - formatting (identation),
>        - crlf conventions (see .gitattributes)
>        - etc
> You are correct; indentation and crlf should also be settable by a decent ide 
> for any cvs.  I think that most ctakes code is space indented, 3 per 
> indentation, and \n only for newlines.  I could be wrong.
> Things that are more stylistic (naming, ordering, etc.) are much more coder- 
> preference.  I would rather have contributions than turn people off with 
> strictures.  I'll even take things like missing { } ...  though there is 
> another great target for refactoring ...
>
>>    - For git vs Subversion, I am able to use the same folder with a 
>> .git
> Thanks for the documentation!  As an Apache project we would need to vote on 
> fully moving to git (as Tim and Dave suggested).  I am definitely not opposed 
> to that - I use github for everything else these days ...
>
>>    - There are commits without any reference to Jira issues or other 
>> type
> Guilty as charged.  A lot of my commits are new development and I only write 
> commit comments.  I could open a jira for each, but I am admittedly lazy 
> about such things.  Ditto for placing links in an email appendix.
>
>> Also, based on the decision to use semantic versioning, it
>     will need to choose between 4.0.1 or 4.1.0.
> Personally I think that our next release should be 4.1.0 as there are enough 
> new features to distinguish that it isn't just a patch release.
> https://urldefense.proofpoint.com/v2/url?u=http-3A__semver.org_&d=DwIC
> aQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCY
> NYmQCP6r0bcpKGd4f7d4gTao&m=2C9CloGmEzEKjb3IhrnmQQjurILI1_YIt7IHsyAiVzQ
> &s=WQhtqrfi8UvHWyiJo0-iNT3I6GXB41jxS0dWS9YMd4s&e=
>
> Thanks,
> Sean
>
> -----Original Message-----
> From: Alexandru Zbarcea [mailto:al...@apache.org]
> Sent: Monday, November 20, 2017 8:34 AM
> To: Apache cTAKES Dev; Hadrian Zbarcea
> Subject: Re: Contribute to ctakes: it is in your best interests! RE: 
> unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
>
> Hi,
>
> To grow the community and bring even more adoption is my desire, too. I 
> cannot agree more with what you said, Sean, Tim.
>
> I have discussed with Hadrian (Apache member) about cTAKES adoption and I 
> think he has great ideas about the priorities for this community to grow. I 
> will like to introduce him to the community and let him express some ideas.
>
> In regards to the technical issues that where already identified on this 
> thread, I would like to understand your perspective and prioritization.
>
>     - There is code commented out, but much of this code seems to still be
>     valuable, like it was commented from some migrations and was left over for
>     somebody to follow-up (e.g. unit tests).
>     - There are issues reported by SonnarQube [1] like:
>        - 3.3K bugs [2]
>        - 16.5% code duplication (24K LoC) [3]
>        174 bugs in the last month [4]
>
>
>     - I would like to see more Unit Tests for the code. There are new
>     commits unrelated to a feature description and so, there is no clear
>     understanding about what the review should focus on. I think it relates to
>     the same request from Sean to have "sanity-test type unit tests - Little
>     two or three-line "does this method crack" tests.". I see this task as one
>     of the most important one.
>     - Removal of hardcoded paths like: "/tmp",
>     "C:/Users/<some-user>/<some-path>.
>     - Migrate scripts from Ant (files like build-*.xml) to maven. It makes
>     the code so unpredictable. I find it difficult to navigate through these
>     when tests are dependent upon these executions.
>     - Classpaths manually specified.
>     - Deprecated code
>     - Old libraries which involve security risks in production (e.g. Spring
>     that was just upgraded)
>
> Other tasks that are related more to productivity.
>
>     - I think it is time to define some conventions for:
>        - formatting (identation),
>        - crlf conventions (see .gitattributes)
>        - etc
>     - For git vs Subversion, I am able to use the same folder with a .git
>     and .svn VCS and documented on the wiki [5].
>     - There are commits without any reference to Jira issues or other type
>     of documentation. In consequence, when release will come, it will be very
>     hard to hunt those changes and understand why those commits were made: 
> bugs
>     vs features. Also, based on the decision to use semantic versioning, it
>     will need to choose between 4.0.1 or 4.1.0.
>
> My $0.02,
> Alex
>
> [1] -
> https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org
> _analysis_overview-3Fid-3Dorg.apache.ctakes-253Actakes&d=DwIBaQ&c=qS4g
> oWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0
> bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=ZBpW0O
> VPlYu308dmEv3E6DK93VfUe8NLi0OClLqa2Sk&e=
> [2] -
> https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org
> _analysis_component-5Fissues-3Fid-3Dorg.apache.ctakes-253Actakes-23res
> olved-3Dfalse-257Ctypes-3DBUG&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J
> ZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp
> 4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=Vot25EW4XwGjz9uLwHo4rc62shM_0n-
> 6Yy5u9BjktsM&e=
> [3] -
> https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org
> _analysis_component-5Fmeasures_metric_duplicated-5Fblocks_list-3Fid-3D
> org.apache.ctakes-253Actakes&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZ
> MSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4
> Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=NKhS3KX3JBBiuFbfjPSq2WT-qibS-QSQ
> zqkG8KbiLIk&e=
> [4] -
> https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org
> _analysis_component-5Fissues-3Fid-3Dorg.apache.ctakes-253Actakes-23res
> olved-3Dfalse-257Ctypes-3DBUG-257CsinceLeakPeriod-3Dtrue&d=DwIBaQ&c=qS
> 4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6
> r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=tNsg
> iXoIKXQPQAzM7g-EEXEephKMNEG50OBl8iuD6lU&e=
> [5] -
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_
> confluence_display_CTAKES_cTAKES-2B4.0-2BDeveloper-2BInstall-2BGuide-2
> 3cTAKES4.0DeveloperInstallGuide-2DSubversion-2BGit&d=DwIBaQ&c=qS4goWBT
> 7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpK
> Gd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=lMZ18SEZob
> 73AXp4a3sMrd22nHpwFtQ__4fR-Q5QQuI&e=
>
>
>
> On Mon, Nov 20, 2017 at 6:32 AM, Miller, Timothy < 
> timothy.mil...@childrens.harvard.edu> wrote:
>
>> Git is available to apache projects, and many projects have moved 
>> over (see here: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__git-2Dwip-2Dus.apache.org_repos_asf&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=qGV9tIcYJGK-tQAMYm5cWevWrBSixPCHj3VfaXum288&e=):
>> Here is the general info on what that looks like:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_d
>> e 
>> v_writable-2Dgit&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxe
>> F 
>> U&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquyS
>> L 
>> w2RPNP-d8XkCTXvOuP-YWuI&s=BRSYUV67HZtyxzLNbqPzAlS-YZmqUpA30rvPsNKX6i0
>> &
>> e=
>>
>> A few points from that link:
>>> Projects can request moving to Git as their main code repository, by
>> creating an INFRA issue. See also the infra-contact page. > Projects 
>> can request new, blank repositories by using reporeq.apache.org.
>>> The current system has basic git support only. We are working on
>> extending this service in the near future.
>>> Custom commit or other hooks will not be supported, all projects get 
>>> the
>> same hooks. Setting up gitpubsub should provide sufficient flexiblity 
>> without impacting the core Git setup, volunteers are welcome to make 
>> that happen.
>>
>> (Not sure what basic support only means.)
>>
>> There are also read-only git repos available by default for every 
>> project and updated in near-real-time:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_d
>> e 
>> v_git.html&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=f
>> s 
>> 67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPN
>> P -d8XkCTXvOuP-YWuI&s=CtgGvLG2s_KqVRWx_tZAcaMSh_KKH4aqc6HGTP3dmtA&e=
>>
>> with those I guess the suggested workflow is to work off of that repo 
>> and then just submit patches to someone who commits with svn rather 
>> than committing directly.
>>
>> I've been using the git-svn connector myself recently since I just 
>> vastly prefer the git lightweight branching for focused development, 
>> as it helps me keep a cleaner working directory. But that adds some 
>> additional annoying steps.
>>
>> Tim
>>
>> ________________________________________
>> From: Finan, Sean <sean.fi...@childrens.harvard.edu>
>> Sent: Saturday, November 18, 2017 1:23 PM
>> To: dev@ctakes.apache.org
>> Subject: RE: Contribute to ctakes: it is in your best interests! RE:
>> unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS]
>>
>> Hi Dave,
>>
>> Those are some great thoughts.  Being an apache project I am not sure 
>> how far we can move from svn, but there may be a way.  You are not 
>> the first to voice this desire for an active github repo and I'm sure 
>> that you won't be the last.
>>
>> I completely agree with your discussion board preference.  Do you 
>> have any recommendations?
>>
>> You make a great point regarding documentation.  In reference to 
>> things that anybody can quickly contribute ... that would be a big one.
>> Volunteers?!?
>>
>> I am really happy to hear that you want to contribute - more than you 
>> already have, which is actually quite a bit!
>>
>> Cheers,
>> Sean
>>
>> -----Original Message-----
>> From: David Kincaid [mailto:kincaid.d...@gmail.com]
>> Sent: Saturday, November 18, 2017 1:10 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Contribute to ctakes: it is in your best interests! RE:
>> unknown dependencies [EXTERNAL] [SUSPICIOUS]
>>
>> Sean, I can share a couple things that have been an obstacle for me.
>> It may seem a minor point to some, but I left Subversion behind years 
>> ago and really have no desire to go back. If the project were moved 
>> over to Git/Github it would really smooth the way for me at least. I 
>> would be happy to help out with this. One of the other things I would 
>> really like to see is the mailing list moved onto a discussion board 
>> platform. It seems to me that a discussion board style of tool tends 
>> to create a more active community than a mailing list does.
>>
>> The other thing that might help get new people involved is making it 
>> easier to find information about the development environment. Things 
>> like branching strategies, coding conventions, etc are really hard to 
>> find from the main cTAKES web site. I saw some references to Jenkins 
>> builds recently on the list. I had no idea there was a Jenkins CI 
>> server for the project somewhere. It also takes some digging to find 
>> a link to Jira. Maybe we could create a Wiki page that describes 
>> where all these tools are and how they are used.
>>
>> You guys have really done some great work over the last couple of 
>> years cleaning up the code base and improving the documentation by a 
>> ton. Things like the fast dictionary annotator, dictionary creator 
>> GUI are a great addition and make it a lot easier for other people to 
>> get up and running more quickly. As I'm ramping up my research as 
>> well as some proof of concept stuff at work I'll be working more and 
>> more with cTAKES and would love to contribute more to the project.
>>
>> Just my thoughts.
>>
>> - Dave
>>
>>
>> On Sat, Nov 18, 2017 at 11:10 AM, Finan, Sean < 
>> sean.fi...@childrens.harvard.edu> wrote:
>>
>>> Hi Tim, Alex,
>>>
>>> Great ideas.  I like your (Tim) idea to 1. start with commented code 
>>> removal.
>>> Then maybe move on to
>>> 2. sanity-test type unit tests - Little two or three-line "does this 
>>> method crack" tests.
>>> And another that is simply
>>> 3. "populate a test cas with type(s) X" and a factory with 
>>> "getSectionTestCas" "getSetenceTestCas" "getPosTestCas" "getChunkTestCas"
>>> ...  just really simple reusables for tests.
>>> Then
>>> 4. refactor to extract and consolidate duplicate code - it is all 
>>> over the place ...
>>>
>>> These are just my initial thoughts and suggestions, but I think that
>> those
>>> 4 tasks can be performed by anybody of any experience level.   They build
>>> upon each other and should help the implementers better understand
>> ctakes.
>>> After that the sky is the limit.
>>>
>>> A couple of years ago I sat on a panel at a workshop for open source 
>>> scientific software.  For the half dozen or so highlighted projects 
>>> (ctakes was one!) the common thread was that getting people to 
>>> contribute is extremely difficult.
>>> I have a tendency to assume that people always act in their best 
>>> interests.  Any student thinking of going towards industry should be 
>>> jumping at the opportunity to contribution to a large, 
>>> production-quality project.  They should also realize that 
>>> contribution means potential recommendation (and possibly hiring
>>> interest) by established developers, physicians and researchers that 
>>> use ctakes.  Even just answering questions on a user or dev list 
>>> creates
>> credibility and can build a network.
>>> Active researchers could discover common thoughts and directions 
>>> that could lead to collaboration outside ctakes.  Researchers and 
>>> companies trying to build upon open source should realize that 
>>> direct contribution is easier than custom substitution.  Plus, it is 
>>> in their best interests that code does what they need it to do in 
>>> the fastest, lightest, most stable way possible.
>>> With a project like ctakes there are a lot of things that can be 
>>> done, there are great opportunities to really shine.  "I wrote this 
>>> tool for my thesis that performs some nlp task" sounds good.
>>> Appending "in an Apache product and it has been taken up by thousands 
>>> across the globe"
>>> makes it sound a lot better.
>>> At my previous job in industry the company actively contributed to 
>>> several open source projects.  We had a few people for whom that was 
>>> 50% of their job.  Why?  Because we made a commitment to use that 
>>> open
>> source software.
>>> It was a better use of our resources to contribute to it, improve it 
>>> and keep its momentum going and prevent it from becoming stale (or
>>> abandoned) while our software continued to move forward.
>>>
>>> Hmm, that was a touch more than I had planned to write.  A whole cup 
>>> of coffee in that one.
>>>
>>> Sean
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu]
>>> Sent: Saturday, November 18, 2017 8:13 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: unknown dependencies [EXTERNAL] [SUSPICIOUS]
>>>
>>> Thanks Alex, looks like that was probably a fat-fingered auto-import 
>>> on my part.
>>>
>>> I like your idea, and I don't know the best way to to start either, 
>>> but maybe one suggestion is to start with one or two focused things 
>>> to clean up, and then ask for volunteers to take on specific modules?
>>> Then people can contribute an hour here and there to do cleanup on 
>>> their task/module and try to fix that thing in a 1-2-month long 
>>> sprint. I am happy to contribute to cleanup, I am responsible for my 
>>> fair share of unclean code, but since I don't have strong software 
>>> engineering chops it would be good to have people with that 
>>> background propose the tasks and describe exactly what needs to be 
>>> done. My idea of cleaning is just to delete commented out sections of 
>>> evaluation code.
>>>
>>> Tim
>>>
>>> ________________________________________
>>> From: Alexandru Zbarcea <al...@apache.org>
>>> Sent: Friday, November 17, 2017 4:46 PM
>>> To: Apache cTAKES Dev
>>> Subject: unknown dependencies [EXTERNAL]
>>>
>>> Hi,
>>>
>>> I notice that a miss-dependency has slipped in the code:
>>> jdk.internal.org.objectweb.asm.commons.AnalyzerAdapter;
>>>
>>> Now, that the Jenkins builds is successful, I think it is easier to 
>>> clean-up the code. I would like to be a common effort. I don't know 
>>> the best way to approach this.
>>>
>>> Looking forward to your advice,
>>> Alex
>>>

Reply via email to