Hi Hadrian, Very true on all points. Any connections that you can make in healthcare are very welcome.
Sean -----Original Message----- From: Hadrian Zbarcea [mailto:hadr...@apache.org] Sent: Monday, November 20, 2017 8:11 PM To: dev@ctakes.apache.org Subject: Re: Contribute to ctakes: it is in your best interests! RE: unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] Hi all, Sorry for the late reply, quite a bit to digest. I will add some thoughts now and more later. (I may not be super responsive, I have to deal with a lot of email). Looks like the topic of the thread is growing the community. It is indeed hard, I spent a ton of time on building strong communities for a few projects in the past. Something that is sometimes not well understood is that every open source project (ASF makes no exception) has actually two communities: users and developers. ASF recognizes that by providing separate mailing lists for the two. It is hard to grow dev@ without growing users@ and I would focus on that. Most of the ideas proposed only address dev@. Growing users@ means a few hard and unrewarding tasks: * clearly articulate the value proposition of the project, target audience and benefits (not features) * easy to 'get started' (ideally under 5 min) * easy to post feedback (comments/bug reporst) * responsiveness in fixing issues * and yes, documentation Things like migration to git, in my experience, address the middle bullets. (However, I would think very hard about removing the binaries from the code base before moving to git). Personally, I do have some good connections in healthcare and I may be able to help. I had a presentation at the OSEHRA Summit in June and I was surprised that almost nobody was aware of cTakes, but NLP was mentioned quite a bit. Last week I was at the HSPC meeting in Indianapolis where, again, very few knew about cTakes. If the project is interested I could find more resources for the project and help with growing adoption. Cheers, Hadrian On 11/20/2017 10:41 AM, Finan, Sean wrote: > Hi Alex, > Some great ideas, all of which are deserving of comment. > >> - There is code commented out, but much of this code seems to >> still be > valuable, like it was commented from some migrations and was left over for > somebody to follow-up (e.g. unit tests). > True. Some intelligence is required. When in doubt, leave it - but there > are a lot of things that are obviously moved or old rewritten code. This is > all volunteer and just getting people involved with "baby steps" would be > great. I would also hope that some inactive authors come back and clean up > comments in their own code. Or write those unit tests if that was the > intention. There are TODO comments in the code that could be tackled. > >> - There are issues reported by SonnarQube [1] like: > This should be handled with kid gloves. A lot of those reports cover items > that are not yet complete, ordered for easier following / understanding of > code, etc. However a lot can be handled easily and quickly, like adding > @Override ... People can use local plugins that check code like findBugs. I > used to be religious about it but have become lax. This is a good reminder > for me to start again. > >> - Removal of hardcoded paths like: "/tmp", > I am in complete agreement. Things like /tmp should probably even be > refactored to use temp files. Things like default paths used in static > createAnnotatorDescription() should instead probably be used in > @ConfigurationParameter default= ... > --- Building upon that statement, it would be nice to migrate older > annotation engines, readers, and cas consumers to the uimafit paradigm. This > would help a newbie understand the difference and how to use AEs, etc. > >> - Migrate scripts from Ant (files like build-*.xml) to maven. > Does ctakes have these? I guess that I've missed them. Yeah, full maven > would be nice. > >> - Deprecated code > We certainly have a lot of it. It is a good excuse to make unit tests before > updating. > >> - I think it is time to define some conventions for: > - formatting (identation), > - crlf conventions (see .gitattributes) > - etc > You are correct; indentation and crlf should also be settable by a decent ide > for any cvs. I think that most ctakes code is space indented, 3 per > indentation, and \n only for newlines. I could be wrong. > Things that are more stylistic (naming, ordering, etc.) are much more coder- > preference. I would rather have contributions than turn people off with > strictures. I'll even take things like missing { } ... though there is > another great target for refactoring ... > >> - For git vs Subversion, I am able to use the same folder with a >> .git > Thanks for the documentation! As an Apache project we would need to vote on > fully moving to git (as Tim and Dave suggested). I am definitely not opposed > to that - I use github for everything else these days ... > >> - There are commits without any reference to Jira issues or other >> type > Guilty as charged. A lot of my commits are new development and I only write > commit comments. I could open a jira for each, but I am admittedly lazy > about such things. Ditto for placing links in an email appendix. > >> Also, based on the decision to use semantic versioning, it > will need to choose between 4.0.1 or 4.1.0. > Personally I think that our next release should be 4.1.0 as there are enough > new features to distinguish that it isn't just a patch release. > https://urldefense.proofpoint.com/v2/url?u=http-3A__semver.org_&d=DwIC > aQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCY > NYmQCP6r0bcpKGd4f7d4gTao&m=2C9CloGmEzEKjb3IhrnmQQjurILI1_YIt7IHsyAiVzQ > &s=WQhtqrfi8UvHWyiJo0-iNT3I6GXB41jxS0dWS9YMd4s&e= > > Thanks, > Sean > > -----Original Message----- > From: Alexandru Zbarcea [mailto:al...@apache.org] > Sent: Monday, November 20, 2017 8:34 AM > To: Apache cTAKES Dev; Hadrian Zbarcea > Subject: Re: Contribute to ctakes: it is in your best interests! RE: > unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] > > Hi, > > To grow the community and bring even more adoption is my desire, too. I > cannot agree more with what you said, Sean, Tim. > > I have discussed with Hadrian (Apache member) about cTAKES adoption and I > think he has great ideas about the priorities for this community to grow. I > will like to introduce him to the community and let him express some ideas. > > In regards to the technical issues that where already identified on this > thread, I would like to understand your perspective and prioritization. > > - There is code commented out, but much of this code seems to still be > valuable, like it was commented from some migrations and was left over for > somebody to follow-up (e.g. unit tests). > - There are issues reported by SonnarQube [1] like: > - 3.3K bugs [2] > - 16.5% code duplication (24K LoC) [3] > 174 bugs in the last month [4] > > > - I would like to see more Unit Tests for the code. There are new > commits unrelated to a feature description and so, there is no clear > understanding about what the review should focus on. I think it relates to > the same request from Sean to have "sanity-test type unit tests - Little > two or three-line "does this method crack" tests.". I see this task as one > of the most important one. > - Removal of hardcoded paths like: "/tmp", > "C:/Users/<some-user>/<some-path>. > - Migrate scripts from Ant (files like build-*.xml) to maven. It makes > the code so unpredictable. I find it difficult to navigate through these > when tests are dependent upon these executions. > - Classpaths manually specified. > - Deprecated code > - Old libraries which involve security risks in production (e.g. Spring > that was just upgraded) > > Other tasks that are related more to productivity. > > - I think it is time to define some conventions for: > - formatting (identation), > - crlf conventions (see .gitattributes) > - etc > - For git vs Subversion, I am able to use the same folder with a .git > and .svn VCS and documented on the wiki [5]. > - There are commits without any reference to Jira issues or other type > of documentation. In consequence, when release will come, it will be very > hard to hunt those changes and understand why those commits were made: > bugs > vs features. Also, based on the decision to use semantic versioning, it > will need to choose between 4.0.1 or 4.1.0. > > My $0.02, > Alex > > [1] - > https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org > _analysis_overview-3Fid-3Dorg.apache.ctakes-253Actakes&d=DwIBaQ&c=qS4g > oWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0 > bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=ZBpW0O > VPlYu308dmEv3E6DK93VfUe8NLi0OClLqa2Sk&e= > [2] - > https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org > _analysis_component-5Fissues-3Fid-3Dorg.apache.ctakes-253Actakes-23res > olved-3Dfalse-257Ctypes-3DBUG&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14J > ZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp > 4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=Vot25EW4XwGjz9uLwHo4rc62shM_0n- > 6Yy5u9BjktsM&e= > [3] - > https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org > _analysis_component-5Fmeasures_metric_duplicated-5Fblocks_list-3Fid-3D > org.apache.ctakes-253Actakes&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZ > MSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4 > Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=NKhS3KX3JBBiuFbfjPSq2WT-qibS-QSQ > zqkG8KbiLIk&e= > [4] - > https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org > _analysis_component-5Fissues-3Fid-3Dorg.apache.ctakes-253Actakes-23res > olved-3Dfalse-257Ctypes-3DBUG-257CsinceLeakPeriod-3Dtrue&d=DwIBaQ&c=qS > 4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6 > r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=tNsg > iXoIKXQPQAzM7g-EEXEephKMNEG50OBl8iuD6lU&e= > [5] - > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_ > confluence_display_CTAKES_cTAKES-2B4.0-2BDeveloper-2BInstall-2BGuide-2 > 3cTAKES4.0DeveloperInstallGuide-2DSubversion-2BGit&d=DwIBaQ&c=qS4goWBT > 7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpK > Gd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=lMZ18SEZob > 73AXp4a3sMrd22nHpwFtQ__4fR-Q5QQuI&e= > > > > On Mon, Nov 20, 2017 at 6:32 AM, Miller, Timothy < > timothy.mil...@childrens.harvard.edu> wrote: > >> Git is available to apache projects, and many projects have moved >> over (see here: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__git-2Dwip-2Dus.apache.org_repos_asf&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPNP-d8XkCTXvOuP-YWuI&s=qGV9tIcYJGK-tQAMYm5cWevWrBSixPCHj3VfaXum288&e=): >> Here is the general info on what that looks like: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_d >> e >> v_writable-2Dgit&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxe >> F >> U&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquyS >> L >> w2RPNP-d8XkCTXvOuP-YWuI&s=BRSYUV67HZtyxzLNbqPzAlS-YZmqUpA30rvPsNKX6i0 >> & >> e= >> >> A few points from that link: >>> Projects can request moving to Git as their main code repository, by >> creating an INFRA issue. See also the infra-contact page. > Projects >> can request new, blank repositories by using reporeq.apache.org. >>> The current system has basic git support only. We are working on >> extending this service in the near future. >>> Custom commit or other hooks will not be supported, all projects get >>> the >> same hooks. Setting up gitpubsub should provide sufficient flexiblity >> without impacting the core Git setup, volunteers are welcome to make >> that happen. >> >> (Not sure what basic support only means.) >> >> There are also read-only git repos available by default for every >> project and updated in near-real-time: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.apache.org_d >> e >> v_git.html&d=DwIBaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=f >> s >> 67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=PHstasp4Y8wYPWGquySLw2RPN >> P -d8XkCTXvOuP-YWuI&s=CtgGvLG2s_KqVRWx_tZAcaMSh_KKH4aqc6HGTP3dmtA&e= >> >> with those I guess the suggested workflow is to work off of that repo >> and then just submit patches to someone who commits with svn rather >> than committing directly. >> >> I've been using the git-svn connector myself recently since I just >> vastly prefer the git lightweight branching for focused development, >> as it helps me keep a cleaner working directory. But that adds some >> additional annoying steps. >> >> Tim >> >> ________________________________________ >> From: Finan, Sean <sean.fi...@childrens.harvard.edu> >> Sent: Saturday, November 18, 2017 1:23 PM >> To: dev@ctakes.apache.org >> Subject: RE: Contribute to ctakes: it is in your best interests! RE: >> unknown dependencies [EXTERNAL] [SUSPICIOUS] [SUSPICIOUS] >> >> Hi Dave, >> >> Those are some great thoughts. Being an apache project I am not sure >> how far we can move from svn, but there may be a way. You are not >> the first to voice this desire for an active github repo and I'm sure >> that you won't be the last. >> >> I completely agree with your discussion board preference. Do you >> have any recommendations? >> >> You make a great point regarding documentation. In reference to >> things that anybody can quickly contribute ... that would be a big one. >> Volunteers?!? >> >> I am really happy to hear that you want to contribute - more than you >> already have, which is actually quite a bit! >> >> Cheers, >> Sean >> >> -----Original Message----- >> From: David Kincaid [mailto:kincaid.d...@gmail.com] >> Sent: Saturday, November 18, 2017 1:10 PM >> To: dev@ctakes.apache.org >> Subject: Re: Contribute to ctakes: it is in your best interests! RE: >> unknown dependencies [EXTERNAL] [SUSPICIOUS] >> >> Sean, I can share a couple things that have been an obstacle for me. >> It may seem a minor point to some, but I left Subversion behind years >> ago and really have no desire to go back. If the project were moved >> over to Git/Github it would really smooth the way for me at least. I >> would be happy to help out with this. One of the other things I would >> really like to see is the mailing list moved onto a discussion board >> platform. It seems to me that a discussion board style of tool tends >> to create a more active community than a mailing list does. >> >> The other thing that might help get new people involved is making it >> easier to find information about the development environment. Things >> like branching strategies, coding conventions, etc are really hard to >> find from the main cTAKES web site. I saw some references to Jenkins >> builds recently on the list. I had no idea there was a Jenkins CI >> server for the project somewhere. It also takes some digging to find >> a link to Jira. Maybe we could create a Wiki page that describes >> where all these tools are and how they are used. >> >> You guys have really done some great work over the last couple of >> years cleaning up the code base and improving the documentation by a >> ton. Things like the fast dictionary annotator, dictionary creator >> GUI are a great addition and make it a lot easier for other people to >> get up and running more quickly. As I'm ramping up my research as >> well as some proof of concept stuff at work I'll be working more and >> more with cTAKES and would love to contribute more to the project. >> >> Just my thoughts. >> >> - Dave >> >> >> On Sat, Nov 18, 2017 at 11:10 AM, Finan, Sean < >> sean.fi...@childrens.harvard.edu> wrote: >> >>> Hi Tim, Alex, >>> >>> Great ideas. I like your (Tim) idea to 1. start with commented code >>> removal. >>> Then maybe move on to >>> 2. sanity-test type unit tests - Little two or three-line "does this >>> method crack" tests. >>> And another that is simply >>> 3. "populate a test cas with type(s) X" and a factory with >>> "getSectionTestCas" "getSetenceTestCas" "getPosTestCas" "getChunkTestCas" >>> ... just really simple reusables for tests. >>> Then >>> 4. refactor to extract and consolidate duplicate code - it is all >>> over the place ... >>> >>> These are just my initial thoughts and suggestions, but I think that >> those >>> 4 tasks can be performed by anybody of any experience level. They build >>> upon each other and should help the implementers better understand >> ctakes. >>> After that the sky is the limit. >>> >>> A couple of years ago I sat on a panel at a workshop for open source >>> scientific software. For the half dozen or so highlighted projects >>> (ctakes was one!) the common thread was that getting people to >>> contribute is extremely difficult. >>> I have a tendency to assume that people always act in their best >>> interests. Any student thinking of going towards industry should be >>> jumping at the opportunity to contribution to a large, >>> production-quality project. They should also realize that >>> contribution means potential recommendation (and possibly hiring >>> interest) by established developers, physicians and researchers that >>> use ctakes. Even just answering questions on a user or dev list >>> creates >> credibility and can build a network. >>> Active researchers could discover common thoughts and directions >>> that could lead to collaboration outside ctakes. Researchers and >>> companies trying to build upon open source should realize that >>> direct contribution is easier than custom substitution. Plus, it is >>> in their best interests that code does what they need it to do in >>> the fastest, lightest, most stable way possible. >>> With a project like ctakes there are a lot of things that can be >>> done, there are great opportunities to really shine. "I wrote this >>> tool for my thesis that performs some nlp task" sounds good. >>> Appending "in an Apache product and it has been taken up by thousands >>> across the globe" >>> makes it sound a lot better. >>> At my previous job in industry the company actively contributed to >>> several open source projects. We had a few people for whom that was >>> 50% of their job. Why? Because we made a commitment to use that >>> open >> source software. >>> It was a better use of our resources to contribute to it, improve it >>> and keep its momentum going and prevent it from becoming stale (or >>> abandoned) while our software continued to move forward. >>> >>> Hmm, that was a touch more than I had planned to write. A whole cup >>> of coffee in that one. >>> >>> Sean >>> >>> >>> >>> >>> -----Original Message----- >>> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] >>> Sent: Saturday, November 18, 2017 8:13 AM >>> To: dev@ctakes.apache.org >>> Subject: Re: unknown dependencies [EXTERNAL] [SUSPICIOUS] >>> >>> Thanks Alex, looks like that was probably a fat-fingered auto-import >>> on my part. >>> >>> I like your idea, and I don't know the best way to to start either, >>> but maybe one suggestion is to start with one or two focused things >>> to clean up, and then ask for volunteers to take on specific modules? >>> Then people can contribute an hour here and there to do cleanup on >>> their task/module and try to fix that thing in a 1-2-month long >>> sprint. I am happy to contribute to cleanup, I am responsible for my >>> fair share of unclean code, but since I don't have strong software >>> engineering chops it would be good to have people with that >>> background propose the tasks and describe exactly what needs to be >>> done. My idea of cleaning is just to delete commented out sections of >>> evaluation code. >>> >>> Tim >>> >>> ________________________________________ >>> From: Alexandru Zbarcea <al...@apache.org> >>> Sent: Friday, November 17, 2017 4:46 PM >>> To: Apache cTAKES Dev >>> Subject: unknown dependencies [EXTERNAL] >>> >>> Hi, >>> >>> I notice that a miss-dependency has slipped in the code: >>> jdk.internal.org.objectweb.asm.commons.AnalyzerAdapter; >>> >>> Now, that the Jenkins builds is successful, I think it is easier to >>> clean-up the code. I would like to be a common effort. I don't know >>> the best way to approach this. >>> >>> Looking forward to your advice, >>> Alex >>>