Re: Next cTAKES release (3.1)?

Andy McMurry Fri, 28 Jun 2013 07:31:45 -0700

+1 ctakes IS domain specific 
+1 UIMAFit should become a part of UIMA and not the focus of ctakes-dev


At first glance, people should think of cTakes as the "UIMA medical text 
library". 

Here are examples that I know users are interested in. 

Suggestions: 

1. ctakes DRUG PROFILE 
http://www.mtsamples.com/site/pages/sample.asp?Type=6-Cardiovascular&Sample=775-H%26P+-+Cardio+(Angina)

2. ctakes NER : 
http://www.mtsamples.com/site/pages/sample.asp?Type=77-rheumatology&Sample=790-Rheumatoid+Arthritis+-+H%26P

3. ctakes SMOKING: 
http://www.mtsamples.com/site/pages/sample.asp?Type=6-Cardiovascular%20/%20Pulmonary&Sample=571-Trouble%20breathing

4. ctakes Lexical features (PoS, sentence boundaries, etc) 
http://www.medicaltranscriptionsamples.com/diabetes-mellitus-followup/







> Very interesting discussion. I think Giri is right about giving example
> training data in the format that our training code can read. While our
> ultimate goal would be to build and release models that are completely
> domain-independent, in the real world it is almost always better to use
> some domain-specific data and we should think more about how to
> facilitate that.



> 
> As for making it easier to get started, it is not totally clear to me
> what this means/how to do it so it might be useful to get specific about
> what this means. I think our biggest hurdle is
> 
> 1) Prerequisite of understanding UIMA/UIMAFit
> 
> Since UIMAFit is officially becoming part of UIMA that will be easier,
> and hopefully people will just learn the easier (in my opinion) UIMAFit
> way than the standard UIMA way of doing things. Is there something we
> can be doing to make understanding UIMA easier? Or do we just need to
> say upfront that this is a prerequisite and hope that people don't give
> up due to this thing that is out of our control?
> 
> Another hurdle is:
> 
> 2) cTAKES is a multi-purpose developer-aimed tool
> 
> So it's not just a matter of hiding complexity -- at some point people
> have to understand their problem, understand cTAKES' capabilities, and
> start coding. Pei's GUI will help for some common use cases but will not
> remove the requirement that someone at the organization knows cTAKES.
> I think one part of this problem is the fact that the typesystem is not
> well documented. A developer needs to know what the output is (objects
> from the typesystem), how to get them (which modules/pipelines), and
> what information is in them. So maybe on this end my recommendation
> would be:
> i) Make the typesystem forefront in documentation -- generate javadocs
> and have as a link on the ctakes frontpage/sidebar
> ii) Similar to the way that we are aiming to have tests in every module,
> also have clearly labeled examples in every module that set up a
> pipeline, run on sample notes (could be the same sample notes from the
> tests), and do something with the results.
> iii) Follow Giri's recommendation to have example training data for
> people who want to take the next step and train their own models
> 
> This is quite a bit of developer overhead, so it's worth asking whether
> you agree with my "diagnosis" and "treatment" or whether you think there
> are different problems/solutions that should be higher priority.
> 
> Tim
> 
> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>> Hi Vijay and Andy,
>> 
>> Thanks for sharing those examples.
>> 
>> "Trouble is, privacy requires that these examples be made up by hand"
>> 
>> Agree with this statement and this is very valid concern.
>> 
>> In "getting started examples", I think we should just have couple of
>> entries (5-10 small entries), not more than that (with explicit statement
>> like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand handcrafting
>> these may not be easy because we are not medical domain experts, but I feel
>> worth time, because it brings in more user community.
>> 
>> Thank you,
>> Giri
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <[email protected]>wrote:
>> 
>>> GREAT !
>>> 
>>> The i2b2 data though isn't publicly distributable, you still need to
>>> request access to it since it is "semi private"
>>> 
>>> 
>>> On Jun 27, 2013, at 9:52 PM, vijay garla <[email protected]> wrote:
>>> 
>>>> We released code on using cTAKES to annotate clinical text and SVMs that
>>>> use the annotations to classify clinical text from the CMC 2007 and I2B2
>>>> 2008 challenges:
>>>> 
>>>> We did the cmd 2007 with cTAKES 2.5:
>>>> 
>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge
>>> <https://code.google.com/p/ytex/downloads/list>
>>>> 
>>>> And the i2b2 2008 with the version of cTAKES distributed with the first
>>>> version of ARC:
>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>> 
>>>> These are both publicly available datasets, and represent real-world
>>>> problems (in general I believe when publishing a paper the code should be
>>>> reproducible and made publicly available, but that's a different issue).
>>>> 
>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to
>>>> upgrade these samples as well.
>>>> 
>>>> Best,
>>>> 
>>>> VJ
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <[email protected]
>>>> wrote:
>>>> 
>>>>> +1 suggestion for documenting many examples of "getting started" NLP
>>>>> datasets.
>>>>> 
>>>>> I have at least one we can use that was created by our lead Pathologist
>>>>> 
>>>>> 
>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml
>>>>> We should provide at least one sample for each domain.
>>>>> Trouble is, privacy requires that these examples be made up by hand and
>>>>> not copy-pasted from EMR systems.
>>>>> 
>>>>> --Andy
>>>>> 
>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> +1 for this observation Andy!
>>>>>> 
>>>>>> Lowering time will motive users in writing blogs about features, how
>>> to,
>>>>>> etc., which reduces core team work load on documentation.
>>>>>> 
>>>>>> I have been trying to write a small "how to write standalone client for
>>>>>> ctakes" with my experience (I saw at least 4 users posted similar
>>>>> question
>>>>>> in last 2 months), but not getting enough time because ctakes depends
>>> on
>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most
>>> of
>>>>>> my spare time is being spent on juggling between these frameworks,
>>>>> posting
>>>>>> and browsing those forums, relating observations to ctakes code. I
>>> think
>>>>> we
>>>>>> need to have some high level documentation about these (with links to
>>>>>> corresponding forums).
>>>>>> 
>>>>>> Above case is for developers (I think this will be more user base as
>>>>> ctakes
>>>>>> progress), for users I think documentation is lot better though some
>>>>>> improvements need to be done.
>>>>>> 
>>>>>> As a developer I felt tough with lack of sample training data (I am
>>> still
>>>>>> struggling in this area even though I browsed all relevant code),
>>> though
>>>>>> training class are there. I understood that there are licensing issues
>>>>> with
>>>>>> REAL data, but at least some hand made example sentences, which may not
>>>>> be
>>>>>> real but helps developers in understanding the type/structure of input
>>>>>> TRAINING classes expecting. This way people who browse the code can
>>>>> reverse
>>>>>> engineer and develop their own models. Sorry if you guys feel this as
>>>>>> novice issue, but I feel most of the developers will be novice when
>>> they
>>>>>> adopt a system and Machine Learning/NLP is ocean. Some documentation in
>>>>>> this area will same lot of time for us.
>>>>>> 
>>>>>> I wish there will be some activity in this area from ctakes core team.
>>>>>> 
>>>>>> Thank you,
>>>>>> Giri
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <[email protected]
>>>>>> wrote:
>>>>>> 
>>>>>>> ctakes is at a point where we have a LOT of features but it is still
>>>>> hard
>>>>>>> to get started.
>>>>>>> 
>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>> obvious
>>>>>>> and requires hand holding.
>>>>>>> This is very typical in early FOSS projects.
>>>>>>> 
>>>>>>> Lowering the time to get invested in ctakes gets more users AND better
>>>>> bug
>>>>>>> reports, FAQ, etc.
>>>>>>> 
>>>>>>> thoughts?
>>>>>>> --Andy
>>>>>>> 
>>>>>>> 
>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I just wanted to gauge the interest of creating the next release of
>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed.
>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>> - New CEM Instance Template population
>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>> - New optional Clear POSTagger
>>>>>>>> - New regression testing component
>>>>>>>> 
>>>>>>>> Should we wait for the Temporal component?
>>>>>>>> 
>>>>>>>> [1]
>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES
>>>>>>> 
>>>>> 
>>> 
>

Re: Next cTAKES release (3.1)?

Reply via email to