Re: Next cTAKES release (3.1)?

John Green Wed, 03 Jul 2013 17:00:04 -0700

I see. Its a pretty random collection of formats. 

Sent from my iPhone


On Jul 3, 2013, at 18:25, andy mcmurry <[email protected]> wrote:

> Mtsamples has lots of free public examples already but we aren't using them
> yet.  This is probably because mtsamples don't have the annotations we need
> to use them as training examples.
> On Jul 3, 2013 2:46 PM, "Hephaestus Studio" <[email protected]>
> wrote:
> 
>> @Andy - Not a doctor yet, but soon! Thanks for the promotion though, one
>> more year!
>> 
>> - Apropos meds or clinical type questions: any developer on here can feel
>> free to shoot me a quick question via the list anytime, Id be happy to
>> confirm that a drug or anything else makes since given a particular
>> clinical/note context.
>> 
>> - "I wonder if there is someway in which you could guide us in making
>> better use of the medical knowledge sources (ontologies) that are
>> available." - I'd be happy to brainstorm about using existing resources to
>> help in decision making. We use these all the time in the clinic.
>> 
>> @ Tim+Andy+Chen - I haven't had a chance to really start chewing into the
>> code, though I hope to over the next year; so, what kind of examples would
>> be most helpful?
>>    - Any particular disease processes?
>>    - Are you all familiar with the ubiquitous SOAP style presentation
>> that doctors use to write free notes? The few examples I clicked through in
>> the repository that Chen pointed me too are very sparse. Would we want
>> gradations? E.g., a scale for "well done" notes to "very quick
>> I-dont-care-because-I'm-in-a-rush" notes?
>> 
>> @ Chen - Thank you for the kind words. It's nice to be welcomed by a
>> community in which you hope to integrate. And thank you for pointing me to
>> the directory with the current sample notes. This was very helpful in
>> determining where those are at in there development. I know that each of
>> your hospitals have a wealth of HIPAA-closed notes, but I'll see what I can
>> do to make some "stereotypical" open-notes for common disease
>> presentations. Again: maybe a scale, not necessarily just on brevity but
>> some other metric, whose continuum represented various permutations of
>> degrees of something, maybe of difficulty in processing? Apropos code,
>> Chen: I will help where I can but where I want to be is elbow deep in the
>> code :)
>> 
>> Finally: I haven't had a chance to look into some of the links from
>> earlier in this thread regarding open access repositories of free text
>> clinical notes: what do you all feel the quality of these resources are?
>> Abundant but low quality? Paucity but those that are there are high quality?
>> 
>> Bottom line: no problem either answering contextual questions (can afib be
>> associated with a lower gi bleed??) and no problem writing some notes, only
>> question would be, before I put in any time: what disease/specialty domain?
>> and would we want some system that put them on a continuum of some
>> variable, say, brevity or "readability"?
>> 
>> Just thinking before leaping,
>> 
>> Thanks,
>> JG
>> 
>> Sent from my iPhone
>> 
>> On Jul 2, 2013, at 21:23, "Chen, Pei" <[email protected]>
>> wrote:
>> 
>>> Hi John,
>>> Welcome!  There are actually many ways to contribute and it's not
>> limited to just code.  It's always great to hear new ideas and suggestions
>> on how to improve the software.  Therefore even, things like user feedback,
>> documentation, new use cases, essentially anything that will make things
>> better would be awesome!
>>> 
>>> To get started, I would suggest subscribing to the email lists.  If you
>> would like to contribute anything, just create an Jira account (anyone
>> should be able to do this), and add/review Jira items (add attachments if
>> you like) and we can even help integrate it.
>>> 
>>> We normally use Jira to keep track of issues:
>>> [1] https://issues.apache.org/jira/browse/ctakes
>>> 
>>> Current collection of sample test notes that have been collected over
>> the years:
>> https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/testdata/input/plaintext/
>>> 
>>> ________________________________________
>>> From: Tim Miller [[email protected]]
>>> Sent: Tuesday, July 02, 2013 6:31 PM
>>> To: [email protected]
>>> Subject: Re: Next cTAKES release (3.1)?
>>> 
>>> Agreed that you could definitely help out, and that would be a great way
>>> to do so. We don't really have "examples" right now, more like just
>>> short test sentences for showing simple results and verifying that
>>> nothing has been broken by changes. I think regular length fake but
>>> realistic notes would be very useful.
>>> Tim
>>> 
>>> On 07/02/2013 05:19 PM, John Green wrote:
>>>> Hi all,
>>>> 
>>>> Ive been following this mail list for a couple of months. Im a third
>> year medical student rounding the bend toward my MD. I used to be a
>> computer programmer, however, and continue my own projects. Im very
>> interested in contributing eventually to cTakes development. In the
>> meantime, given the current talk of examples, if any domain specific
>> examples needed generated I am domain knowledgable enough that I could
>> pound out a few free text notes made to order.
>>>> 
>>>> Let me know, you all may already have docs on hand willing todo this,
>> but if not...
>>>> 
>>>> John Green
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Jun 28, 2013, at 8:59, "Chen, Pei" <[email protected]>
>> wrote:
>>>> 
>>>>> I completely agree with making cTAKES easier use.  I think it is
>> exciting to hear the different use cases here and understanding where some
>> of the areas that need improvements are (which we haven't thought about
>> earlier).
>>>>> I think Tim's suggestions and the 3 concrete actionable items makes a
>> lot of sense.  Hopefully it should attract new users, adopters, and perhaps
>> more committers.
>>>>> 
>>>>>> i) Make the typesystem forefront in documentation -- generate
>> javadocs and
>>>>>> have as a link on the ctakes frontpage/sidebar
>>>>>> ii) Similar to the way that we are aiming to have tests in every
>> module, also
>>>>>> have clearly labeled examples in every module that set up a pipeline,
>> run on
>>>>>> sample notes (could be the same sample notes from the tests), and do
>>>>>> something with the results.
>>>>>> iii) Follow Giri's recommendation to have example training data for
>> people
>>>>>> who want to take the next step and train their own models
>>>>> I think Java developers are accustomed to including a library as a
>> dependency/jar, have an API to pass input, and get the results via pojos;
>> So the examples could initially shield the complexity of wiring a pipeline
>> together etc.
>>>>> If we can improve the API's and how it gets integrated with other
>> apps, we can add any GUI/CLI tools on top of this afterwards.
>>>>> 
>>>>> --Pei
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Miller, Timothy [mailto:[email protected]]
>>>>>> Sent: Friday, June 28, 2013 8:00 AM
>>>>>> To: [email protected]
>>>>>> Subject: Re: Next cTAKES release (3.1)?
>>>>>> 
>>>>>> Very interesting discussion. I think Giri is right about giving
>> example training
>>>>>> data in the format that our training code can read. While our
>> ultimate goal
>>>>>> would be to build and release models that are completely domain-
>>>>>> independent, in the real world it is almost always better to use some
>>>>>> domain-specific data and we should think more about how to facilitate
>> that.
>>>>>> 
>>>>>> As for making it easier to get started, it is not totally clear to me
>> what this
>>>>>> means/how to do it so it might be useful to get specific about what
>> this
>>>>>> means. I think our biggest hurdle is
>>>>>> 
>>>>>> 1) Prerequisite of understanding UIMA/UIMAFit
>>>>>> 
>>>>>> Since UIMAFit is officially becoming part of UIMA that will be
>> easier, and
>>>>>> hopefully people will just learn the easier (in my opinion) UIMAFit
>> way than
>>>>>> the standard UIMA way of doing things. Is there something we can be
>> doing
>>>>>> to make understanding UIMA easier? Or do we just need to say upfront
>> that
>>>>>> this is a prerequisite and hope that people don't give up due to this
>> thing that
>>>>>> is out of our control?
>>>>>> 
>>>>>> Another hurdle is:
>>>>>> 
>>>>>> 2) cTAKES is a multi-purpose developer-aimed tool
>>>>>> 
>>>>>> So it's not just a matter of hiding complexity -- at some point
>> people have to
>>>>>> understand their problem, understand cTAKES' capabilities, and start
>> coding.
>>>>>> Pei's GUI will help for some common use cases but will not remove the
>>>>>> requirement that someone at the organization knows cTAKES.
>>>>>> I think one part of this problem is the fact that the typesystem is
>> not well
>>>>>> documented. A developer needs to know what the output is (objects from
>>>>>> the typesystem), how to get them (which modules/pipelines), and what
>>>>>> information is in them. So maybe on this end my recommendation would
>> be:
>>>>>> i) Make the typesystem forefront in documentation -- generate
>> javadocs and
>>>>>> have as a link on the ctakes frontpage/sidebar
>>>>>> ii) Similar to the way that we are aiming to have tests in every
>> module, also
>>>>>> have clearly labeled examples in every module that set up a pipeline,
>> run on
>>>>>> sample notes (could be the same sample notes from the tests), and do
>>>>>> something with the results.
>>>>>> iii) Follow Giri's recommendation to have example training data for
>> people
>>>>>> who want to take the next step and train their own models
>>>>>> 
>>>>>> This is quite a bit of developer overhead, so it's worth asking
>> whether you
>>>>>> agree with my "diagnosis" and "treatment" or whether you think there
>> are
>>>>>> different problems/solutions that should be higher priority.
>>>>>> 
>>>>>> Tim
>>>>>> 
>>>>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote:
>>>>>>> Hi Vijay and Andy,
>>>>>>> 
>>>>>>> Thanks for sharing those examples.
>>>>>>> 
>>>>>>> "Trouble is, privacy requires that these examples be made up by hand"
>>>>>>> 
>>>>>>> Agree with this statement and this is very valid concern.
>>>>>>> 
>>>>>>> In "getting started examples", I think we should just have couple of
>>>>>>> entries (5-10 small entries), not more than that (with explicit
>>>>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I
>>>>>> understand
>>>>>>> handcrafting these may not be easy because we are not medical domain
>>>>>>> experts, but I feel worth time, because it brings in more user
>> community.
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Giri
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry
>>>>>> <[email protected]>wrote:
>>>>>>>> GREAT !
>>>>>>>> 
>>>>>>>> The i2b2 data though isn't publicly distributable, you still need to
>>>>>>>> request access to it since it is "semi private"
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> We released code on using cTAKES to annotate clinical text and SVMs
>>>>>>>>> that use the annotations to classify clinical text from the CMC
>> 2007
>>>>>>>>> and I2B2
>>>>>>>>> 2008 challenges:
>>>>>>>>> 
>>>>>>>>> We did the cmd 2007 with cTAKES 2.5:
>>>>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr
>>>>>> o
>>>>>>>> ducing_results_on_CMC_2007_challenge
>>>>>>>> <https://code.google.com/p/ytex/downloads/list>
>>>>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the
>>>>>>>>> first version of ARC:
>>>>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008
>>>>>>>>> 
>>>>>>>>> These are both publicly available datasets, and represent
>> real-world
>>>>>>>>> problems (in general I believe when publishing a paper the code
>>>>>>>>> should be reproducible and made publicly available, but that's a
>> different
>>>>>> issue).
>>>>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like
>> to
>>>>>>>>> upgrade these samples as well.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> VJ
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry
>>>>>>>>> <[email protected]
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> +1 suggestion for documenting many examples of "getting started"
>>>>>>>>>> +NLP
>>>>>>>>>> datasets.
>>>>>>>>>> 
>>>>>>>>>> I have at least one we can use that was created by our lead
>>>>>>>>>> Pathologist
>>>>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas
>>>>>>>> es/train/traincase.xml
>>>>>>>>>> We should provide at least one sample for each domain.
>>>>>>>>>> Trouble is, privacy requires that these examples be made up by
>> hand
>>>>>>>>>> and not copy-pasted from EMR systems.
>>>>>>>>>> 
>>>>>>>>>> --Andy
>>>>>>>>>> 
>>>>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <
>>>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> +1 for this observation Andy!
>>>>>>>>>>> 
>>>>>>>>>>> Lowering time will motive users in writing blogs about features,
>>>>>>>>>>> how
>>>>>>>> to,
>>>>>>>>>>> etc., which reduces core team work load on documentation.
>>>>>>>>>>> 
>>>>>>>>>>> I have been trying to write a small "how to write standalone
>>>>>>>>>>> client for ctakes" with my experience (I saw at least 4 users
>>>>>>>>>>> posted similar
>>>>>>>>>> question
>>>>>>>>>>> in last 2 months), but not getting enough time because ctakes
>>>>>>>>>>> depends
>>>>>>>> on
>>>>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,),
>>>>>>>>>>> most
>>>>>>>> of
>>>>>>>>>>> my spare time is being spent on juggling between these
>> frameworks,
>>>>>>>>>> posting
>>>>>>>>>>> and browsing those forums, relating observations to ctakes code.
>> I
>>>>>>>> think
>>>>>>>>>> we
>>>>>>>>>>> need to have some high level documentation about these (with
>> links
>>>>>>>>>>> to corresponding forums).
>>>>>>>>>>> 
>>>>>>>>>>> Above case is for developers (I think this will be more user base
>>>>>>>>>>> as
>>>>>>>>>> ctakes
>>>>>>>>>>> progress), for users I think documentation is lot better though
>>>>>>>>>>> some improvements need to be done.
>>>>>>>>>>> 
>>>>>>>>>>> As a developer I felt tough with lack of sample training data (I
>>>>>>>>>>> am
>>>>>>>> still
>>>>>>>>>>> struggling in this area even though I browsed all relevant code),
>>>>>>>> though
>>>>>>>>>>> training class are there. I understood that there are licensing
>>>>>>>>>>> issues
>>>>>>>>>> with
>>>>>>>>>>> REAL data, but at least some hand made example sentences, which
>>>>>>>>>>> may not
>>>>>>>>>> be
>>>>>>>>>>> real but helps developers in understanding the type/structure of
>>>>>>>>>>> input TRAINING classes expecting. This way people who browse the
>>>>>>>>>>> code can
>>>>>>>>>> reverse
>>>>>>>>>>> engineer and develop their own models. Sorry if you guys feel
>> this
>>>>>>>>>>> as novice issue, but I feel most of the developers will be novice
>>>>>>>>>>> when
>>>>>>>> they
>>>>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some
>>>>>>>>>>> documentation in this area will same lot of time for us.
>>>>>>>>>>> 
>>>>>>>>>>> I wish there will be some activity in this area from ctakes core
>> team.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Giri
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry
>>>>>>>>>>> <[email protected]
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> ctakes is at a point where we have a LOT of features but it is
>>>>>>>>>>>> still
>>>>>>>>>> hard
>>>>>>>>>>>> to get started.
>>>>>>>>>>>> 
>>>>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not
>>>>>>>> obvious
>>>>>>>>>>>> and requires hand holding.
>>>>>>>>>>>> This is very typical in early FOSS projects.
>>>>>>>>>>>> 
>>>>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND
>>>>>>>>>>>> better
>>>>>>>>>> bug
>>>>>>>>>>>> reports, FAQ, etc.
>>>>>>>>>>>> 
>>>>>>>>>>>> thoughts?
>>>>>>>>>>>> --Andy
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" <
>>>>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> I just wanted to gauge the interest of creating the next
>> release
>>>>>>>>>>>>> of
>>>>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira-
>>>>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or
>> closed.
>>>>>>>>>>>> Plenty of bug fixes and new components including:
>>>>>>>>>>>>> - New CEM Instance Template population
>>>>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler
>>>>>>>>>>>>> - New optional Clear POSTagger
>>>>>>>>>>>>> - New regression testing component
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Should we wait for the Temporal component?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%
>>>>>>>> 22%20AND%20project%20%3D%20CTAKES
>>

Re: Next cTAKES release (3.1)?

Reply via email to