We have 5-6 clinical notes that we got from the web (=publicly available to anyone). We can include them as samples in the 3.1 release. We have been using these notes for demo purposes. --Guergana
-----Original Message----- From: Andy McMurry [mailto:[email protected]] Sent: Friday, June 28, 2013 10:15 AM To: [email protected] Subject: Re: Next cTAKES release (3.1)? iDash and others have medical NLP datasets that could be used for ctakes "Getting Started" examples http://idash.ucsd.edu/nlp-and-data-modeling http://idash.ucsd.edu/nlp/umls-vm the GOOD: iDash already includes ctakes the BAD: iDash references old versions ctakes and points to cabig (which is now defunct) Recommendation: we should talk to iDash, create "hello medical world" training examples, and request iDaash point to the cTakes Apache home page. Disclaimer: I'm not involved with iDash On Jun 27, 2013, at 10:58 PM, Girivaraprasad Nambari <[email protected]> wrote: > Hi Vijay and Andy, > > Thanks for sharing those examples. > > "Trouble is, privacy requires that these examples be made up by hand" > > Agree with this statement and this is very valid concern. > > In "getting started examples", I think we should just have couple of > entries (5-10 small entries), not more than that (with explicit > statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand > handcrafting these may not be easy because we are not medical domain > experts, but I feel worth time, because it brings in more user community. > > Thank you, > Giri > > > > > > On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <[email protected]>wrote: > >> GREAT ! >> >> The i2b2 data though isn't publicly distributable, you still need to >> request access to it since it is "semi private" >> >> >> On Jun 27, 2013, at 9:52 PM, vijay garla <[email protected]> wrote: >> >>> We released code on using cTAKES to annotate clinical text and SVMs >>> that use the annotations to classify clinical text from the CMC 2007 >>> and I2B2 >>> 2008 challenges: >>> >>> We did the cmd 2007 with cTAKES 2.5: >>> >> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repro >> ducing_results_on_CMC_2007_challenge >> <https://code.google.com/p/ytex/downloads/list> >>> >>> >>> And the i2b2 2008 with the version of cTAKES distributed with the >>> first version of ARC: >>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008 >>> >>> These are both publicly available datasets, and represent real-world >>> problems (in general I believe when publishing a paper the code >>> should be reproducible and made publicly available, but that's a different >>> issue). >>> >>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to >>> upgrade these samples as well. >>> >>> Best, >>> >>> VJ >>> >>> >>> >>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry >>> <[email protected] >>> wrote: >>> >>>> +1 suggestion for documenting many examples of "getting started" >>>> +NLP >>>> datasets. >>>> >>>> I have at least one we can use that was created by our lead >>>> Pathologist >>>> >>>> >> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas >> es/train/traincase.xml >>>> >>>> We should provide at least one sample for each domain. >>>> Trouble is, privacy requires that these examples be made up by hand >>>> and not copy-pasted from EMR systems. >>>> >>>> --Andy >>>> >>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari < >> [email protected]> >>>> wrote: >>>> >>>>> +1 for this observation Andy! >>>>> >>>>> Lowering time will motive users in writing blogs about features, >>>>> how >> to, >>>>> etc., which reduces core team work load on documentation. >>>>> >>>>> I have been trying to write a small "how to write standalone >>>>> client for ctakes" with my experience (I saw at least 4 users >>>>> posted similar >>>> question >>>>> in last 2 months), but not getting enough time because ctakes >>>>> depends >> on >>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), >>>>> most >> of >>>>> my spare time is being spent on juggling between these frameworks, >>>> posting >>>>> and browsing those forums, relating observations to ctakes code. I >> think >>>> we >>>>> need to have some high level documentation about these (with links >>>>> to corresponding forums). >>>>> >>>>> Above case is for developers (I think this will be more user base >>>>> as >>>> ctakes >>>>> progress), for users I think documentation is lot better though >>>>> some improvements need to be done. >>>>> >>>>> As a developer I felt tough with lack of sample training data (I >>>>> am >> still >>>>> struggling in this area even though I browsed all relevant code), >> though >>>>> training class are there. I understood that there are licensing >>>>> issues >>>> with >>>>> REAL data, but at least some hand made example sentences, which >>>>> may not >>>> be >>>>> real but helps developers in understanding the type/structure of >>>>> input TRAINING classes expecting. This way people who browse the >>>>> code can >>>> reverse >>>>> engineer and develop their own models. Sorry if you guys feel this >>>>> as novice issue, but I feel most of the developers will be novice >>>>> when >> they >>>>> adopt a system and Machine Learning/NLP is ocean. Some >>>>> documentation in this area will same lot of time for us. >>>>> >>>>> I wish there will be some activity in this area from ctakes core team. >>>>> >>>>> Thank you, >>>>> Giri >>>>> >>>>> >>>>> >>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry >>>>> <[email protected] >>>>> wrote: >>>>> >>>>>> ctakes is at a point where we have a LOT of features but it is >>>>>> still >>>> hard >>>>>> to get started. >>>>>> >>>>>> Judging from the mailing lists a lot of how cTakes works is not >> obvious >>>>>> and requires hand holding. >>>>>> This is very typical in early FOSS projects. >>>>>> >>>>>> Lowering the time to get invested in ctakes gets more users AND >>>>>> better >>>> bug >>>>>> reports, FAQ, etc. >>>>>> >>>>>> thoughts? >>>>>> --Andy >>>>>> >>>>>> >>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" < >>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I just wanted to gauge the interest of creating the next release >>>>>>> of >>>>>> cTAKES (3.1) which is currently marked for May in Jira- >>>>>>> >>>>>>> There have already been 22/53 issues [1] marked as fixed or closed. >>>>>> Plenty of bug fixes and new components including: >>>>>>> - New CEM Instance Template population >>>>>>> - New Dependency Parser/Semantic Role Labeler >>>>>>> - New optional Clear POSTagger >>>>>>> - New regression testing component >>>>>>> >>>>>>> Should we wait for the Temporal component? >>>>>>> >>>>>>> [1] >>>>>> >>>> >> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1% >> 22%20AND%20project%20%3D%20CTAKES >>>>>>> >>>>>> >>>>>> >>>> >>>> >> >>
