GREAT ! The i2b2 data though isn't publicly distributable, you still need to request access to it since it is "semi private"
On Jun 27, 2013, at 9:52 PM, vijay garla <[email protected]> wrote: > We released code on using cTAKES to annotate clinical text and SVMs that > use the annotations to classify clinical text from the CMC 2007 and I2B2 > 2008 challenges: > > We did the cmd 2007 with cTAKES 2.5: > https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge<https://code.google.com/p/ytex/downloads/list> > > > And the i2b2 2008 with the version of cTAKES distributed with the first > version of ARC: > https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008 > > These are both publicly available datasets, and represent real-world > problems (in general I believe when publishing a paper the code should be > reproducible and made publicly available, but that's a different issue). > > When we get around to upgrading YTEX to cTAKES 3.1, we would like to > upgrade these samples as well. > > Best, > > VJ > > > > On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <[email protected]>wrote: > >> +1 suggestion for documenting many examples of "getting started" NLP >> datasets. >> >> I have at least one we can use that was created by our lead Pathologist >> >> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml >> >> We should provide at least one sample for each domain. >> Trouble is, privacy requires that these examples be made up by hand and >> not copy-pasted from EMR systems. >> >> --Andy >> >> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari <[email protected]> >> wrote: >> >>> +1 for this observation Andy! >>> >>> Lowering time will motive users in writing blogs about features, how to, >>> etc., which reduces core team work load on documentation. >>> >>> I have been trying to write a small "how to write standalone client for >>> ctakes" with my experience (I saw at least 4 users posted similar >> question >>> in last 2 months), but not getting enough time because ctakes depends on >>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most of >>> my spare time is being spent on juggling between these frameworks, >> posting >>> and browsing those forums, relating observations to ctakes code. I think >> we >>> need to have some high level documentation about these (with links to >>> corresponding forums). >>> >>> Above case is for developers (I think this will be more user base as >> ctakes >>> progress), for users I think documentation is lot better though some >>> improvements need to be done. >>> >>> As a developer I felt tough with lack of sample training data (I am still >>> struggling in this area even though I browsed all relevant code), though >>> training class are there. I understood that there are licensing issues >> with >>> REAL data, but at least some hand made example sentences, which may not >> be >>> real but helps developers in understanding the type/structure of input >>> TRAINING classes expecting. This way people who browse the code can >> reverse >>> engineer and develop their own models. Sorry if you guys feel this as >>> novice issue, but I feel most of the developers will be novice when they >>> adopt a system and Machine Learning/NLP is ocean. Some documentation in >>> this area will same lot of time for us. >>> >>> I wish there will be some activity in this area from ctakes core team. >>> >>> Thank you, >>> Giri >>> >>> >>> >>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <[email protected] >>> wrote: >>> >>>> ctakes is at a point where we have a LOT of features but it is still >> hard >>>> to get started. >>>> >>>> Judging from the mailing lists a lot of how cTakes works is not obvious >>>> and requires hand holding. >>>> This is very typical in early FOSS projects. >>>> >>>> Lowering the time to get invested in ctakes gets more users AND better >> bug >>>> reports, FAQ, etc. >>>> >>>> thoughts? >>>> --Andy >>>> >>>> >>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" < >> [email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> I just wanted to gauge the interest of creating the next release of >>>> cTAKES (3.1) which is currently marked for May in Jira- >>>>> >>>>> There have already been 22/53 issues [1] marked as fixed or closed. >>>> Plenty of bug fixes and new components including: >>>>> - New CEM Instance Template population >>>>> - New Dependency Parser/Semantic Role Labeler >>>>> - New optional Clear POSTagger >>>>> - New regression testing component >>>>> >>>>> Should we wait for the Temporal component? >>>>> >>>>> [1] >>>> >> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES >>>>> >>>> >>>> >> >>
