+1 ctakes IS domain specific +1 UIMAFit should become a part of UIMA and not the focus of ctakes-dev
At first glance, people should think of cTakes as the "UIMA medical text library". Here are examples that I know users are interested in. Suggestions: 1. ctakes DRUG PROFILE http://www.mtsamples.com/site/pages/sample.asp?Type=6-Cardiovascular&Sample=775-H%26P+-+Cardio+(Angina) 2. ctakes NER : http://www.mtsamples.com/site/pages/sample.asp?Type=77-rheumatology&Sample=790-Rheumatoid+Arthritis+-+H%26P 3. ctakes SMOKING: http://www.mtsamples.com/site/pages/sample.asp?Type=6-Cardiovascular%20/%20Pulmonary&Sample=571-Trouble%20breathing 4. ctakes Lexical features (PoS, sentence boundaries, etc) http://www.medicaltranscriptionsamples.com/diabetes-mellitus-followup/ > Very interesting discussion. I think Giri is right about giving example > training data in the format that our training code can read. While our > ultimate goal would be to build and release models that are completely > domain-independent, in the real world it is almost always better to use > some domain-specific data and we should think more about how to > facilitate that. > > As for making it easier to get started, it is not totally clear to me > what this means/how to do it so it might be useful to get specific about > what this means. I think our biggest hurdle is > > 1) Prerequisite of understanding UIMA/UIMAFit > > Since UIMAFit is officially becoming part of UIMA that will be easier, > and hopefully people will just learn the easier (in my opinion) UIMAFit > way than the standard UIMA way of doing things. Is there something we > can be doing to make understanding UIMA easier? Or do we just need to > say upfront that this is a prerequisite and hope that people don't give > up due to this thing that is out of our control? > > Another hurdle is: > > 2) cTAKES is a multi-purpose developer-aimed tool > > So it's not just a matter of hiding complexity -- at some point people > have to understand their problem, understand cTAKES' capabilities, and > start coding. Pei's GUI will help for some common use cases but will not > remove the requirement that someone at the organization knows cTAKES. > I think one part of this problem is the fact that the typesystem is not > well documented. A developer needs to know what the output is (objects > from the typesystem), how to get them (which modules/pipelines), and > what information is in them. So maybe on this end my recommendation > would be: > i) Make the typesystem forefront in documentation -- generate javadocs > and have as a link on the ctakes frontpage/sidebar > ii) Similar to the way that we are aiming to have tests in every module, > also have clearly labeled examples in every module that set up a > pipeline, run on sample notes (could be the same sample notes from the > tests), and do something with the results. > iii) Follow Giri's recommendation to have example training data for > people who want to take the next step and train their own models > > This is quite a bit of developer overhead, so it's worth asking whether > you agree with my "diagnosis" and "treatment" or whether you think there > are different problems/solutions that should be higher priority. > > Tim > > On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote: >> Hi Vijay and Andy, >> >> Thanks for sharing those examples. >> >> "Trouble is, privacy requires that these examples be made up by hand" >> >> Agree with this statement and this is very valid concern. >> >> In "getting started examples", I think we should just have couple of >> entries (5-10 small entries), not more than that (with explicit statement >> like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I understand handcrafting >> these may not be easy because we are not medical domain experts, but I feel >> worth time, because it brings in more user community. >> >> Thank you, >> Giri >> >> >> >> >> >> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry <[email protected]>wrote: >> >>> GREAT ! >>> >>> The i2b2 data though isn't publicly distributable, you still need to >>> request access to it since it is "semi private" >>> >>> >>> On Jun 27, 2013, at 9:52 PM, vijay garla <[email protected]> wrote: >>> >>>> We released code on using cTAKES to annotate clinical text and SVMs that >>>> use the annotations to classify clinical text from the CMC 2007 and I2B2 >>>> 2008 challenges: >>>> >>>> We did the cmd 2007 with cTAKES 2.5: >>>> >>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Reproducing_results_on_CMC_2007_challenge >>> <https://code.google.com/p/ytex/downloads/list> >>>> >>>> And the i2b2 2008 with the version of cTAKES distributed with the first >>>> version of ARC: >>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008 >>>> >>>> These are both publicly available datasets, and represent real-world >>>> problems (in general I believe when publishing a paper the code should be >>>> reproducible and made publicly available, but that's a different issue). >>>> >>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like to >>>> upgrade these samples as well. >>>> >>>> Best, >>>> >>>> VJ >>>> >>>> >>>> >>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry <[email protected] >>>> wrote: >>>> >>>>> +1 suggestion for documenting many examples of "getting started" NLP >>>>> datasets. >>>>> >>>>> I have at least one we can use that was created by our lead Pathologist >>>>> >>>>> >>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cases/train/traincase.xml >>>>> We should provide at least one sample for each domain. >>>>> Trouble is, privacy requires that these examples be made up by hand and >>>>> not copy-pasted from EMR systems. >>>>> >>>>> --Andy >>>>> >>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari < >>> [email protected]> >>>>> wrote: >>>>> >>>>>> +1 for this observation Andy! >>>>>> >>>>>> Lowering time will motive users in writing blogs about features, how >>> to, >>>>>> etc., which reduces core team work load on documentation. >>>>>> >>>>>> I have been trying to write a small "how to write standalone client for >>>>>> ctakes" with my experience (I saw at least 4 users posted similar >>>>> question >>>>>> in last 2 months), but not getting enough time because ctakes depends >>> on >>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), most >>> of >>>>>> my spare time is being spent on juggling between these frameworks, >>>>> posting >>>>>> and browsing those forums, relating observations to ctakes code. I >>> think >>>>> we >>>>>> need to have some high level documentation about these (with links to >>>>>> corresponding forums). >>>>>> >>>>>> Above case is for developers (I think this will be more user base as >>>>> ctakes >>>>>> progress), for users I think documentation is lot better though some >>>>>> improvements need to be done. >>>>>> >>>>>> As a developer I felt tough with lack of sample training data (I am >>> still >>>>>> struggling in this area even though I browsed all relevant code), >>> though >>>>>> training class are there. I understood that there are licensing issues >>>>> with >>>>>> REAL data, but at least some hand made example sentences, which may not >>>>> be >>>>>> real but helps developers in understanding the type/structure of input >>>>>> TRAINING classes expecting. This way people who browse the code can >>>>> reverse >>>>>> engineer and develop their own models. Sorry if you guys feel this as >>>>>> novice issue, but I feel most of the developers will be novice when >>> they >>>>>> adopt a system and Machine Learning/NLP is ocean. Some documentation in >>>>>> this area will same lot of time for us. >>>>>> >>>>>> I wish there will be some activity in this area from ctakes core team. >>>>>> >>>>>> Thank you, >>>>>> Giri >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry <[email protected] >>>>>> wrote: >>>>>> >>>>>>> ctakes is at a point where we have a LOT of features but it is still >>>>> hard >>>>>>> to get started. >>>>>>> >>>>>>> Judging from the mailing lists a lot of how cTakes works is not >>> obvious >>>>>>> and requires hand holding. >>>>>>> This is very typical in early FOSS projects. >>>>>>> >>>>>>> Lowering the time to get invested in ctakes gets more users AND better >>>>> bug >>>>>>> reports, FAQ, etc. >>>>>>> >>>>>>> thoughts? >>>>>>> --Andy >>>>>>> >>>>>>> >>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" < >>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> I just wanted to gauge the interest of creating the next release of >>>>>>> cTAKES (3.1) which is currently marked for May in Jira- >>>>>>>> There have already been 22/53 issues [1] marked as fixed or closed. >>>>>>> Plenty of bug fixes and new components including: >>>>>>>> - New CEM Instance Template population >>>>>>>> - New Dependency Parser/Semantic Role Labeler >>>>>>>> - New optional Clear POSTagger >>>>>>>> - New regression testing component >>>>>>>> >>>>>>>> Should we wait for the Temporal component? >>>>>>>> >>>>>>>> [1] >>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1%22%20AND%20project%20%3D%20CTAKES >>>>>>> >>>>> >>> >
