I see. Its a pretty random collection of formats. Sent from my iPhone
On Jul 3, 2013, at 18:25, andy mcmurry <[email protected]> wrote: > Mtsamples has lots of free public examples already but we aren't using them > yet. This is probably because mtsamples don't have the annotations we need > to use them as training examples. > On Jul 3, 2013 2:46 PM, "Hephaestus Studio" <[email protected]> > wrote: > >> @Andy - Not a doctor yet, but soon! Thanks for the promotion though, one >> more year! >> >> - Apropos meds or clinical type questions: any developer on here can feel >> free to shoot me a quick question via the list anytime, Id be happy to >> confirm that a drug or anything else makes since given a particular >> clinical/note context. >> >> - "I wonder if there is someway in which you could guide us in making >> better use of the medical knowledge sources (ontologies) that are >> available." - I'd be happy to brainstorm about using existing resources to >> help in decision making. We use these all the time in the clinic. >> >> @ Tim+Andy+Chen - I haven't had a chance to really start chewing into the >> code, though I hope to over the next year; so, what kind of examples would >> be most helpful? >> - Any particular disease processes? >> - Are you all familiar with the ubiquitous SOAP style presentation >> that doctors use to write free notes? The few examples I clicked through in >> the repository that Chen pointed me too are very sparse. Would we want >> gradations? E.g., a scale for "well done" notes to "very quick >> I-dont-care-because-I'm-in-a-rush" notes? >> >> @ Chen - Thank you for the kind words. It's nice to be welcomed by a >> community in which you hope to integrate. And thank you for pointing me to >> the directory with the current sample notes. This was very helpful in >> determining where those are at in there development. I know that each of >> your hospitals have a wealth of HIPAA-closed notes, but I'll see what I can >> do to make some "stereotypical" open-notes for common disease >> presentations. Again: maybe a scale, not necessarily just on brevity but >> some other metric, whose continuum represented various permutations of >> degrees of something, maybe of difficulty in processing? Apropos code, >> Chen: I will help where I can but where I want to be is elbow deep in the >> code :) >> >> Finally: I haven't had a chance to look into some of the links from >> earlier in this thread regarding open access repositories of free text >> clinical notes: what do you all feel the quality of these resources are? >> Abundant but low quality? Paucity but those that are there are high quality? >> >> Bottom line: no problem either answering contextual questions (can afib be >> associated with a lower gi bleed??) and no problem writing some notes, only >> question would be, before I put in any time: what disease/specialty domain? >> and would we want some system that put them on a continuum of some >> variable, say, brevity or "readability"? >> >> Just thinking before leaping, >> >> Thanks, >> JG >> >> Sent from my iPhone >> >> On Jul 2, 2013, at 21:23, "Chen, Pei" <[email protected]> >> wrote: >> >>> Hi John, >>> Welcome! There are actually many ways to contribute and it's not >> limited to just code. It's always great to hear new ideas and suggestions >> on how to improve the software. Therefore even, things like user feedback, >> documentation, new use cases, essentially anything that will make things >> better would be awesome! >>> >>> To get started, I would suggest subscribing to the email lists. If you >> would like to contribute anything, just create an Jira account (anyone >> should be able to do this), and add/review Jira items (add attachments if >> you like) and we can even help integrate it. >>> >>> We normally use Jira to keep track of issues: >>> [1] https://issues.apache.org/jira/browse/ctakes >>> >>> Current collection of sample test notes that have been collected over >> the years: >> https://svn.apache.org/repos/asf/ctakes/trunk/ctakes-regression-test/testdata/input/plaintext/ >>> >>> ________________________________________ >>> From: Tim Miller [[email protected]] >>> Sent: Tuesday, July 02, 2013 6:31 PM >>> To: [email protected] >>> Subject: Re: Next cTAKES release (3.1)? >>> >>> Agreed that you could definitely help out, and that would be a great way >>> to do so. We don't really have "examples" right now, more like just >>> short test sentences for showing simple results and verifying that >>> nothing has been broken by changes. I think regular length fake but >>> realistic notes would be very useful. >>> Tim >>> >>> On 07/02/2013 05:19 PM, John Green wrote: >>>> Hi all, >>>> >>>> Ive been following this mail list for a couple of months. Im a third >> year medical student rounding the bend toward my MD. I used to be a >> computer programmer, however, and continue my own projects. Im very >> interested in contributing eventually to cTakes development. In the >> meantime, given the current talk of examples, if any domain specific >> examples needed generated I am domain knowledgable enough that I could >> pound out a few free text notes made to order. >>>> >>>> Let me know, you all may already have docs on hand willing todo this, >> but if not... >>>> >>>> John Green >>>> >>>> Sent from my iPhone >>>> >>>> On Jun 28, 2013, at 8:59, "Chen, Pei" <[email protected]> >> wrote: >>>> >>>>> I completely agree with making cTAKES easier use. I think it is >> exciting to hear the different use cases here and understanding where some >> of the areas that need improvements are (which we haven't thought about >> earlier). >>>>> I think Tim's suggestions and the 3 concrete actionable items makes a >> lot of sense. Hopefully it should attract new users, adopters, and perhaps >> more committers. >>>>> >>>>>> i) Make the typesystem forefront in documentation -- generate >> javadocs and >>>>>> have as a link on the ctakes frontpage/sidebar >>>>>> ii) Similar to the way that we are aiming to have tests in every >> module, also >>>>>> have clearly labeled examples in every module that set up a pipeline, >> run on >>>>>> sample notes (could be the same sample notes from the tests), and do >>>>>> something with the results. >>>>>> iii) Follow Giri's recommendation to have example training data for >> people >>>>>> who want to take the next step and train their own models >>>>> I think Java developers are accustomed to including a library as a >> dependency/jar, have an API to pass input, and get the results via pojos; >> So the examples could initially shield the complexity of wiring a pipeline >> together etc. >>>>> If we can improve the API's and how it gets integrated with other >> apps, we can add any GUI/CLI tools on top of this afterwards. >>>>> >>>>> --Pei >>>>> >>>>>> -----Original Message----- >>>>>> From: Miller, Timothy [mailto:[email protected]] >>>>>> Sent: Friday, June 28, 2013 8:00 AM >>>>>> To: [email protected] >>>>>> Subject: Re: Next cTAKES release (3.1)? >>>>>> >>>>>> Very interesting discussion. I think Giri is right about giving >> example training >>>>>> data in the format that our training code can read. While our >> ultimate goal >>>>>> would be to build and release models that are completely domain- >>>>>> independent, in the real world it is almost always better to use some >>>>>> domain-specific data and we should think more about how to facilitate >> that. >>>>>> >>>>>> As for making it easier to get started, it is not totally clear to me >> what this >>>>>> means/how to do it so it might be useful to get specific about what >> this >>>>>> means. I think our biggest hurdle is >>>>>> >>>>>> 1) Prerequisite of understanding UIMA/UIMAFit >>>>>> >>>>>> Since UIMAFit is officially becoming part of UIMA that will be >> easier, and >>>>>> hopefully people will just learn the easier (in my opinion) UIMAFit >> way than >>>>>> the standard UIMA way of doing things. Is there something we can be >> doing >>>>>> to make understanding UIMA easier? Or do we just need to say upfront >> that >>>>>> this is a prerequisite and hope that people don't give up due to this >> thing that >>>>>> is out of our control? >>>>>> >>>>>> Another hurdle is: >>>>>> >>>>>> 2) cTAKES is a multi-purpose developer-aimed tool >>>>>> >>>>>> So it's not just a matter of hiding complexity -- at some point >> people have to >>>>>> understand their problem, understand cTAKES' capabilities, and start >> coding. >>>>>> Pei's GUI will help for some common use cases but will not remove the >>>>>> requirement that someone at the organization knows cTAKES. >>>>>> I think one part of this problem is the fact that the typesystem is >> not well >>>>>> documented. A developer needs to know what the output is (objects from >>>>>> the typesystem), how to get them (which modules/pipelines), and what >>>>>> information is in them. So maybe on this end my recommendation would >> be: >>>>>> i) Make the typesystem forefront in documentation -- generate >> javadocs and >>>>>> have as a link on the ctakes frontpage/sidebar >>>>>> ii) Similar to the way that we are aiming to have tests in every >> module, also >>>>>> have clearly labeled examples in every module that set up a pipeline, >> run on >>>>>> sample notes (could be the same sample notes from the tests), and do >>>>>> something with the results. >>>>>> iii) Follow Giri's recommendation to have example training data for >> people >>>>>> who want to take the next step and train their own models >>>>>> >>>>>> This is quite a bit of developer overhead, so it's worth asking >> whether you >>>>>> agree with my "diagnosis" and "treatment" or whether you think there >> are >>>>>> different problems/solutions that should be higher priority. >>>>>> >>>>>> Tim >>>>>> >>>>>> On 06/27/2013 10:59 PM, Girivaraprasad Nambari wrote: >>>>>>> Hi Vijay and Andy, >>>>>>> >>>>>>> Thanks for sharing those examples. >>>>>>> >>>>>>> "Trouble is, privacy requires that these examples be made up by hand" >>>>>>> >>>>>>> Agree with this statement and this is very valid concern. >>>>>>> >>>>>>> In "getting started examples", I think we should just have couple of >>>>>>> entries (5-10 small entries), not more than that (with explicit >>>>>>> statement like "ONLY EXAMPLE", NOT GOOD FOR REAL USAGE). I >>>>>> understand >>>>>>> handcrafting these may not be easy because we are not medical domain >>>>>>> experts, but I feel worth time, because it brings in more user >> community. >>>>>>> >>>>>>> Thank you, >>>>>>> Giri >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 27, 2013 at 10:25 PM, Andy McMurry >>>>>> <[email protected]>wrote: >>>>>>>> GREAT ! >>>>>>>> >>>>>>>> The i2b2 data though isn't publicly distributable, you still need to >>>>>>>> request access to it since it is "semi private" >>>>>>>> >>>>>>>> >>>>>>>> On Jun 27, 2013, at 9:52 PM, vijay garla <[email protected]> wrote: >>>>>>>> >>>>>>>>> We released code on using cTAKES to annotate clinical text and SVMs >>>>>>>>> that use the annotations to classify clinical text from the CMC >> 2007 >>>>>>>>> and I2B2 >>>>>>>>> 2008 challenges: >>>>>>>>> >>>>>>>>> We did the cmd 2007 with cTAKES 2.5: >>>>>> https://code.google.com/p/ytex/wiki/WordSenseDisambiguation_V08#Repr >>>>>> o >>>>>>>> ducing_results_on_CMC_2007_challenge >>>>>>>> <https://code.google.com/p/ytex/downloads/list> >>>>>>>>> And the i2b2 2008 with the version of cTAKES distributed with the >>>>>>>>> first version of ARC: >>>>>>>>> https://code.google.com/p/ytex/wiki/FeatEng_V05#i2b2_2008 >>>>>>>>> >>>>>>>>> These are both publicly available datasets, and represent >> real-world >>>>>>>>> problems (in general I believe when publishing a paper the code >>>>>>>>> should be reproducible and made publicly available, but that's a >> different >>>>>> issue). >>>>>>>>> When we get around to upgrading YTEX to cTAKES 3.1, we would like >> to >>>>>>>>> upgrade these samples as well. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> VJ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 27, 2013 at 8:32 PM, Andy McMurry >>>>>>>>> <[email protected] >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 suggestion for documenting many examples of "getting started" >>>>>>>>>> +NLP >>>>>>>>>> datasets. >>>>>>>>>> >>>>>>>>>> I have at least one we can use that was created by our lead >>>>>>>>>> Pathologist >>>>>> https://open.med.harvard.edu/svn/scrubber/releases/3.0/data/input/cas >>>>>>>> es/train/traincase.xml >>>>>>>>>> We should provide at least one sample for each domain. >>>>>>>>>> Trouble is, privacy requires that these examples be made up by >> hand >>>>>>>>>> and not copy-pasted from EMR systems. >>>>>>>>>> >>>>>>>>>> --Andy >>>>>>>>>> >>>>>>>>>> On Jun 27, 2013, at 5:32 PM, Girivaraprasad Nambari < >>>>>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> +1 for this observation Andy! >>>>>>>>>>> >>>>>>>>>>> Lowering time will motive users in writing blogs about features, >>>>>>>>>>> how >>>>>>>> to, >>>>>>>>>>> etc., which reduces core team work load on documentation. >>>>>>>>>>> >>>>>>>>>>> I have been trying to write a small "how to write standalone >>>>>>>>>>> client for ctakes" with my experience (I saw at least 4 users >>>>>>>>>>> posted similar >>>>>>>>>> question >>>>>>>>>>> in last 2 months), but not getting enough time because ctakes >>>>>>>>>>> depends >>>>>>>> on >>>>>>>>>>> lot of other frameworks (UimaFit, cleartk, UIMA Framework etc.,), >>>>>>>>>>> most >>>>>>>> of >>>>>>>>>>> my spare time is being spent on juggling between these >> frameworks, >>>>>>>>>> posting >>>>>>>>>>> and browsing those forums, relating observations to ctakes code. >> I >>>>>>>> think >>>>>>>>>> we >>>>>>>>>>> need to have some high level documentation about these (with >> links >>>>>>>>>>> to corresponding forums). >>>>>>>>>>> >>>>>>>>>>> Above case is for developers (I think this will be more user base >>>>>>>>>>> as >>>>>>>>>> ctakes >>>>>>>>>>> progress), for users I think documentation is lot better though >>>>>>>>>>> some improvements need to be done. >>>>>>>>>>> >>>>>>>>>>> As a developer I felt tough with lack of sample training data (I >>>>>>>>>>> am >>>>>>>> still >>>>>>>>>>> struggling in this area even though I browsed all relevant code), >>>>>>>> though >>>>>>>>>>> training class are there. I understood that there are licensing >>>>>>>>>>> issues >>>>>>>>>> with >>>>>>>>>>> REAL data, but at least some hand made example sentences, which >>>>>>>>>>> may not >>>>>>>>>> be >>>>>>>>>>> real but helps developers in understanding the type/structure of >>>>>>>>>>> input TRAINING classes expecting. This way people who browse the >>>>>>>>>>> code can >>>>>>>>>> reverse >>>>>>>>>>> engineer and develop their own models. Sorry if you guys feel >> this >>>>>>>>>>> as novice issue, but I feel most of the developers will be novice >>>>>>>>>>> when >>>>>>>> they >>>>>>>>>>> adopt a system and Machine Learning/NLP is ocean. Some >>>>>>>>>>> documentation in this area will same lot of time for us. >>>>>>>>>>> >>>>>>>>>>> I wish there will be some activity in this area from ctakes core >> team. >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> Giri >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 27, 2013 at 5:11 PM, Andy McMurry >>>>>>>>>>> <[email protected] >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> ctakes is at a point where we have a LOT of features but it is >>>>>>>>>>>> still >>>>>>>>>> hard >>>>>>>>>>>> to get started. >>>>>>>>>>>> >>>>>>>>>>>> Judging from the mailing lists a lot of how cTakes works is not >>>>>>>> obvious >>>>>>>>>>>> and requires hand holding. >>>>>>>>>>>> This is very typical in early FOSS projects. >>>>>>>>>>>> >>>>>>>>>>>> Lowering the time to get invested in ctakes gets more users AND >>>>>>>>>>>> better >>>>>>>>>> bug >>>>>>>>>>>> reports, FAQ, etc. >>>>>>>>>>>> >>>>>>>>>>>> thoughts? >>>>>>>>>>>> --Andy >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Apr 11, 2013, at 8:55 PM, "Chen, Pei" < >>>>>>>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> I just wanted to gauge the interest of creating the next >> release >>>>>>>>>>>>> of >>>>>>>>>>>> cTAKES (3.1) which is currently marked for May in Jira- >>>>>>>>>>>>> There have already been 22/53 issues [1] marked as fixed or >> closed. >>>>>>>>>>>> Plenty of bug fixes and new components including: >>>>>>>>>>>>> - New CEM Instance Template population >>>>>>>>>>>>> - New Dependency Parser/Semantic Role Labeler >>>>>>>>>>>>> - New optional Clear POSTagger >>>>>>>>>>>>> - New regression testing component >>>>>>>>>>>>> >>>>>>>>>>>>> Should we wait for the Temporal component? >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>> https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%223.1% >>>>>>>> 22%20AND%20project%20%3D%20CTAKES >>
