Ephi, The ClearNLP models in the current cTAKES releases (since 3.1.0 [1]) should contain much more. They should contain at least MiPACQ and SHARP training data. Could you point us to the documentation so we can update it? I believe the break down was:
- Clinical questions: 1,600 sentences, 30,138 tokens. - Medpedia articles: 2,796 sentences, 49,922 tokens. - MiPACQ clinical notes: 8,040 sentences, 107,663 tokens. - MiPACQ pathological notes: 1,225 sentences, 21,581 tokens. - Seattle group health clinical notes: 5,020 sentences, 61,124 tokens. - Seattle group health pathological notes: 2,294 sentences, 34,384 tokens. - SHARP clinical notes: 6,787 sentences, 94,205 tokens. - SHARP stratified: 4,316 sentences, 43,037 tokens. - SHARP stratified SGH: 4,963 sentences, 49,081 tokens. - TEMPREL clinical notes: 19,775 sentences, 266,979 tokens. - TEMPREL pathological notes: 4,335 sentences, 78,829 tokens. There are some discussions on appending/augmenting the existing annotated/training data[2]. I think the short answer is that there is currently no easy way short of having to sign DUA's from every single source institution. [1] http://svn.apache.org/r1465043 [2] http://mail-archives.apache.org/mod_mbox/ctakes-dev/201412.mbox/%3ce5a9fa5abbf1ca4085d4f0794852a51e24241...@chexmbx3a.chboston.org%3E On Sun, Mar 15, 2015 at 11:58 AM, Ephi <[email protected]> wrote: > Hi - > > From the documentation, the data used to train the dep parser in cTAKES > seems to be 1600 clinical questions (from the Mayo clinic?). > > Is there a way to retrieve this data in order to retrain the model (while > adding on additional data) ? > > Thanks! > Ephi >
