Ephi,
The ClearNLP models in the current cTAKES releases (since 3.1.0 [1]) should
contain much more.  They should contain at least MiPACQ and SHARP training
data.  Could you point us to the documentation so we can update it?  I
believe the break down was:


   - Clinical questions: 1,600 sentences, 30,138 tokens.
   - Medpedia articles: 2,796 sentences, 49,922 tokens.
   - MiPACQ clinical notes: 8,040 sentences, 107,663 tokens.
   - MiPACQ pathological notes: 1,225 sentences, 21,581 tokens.
   - Seattle group health clinical notes: 5,020 sentences, 61,124 tokens.
   - Seattle group health pathological notes: 2,294 sentences, 34,384
   tokens.
   - SHARP clinical notes: 6,787 sentences, 94,205 tokens.
   - SHARP stratified: 4,316 sentences, 43,037 tokens.
   - SHARP stratified SGH: 4,963 sentences, 49,081 tokens.
   - TEMPREL clinical notes: 19,775 sentences, 266,979 tokens.
   - TEMPREL pathological notes: 4,335 sentences, 78,829 tokens.

There are some discussions on appending/augmenting the existing
annotated/training data[2].  I think the short answer is that there is
currently no easy way short of having to sign DUA's from every single
source institution.

[1] http://svn.apache.org/r1465043
[2]
http://mail-archives.apache.org/mod_mbox/ctakes-dev/201412.mbox/%3ce5a9fa5abbf1ca4085d4f0794852a51e24241...@chexmbx3a.chboston.org%3E


On Sun, Mar 15, 2015 at 11:58 AM, Ephi <[email protected]> wrote:

> Hi -
>
> From the documentation, the data used to train the dep parser in cTAKES
> seems to be 1600 clinical questions (from the Mayo clinic?).
>
> Is there a way to retrieve this data in order to retrain the model (while
> adding on additional data) ?
>
> Thanks!
> Ephi
>

Reply via email to