That sounds like a good plan.  Of the data used to train the first cTAKES 
sentence detector (prior to Apache cTAKES) there were less than 8000 sentences 
from clinical notes. 

Also of interest may be this table which shows that GENIA, PTB, and Mayo Clinic 
data were all used for that model.

http://jamia.bmj.com/content/17/5/507/T2.expansion.html 

-- James

-----Original Message-----
From: Tim Miller [mailto:[email protected]] 
Sent: Friday, February 07, 2014 4:24 PM
To: [email protected]
Subject: training data for sentence detector

James,
We were discussing the sentence detector thing in person here the other 
day and Pei had a thought that depending on what sources you were using 
for training the sentence detector, we might be able to do something 
equivalent here in Boston by using SHARP, THYME, MIPACQ data which are 
largely from Mayo and probably similar to what you use, then augmenting 
with the little bit of MIMIC that I annotated. I don't know how that 
compares size-wise to the dataset that you are using. Is it quite large 
or do you think if we use derived data from those other projects will we 
be good? What do you think of this plan? Anyone else?
Tim

Reply via email to