That sounds like a good plan. Of the data used to train the first cTAKES sentence detector (prior to Apache cTAKES) there were less than 8000 sentences from clinical notes.
Also of interest may be this table which shows that GENIA, PTB, and Mayo Clinic data were all used for that model. http://jamia.bmj.com/content/17/5/507/T2.expansion.html -- James -----Original Message----- From: Tim Miller [mailto:[email protected]] Sent: Friday, February 07, 2014 4:24 PM To: [email protected] Subject: training data for sentence detector James, We were discussing the sentence detector thing in person here the other day and Pei had a thought that depending on what sources you were using for training the sentence detector, we might be able to do something equivalent here in Boston by using SHARP, THYME, MIPACQ data which are largely from Mayo and probably similar to what you use, then augmenting with the little bit of MIMIC that I annotated. I don't know how that compares size-wise to the dataset that you are using. Is it quite large or do you think if we use derived data from those other projects will we be good? What do you think of this plan? Anyone else? Tim
