Hmm, one problem there is that medical records tend to be punctuated completely differently from normal text in my experience.
-- Karthik Sarma UCLA Medical Scientist Training Program Class of 20?? Member, UCLA Medical Imaging & Informatics Lab Member, CA Delegation to the House of Delegates of the American Medical Association [email protected] gchat: [email protected] linkedin: www.linkedin.com/in/ksarma On Mon, Aug 26, 2013 at 9:46 AM, John Green <[email protected]>wrote: > Just out of curiosity, how was the training data originally built? I mean, > who separated the lines? By hand? Regex? > > > > > > Question two: has anyone made attempts at adding project gutenberg to > the training data for things like sentence detection? Wide variety of > punctuation in the years a lot of those books were written. > > > > > > Trying to piece together how it all works, > > JG > > > > > > — > Sent from Mailbox for iPhone > > On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller > <[email protected]> wrote: > > > Ah, so we might suspect that some of those 7 lines in the file were > > indeed followed by newlines in the original training data. In the > > absence of more/better training data which would help us learn this I > > think it would be reasonable to restore the list of sentence-breaking > > characters to not include apostrophe. Seems like it is rare for a > > sentence to end on it, and my preference is to accidentally call 2 > > sentences one sentence, rather than splitting one sentence in the > > middle. I think it's probably better for downstream processing. > > Just my .02, > > Tim > > On 08/26/2013 12:29 PM, Masanz, James J. wrote: > >> The training data is one sentence per line. > >> That's how you feed data to the sentence detector. > >> > >> -----Original Message----- > >> From: [email protected] [mailto: > [email protected]] On Behalf Of Tim > Miller > >> Sent: Monday, August 26, 2013 11:12 AM > >> To: [email protected] > >> Subject: Re: apostrophe and sentence detector > >> > >> > >> On 08/26/2013 12:05 PM, Masanz, James J. wrote: > >>> The recently rebuilt sentence detector (currently in trunk and the > 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where > the ctakes-3.0.0-incubating model didn't. > >>> > >>> The training data used for the recently rebuilt model only contains > only 7 lines that end with an apostrophe (single quote) > >> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The > >> sentence detector will currently break on newlines no matter what, so > >> the important number is how many sentences end mid-line with an > >> apostrophe, right? > >> Tim >
