The 7 lines I referred to as "ending with apostrophe" indeed have apostrophe followed immediately by newline.
In the training data it is indeed very rare to end on apostrophe. 7 out of >400K sentences. I second your suggestion of removing the apostrophe from the list of sentence-breaking characters. It is straight-forward and cleaner. Thanks -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Tim Miller Sent: Monday, August 26, 2013 11:35 AM To: [email protected] Subject: Re: apostrophe and sentence detector Ah, so we might suspect that some of those 7 lines in the file were indeed followed by newlines in the original training data. In the absence of more/better training data which would help us learn this I think it would be reasonable to restore the list of sentence-breaking characters to not include apostrophe. Seems like it is rare for a sentence to end on it, and my preference is to accidentally call 2 sentences one sentence, rather than splitting one sentence in the middle. I think it's probably better for downstream processing. Just my .02, Tim On 08/26/2013 12:29 PM, Masanz, James J. wrote: > The training data is one sentence per line. > That's how you feed data to the sentence detector. > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of > Tim Miller > Sent: Monday, August 26, 2013 11:12 AM > To: [email protected] > Subject: Re: apostrophe and sentence detector > > > On 08/26/2013 12:05 PM, Masanz, James J. wrote: >> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 >> branch) is sometimes taking the apostrophe as a sentence break where the >> ctakes-3.0.0-incubating model didn't. >> >> The training data used for the recently rebuilt model only contains only 7 >> lines that end with an apostrophe (single quote) > Do you mean 7 sentences that end in a single apostrophe or 7 lines? The > sentence detector will currently break on newlines no matter what, so > the important number is how many sentences end mid-line with an > apostrophe, right? > Tim
