Ah, so we might suspect that some of those 7 lines in the file were indeed followed by newlines in the original training data. In the absence of more/better training data which would help us learn this I think it would be reasonable to restore the list of sentence-breaking characters to not include apostrophe. Seems like it is rare for a sentence to end on it, and my preference is to accidentally call 2 sentences one sentence, rather than splitting one sentence in the middle. I think it's probably better for downstream processing.
Just my .02,
Tim

On 08/26/2013 12:29 PM, Masanz, James J. wrote:
The training data is one sentence per line.
That's how you feed data to the sentence detector.

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of 
Tim Miller
Sent: Monday, August 26, 2013 11:12 AM
To: [email protected]
Subject: Re: apostrophe and sentence detector


On 08/26/2013 12:05 PM, Masanz, James J. wrote:
The recently rebuilt sentence detector (currently in trunk and the 3.1.0 
branch) is sometimes taking the apostrophe as a sentence break where the 
ctakes-3.0.0-incubating model didn't.

The training data used for the recently rebuilt model only contains only 7 
lines that end with an apostrophe (single quote)
Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
sentence detector will currently break on newlines no matter what, so
the important number is how many sentences end mid-line with an
apostrophe, right?
Tim

Reply via email to