The corpus used for cTAKES sentence detection is a combination of some Mayo 
Clinic clinical notes that were manually separated into sentences, combined 
with the Penn Treebank (wall street journal)

-- James

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of 
John Green
Sent: Monday, August 26, 2013 11:46 AM
To: [email protected]
Subject: Re: apostrophe and sentence detector

Just out of curiosity, how was the training data originally built? I mean, who 
separated the lines? By hand? Regex? 

    
      


    Question two: has anyone made attempts at adding project gutenberg to the 
training data for things like sentence detection? Wide variety of punctuation 
in the years a lot of those books were written. 

    
      


    Trying to piece together how it all works,

    JG

    
      


    —
Sent from Mailbox for iPhone

On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
<[email protected]> wrote:

> Ah, so we might suspect that some of those 7 lines in the file were 
> indeed followed by newlines in the original training data. In the 
> absence of more/better training data which would help us learn this I 
> think it would be reasonable to restore the list of sentence-breaking 
> characters to not include apostrophe. Seems like it is rare for a 
> sentence to end on it, and my preference is to accidentally call 2 
> sentences one sentence, rather than splitting one sentence in the 
> middle. I think it's probably better for downstream processing.
> Just my .02,
> Tim
> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
>> The training data is one sentence per line.
>> That's how you feed data to the sentence detector.
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf 
>> Of Tim Miller
>> Sent: Monday, August 26, 2013 11:12 AM
>> To: [email protected]
>> Subject: Re: apostrophe and sentence detector
>>
>>
>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 
>>> branch) is sometimes taking the apostrophe as a sentence break where the 
>>> ctakes-3.0.0-incubating model didn't.
>>>
>>> The training data used for the recently rebuilt model only contains only 7 
>>> lines that end with an apostrophe (single quote)
>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
>> sentence detector will currently break on newlines no matter what, so
>> the important number is how many sentences end mid-line with an
>> apostrophe, right?
>> Tim

Reply via email to