Karthik, well said. There are many differences. I wonder, what do you think
about the logical division of the two sets? Do they share domain? Is one a
subset of the other? I would propose that it wouldnt be unreasonable to think
of clinical notes as being a subset of the english language. It seems to me
that gutenberg is fairly good average of that english language so the superset
could contribute to the recognition of the subset.
JG
—
Sent from Mailbox for iPhone
On Mon, Aug 26, 2013 at 2:07 PM, Masanz, James J. <[email protected]>
wrote:
> The corpus used for cTAKES sentence detection is a combination of some Mayo
> Clinic clinical notes that were manually separated into sentences, combined
> with the Penn Treebank (wall street journal)
> -- James
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of
> John Green
> Sent: Monday, August 26, 2013 11:46 AM
> To: [email protected]
> Subject: Re: apostrophe and sentence detector
> Just out of curiosity, how was the training data originally built? I mean,
> who separated the lines? By hand? Regex?
>
>
> Question two: has anyone made attempts at adding project gutenberg to the
> training data for things like sentence detection? Wide variety of punctuation
> in the years a lot of those books were written.
>
>
> Trying to piece together how it all works,
> JG
>
>
> —
> Sent from Mailbox for iPhone
> On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
> <[email protected]> wrote:
>> Ah, so we might suspect that some of those 7 lines in the file were
>> indeed followed by newlines in the original training data. In the
>> absence of more/better training data which would help us learn this I
>> think it would be reasonable to restore the list of sentence-breaking
>> characters to not include apostrophe. Seems like it is rare for a
>> sentence to end on it, and my preference is to accidentally call 2
>> sentences one sentence, rather than splitting one sentence in the
>> middle. I think it's probably better for downstream processing.
>> Just my .02,
>> Tim
>> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
>>> The training data is one sentence per line.
>>> That's how you feed data to the sentence detector.
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf
>>> Of Tim Miller
>>> Sent: Monday, August 26, 2013 11:12 AM
>>> To: [email protected]
>>> Subject: Re: apostrophe and sentence detector
>>>
>>>
>>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>>>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0
>>>> branch) is sometimes taking the apostrophe as a sentence break where the
>>>> ctakes-3.0.0-incubating model didn't.
>>>>
>>>> The training data used for the recently rebuilt model only contains only 7
>>>> lines that end with an apostrophe (single quote)
>>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
>>> sentence detector will currently break on newlines no matter what, so
>>> the important number is how many sentences end mid-line with an
>>> apostrophe, right?
>>> Tim