Re: OpenNLP Sentence Detector: EOS Characters

Joern Kottmann Thu, 09 Feb 2012 01:11:17 -0800

We alreay have a properties file inside the model. It wouldn't be a
difficult
fix to add a property to it which stores the EOS characters which have been
used during training.


Jörn

On Thu, Feb 9, 2012 at 10:06 AM, Katrin Tomanek
<katrin.toma...@averbis.com>wrote:

> Hi Jörn,
>
> thanks for this explanation.
> What you are saying means, that the context generator and the eos scanner
> are not stored in the model, right?
>
> I had assumed this... other ML toolkits, such as e.g. Mallet (which uses
> the "Pipe"-logic where openlp uses event streams) actually does this.
>
> Maybe this would also be a good improvement...
>
> Best
> Katrin
>
> On 02/09/2012 09:56 AM, Joern Kottmann wrote:
>
>> When you only do it during training then it will not consider ":" as
>> a possible split during detection. That explains your drop in accuracy.
>>
>> It looks like that it is not possible to modify the EOS characters
>> properly
>> with
>> the current version. I suggest that you checkout the source code and then
>> change the defaultEosCharacters array in opennlp.tools.sentdetect.**
>> Factory.
>> With that you are able to do your test and get it working for now.
>>
>> Anyway we should have an easy way to specify the EOS characters without
>> implementing a custom Factory class.
>>
>> Please open a jira to improve this.
>>
>> Jörn
>>
>> On Thu, Feb 9, 2012 at 9:21 AM, Katrin Tomanek
>> <katrin.toma...@averbis.com>**wrote:
>>
>>  Hi Jörn,
>>>
>>> I only modified the training process.
>>>
>>> However, when I check the predictions it turns out that the model never
>>> learns to split at ":" positions.
>>>
>>> Shouldn't it be enought to modify the DefaultSDContextGenerator and the
>>> DefaultEndOfSentenceScanner so that these know about ":" as an EOS,
>>> right?
>>> Or are there other places where ":" should be added?
>>>
>>> Best
>>> Katrin
>>>
>>>
>>>
>>> On 02/09/2012 09:18 AM, Joern Kottmann wrote:
>>>
>>>  Did you modify the evaluation as well? If you just do it during training
>>>> the
>>>> evaluator will not be able to consider ":" as en EOS character.
>>>>
>>>> For me it sounds like that it fails to split on the ":" in some place.
>>>>
>>>> The sentence detector uses a maxent model to classify every EOS
>>>> character
>>>> as either a SPLIT or NO_SPLIT.
>>>>
>>>> Jörn
>>>>
>>>> On Thu, Feb 9, 2012 at 8:59 AM, Katrin Tomanek
>>>> <katrin.toma...@averbis.com>****wrote:
>>>>
>>>>
>>>>  Hi Willian,
>>>>
>>>>>
>>>>> I am currently using opennlp-1.5.2 and try to use it as an API, i.e.
>>>>> not
>>>>> to modify this code by write my own code around it. However, what I
>>>>> described below (with the SDEventStream) results in the same as you are
>>>>> describing: I am changing the set of EOS characters.
>>>>>
>>>>> I am just wondering, why adding ":" as an EOS character decreases the
>>>>> results (dropping von ~80F to 45F in sentence splitting, and ":" is
>>>>> always
>>>>> a sentence boundary symbol in my data!)
>>>>>
>>>>> Looks like I need to debug a little bit more whats happening in the
>>>>> DefaultSDContextGenerator.
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Dr. Katrin Tomanek
>>> Averbis GmbH
>>> Tennenbacher Strasse 11
>>> D-79106 Freiburg
>>>
>>> Fon: +49 (0) 761 - 203 97696
>>> Fax: +49 (0) 761 - 203 97694
>>> E-Mail: katrin.toma...@averbis.com
>>>
>>> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
>>> Sitz der Gesellschaft: Freiburg i. Br.
>>> AG Freiburg i. Br., HRB 701080
>>>
>>>
>>
>
> --
> Dr. Katrin Tomanek
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 97696
> Fax: +49 (0) 761 - 203 97694
> E-Mail: katrin.toma...@averbis.com
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080
>

Re: OpenNLP Sentence Detector: EOS Characters

Reply via email to