Re: Re: English word splitting with opennlp?

Thanks a lot everyone, it's working for me now.

On , Jörn Kottmann <kottm...@gmail.com> wrote:

On 11/16/11 8:26 PM, lishe...@gmail.com wrote:

That's what i thought how the "Learnable Tokenizer" works, but it doesn'twork for for some reason.

What I did:

1) edit a test.train file with following content:

boysandgirls.

boysandgirls.

boysandgirls.

... repeat 30 times ...

2) train a model by:

bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -datatest.train -model test.bin

3) evaluate model by:

echo "boysandgirls" | bin/opennlp TokenizerME test.bin

The result i got:

------------------------------------------------------------------------

Loading Tokenizer model ... done (0.019s)

boysandgirls

Average: 500.0 sent/s

Total: 1 sent

Runtime: 0.0020s

------------------------------------------------------------------------

So the text is still not segmented to words.

Any thoughts?

You shouldn't repeat your training data, since you don't

add any information by doing that. Instead you should either manually

label such data for at least a few hundred domains, or construct

it out of tokenized text, or try an approach as suggest by Aliaksandr.

The reason why it doesn't split boysandgirls is that you enabled thealpha numerical

optimization, which is a performance optimization. This one does skip theprocessing

of white space separated strings which only contain letters.

If you disable it, the model will decide on each character in your teststring

if it is a valid split or not (expect the last one).

Jörn

Reply via email to