Thanks a lot everyone, it's working for me now.

On , Jörn Kottmann <kottm...@gmail.com> wrote:
On 11/16/11 8:26 PM, lishe...@gmail.com wrote:


That's what i thought how the "Learnable Tokenizer" works, but it doesn't work for for some reason.

What I did:

1) edit a test.train file with following content:

boysandgirls.

boysandgirls.

boysandgirls.

... repeat 30 times ...



2) train a model by:

bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -data test.train -model test.bin



3) evaluate model by:

echo "boysandgirls" | bin/opennlp TokenizerME test.bin



The result i got:

------------------------------------------------------------------------

Loading Tokenizer model ... done (0.019s)

boysandgirls





Average: 500.0 sent/s

Total: 1 sent

Runtime: 0.0020s

------------------------------------------------------------------------



So the text is still not segmented to words.

Any thoughts?




You shouldn't repeat your training data, since you don't

add any information by doing that. Instead you should either manually

label such data for at least a few hundred domains, or construct

it out of tokenized text, or try an approach as suggest by Aliaksandr.



The reason why it doesn't split boysandgirls is that you enabled the alpha numerical

optimization, which is a performance optimization. This one does skip the processing

of white space separated strings which only contain letters.



If you disable it, the model will decide on each character in your test string

if it is a valid split or not (expect the last one).



Jörn




Reply via email to