Thanks a lot everyone, it's working for me now. On , Jörn Kottmann <kottm...@gmail.com> wrote:
On 11/16/11 8:26 PM, lishe...@gmail.com wrote:
That's what i thought how the "Learnable Tokenizer" works, but it doesn't work for for some reason.
What I did:
1) edit a test.train file with following content:
boysandgirls.
boysandgirls.
boysandgirls.
... repeat 30 times ...
2) train a model by:
bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -data test.train -model test.bin
3) evaluate model by:
echo "boysandgirls" | bin/opennlp TokenizerME test.bin
The result i got:
------------------------------------------------------------------------
Loading Tokenizer model ... done (0.019s)
boysandgirls
Average: 500.0 sent/s
Total: 1 sent
Runtime: 0.0020s
------------------------------------------------------------------------
So the text is still not segmented to words.
Any thoughts?
You shouldn't repeat your training data, since you don't
add any information by doing that. Instead you should either manually
label such data for at least a few hundred domains, or construct
it out of tokenized text, or try an approach as suggest by Aliaksandr.
The reason why it doesn't split boysandgirls is that you enabled the alpha numerical
optimization, which is a performance optimization. This one does skip the processing
of white space separated strings which only contain letters.
If you disable it, the model will decide on each character in your test string
if it is a valid split or not (expect the last one).
Jörn