That's what i thought how the "Learnable Tokenizer" works, but it doesn't
work for for some reason.
What I did:
1) edit a test.train file with following content:
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
... repeat 30 times ...
2) train a model by:
bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -data
test.train -model test.bin
3) evaluate model by:
echo "boysandgirls" | bin/opennlp TokenizerME test.bin
The result i got:
------------------------------------------------------------------------
Loading Tokenizer model ... done (0.019s)
boysandgirls
Average: 500.0 sent/s
Total: 1 sent
Runtime: 0.0020s
------------------------------------------------------------------------
So the text is still not segmented to words.
Any thoughts?
On , Jörn Kottmann <kottm...@gmail.com> wrote:
The spaces are only used to cut words. This he already did
by just looking at a domain at a time.
To go back to his sample:
boysandgirls.com
This you can easily turn into training data like this:
boysandgirlsSPLIT>.com
I would try to get some good amount of English text,
perform tokenization on it, and then just assume every
token is written together without a space in between.
Then you should be able to generate training strings like
the one above. The TLD can easily be attached randomly.
I guess that might already work well.
To evaluate it, you should make a file with real domains
and split them manually. The tokenizer has an evaluator
which can calculate for you how accurate it is.
Hope this helps,
Jörn
On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:
Hi Ryan,
Learnable tokenizer is trained on standard text, where words are separated
by fair amount of spaces. Your data looks different and one way to tackle
it is to tag a fair amount of samples, creating your own corpus, and then
train a model on it. Tagging might take some time, though. Another
approach
might be to use a dictionary, like WordNet, and look up potential tokens
there. A fairly simple approach might be starting from empty string,
adding
char-by-char to it and looking up in WordNet. If it returns something -
make that string a token and start again from empty string. The suffixes
(.com, .net, etc) are well-known and can be cut. With this approach you'll
encounter difficulties with something like "hotelchain" -> "hot" is a word
and is present in WordNet. Well, these might not be the only approaches
out
there, this is just what came to mind quickly.
Aliaksandr
On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. sunlishe...@gmail.com> wrote:
Hi all,
I'm facing a problem to split concatenated English text, more
specifically, domain name.
For example:
boysandgirls.com -> boy(s)|and|girl(s)|.com
haveaniceday.net -> have|a|nice|day|.net
Can I use opennlp to do this? I checked the opennlp documentation and
looks like "Learnable Tokenizer" is promising, but i couldn't get it
to work.
Any help is appreciated.