That's what i thought how the "Learnable Tokenizer" works, but it doesn't work for for some reason.
What I did:
1) edit a test.train file with following content:
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
... repeat 30 times ...

2) train a model by:
bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -data test.train -model test.bin

3) evaluate model by:
echo "boysandgirls" | bin/opennlp TokenizerME test.bin

The result i got:
------------------------------------------------------------------------
Loading Tokenizer model ... done (0.019s)
boysandgirls


Average: 500.0 sent/s
Total: 1 sent
Runtime: 0.0020s
------------------------------------------------------------------------

So the text is still not segmented to words.
Any thoughts?

On , Jörn Kottmann <kottm...@gmail.com> wrote:
The spaces are only used to cut words. This he already did

by just looking at a domain at a time.



To go back to his sample:

boysandgirls.com



This you can easily turn into training data like this:

boysandgirlsSPLIT>.com



I would try to get some good amount of English text,

perform tokenization on it, and then just assume every

token is written together without a space in between.

Then you should be able to generate training strings like

the one above. The TLD can easily be attached randomly.



I guess that might already work well.



To evaluate it, you should make a file with real domains

and split them manually. The tokenizer has an evaluator

which can calculate for you how accurate it is.



Hope this helps,

Jörn





On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:


Hi Ryan,



Learnable tokenizer is trained on standard text, where words are separated

by fair amount of spaces. Your data looks different and one way to tackle

it is to tag a fair amount of samples, creating your own corpus, and then

train a model on it. Tagging might take some time, though. Another approach

might be to use a dictionary, like WordNet, and look up potential tokens

there. A fairly simple approach might be starting from empty string, adding

char-by-char to it and looking up in WordNet. If it returns something -

make that string a token and start again from empty string. The suffixes

(.com, .net, etc) are well-known and can be cut. With this approach you'll

encounter difficulties with something like "hotelchain" -> "hot" is a word

and is present in WordNet. Well, these might not be the only approaches out

there, this is just what came to mind quickly.



Aliaksandr



On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. sunlishe...@gmail.com> wrote:




Hi all,



I'm facing a problem to split concatenated English text, more

specifically, domain name.

For example:

boysandgirls.com -> boy(s)|and|girl(s)|.com

haveaniceday.net -> have|a|nice|day|.net



Can I use opennlp to do this? I checked the opennlp documentation and

looks like "Learnable Tokenizer" is promising, but i couldn't get it

to work.

Any help is appreciated.







Reply via email to