Re: Re: English word splitting with opennlp?

lishengs Wed, 16 Nov 2011 11:26:28 -0800

That's what i thought how the "Learnable Tokenizer" works, but it doesn'twork for for some reason.

What I did:
1) edit a test.train file with following content:
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
boys<SPLIT>and<SPLIT>girls.
... repeat 30 times ...


2) train a model by:

bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -datatest.train -model test.bin


3) evaluate model by:
echo "boysandgirls" | bin/opennlp TokenizerME test.bin

The result i got:
------------------------------------------------------------------------
Loading Tokenizer model ... done (0.019s)
boysandgirls


Average: 500.0 sent/s
Total: 1 sent
Runtime: 0.0020s
------------------------------------------------------------------------

So the text is still not segmented to words.
Any thoughts?

On , Jörn Kottmann <kottm...@gmail.com> wrote:

The spaces are only used to cut words. This he already did

by just looking at a domain at a time.

To go back to his sample:

boysandgirls.com

This you can easily turn into training data like this:

boysandgirlsSPLIT>.com

I would try to get some good amount of English text,

perform tokenization on it, and then just assume every

token is written together without a space in between.

Then you should be able to generate training strings like

the one above. The TLD can easily be attached randomly.

I guess that might already work well.

To evaluate it, you should make a file with real domains

and split them manually. The tokenizer has an evaluator

which can calculate for you how accurate it is.

Hope this helps,

Jörn

On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:

Hi Ryan,

Learnable tokenizer is trained on standard text, where words are separated

by fair amount of spaces. Your data looks different and one way to tackle

it is to tag a fair amount of samples, creating your own corpus, and then

train a model on it. Tagging might take some time, though. Anotherapproach

might be to use a dictionary, like WordNet, and look up potential tokens

there. A fairly simple approach might be starting from empty string,adding

char-by-char to it and looking up in WordNet. If it returns something -

make that string a token and start again from empty string. The suffixes

(.com, .net, etc) are well-known and can be cut. With this approach you'll

encounter difficulties with something like "hotelchain" -> "hot" is a word

and is present in WordNet. Well, these might not be the only approachesout

there, this is just what came to mind quickly.

Aliaksandr

On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. sunlishe...@gmail.com> wrote:

Hi all,

I'm facing a problem to split concatenated English text, more

specifically, domain name.

For example:

boysandgirls.com -> boy(s)|and|girl(s)|.com

haveaniceday.net -> have|a|nice|day|.net

Can I use opennlp to do this? I checked the opennlp documentation and

looks like "Learnable Tokenizer" is promising, but i couldn't get it

to work.

Any help is appreciated.

Re: Re: English word splitting with opennlp?

Reply via email to