Re: English word splitting with opennlp?

Aliaksandr Autayeu Wed, 16 Nov 2011 11:02:55 -0800

Hi Ryan,

Learnable tokenizer is trained on standard text, where words are separated
by fair amount of spaces. Your data looks different and one way to tackle
it is to tag a fair amount of samples, creating your own corpus, and then
train a model on it. Tagging might take some time, though. Another approach
might be to use a dictionary, like WordNet, and look up potential tokens
there. A fairly simple approach might be starting from empty string, adding
char-by-char to it and looking up in WordNet. If it returns something -
make that string a token and start again from empty string. The suffixes
(.com, .net, etc) are well-known and can be cut. With this approach you'll
encounter difficulties with something like "hotelchain" -> "hot" is a word
and is present in WordNet. Well, these might not be the only approaches out
there, this is just what came to mind quickly.


Aliaksandr

On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sun <lishe...@gmail.com> wrote:

> Hi all,
>
> I'm facing a problem to split concatenated English text, more
> specifically, domain name.
> For example:
> boysandgirls.com -> boy(s)|and|girl(s)|.com
> haveaniceday.net -> have|a|nice|day|.net
>
> Can I use opennlp to do this? I checked the opennlp documentation and
> looks like "Learnable Tokenizer" is promising, but i couldn't get it
> to work.
> Any help is appreciated.
>

Re: English word splitting with opennlp?

Reply via email to