Hi Ryan, Learnable tokenizer is trained on standard text, where words are separated by fair amount of spaces. Your data looks different and one way to tackle it is to tag a fair amount of samples, creating your own corpus, and then train a model on it. Tagging might take some time, though. Another approach might be to use a dictionary, like WordNet, and look up potential tokens there. A fairly simple approach might be starting from empty string, adding char-by-char to it and looking up in WordNet. If it returns something - make that string a token and start again from empty string. The suffixes (.com, .net, etc) are well-known and can be cut. With this approach you'll encounter difficulties with something like "hotelchain" -> "hot" is a word and is present in WordNet. Well, these might not be the only approaches out there, this is just what came to mind quickly.
Aliaksandr On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sun <lishe...@gmail.com> wrote: > Hi all, > > I'm facing a problem to split concatenated English text, more > specifically, domain name. > For example: > boysandgirls.com -> boy(s)|and|girl(s)|.com > haveaniceday.net -> have|a|nice|day|.net > > Can I use opennlp to do this? I checked the opennlp documentation and > looks like "Learnable Tokenizer" is promising, but i couldn't get it > to work. > Any help is appreciated. >