Peter Norvig did an excellent presentation where he exposed one solution
for this problem. You can look at it
(http://videolectures.net/cikm08_norvig_slatuad/) from the slide "Text
Data".
Hope this help,
Alexandre
On 11-11-16 01:44 PM, Ryan L. Sun wrote:
Hi all,
I'm facing a problem to split concatenated English text, more
specifically, domain name.
For example:
boysandgirls.com -> boy(s)|and|girl(s)|.com
haveaniceday.net -> have|a|nice|day|.net
Can I use opennlp to do this? I checked the opennlp documentation and
looks like "Learnable Tokenizer" is promising, but i couldn't get it
to work.
Any help is appreciated.
--
Alexandre Patry
Ingénieur-Chercheur
http://KeaText.com
Transformez vos documents en outils de décision
<< Turn your documents into decison tools