Word boundaries

Zbigniew Łukasiak Mon, 26 Mar 2012 02:03:27 -0700

For our spam classifier I need to split the text into words.
Unfortunately the '\b' regex does not yet work for languages with no
spaces (apparently it is covered in the level 3 of unicode support
http://unicode.org/reports/tr18/#Tailored_Word_Boundaries) - so I need
some custom solution.  This did not seem very difficult - just split
the text into blocks of same unicode script and then use '\b' for most
of the scripts and appropriate libraries for the rest (at least for
Chinese there are some tokenizers on CPAN) - but:


1. How can I split the text into blocks of same scripts?  (Wouldn't a
script-boundary regex property be useful?).  OK I can always loop over
the characters, check their script and check if it is the same as the
previous one - i.e. back to C mode of programming.  But then there is
still the question of:

2. How can I check what script a character belongs to?  Do I need to
cut and paste all the script ranges from unicode.org into a huge
if-else branch in my program or is there a simpler way?

Thanks in advance,
Zbigniew

Word boundaries

Reply via email to