For our spam classifier I need to split the text into words. Unfortunately the '\b' regex does not yet work for languages with no spaces (apparently it is covered in the level 3 of unicode support http://unicode.org/reports/tr18/#Tailored_Word_Boundaries) - so I need some custom solution. This did not seem very difficult - just split the text into blocks of same unicode script and then use '\b' for most of the scripts and appropriate libraries for the rest (at least for Chinese there are some tokenizers on CPAN) - but:
1. How can I split the text into blocks of same scripts? (Wouldn't a script-boundary regex property be useful?). OK I can always loop over the characters, check their script and check if it is the same as the previous one - i.e. back to C mode of programming. But then there is still the question of: 2. How can I check what script a character belongs to? Do I need to cut and paste all the script ranges from unicode.org into a huge if-else branch in my program or is there a simpler way? Thanks in advance, Zbigniew