On Fri, Jul 20, 2012 at 8:56 PM, Дмитрий <dm...@yandex.ru> wrote: > > As for the character classes, they can be generated quite easily from the > UnicodeData.txt[1] file. We can get a general category[2] from this file > by sth like (string->symbol (caddr (string-split line ","))); then we just > need to map the categories into appropriate character classes (e.g. Lu > belongs to upper, alpha, alphanum, graph), etc. and merge characters if the > characters of the same categories if they have adjacent codes. > It's quite easy to do. If I'm not lazy I'll do this this weekend.
Full unicode character classes and case handling are already in the utf8 egg. These are not yet integrated with irregex because irregex is written to be portable across any Scheme, and so it uses its own char-set implementation. When R7RS is released I'll re-package irregex accordingly. Unfortunately, while the utf8 char-sets are very compact, the DFA conversion of large, sparse Unicode char-sets is quite large. I'd like eventually to make a non-backtracking NFA regex matcher which only compiles to DFA when you really need the speed. In the meantime, a fast lookup table for the script of a character would be nice, and this could be use to tokenize a string of mixed-language text. I thought I had this and can't seem to find it anywhere... -- Alex _______________________________________________ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users