André Søreng <[EMAIL PROTECTED]> wrote: > Given a string, I want to find all ocurrences of > certain predefined words in that string. Problem is, the list of > words that should be detected can be in the order of thousands. > > With the re module, this can be solved something like this: > > import re > > r = re.compile("word1|word2|word3|.......|wordN") > r.findall(some_string) > > Unfortunately, when having more than about 10 000 words in > the regexp, I get a regular expression runtime error when > trying to execute the findall function (compile works fine, but > slow).
I wrote a regexp optimiser for exactly this case. Eg a regexp for all 5 letter words starting with re $ grep -c '^re' /usr/share/dict/words 2727 $ grep '^re' /usr/share/dict/words | ./words-to-regexp.pl 5 re|re's|reac[ht]|rea(?:d|d[sy]|l|lm|m|ms|p|ps|r|r[ms])|reb(?:el|u[st])|rec(?:ap|ta|ur)|red|red's|red(?:id|o|s)|ree(?:d|ds|dy|f|fs|k|ks|l|ls|ve)|ref|ref's|refe[dr]|ref(?:it|s)|re(?:gal|hab|(?:ig|i)n|ins|lax|lay|lic|ly|mit|nal|nd|nds|new|nt|nts|p)|rep's|rep(?:ay|el|ly|s)|rer(?:an|un)|res(?:et|in|t|ts)|ret(?:ch|ry)|re(?:use|v)|rev's|rev(?:el|s|ue) As you can see its not perfect. Find it in http://www.craig-wood.com/nick/pub/words-to-regexp.pl Yes its perl and rather cludgy but may give you ideas! -- Nick Craig-Wood <[EMAIL PROTECTED]> -- http://www.craig-wood.com/nick -- http://mail.python.org/mailman/listinfo/python-list