[André Søreng] > Given a string, I want to find all ocurrences of > certain predefined words in that string. Problem is, the list of > words that should be detected can be in the order of thousands. > > With the re module, this can be solved something like this: > > import re > > r = re.compile("word1|word2|word3|.......|wordN") > r.findall(some_string) > > Unfortunately, when having more than about 10 000 words in > the regexp, I get a regular expression runtime error when > trying to execute the findall function (compile works fine, but slow). > > I don't know if using the re module is the right solution here, any > suggestions on alternative solutions or data structures which could > be used to solve the problem?
Put the words you're looking for into a set (or as the keys of a dict in older Pythons; the values in the dict are irrelevant). I don't know what you mean by "word", so write something that breaks your string into what you mean by words. Then: for word in something_that_produces_words(the_string): if word in set_of_words: # found one This takes expected-case time proportional to the number of words in the string, + setup time proportional to the number of "interesting" words (the time needed to create a set or dict from them). 10,000 interesting words won't even start to strain it. -- http://mail.python.org/mailman/listinfo/python-list