Another possibility is to form a suffix array 
(https://en.wikipedia.org/wiki/Suffix_array#Applications) as an index for the 
string, and then search for patterns within the suffix array.  The basic idea 
is that you index the string you're searching over once, and then look for 
patterns within it.  

The main problem with this method is how you're doing the replacements.  If 
your replacement text can create a new string that matches a different regex 
that occurs later on, then you really should use what INADA Naoki suggested.

Thanks,
Cem Karan

On Feb 25, 2017, at 2:08 PM, INADA Naoki <songofaca...@gmail.com> wrote:

> If you can use third party library, I think you can use Aho-Corasick 
> algorithm.
> 
> https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
> 
> https://pypi.python.org/pypi/pyahocorasick/
> 
> On Sat, Feb 25, 2017 at 3:54 AM,  <kar6...@gmail.com> wrote:
>> I have a task to search for multiple patterns in incoming string and replace 
>> with matched patterns, I'm storing all pattern as keys in dict and 
>> replacements as values, I'm using regex for compiling all the pattern and 
>> using the sub method on pattern object for replacement. But the problem I 
>> have a tens of millions of rows, that I need to check for pattern which is 
>> about 1000 and this is turns out to be a very expensive operation.
>> 
>> What can be done to optimize it. Also I have special characters for 
>> matching, where can I specify raw string combinations.
>> 
>> for example is the search string is not a variable we can say
>> 
>> re.search(r"\$%^search_text", "replace_text", "some_text") but when I read 
>> from the dict where shd I place the "r" keyword, unfortunately putting 
>> inside key doesnt work "r key" like this....
>> 
>> Pseudo code
>> 
>> for string in genobj_of_million_strings:
>>   pattern = re.compile('|'.join(regex_map.keys()))
>>   return pattern.sub(lambda x: regex_map[x], string)
>> --
>> https://mail.python.org/mailman/listinfo/python-list
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to