[issue24426] re.split performance degraded significantly by capturing group

Patrick Maupin Wed, 10 Jun 2015 14:29:12 -0700

Patrick Maupin added the comment:

Just to be perfectly clear, this is no exaggeration:


My original file was slightly over 5GB.

I have approximately 1050 bad strings in it, averaging around 11 characters per 
string.

If I split it without capturing those 1050 strings, it takes 3.7 seconds.

If I split it and capture those 1050 strings, it takes 39 seconds.

ISTM that 33 ms to create a capture group with a single 11 character string is 
excessive, so there is probably something else going on like excessive object 
copying, that just isn't noticeable on a smaller source string.

In the small example I posted, if I replace the line:

data = 100 * (200000 * ' ' + '\n')

with 

data = 1000 * (500000 * ' ' + '\n')

then I get approximately the same 3.7 second vs 39 second results on that 
(somewhat older) machine.  I didn't start out with that in the example, because 
I thought the problem should still be obvious from the scaled down example.

Obviously, your CPU numbers will be somewhat different.  The question remains, 
though, why it takes around 60 million CPU cycles for each and every returned 
capture group.  Or, to put it another way, why can I stop doing the capture 
group, and grab the same string with pure Python by looking at the string 
lengths of the intervening strings, well over 100 times faster than it takes 
for the re module to give me that group?

Thanks,
Pat

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24426>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24426] re.split performance degraded significantly by capturing group

Reply via email to