Patrick Maupin added the comment:

1) I have obviously oversimplified my test case, to the point where a developer 
thinks I'm silly enough to reach for the regex module just to split on a 
linefeed.

2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would 
immediately choose that as the most obvious thing to do.

3) My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am 
taking nasty broken output from a Cadence tool, fixing it up, and dumping it 
back out to a file.  Yes, I'm sure this could be optimized, as well, but when I 
can just remove the parentheses and get a 10X speedup, and then figure out the 
string I meant to capture by looking at string lengths, shouldn't there at 
least be a warning that the re module has performance issues with capturing 
groups with split(), and casual users like me should figure out what the 
matching strings are some other way?


I assumed that, since I saw almost exactly the same performance degradation 
with \n as I did with this, that that was a valid testcase.  If that was a bad 
assumption and this is insufficient to debug it, I can submit a bigger testcase.


But if this issue is going to be wontfixed for some reason, there should 
certainly be a documentation note added, because it is not intuitive that 
splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster 
than doing the same thing, but also capturing the 1K ten-byte strings inbetween 
the big ones.


Thanks,
Pat

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24426>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to