Martin v. Löwis wrote: > Walter Dörwald wrote: > >>Martin v. Löwis wrote: >> >>>Walter Dörwald wrote: >>> >>>>I think a maxsplit argument (just as for unicode.split()) would help >>>>too. >>> >>>Correct - that would allow to get rid of the quadratic part. >> >>OK, such a patch should be rather simple. I'll give it a try. > > Actually, on a second thought - it would not remove the quadratic > aspect.
At least it would remove the quadratic number of calls to _PyUnicodeUCS2_IsLinebreak(). For each character it would be called only once. > You would still copy the rest string completely on each > split. So on the first split, you copy N lines (one result line, > and N-1 lines into the rest string), on the second split, N-2 > lines, and so on, totalling N*N/2 line copies again. OK, that's true. We could prevent string copying if we kept the unsplit string and the position of the current line terminator, but this would require a "first position after a line terminator" method. > The only > thing you save is the join (as the rest is already joined), and > the IsLineBreak calls (which are necessary only for the first > line). > > Please see python.org/sf/1268314; The last part of the patch seems to be more related to bug #1235646. With the patch test_pep263 and test_codecs fail (and test_parser, but this might be unrelated): python Lib/test/test_pep263.py gives the following output: File "Lib/test/test_pep263.py", line 22 SyntaxError: list index out of range test_codecs.py has the following two complaints: File "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py", line 366, in readline self.charbuffer = lines[1] + self.charbuffer IndexError: list index out of range and File "/var/home/walter/Achtung/Python-linecache/dist/src/Lib/codecs.py", line 336, in readline line = result.splitlines(False)[0] NameError: global name 'result' is not defined > it solves the problem by > keeping the splitlines result. It only invokes IsLineBreak > once per character, and also copies each character only once, > and allocates each line only once, totalling in O(N) for > these operations. It still does contain a quadratic operation: > the lines are stored in a list, and the result list is > removed from the list with del lines[0]. This copies N-1 > pointers, result in N*N/2 pointer copies. That should still > be much faster than the current code. Using collections.deque() should get rid of this problem. >>unicodelinebreaks = u"".join(unichr(c) for c in xrange(0, >>sys.maxunicode) if unichr(c).islinebreak()) > > That is very inefficient. I would rather add a static list > to the string module, and have a test that says > > assert str.unicodelinebreaks == u"".join(ch for ch in (unichr(c) for c > in xrange(0, sys.maxunicode)) if unicodedata.bidirectional(ch)=='B' or > unicodedata.category(ch)=='Zl') You mean, in the test suite? > unicodelinebreaks could then be defined as > > # u"\r\n\x1c\x1d\x1e\x85\u2028\u2029 > '\n\r\x1c\x1d\x1e\xc2\x85\xe2\x80\xa8\xe2\x80\xa9'.decode("utf-8") That might be better, as this definition won't change very often. BTW, why the decode() call? For a Python without unicode? >>OK, this would mean we'd have to distinguish between a direct call to >>read() and one done by readline() (which we do anyway through the >>firstline argument). > > See my patch. If we have cached lines, we don't need to call .read > at all. I wonder what happens, if calls to read() and readline() are mixed (e.g. if I'm reading Fortran source or anything with a fixed line header): read() would be used to read the first n character (which joins the line buffer) and readline() reads the rest (which would split it again) etc. (Of course this could be done via a single readline() call). But, I think a maxsplit argument for splitlines() woould make sense independent of this problem. Bye, Walter Dörwald _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com