Martin v. Löwis wrote:
> Walter Dörwald wrote:
>
>>This is caused by the chances to the codecs in 2.4. Basically the codecs
>>no longer rely on C's readline() to do line splitting (which can't work
>>for UTF-16), but do it themselves (via unicode.splitlines()).
>
> That explains why you get any calls to IsLineBreak; it doesn't explain
> why you get so many of them.
>
> I investigated this a bit, and one issue seems to be that
> StreamReader.readline performs splitline on the entire input, only to
> fetch the first line. It then joins the rest for later processing.
> In addition, it also performs splitlines on a single line, just to
> strip any trailing line breaks.
This is because unicode.splitlines() is the only API available to Python
that knows about unicode line feeds.
> The net effect is that, for a file with N lines, IsLineBreak is invoked
> up to N*N/2 times per character (atleast for the last character).
>
> So I think it would be best if Unicode characters exposed a .islinebreak
> method (or, failing that, codecs just knew what the line break
> characters are in Unicode 3.2), and then codecs would split off
> the first line of input itself.
I think a maxsplit argument (just as for unicode.split()) would help too.
> [...]
Bye,
Walter Dörwald
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com