Am 24.08.2005 um 21:15 schrieb Martin v. Löwis: > Walter Dörwald wrote: > > >>> Right. Not sure what people think whether this should still be >>> supported, but I keep supporting it whenever I think of it. >>> >> >> OK, so should we add this for 2.4.2 or only for 2.5? >> > > You mean, string.unicodelinebreaks? >
Yes. > I think something needs to be > done to fix the performance problem. In doing so, API changes > might occur. We should not add API changes in 2.4.2 unless they > contribute to the bug fix, and even then, the release manager > probably needs to approve them (in any case, they certainly > need to be backwards compatible) > OK. Your version of the patch (without replacing line = line.splitlines(False)[0] with something better) might be enough for 2.4.2. >> Should this really be put into string.py, or should it be a class >> attribute of unicode? (At least that's what was proposed for the >> other >> strings in string.py (string.whitespace etc.) too. >> > > If the 2.4.2 fix is based on this kind of data, I think it should go > into a private attribute of codecs.py. > I think codecs.unicodelinebreaks has one big problem: it will not work for codecs that do str->str decoding. > For 2.5, I would put it > into strings for tradition. There is no point in having some of these > constants in strings and others as class attributes (unless we also > add them as class attributes in 2.5, in which case adding > unicodelinebreaks into strings would be pointless). > > So I think in 2.5, I would like to see > > # string.py > ascii_letters = str.ascii_letters > > in which case unicode.linebreaks would be the right spelling. > And it would have the advantage, that it could work both with str and unicode if we had both str.linebreaks and unicode.linebreaks >>> I'm not so sure anymore. It is good for consistency, but I doubt >>> there >>> are actual use cases: how often do you want only the first n lines >>> of some string? Reading the first n lines of a file might be an >>> application, but then, you would rather use .readline() directly. >>> >> >> Not every unicode string is read from a StreamReader. >> > > Sure: but how often do you want to fetch the first line of a Unicode > string you happen to have in memory, without iterating over all lines > eventually? > I don't know. The only obvious spot in the standard library (apart from codecs.py) seems to be def shortdescription(self): return self.description().splitlines() [0] in Lib/plat-mac/pimp.py >> Another solution would be to have a unicode.itersplitlines() and >> store >> the iterator. Then we wouldn't need a maxsplit because you simply can >> stop iterating once you have what you want. >> > > That might work. I would then ask for itersplitlines to return pairs > of (line, truncated) so you can easily know whether you merely ran > into the end of the string, or whether you got a complete line > (although it might be a bit too specific for the readlines() case) > Or maybe (line, terminatorlength) which gives you the same info (terminatorlength == 0 means truncated) and makes it easy to strip the terminator. >> So reverting to the 2.3 behaviour for simple codecs is out? >> > > I'm -1, atleast. It would also fix the problem at hand, for the > reported > case. However, it does leave some codecs in the cold, most notably > UTF-8 (which, in turn, isn't an issue for PEP 262, since UTF-8 is > built-in in the parser). > You meant PEP 263, right? > I think the UTF-8 stream reader should support > all Unicode line breaks, so it should continue to use the Python > approach. > OK. > However, UTF-8 is fairly common, so that reading an > UTF-8-encoded file line-by-line shouldn't suck. > OK, so what's missing is a solution for str->str codecs (or we keep line = line.splitlines(False)[0] and test, whether this is fast enough). Bye, Walter Dörwald _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com