Nick Coghlan writes: > Yes, I'm currently thinking the appropriate approach to the docs > will be to remove the current "these have most of the str methods > too" paragraph for binary sequences and instead create three > completely explicit lists of methods:
> - provided, works with arbitrary data > - provided, assumes the use of an ASCII compatible data format I'm not sure what that means. If you mean that in the format string for .format() and %-formatting, bytes 0-127 must always have ASCII coded character semantics with bytes 128-255 unrestricted, indeed, that is the pragmatic restriction. Is there anything else? The implications of this should be made clear, though: funky Asian encodings cannot be safely used in format strings for format(), GB18030 isn't safe in %-formatting either, and the value returned by these operations should be assumed to be non-ASCII-compatible unless proven otherwise (no iterated formatting). I think you also need - provided, assumes pure ASCII-encoded text since as far as I know the only strictly ASCII-compatible binary formats are ISO 2022-compatible encodings and UTF-8, ie, text, and the characters represented with bytes in the range 128-255 are not handled by bytes versions of the case-checking and case-converting operations, and so have extremely dubious semantics unless the data is pure ASCII. This is also true of most of the is_* operations. Note that .center and .strip have pretty dubious semantics for arbitrary "ASCII-compatible" data: >>> b"abc\r\n".center(15) b' abc\r\n ' >>> " \xA0abc\xA0 ".strip() 'abc' >>> b" \xA0abc\xA0 ".strip() b'\xa0abc\xa0' Of course the case of .center() is purely a programmer error, and I don't have a use case where it's problematic in practice. But it's sort of unpleasant. Although I have internalized Guido's point that what's important is that there be no implicit conversions between bytes and str, I still worry that this slew of subtle semantic differences when moving str methods wholesale to bytes is a bug magnet. I have an especially bad feeling about str-into-bytes interpolation. If people want that, they should use a type like asciistr that provides more or less firm guarantees that the content is pure ASCII. > - not provided > PEP 461 would add a fourth category, of being provided, but with > more restricted semantics. I haven't looked closely at PEP 461 yet, and I'm not sure I'm going to have time this week. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com