[Guido] > > My first response to the PEP, however, is that instead of a new > > built-in function, I'd rather relax the requirement that str() return > > an 8-bit string -- after all, int() is allowed to return a long, so > > why couldn't str() be allowed to return a Unicode string?
[MAL] > The problem here is that strings and Unicode are used in different > ways, whereas integers and longs are very similar. Strings are used > for both arbitrary data and text data, Unicode can only be used > for text data. Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a separate "bytes" array type for non-text and for encoded text data, just like Java; strings should always be considered text data. We might be able to get there halfway in Python 2.x: we could introduce the bytes type now, and provide separate APIs to read and write them. (In fact, the array module and the f.readinto() method make this possible today, but it's too klunky so nobody uses it. Perhaps a better API would be a new file-open mode ("B"?) to indicate that a file's read* operations should return bytes instead of strings. The bytes type could just be a very thin wrapper around array('b'). > The new text() built-in would help make a clear distinction > between "convert this object to a string of bytes" and > "please convert this to a text representation". We need to > start making the separation somewhere and I think this is > a good non-invasive start. I agree with the latter, but I would prefer that any new APIs we use use a 'bytes' data type to represent non-text data, rather than having two different sets of APIs to differentiate between the use of 8-bit strings as text vs. data -- while we *currently* use 8-bit strings for both text and data, in Python 3.0 we won't, so then the interim APIs would have to change again. I'd rather intrduce a new data type and new APIs that work with it. > Furthermore, the text() built-in could be used to only > allow 8-bit strings with ASCII content to pass through > and require that all non-ASCII content be returned as > Unicode. > > We wouldn't be able to enforce this in str(). > > I'm +1 on adding text(). I'm still -1. > I would also like to suggest a new formatting marker '%t' > to have the same semantics as text() - instead of changing > the semantics of %s as the Neil suggests in the PEP. Again, > the reason is to make the difference between text and > arbitrary data explicit and visible in the code. Hm. What would be the use case for using %s with binary, non-text data? > > The main problem for a smooth Unicode transition remains I/O, in my > > opinion; I'd like to see a PEP describing a way to attach an encoding > > to text files, and a way to decide on a default encoding for stdin, > > stdout, stderr. > > Hmm, not sure why you need PEPs for this: I'd forgotten how far we've come. I'm still unsure how the default encoding on stdin/stdout works. But it still needs to be simpler; IMO the built-in open() function should have an encoding keyword. (But it could return something whose type is not 'file' -- once again making a distinction between open and file.) Do these files support universal newlines? IMO they should. -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com