Guido van Rossum wrote: > [Guido] > >>>My first response to the PEP, however, is that instead of a new >>>built-in function, I'd rather relax the requirement that str() return >>>an 8-bit string -- after all, int() is allowed to return a long, so >>>why couldn't str() be allowed to return a Unicode string? > > > [MAL] > >>The problem here is that strings and Unicode are used in different >>ways, whereas integers and longs are very similar. Strings are used >>for both arbitrary data and text data, Unicode can only be used >>for text data. > > Yes, that is the case in Python 2.x. In Python 3.x, I'd like to use a > separate "bytes" array type for non-text and for encoded text data, > just like Java; strings should always be considered text data. > > We might be able to get there halfway in Python 2.x: we could > introduce the bytes type now, and provide separate APIs to read and > write them. > > (In fact, the array module and the f.readinto() method > make this possible today, but it's too klunky so nobody uses it. > Perhaps a better API would be a new file-open mode ("B"?) to indicate > that a file's read* operations should return bytes instead of strings. > The bytes type could just be a very thin wrapper around array('b').
I'd prefer to keep such bytes type immutable (arrays are mutable), otherwise, as Martin already mentioned, they wouldn't be usable as dictionary keys and the transition from the current string implementation would be made more difficult than necessary. Since we won't have any use for the string type in Py3k, why not simply strip it down to a plain bytes type ? (I wouldn't want to lose or have to reinvent all the optimizations that went into its implementation and which are missing in the array implementation.) About the file-type idea: We already have text mode and binary mode - with their implementation being platform dependent. I don't think that this is particularly good area to add new functionality. If you use codecs.open() to open a file, you could easily write a codec which implements what you have in mind. >>The new text() built-in would help make a clear distinction >>between "convert this object to a string of bytes" and >>"please convert this to a text representation". We need to >>start making the separation somewhere and I think this is >>a good non-invasive start. > > > I agree with the latter, but I would prefer that any new APIs we use > use a 'bytes' data type to represent non-text data, rather than having > two different sets of APIs to differentiate between the use of 8-bit > strings as text vs. data -- while we *currently* use 8-bit strings for > both text and data, in Python 3.0 we won't, so then the interim APIs > would have to change again. I'd rather intrduce a new data type and > new APIs that work with it. Well, let's put it this way: it all really depends on what str() should mean in Py3k. Given that str() is used for mixed content data strings, simply aliasing str() to unicode() in Py3k would cause a lot of breakage, due to changed semantics. Aliasing str() to bytes() would also cause breakage, due to the fact that bytes types wouldn't have string method like e.g. .lower(), .upper(), etc. Perhaps str() in Py3k should become a helper that converts bytes() to Unicode, provided the content is ASCII-only. In any case, Py3k would only have unicode() for text and bytes() for data, so there's no real need to continue using str(). If we add the text() API in Py2k and with the above meaning, then we could rename unicode() to text() in Py3k - only a cosmetical change, but one that I would find useful: text() and bytes() are more intuitive to understand than unicode() and bytes(). >>Furthermore, the text() built-in could be used to only >>allow 8-bit strings with ASCII content to pass through >>and require that all non-ASCII content be returned as >>Unicode. >> >>We wouldn't be able to enforce this in str(). >> >>I'm +1 on adding text(). > > > I'm still -1. > > >>I would also like to suggest a new formatting marker '%t' >>to have the same semantics as text() - instead of changing >>the semantics of %s as the Neil suggests in the PEP. Again, >>the reason is to make the difference between text and >>arbitrary data explicit and visible in the code. > > > Hm. What would be the use case for using %s with binary, non-text data? I guess we'd only keep it for backwards compatibility and map it to the str() helper. >>>The main problem for a smooth Unicode transition remains I/O, in my >>>opinion; I'd like to see a PEP describing a way to attach an encoding >>>to text files, and a way to decide on a default encoding for stdin, >>>stdout, stderr. >> >>Hmm, not sure why you need PEPs for this: > > > I'd forgotten how far we've come. I'm still unsure how the default > encoding on stdin/stdout works. Codecs in general work like this: they take an existing file-like object and wrap it with new versions of .read(), .write(), .readline(), etc. which filter the data through encoding and/or decoding functions. Once a file is wrapped with a codec StreamWriter/Reader, you can continue using it as if it were a standard file-like object. > But it still needs to be simpler; IMO the built-in open() function > should have an encoding keyword. (But it could return something whose > type is not 'file' -- once again making a distinction between open and > file.) Right, because it would then return a wrapped file object. > Do these files support universal newlines? IMO they should. Since the codecs wrap the underlying file object which does support universal newlines, this should be the case. However, you should be aware of the fact that Unicode defines a lot more line break characters than just \r, \r\n, \n. The codecs use the .splitlines() methods of strings and Unicode - which support all of them transparently, so you don't need to enable universal newlines support at all - it's sort-of enabled per default. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 08 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com