Re: [Python-Dev] bytes / unicode
R. David Murray wrote: Having such a poly_str type would probably make my life easier. A thought on this poly_str type: perhaps it could be called ascii, since that's what it would have to be restricted to, and have a'xxx' as a literal syntax for it, seeing as literals seem to be one of its main use cases. I also would like just vent a little frustration at having to use single-character-slice notation when I want to index a character in a string in my algorithms Thinking way outside the square, and probably the pale as well, maybe @ could be pressed into service as an infix operator, with s...@i being equivalent to s[i:i+1] -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote: A thought on this poly_str type: perhaps it could be called ascii, since that's what it would have to be restricted to, and have a'xxx' as a literal syntax for it, seeing as literals seem to be one of its main use cases. This seems like a good idea. Thinking way outside the square, and probably the pale as well, maybe @ could be pressed into service as an infix operator, with s...@i being equivalent to s[i:i+1] And this is way beyond being intuitive. -- Senthil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Mon, 28 Jun 2010 13:55:26 +0530, Senthil Kumaran orsent...@gmail.com wrote: On Mon, Jun 28, 2010 at 08:28:45PM +1200, Greg Ewing wrote: Thinking way outside the square, and probably the pale as well, maybe @ could be pressed into service as an infix operator, with s...@i being equivalent to s[i:i+1] And this is way beyond being intuitive. Agreed, -1 on that. Like I said, I was just venting. The decision to have indexing bytes return an int is set in stone now and I just have to live with it. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Mon, Jun 28, 2010 at 6:28 PM, Greg Ewing greg.ew...@canterbury.ac.nz wrote: R. David Murray wrote: Having such a poly_str type would probably make my life easier. A thought on this poly_str type: perhaps it could be called ascii, since that's what it would have to be restricted to, and have a'xxx' as a literal syntax for it, seeing as literals seem to be one of its main use cases. One of the virtues of doing this as a helper type in a module somewhere (probably string) is that we can defer that kind of decision until later. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Sat, 26 Jun 2010 23:49:11 -0400 P.J. Eby p...@telecommunity.com wrote: Remember, bytes and strings already have to detect mixed-type operations. Not in Python 3. They just raise a TypeError on bad (mixed-type) arguments. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
P.J. Eby writes: At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote: What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a unicode object which might get passed around for a while before one of your functions receives it and identifies it as unvalidated data. I still don't follow, OK, I give up, since it was your use case that concerned me. I obviously misunderstood. Sorry for the confusion. Sign me, +1 on polymorphic functions in Tsukuba Japan In general this is a hard problem, though. Polymorphism, OK, one-way tainting OK, but in general combining related types is pretty arbitrary, and as in the encoded-bytes case, the result type often varies depending on expectations of callers, not the types of the data. But the caller can enforce those expectations by passing in arguments whose types do what they want in such cases, as long as the string literals used by the function don't get to override the relevant parts of the string protocol(s). This simply isn't true for encoded bytes as proposed. For encoded text, the current encoding has no deterministic relationship to the desired encoding (at the level of generality of the stdlib; of course in specific applications it may be mandated by a standard or private convention). I will have to pass on your other user-defined string types. I've never tried to implement one. I only wanted to point out that a user-controllable tainted string type would be preferable to confounding unicode with tainted. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 03:53 PM 6/27/2010 +1000, Nick Coghlan wrote: We could talk about this even longer, but the most effective way forward is going to be a patch that improves the URL parsing situation. Certainly, it's the only practical solution for the immediate problems in 3.2. I only mentioned that I hate the idea because I'd be more comfortable if it was explicitly declared to be a temporary hack to work around the absence of a string coercion protocol, due to the moratorium on language changes. But, since the moratorium *is* in effect, I'll try to make this my last post on string protocols for a while... and maybe wait until I've looked at the code (str/bytes C implementations) in more detail and can make a more concrete proposal for what the protocol would be and how it would work. (Not to mention closer to the end of the moratorium.) There are a *very small* number of APIs where it is appropriate to be polymorphic This is only true if you focus exclusively on bytes vs. unicode, rather than the general issue that it's currently impractical to pass *any* sort of user-defined string type through code that you don't directly control (stdlib or third-party). The virtues of a separate poly_str type are that: 1. It can be simple and implemented in Python, dispatching to str or bytes as appropriate (probably in the strings module) 2. No chance of impacting the performance of the core interpreter (as builtins are not affected) Note that adding a string coercion protocol isn't going to change core performance for existing cases, since any place where the protocol would be invoked would be a code branch that either throws an error or *already* falls back to some other protocol (e.g. the buffer protocol). 3. Lower impact if it turns out to have been a bad idea How many protocols have been added that turned out to be bad ideas? The only ones that have been removed in 3.x, IIRC, are three-way compare, slice-specific operations, and __coerce__... and I'm going to miss __cmp__. ;-) However, IIUC, the reason these protocols were dropped isn't because they were bad ideas. Rather, they're things that can be implemented in terms of a finer-grained protocol. i.e., if you want __cmp__ or __getslice__ or __coerce__, you can always implement them via a mixin that converts the newer fine-grained protocols into invocations of the older protocol. (As I plan to do for __cmp__ in the handful of places I use it.) At the moment, however, this isn't possible for multi-string operations outside of __add__/__radd__ and comparison -- the coercion rules are hard-wired and can't be overridden by user-defined types. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
I've been watching this discussion with intense interest, but have been so lagged in following the thread that I haven't replied. I got caught up today On Sun, 27 Jun 2010 15:53:59 +1000, Nick Coghlan ncogh...@gmail.com wrote: The difference is that we have three classes of algorithm here: - those that work only on octet sequences - those that work only on character sequences - those that can work on either Python 2 lumped all 3 classes of algorithm together through the multi-purpose 8-bit str type. The unicode type provided some scope to separate out the second category, but the divisions were rather blurry. Python 3 forces the first two to be separated by using either octets (bytes/bytearray) or characters (str). There are a *very small* number of APIs where it is appropriate to be polymorphic, but this is currently difficult due to the need to supply literals of the appropriate type for the objects being operated on. This isn't ever going to happen automagically due to the need to explicitly provide two literals (one for octet sequences, one for character sequences). In email6 I'm currently handling this by putting the algorithm on a base class and the literals on 'Bytes...' and 'String...' subclasses as class variables. Slightly ugly, but it works. The current design also speaks to an earlier point someone made about the fact that we are really dealing with more complex, and domain specific, data, not simply byte strings. A BytesMessage contains lots of structured encoding information as well as the possibility of 'garbage' bytes. A StringMessage contains text and data decoded into objects (ex: an image object), possibly with some PEP 383 surrogates included (haven't quite figured that part out yet). So, a BytesMessage object isn't just a byte string, it's a load of structured data that requires the associated algorithms to convert into meaningful text and objects. Going the other way, the decisions made about character encodings need to be encoded into the structured bytes representation that could ultimately go out on the wire. I suspect that the same thing needs to be done for URIs/IRIs, and html/MIME and the corresponding text and objects. It is my hope that the email6 work will lay a firm foundation for the latter, but URI/IRI is a whole different protocol that I'm glad I don't have to deal with :) The virtues of a separate poly_str type are that: Having such a poly_str type would probably make my life easier. I also would like just vent a little frustration at having to use single-character-slice notation when I want to index a character in a string in my algorithms -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 12:42 PM 6/26/2010 +0900, Stephen J. Turnbull wrote: What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a unicode object which might get passed around for a while before one of your functions receives it and identifies it as unvalidated data. I still don't follow, since passing in bytes should return bytes. Returning unicode would be an error, in the case of a polymorphic function (per Guido). But you agree that there are better mechanisms for validation (although not available in Python yet), so I don't see this as an potential obstacle to polymorphism now. Nope. I'm just saying that, given two bytestrings to url-join or path join or whatever, a polymorph should hand back a bytestring. This seems pretty uncontroversial. What I want is for the stdlib to create stringlike objects of a type determined by the types of the inputs -- In general this is a hard problem, though. Polymorphism, OK, one-way tainting OK, but in general combining related types is pretty arbitrary, and as in the encoded-bytes case, the result type often varies depending on expectations of callers, not the types of the data. But the caller can enforce those expectations by passing in arguments whose types do what they want in such cases, as long as the string literals used by the function don't get to override the relevant parts of the string protocol(s). The idea that I'm proposing is that the basic string and byte types should defer to user-defined string types for mixed type operations, so that polymorphism of string-manipulation functions is the *default* case, rather than a *special* case. This makes tainting easier to implement, as well as optimizing and other special cases (like my source string w/file and line info, or a string with font/formatting attributes). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/pje%40telecommunity.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Sun, Jun 27, 2010 at 4:17 AM, P.J. Eby p...@telecommunity.com wrote: The idea that I'm proposing is that the basic string and byte types should defer to user-defined string types for mixed type operations, so that polymorphism of string-manipulation functions is the *default* case, rather than a *special* case. This makes tainting easier to implement, as well as optimizing and other special cases (like my source string w/file and line info, or a string with font/formatting attributes). Rather than building this into the base string type, perhaps it would be better (at least initially) to add in a polymorphic str subtype that worked along the following lines: 1. Has an encoded argument in the constructor (e.g. poly_str(/, encoded=b/) 2. If given objects with an encode() method, assumes they're strings and uses its own parent class methods 3. If given objects with a decode() method, assumes they're encoded and delegates to the encoded attribute str/bytes agnostic functions would need to invoke poly_str deliberately, while bytes-only and text-only algorithms could just use the appropriate literals. Third party types would be supported to some degree (by having either encode or decode methods), although they could still run into trouble with some operations (While full support for third party strings and byte sequence implementations is an interesting idea, I think it's overkill for the specific problem of making it easier to write str/bytes agnostic functions for tasks like URL parsing). Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 12:43 PM 6/27/2010 +1000, Nick Coghlan wrote: While full support for third party strings and byte sequence implementations is an interesting idea, I think it's overkill for the specific problem of making it easier to write str/bytes agnostic functions for tasks like URL parsing. OTOH, to write your partial implementation is almost as complex - it still must take into account joining and formatting, and so by that point, you've just proposed a new protocol for coercion... so why not just make the coercion protocol explicit in the first place, rather than hardwiring a third type's worth of special cases? Remember, bytes and strings already have to detect mixed-type operations. If there was an API for that, then the hardcoded special cases would just be replaced, or supplemented with type slot checks and calls after the special cases. To put it another way, if you already have two types special-casing their interactions with each other, then rather than add a *third* type to that mix, maybe it's time to have a protocol instead, so that the types that care can do the special-casing themselves, and you generalize to N user types. (Btw, those who are saying that the resulting potential for N*N interaction makes the feature unworkable seem to be overlooking metaclasses and custom numeric types -- two Python features that in principle have the exact same problem, when you use them beyond a certain scope. At least with those features, though, you can generally mix your user-defined metaclasses or numeric types with the Python-supplied basic ones and call arbitrary Python functions on them, without as much heartbreak as you'll get with a from-scratch stringlike object.) All that having been said, a new protocol probably falls under the heading of the language moratorium, unless it can be considered new methods on builtins? (But that seems like a stretch even to me.) I just hate the idea that functions taking strings should have to be *rewritten* to be explicitly type-agnostic. It seems *so* un-Pythonic... like if all the bitmasking functions you'd ever written using 32-bit int constants had to be rewritten just because we added longs to the language, and you had to upcast them to be compatible or something. Sounds too much like C or Java or some other non-Python language, where dynamism and polymorphy are the special case, instead of the general rule. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Sun, Jun 27, 2010 at 1:49 PM, P.J. Eby p...@telecommunity.com wrote: I just hate the idea that functions taking strings should have to be *rewritten* to be explicitly type-agnostic. It seems *so* un-Pythonic... like if all the bitmasking functions you'd ever written using 32-bit int constants had to be rewritten just because we added longs to the language, and you had to upcast them to be compatible or something. Sounds too much like C or Java or some other non-Python language, where dynamism and polymorphy are the special case, instead of the general rule. The difference is that we have three classes of algorithm here: - those that work only on octet sequences - those that work only on character sequences - those that can work on either Python 2 lumped all 3 classes of algorithm together through the multi-purpose 8-bit str type. The unicode type provided some scope to separate out the second category, but the divisions were rather blurry. Python 3 forces the first two to be separated by using either octets (bytes/bytearray) or characters (str). There are a *very small* number of APIs where it is appropriate to be polymorphic, but this is currently difficult due to the need to supply literals of the appropriate type for the objects being operated on. This isn't ever going to happen automagically due to the need to explicitly provide two literals (one for octet sequences, one for character sequences). The virtues of a separate poly_str type are that: 1. It can be simple and implemented in Python, dispatching to str or bytes as appropriate (probably in the strings module) 2. No chance of impacting the performance of the core interpreter (as builtins are not affected) 3. Lower impact if it turns out to have been a bad idea We could talk about this even longer, but the most effective way forward is going to be a patch that improves the URL parsing situation. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Guido van Rossum writes: On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull step...@xemacs.org wrote: Understood, but both the majority of str/bytes methods and several existing APIs (e.g. many in the os module, like os.listdir()) do it this way. Understood. Also, IMO a polymorphic function should *not* accept *mixed* bytes/text input -- join('x', b'y') should be rejected. Agreed. But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me. So, actually, I *don't* understand what you mean by needing LBYL. Consider docutils. Some folks assert that URIs *are* bytes and should be manipulated as such. So base URIs should be bytes. But there are various ways to refer to a base URI and combine it with relative URI taken from literal text in reST. That literal text will be represented as str. So you want to use urljoin, but this usage isn't polymorphic. If you forget to do a conversion here, urljoin will raise, of course. But late conversion may not be appropriate. AIUI Philip at least wants ways to raise exceptions earlier than that on some code paths. That's LBYL, no? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
P.J. Eby writes: This doesn't have to be in the functions; it can be in the *types*. Mixed-type string operations have to do type checking and upcasting already, but if the protocol were open, you could make an encoded-bytes type that would handle the error checking. Don't you realize that encoded-bytes is equivalent to use of a very limited profile of ISO 2022 coding extensions? Such as Emacs/MULE internal encoding or TRON code? It has been tried. It does not work. I understand how types can do such checking; my point is that the encoded-bytes type doesn't have enough information to do it in the cases where you think it is better than converting to str. There are *no useful operations* that can be done on two encoded-bytes with different encodings unless you know the ultimate target codec. The only sensible way to define the concatenation of ('ascii', 'English') with ('euc-jp','ÆüËܸì') is something like ('ascii', 'English', 'euc-jp','ÆüËܸì'), and *not* ('euc-jp','EnglishÆüËܸì'), because you don't know that the ultimate target codec is 'euc-jp'-compatible. Worse, you need to build in all the information about which codecs are mutually compatible into the encoded-bytes type. For example, if the ultimate target is known to be 'shift_jis', it's trivially compatible with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't have. (Btw, in some earlier emails, Stephen, you implied that this could be fixed with codecs -- but it can't, because the problem isn't with the bytes containing invalid Unicode, it's with the Unicode containing invalid bytes -- i.e., characters that can't be encoded to the ultimate codec target.) No, the problem is not with the Unicode, it is with the code that allows characters not encodable with the target codec. If you don't have a target codec, there are ascii-safe source codecs, such as 'latin-1' or 'ascii' with surrogateescape, that will work any time that bytes-oriented processing can work. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 04:49 PM 6/25/2010 +0900, Stephen J. Turnbull wrote: P.J. Eby writes: This doesn't have to be in the functions; it can be in the *types*. Mixed-type string operations have to do type checking and upcasting already, but if the protocol were open, you could make an encoded-bytes type that would handle the error checking. Don't you realize that encoded-bytes is equivalent to use of a very limited profile of ISO 2022 coding extensions? Such as Emacs/MULE internal encoding or TRON code? It has been tried. It does not work. I understand how types can do such checking; my point is that the encoded-bytes type doesn't have enough information to do it in the cases where you think it is better than converting to str. There are *no useful operations* that can be done on two encoded-bytes with different encodings unless you know the ultimate target codec. I do know the ultimate target codec -- that's the point. IOW, I want to be able to do to all my operations by passing target-encoded strings to polymorphic functions. Then, the moment something creeps in that won't go to the target codec, I'll be able to track down the hole in the legacy code that's letting bad data creep in. The only sensible way to define the concatenation of ('ascii', 'English') with ('euc-jp','ÆüËܸì') is something like ('ascii', 'English', 'euc-jp','ÆüËܸì'), and *not* ('euc-jp','EnglishÆüËܸì'), because you don't know that the ultimate target codec is 'euc-jp'-compatible. Worse, you need to build in all the information about which codecs are mutually compatible into the encoded-bytes type. For example, if the ultimate target is known to be 'shift_jis', it's trivially compatible with 'ascii' and 'euc-jp' requires a conversion, but latin-9 you can't have. The interaction won't be with other encoded bytes, it'll be with other *unicode* strings. Ones coming from other code, and literals embedded in the stdlib. No, the problem is not with the Unicode, it is with the code that allows characters not encodable with the target codec. And which code that is, precisely, is the thing that may be very difficult to find, unless I can identify it at the first point it enters (and corrupts) my output data. When dealing with a large code base, this may be a nontrivial problem. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Fri, Jun 25, 2010 at 2:05 AM, Stephen J. Turnbull step...@xemacs.orgwrote: But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me. So, actually, I *don't* understand what you mean by needing LBYL. Consider docutils. Some folks assert that URIs *are* bytes and should be manipulated as such. So base URIs should be bytes. I don't get what you are arguing against. Are you worried that if we make URL code polymorphic that this will mean some code will treat URLs as bytes, and that code will be incompatible with URLs as text? No one is arguing we remove text support from any of these functions, only that we allow bytes. -- Ian Bicking | http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Ian Bicking writes: I don't get what you are arguing against. Are you worried that if we make URL code polymorphic that this will mean some code will treat URLs as bytes, and that code will be incompatible with URLs as text? No one is arguing we remove text support from any of these functions, only that we allow bytes. No, I understand what Guido means by polymorphic. I'm arguing that as I understand one of Philip Eby's use cases, bytes is a misspelling of validated and unicode is a misspelling of unvalidated. In case of some kind of bug, polymorphic stdlib functions would allow propagation of unvalidated/unicode within the validated zone, aka errors passing silently. Now that I understand that that use case doesn't actually care about bytes vs. unicode *string* semantics at all, the argument becomes moot, I guess. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 01:18 AM 6/26/2010 +0900, Stephen J. Turnbull wrote: It seems to me what is wanted here is something like Perl's taint mechanism, for *both* kinds of strings. Am I missing something? You could certainly view it as a kind of tainting. The part where the type would be bytes-based is indeed somewhat incidental to the actual use case -- it's just that if you already have the bytes, and all you want to do is tag them (e.g. the WSGI headers case), the extra encoding step seems pointless. A string coercion protocol (that would be used by .join(), .format(), __contains__, __mod__, etc.) would allow you to do whatever sort of tainted-string or tainted-bytes implementations one might wish to have. I suppose that tainting user inputs (as in Perl) would be just as useful of an application of the same coercion protocol. Actually, I have another use case for this custom string coercion, which is that I once wrote a string subclass whose purpose was to track the original file and line number of some text. Even though only my code was manipulating the strings, it was very difficult to get the tainting to work correctly without extreme care as to the string methods used. (For example, I had to use string addition rather than %-formatting.) But with your architecture, it seems to me that you actually don't want polymorphic functions in the stdlib. You want the stdlib functions to be bytes-oriented if and only if they are reliable. (This is what I was saying to Guido elsewhere.) I'm not sure I follow you. What I want is for the stdlib to create stringlike objects of a type determined by the types of the inputs -- where the logic for deciding this coercion can be controlled by the input objects' types, rather than putting this in the hands of the stdlib function. And of course, this applies to non-stdlib functions, too -- anything that simply manipulates user-defined string classes, should allow the user-defined classes to determine the coercion of the result. BTW, this was a little unclear to me: [Collisions will] be with other *unicode* strings. Ones coming from other code, and literals embedded in the stdlib. What about the literals in the stdlib? Are you saying they contain invalid code points for your known output encoding? Or are you saying that with non-polymorphic unicode stdlib, you get lots of false positives when combining with your validated bytes? No, I mean that the current string coercion rules cause everything to be converted to unicode, thereby discarding the tainting information, so to speak. This applies equally to other tainting use cases, and other uses for custom stringlike objects. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
P.J. Eby writes: it's just that if you already have the bytes, and all you want to do is tag them (e.g. the WSGI headers case), the extra encoding step seems pointless. Well, I'll have to concede that unless and until I get involved in the WSGI development effort.wink But with your architecture, it seems to me that you actually don't want polymorphic functions in the stdlib. You want the stdlib functions to be bytes-oriented if and only if they are reliable. (This is what I was saying to Guido elsewhere.) I'm not sure I follow you. What I'm saying here is that if bytes are the signal of validity, and the stdlib functions preserve validity, then it's better to have the stdlib functions object to unicode data as an argument. Compare the alternative: it returns a unicode object which might get passed around for a while before one of your functions receives it and identifies it as unvalidated data. But you agree that there are better mechanisms for validation (although not available in Python yet), so I don't see this as an potential obstacle to polymorphism now. What I want is for the stdlib to create stringlike objects of a type determined by the types of the inputs -- In general this is a hard problem, though. Polymorphism, OK, one-way tainting OK, but in general combining related types is pretty arbitrary, and as in the encoded-bytes case, the result type often varies depending on expectations of callers, not the types of the data. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Guido van Rossum writes: For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. While you have come down on the side of polymorphism (as opposed to separate functions), I'm a little nervous about it. Specifically, Philip Eby expressed a desire for earlier type errors, while polymorphism seems to ensure that you'll need to Look Before You Leap to get early error detection. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 20:07, James Y Knight f...@fuhm.net wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- Well, yes, maybe even if *you* don't care. But often the functions you need to call must care, and then you need to decode to unicode, even if you personally don't care. And in those cases, you should deocde as early as possible. In the cases where neither you nor the functions you call care, then you don't have to decode, and you can happily pass binary data from one function to another. So this is not really a question of the direction Python 3 went. It's more a case that some methods that *could* do their transformations in a well defined way on bytes don't, and then force you to decode to unicode. But that's not a problem with direction, it's just a missing feature in the stdlib. -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python3porting.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Lennart Regebro wrote: On Tue, Jun 22, 2010 at 20:07, James Y Knight f...@fuhm.net wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- Well, yes, maybe even if *you* don't care. But often the functions you need to call must care, and then you need to decode to unicode, even if you personally don't care. And in those cases, you should deocde as early as possible. In the cases where neither you nor the functions you call care, then you don't have to decode, and you can happily pass binary data from one function to another. So this is not really a question of the direction Python 3 went. It's more a case that some methods that *could* do their transformations in a well defined way on bytes don't, and then force you to decode to unicode. But that's not a problem with direction, it's just a missing feature in the stdlib. The discussion is showing that in at least a few application spaces, the stdlib should be able to work on both bytes and Unicode, preferably using the same interfaces using polymorphism, i.e. some_function(bytes) - bytes some_function(str) - str In Python2 this partially works due to the automatic bytes-str conversion (in some cases you get some_function(bytes) - str), the codec base class implementations being a prime example. In Python3, things have to be done explicity and I think we need to add a few helpers to make writing such str/bytes interfaces easier. We've already had some suggestions in that area, but probably need to collect a few more ideas based on real-life porting attempts. I'd like to make this a topic at the upcoming language summit in Birmingham, if Michael agrees. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 24 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2010-07-19: EuroPython 2010, Birmingham, UK24 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 24/06/2010 11:58, M.-A. Lemburg wrote: Lennart Regebro wrote: On Tue, Jun 22, 2010 at 20:07, James Y Knightf...@fuhm.net wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- Well, yes, maybe even if *you* don't care. But often the functions you need to call must care, and then you need to decode to unicode, even if you personally don't care. And in those cases, you should deocde as early as possible. In the cases where neither you nor the functions you call care, then you don't have to decode, and you can happily pass binary data from one function to another. So this is not really a question of the direction Python 3 went. It's more a case that some methods that *could* do their transformations in a well defined way on bytes don't, and then force you to decode to unicode. But that's not a problem with direction, it's just a missing feature in the stdlib. The discussion is showing that in at least a few application spaces, the stdlib should be able to work on both bytes and Unicode, preferably using the same interfaces using polymorphism, i.e. some_function(bytes) - bytes some_function(str) - str In Python2 this partially works due to the automatic bytes-str conversion (in some cases you get some_function(bytes) - str), the codec base class implementations being a prime example. In Python3, things have to be done explicity and I think we need to add a few helpers to make writing such str/bytes interfaces easier. We've already had some suggestions in that area, but probably need to collect a few more ideas based on real-life porting attempts. I'd like to make this a topic at the upcoming language summit in Birmingham, if Michael agrees. Yep, it sounds like a great topic for the language summit. Michael -- http://www.ironpythoninaction.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Thu, Jun 24, 2010 at 1:12 AM, Stephen J. Turnbull step...@xemacs.org wrote: Guido van Rossum writes: For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. While you have come down on the side of polymorphism (as opposed to separate functions), I'm a little nervous about it. Specifically, Philip Eby expressed a desire for earlier type errors, while polymorphism seems to ensure that you'll need to Look Before You Leap to get early error detection. Understood, but both the majority of str/bytes methods and several existing APIs (e.g. many in the os module, like os.listdir()) do it this way. Also, IMO a polymorphic function should *not* accept *mixed* bytes/text input -- join('x', b'y') should be rejected. But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me. So, actually, I *don't* understand what you mean by needing LBYL. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum gu...@python.org wrote: Also, IMO a polymorphic function should *not* accept *mixed* bytes/text input -- join('x', b'y') should be rejected. But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me. A policy of allowing arguments to be either str or bytes, but not a mixture, actually avoids one of the more painful aspects of the 2.x promote mixed operations to unicode approach. Specifically, you either had to scan all the arguments up front to check for unicode, or else you had to stop what you were doing and start again with the unicode version if you encountered unicode partway through. Neither was particularly nice to implement. As you noted elsewhere, literals and string methods are still likely to be a major sticking point with that approach - common operations like ''.join(seq) and b''.join(seq) aren't polymorphic, so functions that use them won't be polymorphic either. (It's only the str-unicode promotion behaviour in 2.x that works around this problem there). Would it be heretical to suggest that sum() be allowed to work on strings to at least eliminate ''.join() as something that breaks bytes processing? It already works for bytes, although it then fails with a confusing message for bytearray: sum(ba b c.split(), b'') b'abc' sum(bytearray(ba b c).split(), bytearray(b'')) Traceback (most recent call last): File stdin, line 1, in module TypeError: sum() can't sum bytes [use b''.join(seq) instead] sum(a b c.split(), '') Traceback (most recent call last): File stdin, line 1, in module TypeError: sum() can't sum strings [use ''.join(seq) instead] Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Thu, Jun 24, 2010 at 8:25 AM, Nick Coghlan ncogh...@gmail.com wrote: On Fri, Jun 25, 2010 at 12:33 AM, Guido van Rossum gu...@python.org wrote: Also, IMO a polymorphic function should *not* accept *mixed* bytes/text input -- join('x', b'y') should be rejected. But join('x', 'y') - 'x/y' and join(b'x', b'y') - b'x/y' make sense to me. A policy of allowing arguments to be either str or bytes, but not a mixture, actually avoids one of the more painful aspects of the 2.x promote mixed operations to unicode approach. Specifically, you either had to scan all the arguments up front to check for unicode, or else you had to stop what you were doing and start again with the unicode version if you encountered unicode partway through. Neither was particularly nice to implement. Right. Polymorphic functions should *not* allow mixing text and bytes. It's all text or all bytes. As you noted elsewhere, literals and string methods are still likely to be a major sticking point with that approach - common operations like ''.join(seq) and b''.join(seq) aren't polymorphic, so functions that use them won't be polymorphic either. (It's only the str-unicode promotion behaviour in 2.x that works around this problem there). Would it be heretical to suggest that sum() be allowed to work on strings to at least eliminate ''.join() as something that breaks bytes processing? It already works for bytes, although it then fails with a confusing message for bytearray: sum(ba b c.split(), b'') b'abc' sum(bytearray(ba b c).split(), bytearray(b'')) Traceback (most recent call last): File stdin, line 1, in module TypeError: sum() can't sum bytes [use b''.join(seq) instead] sum(a b c.split(), '') Traceback (most recent call last): File stdin, line 1, in module TypeError: sum() can't sum strings [use ''.join(seq) instead] I don't think we should abuse sum for this. A simple idiom to get the *empty* string of a particular type is x[:0] so you could write something like this to concatenate a list or strings or bytes: xs[:0].join(xs). Note that if xs is empty we wouldn't know what to do anyway so this should be disallowed. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
P.J. Eby a écrit : [...] stdlib constants are almost always ASCII, and the main use cases for ebytes would involve ascii-extended encodings.) Then, how about a new ascii string literal? This would produce a special kind of string that would coerce to a normal string when mixed with a str, and to a bytes using ascii codec when mixed with a bytes. Then you could write a/.join(base, path) and not worry if base and path are both str, or both bytes (mixed being of course forbidden). B. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 05:12 PM 6/24/2010 +0900, Stephen J. Turnbull wrote: Guido van Rossum writes: For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. While you have come down on the side of polymorphism (as opposed to separate functions), I'm a little nervous about it. Specifically, Philip Eby expressed a desire for earlier type errors, while polymorphism seems to ensure that you'll need to Look Before You Leap to get early error detection. This doesn't have to be in the functions; it can be in the *types*. Mixed-type string operations have to do type checking and upcasting already, but if the protocol were open, you could make an encoded-bytes type that would handle the error checking. (Btw, in some earlier emails, Stephen, you implied that this could be fixed with codecs -- but it can't, because the problem isn't with the bytes containing invalid Unicode, it's with the Unicode containing invalid bytes -- i.e., characters that can't be encoded to the ultimate codec target.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Fri, Jun 25, 2010 at 3:07 AM, P.J. Eby p...@telecommunity.com wrote: (Btw, in some earlier emails, Stephen, you implied that this could be fixed with codecs -- but it can't, because the problem isn't with the bytes containing invalid Unicode, it's with the Unicode containing invalid bytes -- i.e., characters that can't be encoded to the ultimate codec target.) That's what the surrogateescape error handler is for though - it will happily accept mojibake on input (putting invalid bytes into the PUA), and happily generate mojibake on output (recreating the invalid bytes from the PUA) as well. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Fri, Jun 25, 2010 at 1:41 AM, Guido van Rossum gu...@python.org wrote: I don't think we should abuse sum for this. A simple idiom to get the *empty* string of a particular type is x[:0] so you could write something like this to concatenate a list or strings or bytes: xs[:0].join(xs). Note that if xs is empty we wouldn't know what to do anyway so this should be disallowed. That's a good trick, although there's a [0] missing from your join example (type(xs[0])() is another way to spell the same idea, but the subscripting version would likely be faster since it skips the builtin lookup). Promoting that over explicit use of empty str and bytes literals is probably step 1 in eliminating gratuitous breakage of bytes/str polymorphism (this trick also has the benefit of working with non-builtin character sequence types). Use of non-empty bytes/str literals is going to be harder to handle - actually trying to apply a polymorphic philosophy to the Python 3 URL parsing libraries may be a good way to learn more on that front. Cheers, Nick. P.S. I'm off to Sydney for PyconAU this evening, so I'm not sure how much time I'll get to follow python-dev until next week. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Ian Bicking writes: Just for perspective, I don't know if I've ever wanted to deal with a URL like that. Ditto, I do many times a day for Japanese media sites and Wikipedia. I know how it is supposed to work, and I know what a browser does with that, but so many tools will clean that URL up *or* won't be able to deal with it at all that it's not something I'll be passing around. I'm not suggesting that is something you want to be passing around; it's a presentation form, and I prefer that the internal form use Unicode. While it's nice to be correct about encodings, sometimes it is impractical. And it is far nicer to avoid the situation entirely. But you cannot avoid it entirely. Processing bytes mean you are assuming ASCII compatibility. Granted, this is a pretty good assumption, especially if you got the bytes off the wire, but it's not universally so. Maybe it's a YAGNI, but one reason I prefer the decode-process-encode paradigm is that choice of codec is a specification of the assumptions you're making about encoding. So the Know-Nothing codec described above assumes just enough ASCII compatibility to parse the scheme. You could also have codecs which assume just enough ASCII compatibility to parse a hierarchical scheme, etc. That is, decoding content you don't care about isn't just inefficient, it's complicated and can introduce errors. That depends on the codec(s) used. Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. Indeed, a programmer using Python 2 would want to do so, because all her literal strings are bytes by default (ie, if she doesn't mark them with `u'), and interactive input is, too. This is no longer so obvious in Python 3 which takes the attitude that things that are expected to be human-readable should be processed as str. The obvious example in URI space is the file:/// URL, which you'll typically build up from a user string or a file browser, which will call the os.path stuff which returns str. Text editors and viewers will also use str for their buffers, and if they provide a way to fish out URIs for their users, they'll probably return str. I won't pretend to judge the relative importance of such use cases. But use cases for urllib which naturally favor str until you put the URI on the wire do exist, as does the debugging presentation aspect. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
James Y Knight writes: The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to output. This is the world we already have, modulo s/utf8/ascii + random GR charset/. It doesn't work, and it can't, in Japan or China or Korea, and probably not in Russia or Kazakhstan, for some time yet. That's not to say that byte-oriented processing doesn't have its place. And in many cases it's reasonable (but not secure or bulletproof!) to assume ASCII compatibility of the byte stream, passing through syntactically unimportant bytes verbatim. Syntactic analysis of such streams will surely have a lot in common with that for text streams, so the same tools should be available. (That's the point of Guido's endorsement of polymorphism, AIUI.) But it's just not reasonable to assume that will work in a context where text streams from various sources are mixed with byte streams. In that case, the byte streams need to be converted to text before mixing. (You can't do it the other way around because there is no guarantee that the text is compatible with the current encoding of the byte stream, nor that all the byte streams have the same encoding.) We do need str-based implementations of modules like urllib. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Nick Coghlan wrote: On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg m...@egenix.com wrote: It would be great if we could have something like the above as builtin method: x.split(''.as(x)) As per my other message, another possible (and reasonably intuitive) spelling would be: x.split(x.coerce('')) You are right: there are two ways to adapt one object to another. You can either adapt object 1 to object 2 or object 2 to object 1. This is what the Python2 coercion protocol does for operators. I just wanted to avoid using that term, since Python3 removes the coercion protocol. Writing it as a helper function is also possible, although it be trickier to remember the correct argument ordering: def coerce_to(target, obj, encoding=None, errors='surrogateescape'): if isinstance(obj, type(target)): return obj if encoding is None: encoding = sys.getdefaultencoding() try:: convert = obj.decode except AttributeError: convert = obj.encode return convert(encoding, errors) x.split(coerce_to(x, '')) Perhaps something to discuss on the language summit at EuroPython. Too bad we can't add such porting enhancements to Python2 anymore. Well, we can if we really want to, it just entails convincing Benjamin to reschedule the 2.7 final release. Given the UserDict/ABC/old-style classes issue, there's a fair chance there's going to be at least one more 2.7 RC anyway. That said, since this kind of coercion can be done in a helper function, that should be adequate for the 2.x to 3.x conversion case (for 2.x, the helper function can be defined to just return the second argument since bytes and str are the same type, while the 3.x version would look something like the code above) True. Note that the point of using a builtin method was to get better performance. Such type adaptions are often needed in loops, so adding a few extra Python function calls just to convert a str object to a bytes object or vice-versa is a bit much overhead. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 23 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2010-07-19: EuroPython 2010, Birmingham, UK25 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 7:18 PM, M.-A. Lemburg m...@egenix.com wrote: Note that the point of using a builtin method was to get better performance. Such type adaptions are often needed in loops, so adding a few extra Python function calls just to convert a str object to a bytes object or vice-versa is a bit much overhead. I actually agree with that, I just think we need more real world experience as to what works with the Python 3 text model before we start messing with the APIs for the builtin objects (fair point that coerce is a loaded term given the existence of the old coercion protocol. It's the right word for the task though). One of the key points coming out of this thread (to my mind) is the lack of a Text ABC or other way of making an object that can be passed to functions expecting a str instance with a reasonable expectation of having it work. Are there some core string capabilities that can be identified and then expanded out to a full str-compatible API? (i.e. something along the lines of what collections.MutableMapping now provides for dict-alikes). However, even if something like that was added, PJE is correct in pointing out that builtin strings still don't play well with others in many cases (usually due to underlying optimisations or other sound reasons, but perhaps sometimes gratuitously). Most of the string binary operations can be dealt with through their reflected forms, but str.__mod__ will never return NotImplemented, __contains__ has no reflected form and the actual method calls are of course right out (e.g. the arguments to str.join() or str.split() calls have no ability to affect the type of the result). Third party number implementations couldn't provide comparable funtionality to builtin int and long objects until the __index__ protocol was added. Perhaps PJE is right that what this is really crying out for is a way to have third party real string implementations so that there can actually be genuine experimentation in the Unicode handling space outside the language core (comparable to the difference between the you can turn me into an int __int__ method and the I am an int equivalent __index__ method). That may be tapping in a nail with a sledgehammer (and would raise significant moratorium questions if pursued further), but I think it's a valid question to at least ask. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 08:34 PM 6/22/2010 -0400, Glyph Lefkowitz wrote: I suspect the practical problem here is that there's no CharacterString ABC That, and the absence of a string coercion protocol so that mixing your custom string with standard strings will do the right thing for your intended use. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make them text. This *is* a case where dont make me think is a losing propsition: programmers who work with URLs in any non-opaque way as text are eventually going to be bitten by this issue no matter how hard we wave our hands. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkwiKI4ACgkQ+gerLs4ltQ56/QCbBPdj8jaPbcvPIDPb7ys04oHg fLIAnR+kA2udazsnpzTp2INGz2CoWgzj =Swjw -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 8:30 AM, Tres Seaver tsea...@palladion.com wrote: Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make them text. This *is* a case where dont make me think is a losing propsition: programmers who work with URLs in any non-opaque way as text are eventually going to be bitten by this issue no matter how hard we wave our hands. This has been asserted and contested several times now, and I don't see the two positions getting any closer. So I propose that we drop the discussion are URLs text or bytes and try to find something more pragmatic to discuss. For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Jun 23, 2010, at 08:43 AM, Guido van Rossum wrote: So I propose that we drop the discussion are URLs text or bytes and try to find something more pragmatic to discuss. email has exactly the same question, and the answer is yes. wink For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. I think email package hackers should watch this effort closely. RDM has written some stuff up on how we think we're going to handle this, though it's probably pretty email package specific. Maybe there's a better, general, or conventional approach lurking around somewhere. http://wiki.python.org/moin/Email%20SIG -Barry signature.asc Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Tres Seaver tsea...@palladion.com wrote: Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make them text. This URLs are exactly text (strings, representable as Unicode strings in Py3K), and were designed as such from the start. The fact that some of the things tunneled or carried in URLs are string representations of non-string data shouldn't obscure that point. They're not text-ish, they're text. They're not opaque, either; they break down in well-specified ways, mainly into strings. The trouble comes in when we try to go beyond the spec, or handle things that don't conform to the spec. Sure, a path component of a URI might actually be a %-escaped sequence of arbitrary bytes, even bytes that don't represent a string in any known encoding, but that's only *after* reversing the %-escapes, which should happen in a scheme-specific piece of code, not in generic URL parsing or manipulation. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 10:30 AM, Tres Seaver tsea...@palladion.com wrote: Stephen J. Turnbull wrote: We do need str-based implementations of modules like urllib. Why would that be? URLs aren't text, and never will be. The fact that to the eye they may seem to be text-ish doesn't make them text. This *is* a case where dont make me think is a losing propsition: programmers who work with URLs in any non-opaque way as text are eventually going to be bitten by this issue no matter how hard we wave our hands. HTML is text, and URLs are embedded in that text, so it's easy to get a URL that is text. Though, with a little testing, I notice that text alone can't tell you what the right URL really is (at least the intended URL when unsafe characters are embedded in HTML). To test I created two pages, one in Latin-1 another in UTF-8, and put in the link: ./test.html?param=Réunion On a Latin-1 page it created a link to test.html?param=R%E9union and on a UTF-8 page it created a link to test.html?param=R%C3%A9union (the second link displays in the URL bar as test.html?param=Réunion but copies with percent encoding). Though if you link to ./Réunion.html then both pages create UTF-8 links. And both pages also link http://Réunion.comhttp://xn--runion-bva.comto http://xn--runion-bva.com/. So really neither bytes nor text works completely; query strings receive the encoding of the page, which would be handled transparently if you worked on the page's bytes. Path and domain are consistently encoded with UTF-8 and punycode respectively and so would be handled best when treated as text. And of course if you are a page with a non-ASCII-compatible encoding you really must handle encodings before the URL is sensible. Another issue here is that there's no encoding for turning a URL into bytes if the URL is not already ASCII. A proper way to encode a URL would be: (Totally as an aside, as I remind myself of new module names I notice it's not easy to google specifically for Python 3 docs, e.g. python 3 urlsplit gives me 2.6 docs) from urllib.parse import urlsplit, urlunsplit import encodings.idna def encode_http_url(url, page_encoding='ASCII', errors='strict'): scheme, netloc, path, query, fragment = urlsplit(url) scheme = scheme.encode('ASCII', errors) auth = port = None if '@' in netloc: auth, netloc = netloc.split('@', 1) if ':' in netloc: netloc, port = netloc.split(':', 1) netloc = encodings.idna.ToASCII(netloc) if port: netloc = netloc + b':' + port.encode('ASCII', errors) if auth: netloc = auth.encode('UTF-8', errors) + b'@' + netloc path = path.encode('UTF-8', errors) query = query.encode(page_encoding, errors) fragment = fragment.encode('UTF-8', errors) return urlunsplit_bytes((scheme, netloc, path, query, fragment)) Where urlunsplit_bytes handles bytes (urlunsplit does not). It's helpful for me at least to look at that code specifically: def urlunsplit(components): scheme, netloc, url, query, fragment = components if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'): if url and url[:1] != '/': url = '/' + url url = '//' + (netloc or '') + url if scheme: url = scheme + ':' + url if query: url = url + '?' + query if fragment: url = url + '#' + fragment return url In this case it really would be best to have Python 2's system where things are coerced to ASCII implicitly. Or, more specifically, if all those string literals in that routine could be implicitly converted to bytes using ASCII. Conceptually I think this is reasonable, as for URLs (at least with HTTP, but in practice I think this applies to all URLs) the ASCII bytes really do have meaning. That is, '/' (*in the context of urlunsplit*) really is \x2f specifically. Or another example, making a GET request really means sending the bytes \x47\x45\x54 and there is no other set of bytes that has that meaning. The WebSockets specification for instance defines things like colon: http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-76#page-5 -- in an earlier version they even used bytes to describe HTTP ( http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-54#page-13), though this annoyed many people. -- Ian Bicking | http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Guido van Rossum gu...@python.org wrote: So I propose that we drop the discussion are URLs text or bytes and try to find something more pragmatic to discuss. For example: how we can make the suite of functions used for URL processing more polymorphic, so that each developer can choose for herself how URLs need to be treated in her application. While I agree with find something more pragmatic to discuss, it also seems to me that introducing polymorphic URL processing might make things more confusing and error-prone. The bigger problem seems to be that we're revisiting the design discussion about urllib.parse from the summer of 2008. See http://bugs.python.org/issue3300 if you want to recall how we hashed this out 2 years ago. I didn't particularly like that design, but I had to go off on vacation :-), and things got settled while I was away. I haven't heard much from Matt Giuca since he stopped by and lobbed that patch into the standard library. But since Guido is the one who settled it, why are we talking about it again? Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Oops, I forgot some important quoting (important for the algorithm, maybe not actually for the discussion)... from urllib.parse import urlsplit, urlunsplit import encodings.idna # urllib.parse.quote both always returns str, and is not as conservative in quoting as required here... def quote_unsafe_bytes(b): result = [] for c in b: if c 0x20 or c = 0x80: result.extend(('%%%02X' % c).encode('ASCII')) else: result.append(c) return bytes(result) def encode_http_url(url, page_encoding='ASCII', errors='strict'): scheme, netloc, path, query, fragment = urlsplit(url) scheme = scheme.encode('ASCII', errors) auth = port = None if '@' in netloc: auth, netloc = netloc.split('@', 1) if ':' in netloc: netloc, port = netloc.split(':', 1) netloc = encodings.idna.ToASCII(netloc) if port: netloc = netloc + b':' + port.encode('ASCII', errors) if auth: netloc = quote_unsafe_bytes(auth.encode('UTF-8', errors)) + b'@' + netloc path = quote_unsafe_bytes(path.encode('UTF-8', errors)) query = quote_unsafe_bytes(query.encode(page_encoding, errors)) fragment = quote_unsafe_bytes(fragment.encode('UTF-8', errors)) return urlunsplit_bytes((scheme, netloc, path, query, fragment)) -- Ian Bicking | http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Jun 22, 2010, at 8:57 PM, Robert Collins wrote: bzr has a cache of decoded strings in it precisely because decode is slow. We accept slowness encoding to the users locale because thats typically much less data to examine than we've examined while generating the commit/diff/whatever. We also face memory pressure on a regular basis, and that has been, at least partly, due to UCS4 - our translation cache helps there because we have less duplicate UCS4 strings. Thanks for setting the record straight - apologies if I missed this earlier in the thread. It does seem vaguely familiar. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Bill Janssen wrote: The bigger problem seems to be that we're revisiting the design discussion about urllib.parse from the summer of 2008. See http://bugs.python.org/issue3300 if you want to recall how we hashed this out 2 years ago. I didn't particularly like that design, but I had to go off on vacation :-), and things got settled while I was away. I haven't heard much from Matt Giuca since he stopped by and lobbed that patch into the standard library. But since Guido is the one who settled it, why are we talking about it again? Perhaps such decisions need revisiting in light of subsequent experience / pain / learning. E.g: - - the repeated inability of the web-sig to converge on appropriate semantics for a Python3-compatible version of the WSGI spec; - - the subsequent quirkiness of the Python3 wsgiref implementation; - - the breakage in cgi.py which prevents handling file uploads in a web application; - - the slow adoption / porting rate of major web frameworks and libraries to Python 3. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkwiUSAACgkQ+gerLs4ltQ49EwCeLYwrZs6QfairPP5zpeeUlxao qg8An37kRz1CrzGc3kScvSqVx8FPnO1M =lR6R -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, 23 Jun 2010 14:23:33 -0400 Tres Seaver tsea...@palladion.com wrote: Perhaps such decisions need revisiting in light of subsequent experience / pain / learning. E.g: - - the repeated inability of the web-sig to converge on appropriate semantics for a Python3-compatible version of the WSGI spec; - - the subsequent quirkiness of the Python3 wsgiref implementation; The way wsgiref was adapted is admittedly suboptimal. It was totally broken at first, and PJE didn't want to look very deeply into it. We therefore had to settle on a series of small modifications that seemed rather reasonable, but without any in-depth discussion of what WSGI had to look like under Python 3 (since it was not our job and responsibility). Therefore, I don't think wsgiref should be taken as a guide to what a cleaned up, Python 3-specific WSGI must look like. - - the slow adoption / porting rate of major web frameworks and libraries to Python 3. Some of the major web frameworks and libraries have a ton of dependencies, which would explain why they really haven't bothered yet. I don't think you can't claim, though, that Python 3 makes things significantly harder for these frameworks. The proof is that many of them already give the user unicode strings in Python 2.x. They must have somehow got the decoding right. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 09:36:45PM +0200, Antoine Pitrou wrote: On Wed, 23 Jun 2010 14:23:33 -0400 Tres Seaver tsea...@palladion.com wrote: - - the slow adoption / porting rate of major web frameworks and libraries to Python 3. Some of the major web frameworks and libraries have a ton of dependencies, which would explain why they really haven't bothered yet. I don't think you can't claim, though, that Python 3 makes things significantly harder for these frameworks. The proof is that many of them already give the user unicode strings in Python 2.x. They must have somehow got the decoding right. Note that this assumption seems optimistic to me. I started talking to Graham Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste do decoding of bytes to unicode at different layers which caused problems for application level code that should otherwise run fine when being served by mod_wsgi or paste httpserver. That was the beginning of Graham starting to talk about what the wsgi spec really should look like under python3 instead of the broken way that the appendix to the current wsgi spec states. -Toshio pgpRSbaUGJzcz.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, 23 Jun 2010 17:30:22 -0400 Toshio Kuratomi a.bad...@gmail.com wrote: Note that this assumption seems optimistic to me. I started talking to Graham Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste do decoding of bytes to unicode at different layers which caused problems for application level code that should otherwise run fine when being served by mod_wsgi or paste httpserver. That was the beginning of Graham starting to talk about what the wsgi spec really should look like under python3 instead of the broken way that the appendix to the current wsgi spec states. Ok, but the reason would be that the WSGI spec is broken. Not Python 3 itself. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 11:35:12PM +0200, Antoine Pitrou wrote: On Wed, 23 Jun 2010 17:30:22 -0400 Toshio Kuratomi a.bad...@gmail.com wrote: Note that this assumption seems optimistic to me. I started talking to Graham Dumpleton, author of mod_wsgi a couple years back because mod_wsgi and paste do decoding of bytes to unicode at different layers which caused problems for application level code that should otherwise run fine when being served by mod_wsgi or paste httpserver. That was the beginning of Graham starting to talk about what the wsgi spec really should look like under python3 instead of the broken way that the appendix to the current wsgi spec states. Ok, but the reason would be that the WSGI spec is broken. Not Python 3 itself. Agreed. Neither python2 nor python3 is broken. It's the wsgi spec and the implementation of that spec where things fall down. From your first post, I thought you were claiming that python3 was broken since web frameworks got decoding right on python2 and I just wanted to defend python3 by showing that python2 wasn't all sunshine and roses. -Toshio pgp8xQXfAPrYT.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote: Toshio Kuratomi writes: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual representation that was intended will be garbled (or be represented by escaped byte sequences). For that matter, apache will serve requests that have no true textual representation as it is working on the byte level rather than the character level. Sure. I've never seen that combination, but I have seen Shift JIS and KOI8-R in the same path. But in that case, just using 'latin-1' as the encoding allows you to use the (unicode) string operations internally, and then spew your mess out into the world for someone else to clean up, just as using bytes would. This is true. I'm giving this as a real-world counter example to the assertion that URIs are text. In fact, I think you're confusing things a little by asserting that the RFC says that URIs are text. I'll address that in two sections down. So a complete solution really should allow the programmer to pass in uris as bytes when the programmer knows that they need it. Other than passing bytes into a constructor, I would argue if a complete solution requires, eg, an interface that allows urljoin(base,subdir) where the types of base and subdir are not required to match, then it doesn't belong in the stdlib. For stdlib usage, that's premature optimization IMO. I'll definitely buy that. Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? (I think, given other options, I'd rather see two separate functions, though. It seems more discoverable and less prone to taking bad input some of the time to have two functions that clearly only take one type of data apiece.) The RFC says that URIs are text, and therefore they can (and IMO should) be operated on as text in the stdlib. If I'm reading the RFC correctly, you're actually operating on two different levels here. Here's the section 2 that you quoted earlier, now in its entirety:: 2. Characters The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text. The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must invert that relation in order to understand the URI syntax. Therefore, the integer values used by the ABNF must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules. A URI is composed from a limited set of characters consisting of digits, letters, and a few graphic symbols. A reserved subset of those characters may be used to delimit syntax components within a URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each component's identifying data. So here's some data that matches those terms up to actual steps in the process:: # We start off with some arbitrary data that defines a resource. This is # not necessarily text. It's the data from the first sentence: data = b\xff\xf0\xef\xe0 # We encode that into text and combine it with the scheme and host to form # a complete uri. This is the URI characters mentioned in section #2. # It's also the sequence of characters mentioned in 1.1 as it is not # until this point that we actually have a URI. uri = bhttp://host/; + percentencoded(data) # # Note1: percentencoded() needs to take any bytes or characters outside of # the characters listed in section 2.3 (ALPHA / DIGIT / - / . / _ # / ~) and percent encode them. The URI can only consist of characters # from this set and the reserved character set (2.2). # # Note2: in this simplistic example, we're only dealing with one piece of # data. With multiple pieces, we'd need to combine them with separators, # for instance like this: # uri = b'http://host/' + percentencoded(data1) + b'/' # + percentencoded(data2) # # Note3: at this point, the uri could be stored as unicode or bytes in # python3. It doesn't matter. It will be a subset of ASCII in either # case. # Then we
Re: [Python-Dev] bytes / unicode
On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote: The RFC says that URIs are text, and therefore they can (and IMO should) be operated on as text in the stdlib. No, *blue* is the best color for a shed. Oops, wait, let me try that again. While I broadly agree with this statement, it is really an oversimplification. An URI is a structured object, with many different parts, which are transformed from bytes to ASCII (or something latin1-ish, which is really just bytes with a nice face on them) to real, honest-to-goodness text via the IRI specification: http://tools.ietf.org/html/rfc3987. Note also that the complete solution argument cuts both ways. Eg, a complete solution should implement UTS 39 confusables detection[1] and IDNA[2]. Good luck doing that with bytes! And good luck doing that with just characters, too. You need a parsed representation of the URI that you can encode different parts of in different ways. (My understanding is that you should only really implement confusables detection in the netloc... while that may be a bogus example, you're certainly only supposed to do IDNA in the netloc!) You can just call urlsplit() all over the place to emulate this, but this does not give you the ability to go back to the original bytes, and thereby preserve things like brokenly-encoded segments, which seems to be what a lot of this hand-wringing is about. To put it another way, there is no possible information-preserving string or bytes type that will make everyone happy as a result from urljoin(). The only return-type that gives you *everything* is URI. just using 'latin-1' as the encoding allows you to use the (unicode) string operations internally, and then spew your mess out into the world for someone else to clean up, just as using bytes would. This is the limitation that everyone seems to keep dancing around. If you are using the stdlib, with functions that operate on sequences like 'str' or 'bytes', you need to choose from one of three options: 1. decode everything to latin1 (although I prefer to call it charmap when used in this way) so that you can have some mojibake that will fool a function that needs a unicode object, but not lose any information about your input so that it can be transformed back into exact bytes (and be very careful to never pass it somewhere that it will interact with real text!), 2. actually decode things to an appropriate encoding to be displayed to the user and manipulated with proper text-manipulation tools, and throw away information about the bytes, 3. keep both the bytes and the characters together (perhaps in a data structure) so that you can both display the data and encode it in situationally-appropriate ways. The stdlib as it is today is not going to handle the 3rd case for anyone. I think that's fine; it is not the stdlib's job to solve everyone's problems. I've been happy with it providing correctly-functioning pieces that can be used to build more elaborate solutions. This is what I meant when I said I agree with Stephen's first point: the stdlib *should* just keep operating entirely on strings, because URIs are defined, by the spec, to be sequences of ASCII characters. But that's not the whole story. PJE's bstr and ebytes proposals set my teeth on edge. I can totally understand the motivation for them, but I think it would be a big step backwards for python 3 to succumb to that temptation, even in the form of a third-party library. It is really trying to cram more information into a pile of bytes than truly exists there. (Also, if we're going to have encodings attached to bytes objects, I would very much like to add JPEG and FLAC to the list of possibilities.) The real tension there is that WSGI is desperately trying to avoid defining any data structures (i.e. classes), while still trying to work with structured data. An URI class with a 'child' method could handily solve this problem. You could happily call IRI(...).join(some bytes).join(some text) and then just say give me some bytes, it's time to put this on the network, or give me some characters, I have to show something to the user, or even give me some characters appropriate for an 'href=' target in some HTML I'm generating - although that last one could be left to the HTML generator, provided it could get enough information from the URI/IRI object's various parts itself. I don't mean to pick on WSGI, either. This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of the time before, but now you need to keep track of text too and the functions which seemed to work on bytes no longer do. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of the time before, but now you need to keep track of text too and the functions which seemed to work on bytes no longer do. Thanks Glyph. That is a nice summary of one kind of challenge facing programmers. Raymond ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Glyph Lefkowitz writes: On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote: Note also that the complete solution argument cuts both ways. Eg, a complete solution should implement UTS 39 confusables detection[1] and IDNA[2]. Good luck doing that with bytes! And good luck doing that with just characters, too. I agree with you, sorry. I meant to cast doubt on the idea of complete solutions, or at least claims that completeness is an excuse for putting it in the stdlib. This is the limitation that everyone seems to keep dancing around. If you are using the stdlib, with functions that operate on sequences like 'str' or 'bytes', you need to choose from one of three options: There's a *fourth* way: specially designed codecs to preserve as much metainformation as you need, while always using the str format internally. This can be done for at least 100,000 separate (character, encoding) pairs by multiplexing into private space with an auxiliary table of encodings and equivalences. That's probably overkill. In many cases, adding simple PEP 383 mechanism (to preserve uninterpreted bytes) might be enough though, and that's pretty plausible IMO. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Toshio Kuratomi writes: I'll definitely buy that. Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? Probably. But it doesn't matter what I say, since Guido has defined that as polymorphism and approved it in principle. (I think, given other options, I'd rather see two separate functions, though. Yes. If you want to deal with things like this:: http://host/café Yes. At that point you are no longer dealing with the sequence of characters talked about in the RFC. You are dealing with data which may or may not be text. That's right, and I think that in most cases that is what programmers want to be dealing with. Let the library make sure that what goes on the wire conforms to the RFC. I don't want to know about it, I want to work with the content of the URI. The proliferation of encoding I agree is a thing that is ugly. Although, if I'm thinking correctly, that only matters when you want to allow mixing bytes and unicode, correct? Well you need to know a fair amount about the encoding: that the reserved bytes are used as defined in the RFC, for example. For debugging, I'm either not understanding or you're wrong. If I'm given an arbitrary sequence of bytes how do I sanely store them as str internally? If it's really arbitrary, you use either a mapping to private space or PEP 383, and accept that it won't make sense. But in most cases you should be able to achieve a fair degree of sanity. If I transform them using an encoding that anticipates the full range of bytes I may be able to display some representation of them but it's not necessarily the sanest method of display (for instance, if I know that path element 1 is always going to be a utf8 encoded string and path element 2 is always shift-jis encoded, and path element 3 is binary data, I could construct a much saner display method than treating the whole thing as latin1). And I think in most cases you will know, although the cases where you'll know will be because of a system-wide encoding. What is your basis for asserting that URIs that aren't sanely treated as text are garbage? I don't mean we can throw them away, I mean we can't do any sensible processing on them. You at least need to know about the reseved delimiters. In the same way that Philip used 'garbage' for the unknown encoding. And in the sense of garbage in, garbage out. unicode handling redesign. I'm stating my reading of the RFC not to defend the use case Philip has, but because I think that the outlook that non-text uris (before being percentencoded) are violations of the RFC That's not what I'm saying. What I'm trying to point out is that manipulating a bytes object as an URI sort of presumes a lot about its encoding as text. Since many of the URIs we deal with are more or less textual, why not take advantage of that? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
[Just addressing one little issue here; generally I'm just happy that we're discussing this issue in such detail from so many points of view.] On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi a.bad...@gmail.com wrote: [...] Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? (I think, given other options, I'd rather see two separate functions, though. It seems more discoverable and less prone to taking bad input some of the time to have two functions that clearly only take one type of data apiece.) Hm. I'd rather see a single function (it would be polymorphic in my earlier terminology). After all a large number of string method calls (and some other utility function calls) already look the same regardless of whether they are handling bytes or text (as long as it's uniform). If the building blocks are all polymorphic it's easier to create additional polymorphic functions. FWIW, there are two problems with polymorphic functions, though they can be overcome: (1) Literals. If you write something like x.split('') you are implicitly assuming x is text. I don't see a very clean way to overcome this; you'll have to implement some kind of type check e.g. x.split('') if isinstance(x, str) else x.split(b'') A handy helper function can be written: def literal_as(constant, variable): if isinstance(variable, str): return constant else: return constant.encode('utf-8') So now you can write x.split(literal_as('', x)). (2) Data sources. These can be functions that produce new data from non-string data, e.g. str(int), read it from a named file, etc. An example is read() vs. write(): it's easy to create a (hypothetical) polymorphic stream object that accepts both f.write('booh') and f.write(b'booh'); but you need some other hack to make read() return something that matches a desired return type. I don't have a generic suggestion for a solution; for streams in particular, the existing distinction between binary and text streams works, of course, but there are other situations where this doesn't generalize (I think some XML interfaces have this awkwardness in their API for converting a tree to a string). -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 6:31 AM, Stephen J. Turnbull step...@xemacs.orgwrote: Toshio Kuratomi writes: I'll definitely buy that. Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? Probably. But it doesn't matter what I say, since Guido has defined that as polymorphism and approved it in principle. (I think, given other options, I'd rather see two separate functions, though. Yes. If you want to deal with things like this:: http://host/café http://host/caf%C3%A9 Yes. Just for perspective, I don't know if I've ever wanted to deal with a URL like that. I know how it is supposed to work, and I know what a browser does with that, but so many tools will clean that URL up *or* won't be able to deal with it at all that it's not something I'll be passing around. So from a practical point of view this really doesn't come up, and if it did it would be in a situation where you could easily do something ad hoc (though there is not currently a routine to quote unsafe characters in a URL... that would be helpful, though maybe urllib.quote(url.encode('utf8'), '%/:') would do it). Also while it is problematic to treat the URL-unquoted value as text (because it has an unknown encoding, no encoding, or regularly a mixture of encodings), the URL-quoted value is pretty easy to pass around, and normalization (in this case to http://host/caf%C3%A9) is generally fine. While it's nice to be correct about encodings, sometimes it is impractical. And it is far nicer to avoid the situation entirely. That is, decoding content you don't care about isn't just inefficient, it's complicated and can introduce errors. The encoding of the underlying bytes of a %-decoded URL is largely uninteresting. Browsers (whose behavior drives a lot of convention) don't touch any of that encoding except lately occasionally to *display* some data in a more friendly way. But it's only display, and errors just make it revert to the old encoded display. Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. -- Ian Bicking | http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 08:31:13PM +0900, Stephen J. Turnbull wrote: Toshio Kuratomi writes: unicode handling redesign. I'm stating my reading of the RFC not to defend the use case Philip has, but because I think that the outlook that non-text uris (before being percentencoded) are violations of the RFC That's not what I'm saying. What I'm trying to point out is that manipulating a bytes object as an URI sort of presumes a lot about its encoding as text. I think we're more or less in agreement now but here I'm not sure. What manipulations are you thinking about? Which stage of URI construction are you considering? I've just taken a quick look at python3.1's urllib module and I see that there is a bit of confusion there. But it's not about unicode vs bytes but about whether a URI should be operated on at the real URI level or the data-that-makes-a-uri level. * all functions I looked at take python3 str rather than bytes so there's no confusing stuff here * urllib.request.urlopen takes a strict uri. That means that you must have a percent encoded uri at this point * urllib.parse.urljoin takes regular string values * urllib.parse and urllib.unparse take regular string values Since many of the URIs we deal with are more or less textual, why not take advantage of that? Cool, so to summarize what I think we agree on: * Percent encoded URIs are text according to the RFC. * The data that is used to construct the URI is not defined as text by the RFC. * However, it is very often text in an unspecified encoding * It is extremely convenient for programmers to be able to treat the data that is used to form a URI as text in nearly all common cases. -Toshio pgpDvecDxPAjV.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote: Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- all you really wanted to do is pass it from one API to another, with some well-defined transformations, which don't actually depend on it having being decoded properly. (For example, extracting the path from the URL and attempting to open it as a file on the filesystem.) This means that Python3 programs can become *more* fragile in the face of random data you encounter out in the real world, rather than less fragile, which was the goal of the whole exercise. The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to output. It seems kinda too late for that, though: next time someone designs a language, they can try that. :) James___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Guido van Rossum wrote: [Just addressing one little issue here; generally I'm just happy that we're discussing this issue in such detail from so many points of view.] On Mon, Jun 21, 2010 at 10:50 PM, Toshio Kuratomi a.bad...@gmail.com wrote: [...] Would urljoin(b_base, b_subdir) = bytes and urljoin(u_base, u_subdir) = unicode be acceptable though? (I think, given other options, I'd rather see two separate functions, though. It seems more discoverable and less prone to taking bad input some of the time to have two functions that clearly only take one type of data apiece.) Hm. I'd rather see a single function (it would be polymorphic in my earlier terminology). After all a large number of string method calls (and some other utility function calls) already look the same regardless of whether they are handling bytes or text (as long as it's uniform). If the building blocks are all polymorphic it's easier to create additional polymorphic functions. FWIW, there are two problems with polymorphic functions, though they can be overcome: (1) Literals. If you write something like x.split('') you are implicitly assuming x is text. I don't see a very clean way to overcome this; you'll have to implement some kind of type check e.g. x.split('') if isinstance(x, str) else x.split(b'') A handy helper function can be written: def literal_as(constant, variable): if isinstance(variable, str): return constant else: return constant.encode('utf-8') So now you can write x.split(literal_as('', x)). This polymorphism is what we used in Python2 a lot to write code that works for both Unicode and 8-bit strings. Unfortunately, this no longer works as easily in Python3 due to the literals sometimes having the wrong type and using such a helper function slows things down a lot. It would be great if we could have something like the above as builtin method: x.split(''.as(x)) Perhaps something to discuss on the language summit at EuroPython. Too bad we can't add such porting enhancements to Python2 anymore. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 22 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2010-07-19: EuroPython 2010, Birmingham, UK26 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 6/22/2010 1:22 AM, Glyph Lefkowitz wrote: The thing that I have heard in passing from a couple of folks with experience in this area is that some older software in asia would present characters differently if they were originally encoded in a japanese encoding versus a chinese encoding, even though they were really the same characters. As I tried to say in another post, that to me is similar to wanting to present English text is different fonts depending on whether spoken by an American or Brit, or a modern person versus a Renaissance person. I do know that Han Unification is a giant political mess (http://en.wikipedia.org/wiki/Han_unification makes for some Thanks, I will take a look. interesting reading), but my understanding is that it has handled enough of the cases by now that one can write software to display asian languages and it will basically work with a modern version of unicode. (And of course, there's always the private use area, as Stephen Turnbull pointed out.) Regardless, this is another example where keeping around a string isn't really enough. If you need to display a japanese character in a distinct way because you are operating in the japanese *script*, you need a tag surrounding your data that is a hint to its presentation. The fact that these presentation hints were sometimes determined by their encoding is an unfortunate historical accident. Yes. The asian languages I know anything about seems to natively have almost none of the symbols English has, many borrowed from math, that have been pressed into service for text markup. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 6/22/2010 12:53 PM, Guido van Rossum wrote: On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger raymond.hettin...@gmail.com wrote: On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of the time before, but now you need to keep track of text too and the functions which seemed to work on bytes no longer do. Thanks Glyph. That is a nice summary of one kind of challenge facing programmers. Ironically, Glyph also described the pain in 2.x: it only kinda worked. The people with problematic code to convert must imclude some who managed to tolerate and perhaps suppress the pain. I suspect that conversion attempts brings it back to the surface. It is natural to blame the re-surfacer rather than the original source. (As in 'blame the messenger'). -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 1:07 PM, James Y Knight f...@fuhm.net wrote: The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to output. It seems kinda too late for that, though: next time someone designs a language, they can try that. :) surrogateescape does help a lot, my only problem with it is that it's out-of-band information. That is, if you have data that went through data.decode('utf8', 'surrogateescape') you can restore it to bytes or transcode it to another encoding, but you have to know that it was decoded specifically that way. And of course if you did have to transcode it (e.g., text.encode('utf8', 'surrogateescape').decode('latin1')) then if you had actually handled the text in any way you may have broken it; you don't *really* have valid text. A lazier solution feels like it would be easier and more transparent to work with. But... I also don't see any major language constraint to having another kind of string that is bytes+encoding. I think PJE brought up a problem with a couple coercion aspects. -- Ian Bicking | http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburg m...@egenix.com wrote: return constant.encode('utf-8') So now you can write x.split(literal_as('', x)). This polymorphism is what we used in Python2 a lot to write code that works for both Unicode and 8-bit strings. Unfortunately, this no longer works as easily in Python3 due to the literals sometimes having the wrong type and using such a helper function slows things down a lot. I didn't work in 2 either - see for instance the traceback module with an Exception with unicode args and a non-ascii file path - the file path is in its bytes form, the string joining logic triggers an implicit upcast and *boom*. Too bad we can't add such porting enhancements to Python2 anymore Perhaps a 'py3compat' module on pypi, with things like the py._builtin reraise helper and so forth ? -Rob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 2:17 AM, Guido van Rossum gu...@python.org wrote: (1) Literals. If you write something like x.split('') you are implicitly assuming x is text. I don't see a very clean way to overcome this; you'll have to implement some kind of type check e.g. x.split('') if isinstance(x, str) else x.split(b'') A handy helper function can be written: def literal_as(constant, variable): if isinstance(variable, str): return constant else: return constant.encode('utf-8') So now you can write x.split(literal_as('', x)). I think this is a key point. In checking the behaviour of the os module bytes APIs (see below), I used a simple filter along the lines of: [x for x in seq if x.endswith(b)] It would be nice if code along those lines could easily be made polymorphic. Maybe what we want is a new class method on bytes and str (this idea is similar to what MAL suggests later in the thread): def coerce(cls, obj, encoding=None, errors='surrogateescape'): if isinstance(obj, cls): return existing if encoding is None: encoding = sys.getdefaultencoding() # This is the str version, bytes,coerce would use obj.encode() instead return obj.decode(encoding, errors) Then my example above could be made polymorphic (for ASCII compatible encodings) by writing: [x for x in seq if x.endswith(x.coerce(b))] I'm trying to see downsides to this idea, and I'm not really seeing any (well, other than 2.7 being almost out the door and the fact we'd have to grant ourselves an exception to the language moratorium) (2) Data sources. These can be functions that produce new data from non-string data, e.g. str(int), read it from a named file, etc. An example is read() vs. write(): it's easy to create a (hypothetical) polymorphic stream object that accepts both f.write('booh') and f.write(b'booh'); but you need some other hack to make read() return something that matches a desired return type. I don't have a generic suggestion for a solution; for streams in particular, the existing distinction between binary and text streams works, of course, but there are other situations where this doesn't generalize (I think some XML interfaces have this awkwardness in their API for converting a tree to a string). We may need to use the os and io modules as the precedents here: os: normal API is text using the surrogateescape error handler, parallel bytes API exposes raw bytes. Parallel API is polymorphic if possible (e.g. os.listdir), but appends a 'b' to the name if the polymorphic approach isn't practical (e.g. os.environb, os.getcwdb, os.getenvb). io. layered API, where both the raw bytes of the wire protocol and the decoded bytes of the text layer are available Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 4:09 AM, M.-A. Lemburg m...@egenix.com wrote: It would be great if we could have something like the above as builtin method: x.split(''.as(x)) As per my other message, another possible (and reasonably intuitive) spelling would be: x.split(x.coerce('')) Writing it as a helper function is also possible, although it be trickier to remember the correct argument ordering: def coerce_to(target, obj, encoding=None, errors='surrogateescape'): if isinstance(obj, type(target)): return obj if encoding is None: encoding = sys.getdefaultencoding() try:: convert = obj.decode except AttributeError: convert = obj.encode return convert(encoding, errors) x.split(coerce_to(x, '')) Perhaps something to discuss on the language summit at EuroPython. Too bad we can't add such porting enhancements to Python2 anymore. Well, we can if we really want to, it just entails convincing Benjamin to reschedule the 2.7 final release. Given the UserDict/ABC/old-style classes issue, there's a fair chance there's going to be at least one more 2.7 RC anyway. That said, since this kind of coercion can be done in a helper function, that should be adequate for the 2.x to 3.x conversion case (for 2.x, the helper function can be defined to just return the second argument since bytes and str are the same type, while the 3.x version would look something like the code above) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 22/06/2010 22:40, Robert Collins wrote: On Wed, Jun 23, 2010 at 6:09 AM, M.-A. Lemburgm...@egenix.com wrote: return constant.encode('utf-8') So now you can write x.split(literal_as('', x)). This polymorphism is what we used in Python2 a lot to write code that works for both Unicode and 8-bit strings. Unfortunately, this no longer works as easily in Python3 due to the literals sometimes having the wrong type and using such a helper function slows things down a lot. I didn't work in 2 either - see for instance the traceback module with an Exception with unicode args and a non-ascii file path - the file path is in its bytes form, the string joining logic triggers an implicit upcast and *boom*. Yeah, there are still a few places in unittest where a unicode exception can cause the whole test run to bomb out. No-one has *yet* reported these as bugs and I try and ferret them out as I find them. All the best, Michael Too bad we can't add such porting enhancements to Python2 anymore Perhaps a 'py3compat' module on pypi, with things like the py._builtin reraise helper and so forth ? -Rob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 22/06/2010 19:07, James Y Knight wrote: On Jun 22, 2010, at 1:03 PM, Ian Bicking wrote: Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations. Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, Well, both .NET and Java take this approach as well. I wonder how they cope with the particular issues that have been mentioned for web applications - both platforms are used extensively for web apps. Having used IronPython, which has .NET unicode strings (although it does a lot of magic to *allow* you to store binary data in strings for compatibility with CPython), I have to say that this approach makes a lot of programming *so* much more pleasant. We did a lot of I/O (can you do useful programming without I/O?) including working with databases, but I didn't work *much* with wire protocols (fetching a fair bit of data from the web though now I think about it). I think wire protocols can present particular problems; sometimes having mixed encodings in the same data it seems. Where you don't have these problems keeping bytes data and all Unicode text data separate and encoding / decoding at the boundaries is really much more sane and pleasant. It would be a real shame if we decided that the way forward for Python 3 was to try and move closer to how bytes/text was handled in Python 2. All the best, Michael even when you don't care -- all you really wanted to do is pass it from one API to another, with some well-defined transformations, which don't actually depend on it having being decoded properly. (For example, extracting the path from the URL and attempting to open it as a file on the filesystem.) This means that Python3 programs can become *more* fragile in the face of random data you encounter out in the real world, rather than less fragile, which was the goal of the whole exercise. The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to output. It seems kinda too late for that, though: next time someone designs a language, they can try that. :) James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (BOGUS AGREEMENTS) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 11:17 AM, Guido van Rossum gu...@python.org wrote: (2) Data sources. These can be functions that produce new data from non-string data, e.g. str(int), read it from a named file, etc. An example is read() vs. write(): it's easy to create a (hypothetical) polymorphic stream object that accepts both f.write('booh') and f.write(b'booh'); but you need some other hack to make read() return something that matches a desired return type. I don't have a generic suggestion for a solution; for streams in particular, the existing distinction between binary and text streams works, of course, but there are other situations where this doesn't generalize (I think some XML interfaces have this awkwardness in their API for converting a tree to a string). This reminds me of the optimization ElementTree and lxml made in Python 2 (not sure what they do in Python 3?) where they use str when a string is ASCII to avoid the memory and performance overhead of unicode. Also at least lxml is also dealing with the divide between the internal libxml2 string representation and the Python representation. This is a place where bytes+encoding might also have some benefit. XML is someplace where you might load a bunch of data but only touch a little bit of it, and the amount of data is frequently large enough that the efficiencies are important. -- Ian Bicking | http://blog.ianbicking.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 07:41 AM 6/23/2010 +1000, Nick Coghlan wrote: Then my example above could be made polymorphic (for ASCII compatible encodings) by writing: [x for x in seq if x.endswith(x.coerce(b))] I'm trying to see downsides to this idea, and I'm not really seeing any (well, other than 2.7 being almost out the door and the fact we'd have to grant ourselves an exception to the language moratorium) Notice, however, that if multi-string operations used a coercion protocol (they currently have to do type checks already for byte/unicode mixes), then you could make the entire stdlib polymorphic by default, even for other kinds of strings that don't exist yet. If you invent a new numeric type, generally speaking you can pass it to existing stdlib functions taking numbers, as long as it implements the appropriate protocols. Why not do the same for strings? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Jun 22, 2010, at 12:53 PM, Guido van Rossum wrote: On Mon, Jun 21, 2010 at 11:47 PM, Raymond Hettinger raymond.hettin...@gmail.com wrote: On Jun 21, 2010, at 10:31 PM, Glyph Lefkowitz wrote: This is a common pain-point for porting software to 3.x - you had a string, it kinda worked most of the time before, but now you need to keep track of text too and the functions which seemed to work on bytes no longer do. Thanks Glyph. That is a nice summary of one kind of challenge facing programmers. Ironically, Glyph also described the pain in 2.x: it only kinda worked. It was not my intention to be ironic about it - that was exactly what I meant :). 3.x is forcing you to confront an issue that you _should_ have confronted for 2.x anyway. (And, I hope, most libraries doing a 3.x migration will take the opportunity to make their 2.x APIs unicode-clean while still in 2to3 mode, and jump ship to 3.x source only _after_ there's a nice transition path for their clients that can be taken in 2 steps.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Jun 22, 2010, at 2:07 PM, James Y Knight wrote: Yeah. This is a real issue I have with the direction Python3 went: it pushes you into decoding everything to unicode early, even when you don't care -- all you really wanted to do is pass it from one API to another, with some well-defined transformations, which don't actually depend on it having being decoded properly. (For example, extracting the path from the URL and attempting to open it as a file on the filesystem.) But you _do_ need to decode it in this case. If you got your URL from some funky UTF-32 datasource, b\x00\x00\x00/ is not a path separator, / is. Plus, you should really be separating path segments and looking at them individually so that you don't fall victim to %2F bugs. And if you want your code to be portable, you need a Unicode representation of your pathname anyway for Windows; plus, there, you need to care about \ as well as /. The fact that your wire-bytes were probably ASCII(-ish) and your filesystem probably encodes pathnames as UTF-8 and so everything looks like it lines up is no excuse not to be explicit about your expectations there. You may want to transcode your characters into some other characters later, but that shouldn't stop you from treating them as characters of some variety in the meanwhile. The surrogateescape method is a nice workaround for this, but I can't help thinking that it might've been better to just treat stuff as possibly-invalid-but-probably-utf8 byte-strings from input, through processing, to output. It seems kinda too late for that, though: next time someone designs a language, they can try that. :) I can think of lots of optimizations that might be interesting for Python (or perhaps some other runtime less concerned with cleverness overload, like PyPy) to implement, like a UTF-8 combining-characters overlay that would allow for fast indexing, lazily populated as random access dictates. But this could all be implemented as smartness inside .encode() and .decode() and the str and bytes types without changing the way the API works. I realize that there are implications at the C level, but as long as you can squeeze a function call in to certain places, it could still work. I can also appreciate what's been said in this thread a bunch of times: to my knowledge, nobody has actually shown a profile of an application where encoding is significant overhead. I believe that encoding _will_ be a significant overhead for some applications (and actually I think it will be very significant for some applications that I work on), but optimizations should really be implemented once that's been demonstrated, so that there's a better understanding of what the overhead is, exactly. Is memory a big deal? Is CPU? Is it both? Do you want to tune for the tradeoff? etc, etc. Clever data-structures seem premature until someone has a good idea of all those things. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Jun 22, 2010, at 7:23 PM, Ian Bicking wrote: This is a place where bytes+encoding might also have some benefit. XML is someplace where you might load a bunch of data but only touch a little bit of it, and the amount of data is frequently large enough that the efficiencies are important. Different encodings have different characteristics, though, which makes them amenable to different types of optimizations. If you've got an ASCII string or a latin1 string, the optimizations of unicode are pretty obvious; if you've got one in UTF-16 with no multi-code-unit sequences, you could also hypothetically cheat for a while if you're on a UCS4 build of Python. I suspect the practical problem here is that there's no CharacterString ABC in the collections module for third-party libraries to provide their own peculiarly-optimized implementations that could lazily turn into real 'str's as needed. I'd volunteer to write a PEP if I thought I could actually get it done :-\. If someone else wants to be the primary author though, I'll try to help out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 4:23 PM, Ian Bicking i...@colorstudy.com wrote: This reminds me of the optimization ElementTree and lxml made in Python 2 (not sure what they do in Python 3?) where they use str when a string is ASCII to avoid the memory and performance overhead of unicode. An optimization that forces me to typecheck the return value of the function and that I only discovered after code started breaking. I can't say was enthused about that decision when I discovered it. -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Wed, Jun 23, 2010 at 12:25 PM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: I can also appreciate what's been said in this thread a bunch of times: to my knowledge, nobody has actually shown a profile of an application where encoding is significant overhead. I believe that encoding _will_ be a significant overhead for some applications (and actually I think it will be very significant for some applications that I work on), but optimizations should really be implemented once that's been demonstrated, so that there's a better understanding of what the overhead is, exactly. Is memory a big deal? Is CPU? Is it both? Do you want to tune for the tradeoff? etc, etc. Clever data-structures seem premature until someone has a good idea of all those things. bzr has a cache of decoded strings in it precisely because decode is slow. We accept slowness encoding to the users locale because thats typically much less data to examine than we've examined while generating the commit/diff/whatever. We also face memory pressure on a regular basis, and that has been, at least partly, due to UCS4 - our translation cache helps there because we have less duplicate UCS4 strings. You're welcome to dig deeper into this, but I don't have more detail paged into my head at the moment. -Rob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Robert Collins writes: Also, url's are bytestrings - by definition; Eh? RFC 3896 explicitly says A URI is an identifier consisting of a sequence of characters matching the syntax rule named URI in Section 3. (where the phrase sequence of characters appears in all ancestors I found back to RFC 1738), and 2. Characters The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text. if the standard library has made them unicode objects in 3, I expect a lot of pain in the webserver space. Yup. But pain is inevitable if people are treating URIs (whether URLs or otherwise) as octet sequences. Then your base URL is gonna be b'mailto:step...@xemacs.org', but the natural thing the UI will want to do is formurl = baseurl + '?subject=うるさいやつだなぁ…' IMO, the UI is right. Something like the above ought to work. So the function that actually handles composing the URL should take a string (ie, unicode), and do all escaping. The UI code should not need to know about escaping. If nothing escapes except the function that puts the URL in composed form, and that function always escapes, life is easy. Of course, in real life it's not that easy. But it's possible to make things unnecessarily hard for the users of your URI API(s), and one way to do that is to make URIs into just bytes (and just unicode is probably nearly as bad, except that at least you know it's not ready for the wire). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
2010/6/21 Stephen J. Turnbull step...@xemacs.org: IMO, the UI is right. Something like the above ought to work. Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work with both? -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python3porting.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Mon, Jun 21, 2010 at 12:30 PM, P.J. Eby p...@telecommunity.com wrote: I also find it weird that there seem to be two camps on this subject, one of which claims that All Is Well And There Is No Problem -- but I do not recall seeing anyone who was in the What do I do; this doesn't seem ready camp who switched sides and took the time to write down what made them realize that they were wrong about there being a problem, and what steps they had to take. The existence of one or more such documents would certainly ease my mind, and I imagine that of other people who are less waiting for others' libraries, than for the stdlib (and/or language) itself to settle. (Or more precisely, for it to be SEEN to have settled.) I don't know that the all is well camp actually exists. The camp that I do see existing is the one that says without a bug report, inconsistencies in the standard library's unicode handling won't get fixed. The issues picked up by the regression test suite have already been dealt with, but that suite is unfortunately far from comprehensive. Just like a lot of Python code that is out there, the standard library isn't immune to the poor coding practices that were permitted by the blurry lines between text and octet streams in 2.x. It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow encoding keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve. Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story... Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Lennart Regebro writes: 2010/6/21 Stephen J. Turnbull step...@xemacs.org: IMO, the UI is right. Something like the above ought to work. Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work with both? First, a caveat: I'm a Unicode/encodings person, not an experienced web programmer. My opinions on whether this would work well in practice should be taken with a grain of salt. Speaking for myself, I live in a country where the natives have saddled themselves with no less than 4 encodings in common use, and I would never want binary since none of them would display as anything useful in a traceback. Wherever possible, I decode blobs into structured objects, I do it as soon as possible, and if for efficiency reasons I want to do this lazily, I store the blob in a separate .raw_object attribute. If they're textual, I decode them to text. I can't see an efficiency argument for decoding URIs lazily in most applications. In the case of structured text like URIs, I would create a separate class for handling them with string-like operations. Internally, all text would be raw Unicode (ie, not url-encoded); repr(uri) would use some kind of readable quoting convention (not url-encoding) to disambiguate random reserved characters from separators, while str(uri) would produce an url-encoded string. Converting to and from wire format is just .encode and .decode, then, and in this country you need to be flexible about which encoding you use. Agreed, this stuff is really annoying. But I think that just comes with the territory. PJE reports that folks don't like doing encoding and decoding all over the place. I understand that, but if they're doing a lot of that, I have to wonder why. Why not define the one line function and get on with life? The thing is, where I live, it's not going to be a one line function. I'm going to be dealing with URLs that are url-encoded representations of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047! So I need an API that explicitly encodes and decodes. And I need an API that presents Japanese as Japanese rather than as line noise. Eg, PJE writes Ugh. I meant: newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1') Which just goes to the point of how ridiculous it is to have to convert things to strings and back again to use APIs that ought to just handle bytes properly in the first place. But if you need that everywhere, what's so hard about def urljoin_wrapper (base, subdir): return urljoin(str(base, 'latin-1'), subdir).encode('latin-1') Now, note how that pattern fails as soon as you want to use non-ISO-8859-1 languages for subdir names. In Python 3, the code above is just plain buggy, IMHO. The original author probably will never need the generalization. But her name will be cursed unto the nth generation by people who use her code on a different continent. The net result is that bytes are *not* a programmer- or user-friendly way to do this, except for the minority of the world for whom Latin-1 is a good approximation to their daily-use unibyte encoding (eg, it's probably usable for debugging in Dansk, but you won't win any popularity contests in Tel Aviv or Shanghai). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow encoding keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve. Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story... The overall impression, though, is that this isn't really a step forward. Now, bytes are the special case instead of unicode, but that special case isn't actually handled any better by the stdlib - in fact, it's arguably worse. And, the burden of addressing this seems to have been shifted from the people who made the change, to the people who are going to use it. But those people are not necessarily in a position to tell you anything more than, give me something that works with bytes. What I can tell you is that before, since string constants in the stdlib were ascii bytes, and transparently promoted to unicode, stdlib behavior was *predictable* in the presence of special cases: you got back either bytes or unicode, but either way, you could idempotently upgrade the result to unicode, or just pass it on. APIs were str safe, unicode aware. If you passed in bytes, you weren't going to get unicode without a warning, and if you passed in unicode, it'd work and you'd get unicode back. Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back. Ironically, it almost *would* have been better if bytes simply didn't work as strings at all, *ever*, but if you could wrap them with a bstr() to *treat* them as text. You could still have restrictions on combining them, as long as it was a restriction on the unicode you mixed with them. That is, if you could combine a bstr and a str if the *str* was restricted to ASCII. If we had the Python 3 design discussions to do over again, I think I would now have stuck with the position of not letting bytes be string-compatible at all, and instead proposed an explicit bstr() wrapper/adapter to use them as strings, that would (in that case) force coercion in the direction of bytes rather than strings. (And bstr need not have been a builtin - it could have been something you import, to help discourage casual usage.) Might this approach lead to some people doing things wrong in the case of porting? Sure. But there'd be little reason to use it in new code that didn't have a real need for bytestring manipulation. It might've been a better balance between practicality and purity, in that it keeps the language pure, while offering a practical way to deal with things in bytes if you really need to. And, bytes wouldn't silently succeed *some* of the time, leading to a trap. An easy inconsistency is worse than a bit of uniform chicken-waving. Is it too late to make that tradeoff? Probably. Certainly it's not practical to *implement* outside the language core, and removing string methods would fux0r anybody whose currently-ported code relies on bytes objects having string-like methods. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 21/06/2010 17:46, P.J. Eby wrote: At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow encoding keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve. Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story... The overall impression, though, is that this isn't really a step forward. Now, bytes are the special case instead of unicode, but that special case isn't actually handled any better by the stdlib - in fact, it's arguably worse. And, the burden of addressing this seems to have been shifted from the people who made the change, to the people who are going to use it. But those people are not necessarily in a position to tell you anything more than, give me something that works with bytes. What I can tell you is that before, since string constants in the stdlib were ascii bytes, and transparently promoted to unicode, stdlib behavior was *predictable* in the presence of special cases: you got back either bytes or unicode, but either way, you could idempotently upgrade the result to unicode, or just pass it on. APIs were str safe, unicode aware. If you passed in bytes, you weren't going to get unicode without a warning, and if you passed in unicode, it'd work and you'd get unicode back. Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back. Ironically, it almost *would* have been better if bytes simply didn't work as strings at all, *ever*, but if you could wrap them with a bstr() to *treat* them as text. You could still have restrictions on combining them, as long as it was a restriction on the unicode you mixed with them. That is, if you could combine a bstr and a str if the *str* was restricted to ASCII. If we had the Python 3 design discussions to do over again, I think I would now have stuck with the position of not letting bytes be string-compatible at all, and instead proposed an explicit bstr() wrapper/adapter to use them as strings, that would (in that case) force coercion in the direction of bytes rather than strings. (And bstr need not have been a builtin - it could have been something you import, to help discourage casual usage.) Might this approach lead to some people doing things wrong in the case of porting? Sure. But there'd be little reason to use it in new code that didn't have a real need for bytestring manipulation. It might've been a better balance between practicality and purity, in that it keeps the language pure, while offering a practical way to deal with things in bytes if you really need to. And, bytes wouldn't silently succeed *some* of the time, leading to a trap. An easy inconsistency is worse than a bit of uniform chicken-waving. Is it too late to make that tradeoff? Probably. Certainly it's not practical to *implement* outside the language core, and removing string methods would fux0r anybody whose currently-ported code relies on bytes objects having string-like methods. Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? Michael ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 01:08 AM 6/22/2010 +0900, Stephen J. Turnbull wrote: But if you need that everywhere, what's so hard about def urljoin_wrapper (base, subdir): return urljoin(str(base, 'latin-1'), subdir).encode('latin-1') Now, note how that pattern fails as soon as you want to use non-ISO-8859-1 languages for subdir names. Bear in mind that the use cases I'm talking about here are WSGI stacks with components written by multiple authors -- each of whom may have to define that function, and still get it right. Sure, there are some things that could go in wsgiref in the stdlib. However, as of this moment, there's only a very uneasy rough consensus in Web-Sig as to how the heck WSGI should actually *work* on Python 3, because of issues like these. That makes it tough to actually say what should happen in the stdlib -- e.g., which things should be classed as stdlib bugs, which things should be worked around with wrappers or new functions, etc. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Tue, Jun 22, 2010 at 01:08:53AM +0900, Stephen J. Turnbull wrote: Lennart Regebro writes: 2010/6/21 Stephen J. Turnbull step...@xemacs.org: IMO, the UI is right. Something like the above ought to work. Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work with both? First, a caveat: I'm a Unicode/encodings person, not an experienced web programmer. My opinions on whether this would work well in practice should be taken with a grain of salt. Speaking for myself, I live in a country where the natives have saddled themselves with no less than 4 encodings in common use, and I would never want binary since none of them would display as anything useful in a traceback. Wherever possible, I decode blobs into structured objects, I do it as soon as possible, and if for efficiency reasons I want to do this lazily, I store the blob in a separate .raw_object attribute. If they're textual, I decode them to text. I can't see an efficiency argument for decoding URIs lazily in most applications. In the case of structured text like URIs, I would create a separate class for handling them with string-like operations. Internally, all text would be raw Unicode (ie, not url-encoded); repr(uri) would use some kind of readable quoting convention (not url-encoding) to disambiguate random reserved characters from separators, while str(uri) would produce an url-encoded string. Converting to and from wire format is just .encode and .decode, then, and in this country you need to be flexible about which encoding you use. Agreed, this stuff is really annoying. But I think that just comes with the territory. PJE reports that folks don't like doing encoding and decoding all over the place. I understand that, but if they're doing a lot of that, I have to wonder why. Why not define the one line function and get on with life? The thing is, where I live, it's not going to be a one line function. I'm going to be dealing with URLs that are url-encoded representations of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047! So I need an API that explicitly encodes and decodes. And I need an API that presents Japanese as Japanese rather than as line noise. Eg, PJE writes Ugh. I meant: newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1') Which just goes to the point of how ridiculous it is to have to convert things to strings and back again to use APIs that ought to just handle bytes properly in the first place. But if you need that everywhere, what's so hard about def urljoin_wrapper (base, subdir): return urljoin(str(base, 'latin-1'), subdir).encode('latin-1') Now, note how that pattern fails as soon as you want to use non-ISO-8859-1 languages for subdir names. In Python 3, the code above is just plain buggy, IMHO. The original author probably will never need the generalization. But her name will be cursed unto the nth generation by people who use her code on a different continent. The net result is that bytes are *not* a programmer- or user-friendly way to do this, except for the minority of the world for whom Latin-1 is a good approximation to their daily-use unibyte encoding (eg, it's probably usable for debugging in Dansk, but you won't win any popularity contests in Tel Aviv or Shanghai). One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual representation that was intended will be garbled (or be represented by escaped byte sequences). For that matter, apache will serve requests that have no true textual representation as it is working on the byte level rather than the character level. So a complete solution really should allow the programmer to pass in uris as bytes when the programmer knows that they need it. -Toshio pgpAvx546YBxD.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 6/20/2010 11:56 PM, Terry Reedy wrote: The specific example is urllib.parse.parse_qsl('a=b%e0') [('a', 'b�')] where the character after 'b' is white ? in dark diamond, indicating an error. parse_qsl() splits that input on '=' and sends each piece to urllib.parse.unquote unquote() attempts to Replace %xx escapes by their single-character equivalent.. unquote has an encoding parameter that defaults to 'utf-8' in *its* call to .decode. parse_qsl does not have an encoding parameter. If it did, and it passed that to unquote, then the above example would become (simulated interaction) urllib.parse.parse_qsl('a=b%e0', encoding='latin-1') [('a', 'bà')] I got that output by copying the file and adding encoding-'latin-1' to the unquote call. Does this solve this problem? Has anything like this been added for 3.2? Should it be? With a little searching, I found http://bugs.python.org/issue5468 with Miles Kaufmann's year-old comment parse_qs and parse_qsl should also grow encoding and errors parameters to pass to the underlying unquote(). Patch review is needed. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Mon, Jun 21, 2010 at 9:46 AM, P.J. Eby p...@telecommunity.com wrote: At 10:51 PM 6/21/2010 +1000, Nick Coghlan wrote: It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow encoding keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve. Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story... The overall impression, though, is that this isn't really a step forward. Now, bytes are the special case instead of unicode, but that special case isn't actually handled any better by the stdlib - in fact, it's arguably worse. And, the burden of addressing this seems to have been shifted from the people who made the change, to the people who are going to use it. But those people are not necessarily in a position to tell you anything more than, give me something that works with bytes. What I can tell you is that before, since string constants in the stdlib were ascii bytes, and transparently promoted to unicode, stdlib behavior was *predictable* in the presence of special cases: you got back either bytes or unicode, but either way, you could idempotently upgrade the result to unicode, or just pass it on. APIs were str safe, unicode aware. If you passed in bytes, you weren't going to get unicode without a warning, and if you passed in unicode, it'd work and you'd get unicode back. Actually, the big problem with Python 2 is that if you mix str and unicode, things work or crash depending on whether any of the str objects involved contain non-ASCII bytes. If one API decides to upgrade to Unicode, the result, when passed to another API, may well cause a UnicodeError because not all arguments have had the same treatment. Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back. This seems an overgeneralization of a particular bug. There are APIs that are strictly text-in, text-out. There are others that are bytes-in, bytes-out. Let's call all those *pure*. For some operations it makes sense that the API is *polymorphic*, with which I mean that text-in causes text-out, and bytes-in causes byte-out. All of these are fine. Perhaps there are more situations where a polymorphic API would be helpful. Such APIs are not always so easy to implement, because they have to be careful with literals or other constants (and even more so mutable state) used internally -- but it can be done, and there are plenty of examples in the stdlib. The real problem apparently lies in (what I believe is only a few rare) APIs that are text-or-bytes-in and always-text-out (or always-bytes-out). Let's call them *hybrid*. Clearly, mixing hybrid APIs in a stream of pure or polymorphic API calls is a problem, because they turn a pure or polymorphic overall operation into a hybrid one. There are also text-in, bytes-out or bytes-in, text-out APIs that are intended for encoding/decoding of course, but these are in a totally different class. Abstractly, it would be good if there were as few as possible hybrid APIs, many pure or polymorphic APIs (which it should be in a particular case is a pragmatic choice), and a limited number of encoding/decoding APIs, which should generally be invoked at the edges of the program (e.g., I/O). Ironically, it almost *would* have been better if bytes simply didn't work as strings at all, *ever*, but if you could wrap them with a bstr() to *treat* them as text. You could still have restrictions on combining them, as long as it was a restriction on the unicode you mixed with them. That is, if you could combine a bstr and a str if the *str* was restricted to ASCII. ISTR that we considered something like this and decided to stay away from it. At this point I think that a successful 3rd party bstr implementation would be required before we rush to add one to the stdlib. If we had the Python 3 design discussions to do over again, I think I would now have stuck with the position of not letting bytes be string-compatible at all, They aren't, unless you consider the presence of some methods with similar behavior (.lower(), .split() and so on) and the existence of some polymorphic APIs (see above) as compatibility. and instead proposed an explicit bstr() wrapper/adapter to use them as strings, that would (in that case) force coercion in the direction of bytes rather than strings. (And bstr need not have been a builtin - it could have been something you import, to help discourage casual usage.) I'm stil unclear on exactly what bstr is supposed to be, but it sounds a bit like one of the rejected
Re: [Python-Dev] bytes / unicode
At 05:49 PM 6/21/2010 +0100, Michael Foord wrote: Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? __contains__ doesn't have a converse operation, so you can't code a type that works around this (Python 3.1 shown): from os.path import join join(b'x','y') Traceback (most recent call last): File stdin, line 1, in module File c:\Python31\lib\ntpath.py, line 161, in join if b[:1] in seps: TypeError: Type str doesn't support the buffer API join('y',b'x') Traceback (most recent call last): File stdin, line 1, in module File c:\Python31\lib\ntpath.py, line 161, in join if b[:1] in seps: TypeError: 'in string' requires string as left operand, not bytes IOW, only one of these two cases can be worked around by using a bstr (or ebytes) that doesn't have support from the core string type. I'm not sure if the in operator is the only case where implementing such a type would fail, but it's the most obvious one. String formatting, of both the % and .format() varieties is another. (__rmod__ doesn't help if your bytes object is one of several data items in a tuple or dict -- the common case for % formatting.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 6/21/2010 8:51 AM, Nick Coghlan wrote: I don't know that the all is well camp actually exists. The camp that I do see existing is the one that says without a bug report, inconsistencies in the standard library's unicode handling won't get fixed. The issues picked up by the regression test suite have already been dealt with, but that suite is unfortunately far from comprehensive. Just like a lot of Python code that is out there, the standard library isn't immune to the poor coding practices that were permitted by the blurry lines between text and octet streams in 2.x. It may be that there are places where we need to rewrite standard library algorithms to be bytes/str neutral (e.g. by using length one slices instead of indexing). It may be that there are more APIs that need to grow encoding keyword arguments that they then pass on to the functions they call or use to convert str arguments to bytes (or vice-versa). But without people trying to port affected libraries and reporting bugs when they find issues, the situation isn't going to improve. Now, if these bugs are already being reported against 3.1 and just aren't getting fixed, that's a completely different story... Some of the above have been, over a year ago. See, for instance, http://bugs.python.org/issue5468 I am getting the impression that the people who use the web modules tend, like me, to not have the tools to write and test patches . So they can squeak but not grease. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 12:56 PM 6/21/2010 -0400, Toshio Kuratomi wrote: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual representation that was intended will be garbled (or be represented by escaped byte sequences). For that matter, apache will serve requests that have no true textual representation as it is working on the byte level rather than the character level. So a complete solution really should allow the programmer to pass in uris as bytes when the programmer knows that they need it. ebytes(somebytes, 'garbage'), perhaps, which would be like ascii, but where combining with non-garbage would results in another 'garbage' ebytes? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote: Perhaps there are more situations where a polymorphic API would be helpful. Such APIs are not always so easy to implement, because they have to be careful with literals or other constants (and even more so mutable state) used internally -- but it can be done, and there are plenty of examples in the stdlib. What if we could use the time machine to make the APIs that *were* polymorphic, regain their previously-polymorphic status, without needing to actually *change* any of the code of those functions? That's what Barry's ebytes proposal would do, with appropriate coercion rules. Passing ebytes into such a function would yield back ebytes, even if the function used strings internally, as long as those strings could be encoded back to bytes using the ebytes' encoding. (Which would normally be the case, since stdlib constants are almost always ASCII, and the main use cases for ebytes would involve ascii-extended encodings.) I'm stil unclear on exactly what bstr is supposed to be, but it sounds a bit like one of the rejected proposals for having a single (Unicode-capable) str type that is implemented using different width encodings (Latin-1, UCS-2, UCS-4) underneath. Not quite - as modified by Barry's proposal (which I like better than mine) it'd be an object that just combines bytes with an attribute indicating the underlying encoding. When it interacts with strings, the strings are *encoded* to bytes, rather than upgrading the bytes to text. This is actually a big advantage for error-detection in any application where you're working with data that *must* be encodable in a specific encoding for output, as it allows you to catch errors much *earlier* than you would if you only did the encoding at your output boundary. Anyway, this would not be the normal bytes type or string type; it's bytes with an encoding. It's also more general than Unicode, in the sense that it allows you to work with character sets that don't really *have* a proper Unicode mapping. One issue I remember from my enterprise days is some of the Asian-language developers at NTT/Verio explaining to me that unicode doesn't actually solve certain issues -- that there are use cases where you really *do* need bytes plus encoding in order to properly express something. Unfortunately, I never quite wrapped my head around the idea, I just remember it had something to do with the fact that Unicode has single character codes that mean different things in different languages, such that you were actually losing information by converting to unicode, or something like that. (Or maybe the characters were expressed differently in certain encodings according to what language they came from, so you couldn't roundtrip them through unicode without losing information. I think that's probably was what it was; maybe somebody here can chime in more on that point.) Anyway, a type like this would need to have at least a bit of support from the core language, because the str type would need to be able to handle at least the __contains__ and %/.format() coercion cases, since these functions don't have __r*__ equivalents that a user-implemented type could provide... and strings don't have anything like a '__coerce__' either. If sufficient hooks existed, then an ebytes could be implemented outside the stdlib, and still used within it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
2010/6/21 Stephen J. Turnbull step...@xemacs.org: Robert Collins writes: Also, url's are bytestrings - by definition; Eh? RFC 3896 explicitly says ?Definitions of Managed Objects for the DS3/E3 Interface Type Perhaps you mean 3986 ? :) A URI is an identifier consisting of a sequence of characters matching the syntax rule named URI in Section 3. (where the phrase sequence of characters appears in all ancestors I found back to RFC 1738), and Sure, ok, let me unpack what I meant just a little. An abstract URI is neither unicode nor bytes per se - see section 1.2.1 A URI is a sequence of characters from a very limited set: the letters of the basic Latin alphabet, digits, and a few special characters. URI interpretation is fairly strictly separated between producers and consumers. A consumer can manipulate a url with other url fragments - e.g. doing urljoin. But it needs to keep the url as a url and not try to decode it to a unicode representation. The producer of the url however, can decode via whatever heuristics it wants - because it defines the encoding used to go from unicode to URL encoding. As an example, if I give the uri http://server/%c3%83;, rendering that as http://server/Ã is able to lead to transcription errors and reinterpretation problems unless you know - out of band - that the server is using utf8 to encode. Conversely if someone enters in http://server/Ã in their browser window, choosing utf8 or their local encoding is quite arbitrary and able to not match how the server would represent that resource. Beyond that, producers can do odd things - like when there are a series of servers stacked and forwarding requests amongst themselves - where they generate different parts of the same URL using different encodings. 2. Characters The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text. Thats true, but its been taken out of context; the set of characters permitted in a URL is a strict subset of characters found in ASCII; there is a BNF that defines it and it is quite precise. While it doesn't define a set of octets, it also doesn't define support for unicode characters - individual schemes need to define the mapping used between characters define as safe and those that get percent encoded. E.g. unicode (abstract) - utf8 - percent encoded. See also the section on comparing URL's - Unicode isn't at all relevant. if the standard library has made them unicode objects in 3, I expect a lot of pain in the webserver space. Yup. But pain is inevitable if people are treating URIs (whether URLs or otherwise) as octet sequences. Then your base URL is gonna be b'mailto:step...@xemacs.org', but the natural thing the UI will want to do is formurl = baseurl + '?subject=うるさいやつだなぁ…' IMO, the UI is right. Something like the above ought to work. I wish it would. The problem is not in Python here though - and casually handwaving will exacerbate it, not fix it. Modelling URL's as string like things is great from a convenience perspective, but, like file paths, they are much more complex difficult. For your particular case, subject contains characters outside the URL specification, so someone needs to choose an encoding to get them into a sequence-of-bytes-that-can-be-percent-escaped. Section 2.5, identifying data goes into this to some degree. Note a trap - the last paragraph says 'when a *NEW* URI scheme...' (emphasis mine). Existing schemes do not mandate UTF8, which is why the producer/consumer split matters. I spent a few minutes looking, but its lost in the minutiae somewhere - HTTP does not specify UTF8 (though I wish it would) for its URI's, and std66 is the generic definition and rules for new URI schemes, preserving intact the mistake of HTTP. So the function that actually handles composing the URL should take a string (ie, unicode), and do all escaping. The UI code should not need to know about escaping. If nothing escapes except the function that puts the URL in composed form, and that function always escapes, life is easy. Arg. The problem is very similar to the file system problem: - We get given a sequence of bytes - we have some rules that will let us manipulate the sequence to get hostnames, query parameters and so forth - and others to let use walk a directory structure - and no guarantee that any of the data is in any particular encoding other than 'URL'. In
Re: [Python-Dev] bytes / unicode
On 6/21/2010 1:29 PM, P.J. Eby wrote: At 05:49 PM 6/21/2010 +0100, Michael Foord wrote: Why is your proposed bstr wrapper not practical to implement outside the core and use in your own libraries and frameworks? __contains__ doesn't have a converse operation, so you can't code a type that works around this (Python 3.1 shown): from os.path import join join(b'x','y') join('y',b'x') I am really unclear what result you intend for such mixed pairs, for all possible mixed pairs, sensible or not. It would seem to me best to write your own pjoin function that did exactly what you want over the whole input domain. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 6/21/2010 1:29 PM, Guido van Rossum wrote: Actually, the big problem with Python 2 is that if you mix str and unicode, things work or crash depending on whether any of the str objects involved contain non-ASCII bytes. If one API decides to upgrade to Unicode, the result, when passed to another API, may well cause a UnicodeError because not all arguments have had the same treatment. Now, the APIs are neither safe nor aware -- if you pass bytes in, you get unpredictable results back. This seems an overgeneralization of a particular bug. There are APIs that are strictly text-in, text-out. There are others that are bytes-in, bytes-out. Let's call all those *pure*. For some operations it makes sense that the API is *polymorphic*, with which I mean that text-in causes text-out, and bytes-in causes byte-out. All of these are fine. Perhaps there are more situations where a polymorphic API would be helpful. Such APIs are not always so easy to implement, because they have to be careful with literals or other constants (and even more so mutable state) used internally -- but it can be done, and there are plenty of examples in the stdlib. The real problem apparently lies in (what I believe is only a few rare) APIs that are text-or-bytes-in and always-text-out (or always-bytes-out). Let's call them *hybrid*. Clearly, mixing hybrid APIs in a stream of pure or polymorphic API calls is a problem, because they turn a pure or polymorphic overall operation into a hybrid one. There are also text-in, bytes-out or bytes-in, text-out APIs that are intended for encoding/decoding of course, but these are in a totally different class. Abstractly, it would be good if there were as few as possible hybrid APIs, many pure or polymorphic APIs (which it should be in a particular case is a pragmatic choice), and a limited number of encoding/decoding APIs, which should generally be invoked at the edges of the program (e.g., I/O). Nice summary of part of the 'why' for Python3. I still believe that believe that the instances of bytes silently succeeding *some* of the time refers to specific bugs in specific APIs, either intentional because of misguided compatibility desires, or accidental in the haste of trying to convert the entire stdlib to Python 3 in a finite time. I think http://bugs.python.org/issue5468 reports one aspect of haste, missing encoding and errors paramaters. But it has not gotten much attention. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Toshio Kuratomi writes: One comment here -- you can also have uri's that aren't decodable into their true textual meaning using a single encoding. Apache will happily serve out uris that have utf-8, shift-jis, and euc-jp components inside of their path but the textual representation that was intended will be garbled (or be represented by escaped byte sequences). For that matter, apache will serve requests that have no true textual representation as it is working on the byte level rather than the character level. Sure. I've never seen that combination, but I have seen Shift JIS and KOI8-R in the same path. But in that case, just using 'latin-1' as the encoding allows you to use the (unicode) string operations internally, and then spew your mess out into the world for someone else to clean up, just as using bytes would. So a complete solution really should allow the programmer to pass in uris as bytes when the programmer knows that they need it. Other than passing bytes into a constructor, I would argue if a complete solution requires, eg, an interface that allows urljoin(base,subdir) where the types of base and subdir are not required to match, then it doesn't belong in the stdlib. For stdlib usage, that's premature optimization IMO. The RFC says that URIs are text, and therefore they can (and IMO should) be operated on as text in the stdlib. It's not just a matter of manipulating the URIs themselves, where working directly on bytes will work just as well and and with the same string operations (as long as everything is bytes). It's also a question of API complexity (eg, Barry's bugaboo of proliferation of encoding= parameters) and of debugging (if URIs are internally str, then they will display sanely in tracebacks and the interpreter). The cases where URIs can't be sanely treated as text are garbage input, and the stdlib should not try to provide a solution. Just passing in bytes and getting out bytes is GIGO. Trying to do some error-checking is going to be insufficient much of the time and overly strict most of the rest of the time. The programmer in the trenches is going to need to decide what to allow and what not; I don't think there are general answers because we know that allowing random URLs on the web leads to various kinds of problems. Some sites will need to address some of them. Note also that the complete solution argument cuts both ways. Eg, a complete solution should implement UTS 39 confusables detection[1] and IDNA[2]. Good luck doing that with bytes! If you *need* bytes (rather than simply trying to avoid conversion overhead), you're in a hazmat handling situation. Passing bytes in to stdlib APIs here is the equivalent of carrying around kilograms of fissionables in an open bucket. While the Tokaimura comparison is hyperbole, it can't be denied that use of bytes here shortcuts a lot of processing strongly suggested by the RFCs, and prevents use of various programming conveniences (such as reasonable display of URI values in debugging). Does the efficiency really justify including that in the stdlib? I dunno, I'm not a web programmer in the trenches. But I take my cue from MvL and MAL who don't seem real enthusiastic about this. And as Martin says, there is as yet no evidence offered that the overhead of conversion is a general problem. Footnotes: [1] http://www.unicode.org/reports/tr39/ [2] http://www.rfc-editor.org/rfc/rfc3490.txt ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Robert Collins writes: Perhaps you mean 3986 ? :) Thank you for the correction. A URI is an identifier consisting of a sequence of characters matching the syntax rule named URI in Section 3. (where the phrase sequence of characters appears in all ancestors I found back to RFC 1738), and Sure, ok, let me unpack what I meant just a little. An abstract URI is neither unicode nor bytes per se - see section 1.2.1 A URI is a sequence of characters from a very limited set: the letters of the basic Latin alphabet, digits, and a few special characters. My position is that this describes the network protocol, not the abstract URI. It in no way suggests that uri-encoded forms should be handled internally. And the RFC explicitly says this is text, and therefore sanctions the user- and programmer-friendly practice of doing internal processing as text. Note that in a hypothetical bytes-oriented API base = convert_uri_to_wire_format('http://www.example.org/') formuri = uri_join(base,b'home/steve/public_html') the bytes literal b'/home/steve/public_html' clearly is intended as readable text. This is mixing types in the programmer's mind, even though base is internally in bytes format and the relative URI is also in bytes format. This is un-Pythonic IMO. URI interpretation is fairly strictly separated between producers and consumers. A consumer can manipulate a url with other url fragments - e.g. doing urljoin. But it needs to keep the url as a url and not try to decode it to a unicode representation. Unfortunately, outside of Kansas and Canberra, it don't work that way. How do you propose to uri_join base as above and '/home/スティーブ/public_html'? Encoding and/or decoding must be done somewhere, and it would be damn unfriendly to make the browser user do it! In the bytes-oriented API, the programmer must be continually making decisions about whether and how to handle non-ASCII components from outside (or, more likely, cursing the existence of the damned foreigners, and then ignoring the possibility ... let them eat UnicodeException!) As an example, if I give the uri http://server/%c3%83;, rendering that as http://server/Ã is able to lead to transcription errors and reinterpretation problems unless you know - out of band - that the server is using utf8 to encode. Conversely if someone enters in http://server/Ã in their browser window, choosing utf8 or their local encoding is quite arbitrary and able to not match how the server would represent that resource. Sure. Using bytes doesn't solve either problem. It just allows you to wash your hands of it and pass it on to someone else, who probably has even less information than you do. Eg, in the case of passing the uri http://server/%c3%83; to someone else without telling them the encoding means that effectively they're limited to ASCII if they want to append meaningful relative paths without guessing the encoding. In the case of the user entering http://server/Ã;, you have to do *something* to produce bytes eventually. When was the last time you typed %c3%83 at the end of a URL in a browser address field? 2. Characters The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text. Thats true, but its been taken out of context; the set of characters permitted in a URL is a strict subset of characters found in ASCII; No. Again, you're confounding the URL with its network format. There's no question that the network format is in bytes, and before putting the URI into a wire protocol, you need to encode non-URI characters. However, the abstract URI is text, and may not even be represented by octets or Unicode at all (eg, represented by carbon residue on recycled wood pulp). See also the section on comparing URL's - Unicode isn't at all relevant. Not to the RFC, which talks about *characters* and gives examples that imply transcoding (eg, between EBCDIC and UTF-16), see the section you cite. However, Unicode is the canonical representation of text inside Python, and therefore TOOWTDI for URL comparison in Python. Thank you for that killer argument for my position; I hadn't thought of it. I wish it would. The problem is not in Python here though - and casually handwaving will exacerbate it, not fix it. Using bytes because we just don't know is exactly casual handwaving. Well, maybe not
Re: [Python-Dev] bytes / unicode
On Jun 21, 2010, at 2:17 PM, P.J. Eby wrote: One issue I remember from my enterprise days is some of the Asian-language developers at NTT/Verio explaining to me that unicode doesn't actually solve certain issues -- that there are use cases where you really *do* need bytes plus encoding in order to properly express something. The thing that I have heard in passing from a couple of folks with experience in this area is that some older software in asia would present characters differently if they were originally encoded in a japanese encoding versus a chinese encoding, even though they were really the same characters. I do know that Han Unification is a giant political mess (http://en.wikipedia.org/wiki/Han_unification makes for some interesting reading), but my understanding is that it has handled enough of the cases by now that one can write software to display asian languages and it will basically work with a modern version of unicode. (And of course, there's always the private use area, as Stephen Turnbull pointed out.) Regardless, this is another example where keeping around a string isn't really enough. If you need to display a japanese character in a distinct way because you are operating in the japanese *script*, you need a tag surrounding your data that is a hint to its presentation. The fact that these presentation hints were sometimes determined by their encoding is an unfortunate historical accident. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
2010/6/20 Antoine Pitrou solip...@pitrou.net: On Sun, 20 Jun 2010 14:40:56 -0400 P.J. Eby p...@telecommunity.com wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly convert back and forth to full-blown unicode Well, then why don't you just stick with a bytes object? There are not many tools for treating bytes as text. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
Also, url's are bytestrings - by definition; if the standard library has made them unicode objects in 3, I expect a lot of pain in the webserver space. -Rob ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 6/20/2010 5:55 PM, Benjamin Peterson wrote: 2010/6/20 Antoine Pitrousolip...@pitrou.net: On Sun, 20 Jun 2010 14:40:56 -0400 P.J. Ebyp...@telecommunity.com wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly convert back and forth to full-blown unicode Well, then why don't you just stick with a bytes object? There are not many tools for treating bytes as text. If one writes a function (most easily in Python) 1. in terms of the methods and operations shared by unicode and bytes, which is nearly all of them, and 2. does not gratuitously (and dare I say, unpythonically) do a class check to unnecessarily exclude one or the other, and 3. does not specialize by assuming only one of the possible values for type-specific constants, such as number of chars/codes, and 4. does not do something unicode specific such as normalization, then the function should be agnostic and operate generically. I think there was some temptation to be 'pure' and limit text methods to str and enforce the decode-manipulate-encode paradigm (which is extremely common in various forms, and nothing unusual). But for practicality and efficiency, that was not done. Do you have in mind any tools that could and should operate on both, but do not? (I realize that at the C level, code is not just specialized to 'unicode', but to 2-byte versus 4-byte representations.) Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote: Do you have in mind any tools that could and should operate on both, but do not? From http://mail.python.org/pipermail/web-sig/2009-September/004105.html : The problem which arises is that unquoting of URLs in Python 3.X stdlib can only be done on unicode strings. If though a string contains non UTF-8 encoded characters it can fail. I don't have any direct experience with the specific issue demonstrated in that post, but in the context of the discussion as a whole, I understood the overall issue as if you pass bytes to certain stdlib functions, you might get back unicode, an explicit error, or (at least in the case shown above) something that's just plain wrong. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
At 11:47 PM 6/20/2010 +0200, Antoine Pitrou wrote: On Sun, 20 Jun 2010 14:40:56 -0400 P.J. Eby p...@telecommunity.com wrote: Actually, I would say that it's more that (in the network protocol case) we *have* bytes, some of which we would like to *treat* as text, yet do not wish to constantly convert back and forth to full-blown unicode Well, then why don't you just stick with a bytes object? Because the stdlib is not consistent in how well it handles bytes objects. While reading over this thread, I'm wondering whether at least my (WSGI-related) problems in this area would be solved by the availability of a type (say bstr) that was simply a wrapper providing string-like behavior over an underlying bytes, byte array, or memoryview, that would produce objects of compatible type when combined with strings (by encoding them to match). This really sounds horrible. Python 3 was designed precisely to discourage ad hoc mixing of bytes and unicode. Who said ad hoc mixing? The point is to have a simple way to ensure that my bytes don't get implicitly converted to unicode, and (ideally) don't have to get converted *back*, either. The idea that by passing bytes to the stdlib, I randomly get back either bytes or unicode (i.e. undocumentedly and inconsistently between different stdlib APIs, as well as possibly dependent on runtime conditions), is NOT discouraging ad hoc mixing. seems so much saner than writing *this* everywhere: newurl = str(urljoin(str(base, 'latin-1'), 'subdir'), 'latin-1') urljoin already returns an str object. Why do you want to decode it again? Ugh. I meant: newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1') Which just goes to the point of how ridiculous it is to have to convert things to strings and back again to use APIs that ought to just handle bytes properly in the first place. (I don't know if there are actually any problems in the case of urljoin; I wasn't the person who originally brought up the stdlib not treating URLs as bytestrings in 3.x issue on the Web-SIG. Somewhere along the line I got the impression that urljoin was one such API, but in researching the issue it looks like maybe the canonical example was qsl_parse.) It's possible that the stdlib situation has improved tremendously since then, of course. I don't know if the bug was reported, or how many remain. And it's precisely the part where I don't know how many remain that keeps me from doing more than idly thinking about porting any of my libraries (let alone apps) to Python 3.x. The fact that the stdlib itself has these sorts of issues raises major red flags to me about whether the One Obvious Way has yet been found. If the stdlib maintainers don't agree on the One Obvious Way, that seems even worse. Or if there is such a Way, but nobody has documented its practices yet, that's almost the same thing. I also find it weird that there seem to be two camps on this subject, one of which claims that All Is Well And There Is No Problem -- but I do not recall seeing anyone who was in the What do I do; this doesn't seem ready camp who switched sides and took the time to write down what made them realize that they were wrong about there being a problem, and what steps they had to take. The existence of one or more such documents would certainly ease my mind, and I imagine that of other people who are less waiting for others' libraries, than for the stdlib (and/or language) itself to settle. (Or more precisely, for it to be SEEN to have settled.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On 6/20/2010 9:33 PM, P.J. Eby wrote: At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote: Do you have in mind any tools that could and should operate on both, but do not? From http://mail.python.org/pipermail/web-sig/2009-September/004105.html : Thank for the concrete examples in this and your other post. I am cc-ing the author of the above. The problem which arises is that unquoting of URLs in Python 3.X stdlib can only be done on unicode strings. Actually, I believe this is an encoding rather than bytes versus unicode issue. If though a string contains non UTF-8 encoded characters it can fail. Which is to say, I believe, if the ascii text in the (unicode) string has a % encoding of a byte that that is not a legal utf-8 encoding of anything. The specific example is urllib.parse.parse_qsl('a=b%e0') [('a', 'b�')] where the character after 'b' is white ? in dark diamond, indicating an error. parse_qsl() splits that input on '=' and sends each piece to urllib.parse.unquote unquote() attempts to Replace %xx escapes by their single-character equivalent.. unquote has an encoding parameter that defaults to 'utf-8' in *its* call to .decode. parse_qsl does not have an encoding parameter. If it did, and it passed that to unquote, then the above example would become (simulated interaction) urllib.parse.parse_qsl('a=b%e0', encoding='latin-1') [('a', 'bà')] I got that output by copying the file and adding encoding-'latin-1' to the unquote call. Does this solve this problem? Has anything like this been added for 3.2? Should it be? I don't have any direct experience with the specific issue demonstrated in that post, but in the context of the discussion as a whole, I understood the overall issue as if you pass bytes to certain stdlib functions, you might get back unicode, an explicit error, or (at least in the case shown above) something that's just plain wrong. As indicated above, I so far think that the problem is with the application of the new model, not the model itself. Just for 'fun', I tried feeding bytes to the function. p.parse_qsl(b'a=b%e0') Traceback (most recent call last): File pyshell#2, line 1, in module p.parse_qsl(b'a=b%e0') File C:\Programs\Python31\lib\urllib\parse.py, line 377, in parse_qsl pairs = [s2 for s1 in qs.split('') for s2 in s1.split(';')] TypeError: Type str doesn't support the buffer API I do not know if that message is correct, but certainly trying to split bytes with unicode is (now, at least) a mistake. This could be 'fixed' by replacing the typed literals with expressions that match the type of the input. But I am not sure if that is sensible since the next step is to unquote and decode to unicode anyway. I just do not know the use case. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] bytes / unicode
On Sun, Jun 20, 2010 at 23:55, Benjamin Peterson benja...@python.org wrote: There are not many tools for treating bytes as text. Well, what tools would you need that can be used also on bytes? Bytes objects has a lot of the same methods like strings do, and that will cover 99% of the cases. Most text tools assume that the text really is text, and much of it doesn't make sense unless you've converted it to Unicode first. But most of the things you would need to do, such as in a web-server doesn't really involve treating the text as something linguistic, but it's a matter of replacing and escaping and such, and that could be done while the text is in bytes form.But the tools for that exists... Is there some specific tool that is missing? -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python3porting.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com