At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote:
Perhaps there are more situations where a polymorphic API would be
helpful. Such APIs are not always so easy to implement, because they
have to be careful with literals or other constants (and even more so
mutable state) used internally -- but it can be done, and there are
plenty of examples in the stdlib.

What if we could use the time machine to make the APIs that *were* polymorphic, regain their previously-polymorphic status, without needing to actually *change* any of the code of those functions?

That's what Barry's ebytes proposal would do, with appropriate coercion rules. Passing ebytes into such a function would yield back ebytes, even if the function used strings internally, as long as those strings could be encoded back to bytes using the ebytes' encoding. (Which would normally be the case, since stdlib constants are almost always ASCII, and the main use cases for ebytes would involve ascii-extended encodings.)


I'm stil unclear on exactly what bstr is supposed to be, but it sounds
a bit like one of the rejected proposals for having a single
(Unicode-capable) str type that is implemented using different width
encodings (Latin-1, UCS-2, UCS-4) underneath.

Not quite - as modified by Barry's proposal (which I like better than mine) it'd be an object that just combines bytes with an attribute indicating the underlying encoding. When it interacts with strings, the strings are *encoded* to bytes, rather than upgrading the bytes to text.

This is actually a big advantage for error-detection in any application where you're working with data that *must* be encodable in a specific encoding for output, as it allows you to catch errors much *earlier* than you would if you only did the encoding at your output boundary.

Anyway, this would not be the normal bytes type or string type; it's "bytes with an encoding". It's also more general than Unicode, in the sense that it allows you to work with character sets that don't really *have* a proper Unicode mapping.

One issue I remember from my "enterprise" days is some of the Asian-language developers at NTT/Verio explaining to me that unicode doesn't actually solve certain issues -- that there are use cases where you really *do* need "bytes plus encoding" in order to properly express something. Unfortunately, I never quite wrapped my head around the idea, I just remember it had something to do with the fact that Unicode has single character codes that mean different things in different languages, such that you were actually losing information by converting to unicode, or something like that. (Or maybe the characters were expressed differently in certain encodings according to what language they came from, so you couldn't roundtrip them through unicode without losing information. I think that's probably was what it was; maybe somebody here can chime in more on that point.)

Anyway, a type like this would need to have at least a bit of support from the core language, because the str type would need to be able to handle at least the __contains__ and %/.format() coercion cases, since these functions don't have __r*__ equivalents that a user-implemented type could provide... and strings don't have anything like a '__coerce__' either.

If sufficient hooks existed, then an ebytes could be implemented outside the stdlib, and still used within it.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to