Re: [Python-Dev] bytes / unicode

P.J. Eby Mon, 21 Jun 2010 11:47:47 -0700

At 10:29 AM 6/21/2010 -0700, Guido van Rossum wrote:

Perhaps there are more situations where a polymorphic API would be
helpful. Such APIs are not always so easy to implement, because they
have to be careful with literals or other constants (and even more so
mutable state) used internally -- but it can be done, and there are
plenty of examples in the stdlib.

What if we could use the time machine to make the APIs that *were*polymorphic, regain their previously-polymorphic status, withoutneeding to actually *change* any of the code of those functions?

That's what Barry's ebytes proposal would do, with appropriatecoercion rules. Passing ebytes into such a function would yield backebytes, even if the function used strings internally, as long asthose strings could be encoded back to bytes using the ebytes'encoding. (Which would normally be the case, since stdlib constantsare almost always ASCII, and the main use cases for ebytes wouldinvolve ascii-extended encodings.)

I'm stil unclear on exactly what bstr is supposed to be, but it sounds
a bit like one of the rejected proposals for having a single
(Unicode-capable) str type that is implemented using different width
encodings (Latin-1, UCS-2, UCS-4) underneath.

Not quite - as modified by Barry's proposal (which I like better thanmine) it'd be an object that just combines bytes with an attributeindicating the underlying encoding. When it interacts with strings,the strings are *encoded* to bytes, rather than upgrading the bytes to text.

This is actually a big advantage for error-detection in anyapplication where you're working with data that *must* be encodablein a specific encoding for output, as it allows you to catch errorsmuch *earlier* than you would if you only did the encoding at youroutput boundary.

Anyway, this would not be the normal bytes type or string type; it's"bytes with an encoding". It's also more general than Unicode, inthe sense that it allows you to work with character sets that don'treally *have* a proper Unicode mapping.

One issue I remember from my "enterprise" days is some of theAsian-language developers at NTT/Verio explaining to me that unicodedoesn't actually solve certain issues -- that there are use caseswhere you really *do* need "bytes plus encoding" in order to properlyexpress something. Unfortunately, I never quite wrapped my headaround the idea, I just remember it had something to do with the factthat Unicode has single character codes that mean different things indifferent languages, such that you were actually losing informationby converting to unicode, or something like that. (Or maybe thecharacters were expressed differently in certain encodings accordingto what language they came from, so you couldn't roundtrip themthrough unicode without losing information. I think that's probablywas what it was; maybe somebody here can chime in more on that point.)

Anyway, a type like this would need to have at least a bit of supportfrom the core language, because the str type would need to be able tohandle at least the __contains__ and %/.format() coercion cases,since these functions don't have __r*__ equivalents that auser-implemented type could provide... and strings don't haveanything like a '__coerce__' either.

If sufficient hooks existed, then an ebytes could be implementedoutside the stdlib, and still used within it.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to