On Sat, 18 Feb 2006 09:59:38 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
<[EMAIL PROTECTED]> wrote:
>Aahz wrote:
>> The problem is that they don't understand that "Martin v. L?wis" is not
>> Unicode -- once all strings are Unicode, this is guaranteed to work.
Well, after all the "string" literal escapes that were being used
to define byte values are all rewritten, yes, I'll believe the guarantee ;-)
(BTW, are there plans for migration tools?)
Ok, now back to the s/bytes/octet/ topic:
>
>This specific call, yes. I don't think the problem will go away as long
>as both encode and decode are available for both strings and byte
>arrays.
>
>> While it's not absolutely true, my experience of watching Unicode
>> confusion is that the simplest approach for newbies is: encode FROM
>> Unicode, decode TO Unicode.
>
>I think this is what should be in-grained into the library, also. It
>shouldn't try to give additional meaning to these terms.
>
Thinking about bytes recently, it occurs to me that bytes are really not
intrinsically
numeric in nature. They don't necessarily represent uint8's. E.g., a binary
file is
really a sequence of bit octets in its most primitive and abstract sense.
So I'm wondering if we shouldn't have an octet type analogous to unicode, and
instances of octet
would be vectors of octets as abstract 8-bit bit vectors, like instances of
unicode are vectors of abstract characters.
If you wanted integers you could map ord for integers guaranteed to be in
range(256).
The constructor would naturally take any suitable integer sequence so
octet([65,66,67]) would work.
In general, all encode methods would produce an octet instance, e.g.
unicode.encode.
octet.decode(octet_instance, 'src_encoding') or
octet_instance.decode('src_encoding') would do
all the familiar character code sequence decoding,
e.g., octet.decode(oseq, 'utf-8') or oseq.decode('utf-8') to make a unicode
instance.
Going from unicode, unicode.encode(uinst, 'utf-8') or uinst.encode('utf-8')
would produce an octet instance.
I think this is conceptually purer than the current bytes idea, since the
result really has no arithmetic significance.
Also, ord would work on a length-one octet instance, and produce the unsigned
integer value you'd expect, but would fail
if not length-one, like ord on unicode (or current str).
Thus octet would replace bytes as the binary info container, and would not have
any presumed aritmetic
significance, either as integer or as
character-of-current-source-encoding-inferred-from-integer-value-as-ord.
To get a text representation of octets, hex is natural, e.g., octet('6162
6380') # spaces ignored
so repr(octet('a deaf bee')) => "octet('adeafbee')" and
octet('616263').decode('ascii') => u'abc' and
back: u'abc.encode('ascii') => octet('616263'). The base64 codec looks
conceptually cleaner too, so long
as you keep in mind base64 as a character subset of unicode and the name of the
transformation function pair.
octet('616263').decode('base64') => u'YWJj\n' # octets -> characters
u'YWJj\n'.encode('base64') => octet('616263') # characters -> octets
If you wanted integer-nature bytes, you could have octet codecs for uint8 and
int8, e.g., octseq.decode('int8')
could produce a list of signed integers all in range(-128,128). Or maybe
map(dec_int8, octseq). The array
module could easily be a target for octet.decode, e.g.,
octseq.decode('array_B') or octet.decode(octseq, 'array_B'),
and octet(array_instance) the other way.
Likewise, other types could be destination for octet.decode.
E.g., if you had an abstraction for a display image one could have 'gif' and
'png' and 'bmp' etc
be like 'cp437', 'latin-1', and 'utf-8' etc are for decoding octest to unicode,
and write stuff like
o_seq = open('pic.gif','rb') # makes octet instance
img = o_seq.decode('gif89') # => img is abstract, internally represented
suitably but hidden, like unicode.
open('pic.png', 'wb').write(img.encode('png'))
UIAM PIL has this functionality, if not as encode/decode methods.
Similarly, there could be an abstract archive container, and you could have
arch = open('tree.tgz','rb').decode('tgz') # => might do lazy things
waiting for encode
egg_octets = arch.encode('python_egg') # convert to egg format?? (just
hand-waving ;-)
Probably all it would take is to wrap some things in abstract-container (AC)
types, to enforce the protocol.
Image(octet_seq, 'gif') might produce an AC that only saved a (octet_seq,
'gif') internally, or it might
do eager conversion per optional additional args. Certainly .bmp without rle
can be hugely wasteful.
For flexibility like eager vs not, or perhaps returning an iterator instead of
a byte sequence,
I guess the encode/decode signatures should be (enc, *args, **kw) and pass
those things on to
the worker functions? An abstract container could have a "pack" codec to do
serial composition/decomposition.
I'm sure Mal has all this stuff one way or another, but I wanted the conceptual
purity of AC instances ac in
ac = octet_seq.decode('src_enc'); octet_seq = ac.encode('dst_enc') ;-)
Bottom line thought: binary octets aren't numeric ;-)
Regards,
Bengt Richter
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com