On 8/26/07, Gregory P. Smith <[EMAIL PROTECTED]> wrote: > On 8/26/07, Travis Oliphant <[EMAIL PROTECTED]> wrote: > > Gregory P. Smith wrote: > > > I'm in favor of not allowing unicode for hash functions. Depending on > > > the system default encoding for a hash will not be portable. > > > > > > another question for hashlib: It uses PyArg_Parse to get a single 's' > > > out of an optional parameter [see the code] and I couldn't figure out > > > what the best thing to do there was. It just needs a C string to pass > > > to openssl to lookup a hash function by name. Its C so i doubt it'll > > > ever be anything but ascii. How should that parameter be parsed > > > instead of the old 's' string format? PyBUF_CHARACTER actually sounds > > > ideal in that case assuming it guarantees UTF-8 but I wasn't clear > > > that it did that (is it always utf-8 or the possibly useless as far as > > > APIs expecting C strings are concerned system "default encoding")? > > > Requiring a bytes object would also work but I really don't like the > > > idea of users needing to use a specific type for something so simple. > > > (i consider string constants with their preceding b, r, u, s, type > > > characters ugly in code without a good reason for them to be there) > > > > > > > The PyBUF_CHARACTER flag was an add-on after I realized that the old > > buffer API was being in several places to get Unicode objects to encode > > their data as a string (in the default encoding of the system, I believe). > > > > The unicode object is the only one that I know of that actually does > > something different when it is called with PyBUF_CHARACTER. > > > > > Is it just me or do unicode objects supporting the buffer api seem > > > like an odd concept given that buffer api consumers (rather than > > > unicode consumers) shouldn't need to know about encodings of the data > > > being received. > > > > I think you have a point. The buffer API does support the concept of > > "formats" but not "encodings" so having this PyBUF_CHARACTER flag looks > > rather like a hack. I'd have to look, because I don't even remember > > what is returned as the "format" from a unicode object if it is > > requested (it is probably not correct). > > given that utf-8 characters are varying widths i don't see how it could ever > practically be correct for unicode.
Well, *practically*, the unicode object returns UTF-8 for PyBUF_CHARACTER. That is correct (at least until I rip all this out, which I'm in the middle of -- but no time to finish it tonight). > > I would prefer that the notion of encoding a unicode object is separated > > from the notion of the buffer API, but last week I couldn't see another > > way to un-tease it. > > > > -Travis > > A thought that just occurred to me... Would a PyBUF_CANONICAL flag be useful > instead of CHARACTERS? For unicode that'd mean utf-8 (not just the default > encoding) but I could imagine other potential uses such as multi-dimension > buffers (PIL image objects?) presenting a defined canonical form of the data > useful for either serialization and hashing. Any buffer api implementing > object would define its own canonical form. Note, the default encoding in 3.0 is fixed to UTF-8. (And it's fixed in a much more permanent way than in 2.x -- it is really hardcoded and there is really no way to change it.) But I'm thinking YAGNI -- the buffer API should always just return the bytes as they already are sitting in memory, not some transformation thereof. The current behavior of the unicode object for PyBUF_CHARACTER violates this. (There are no other violations BTW.) This is why I want to rip it out. I'm close... -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com