-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Working on the 3.0 version of bsddb, I have the following issue.
Until 3.0, keys and values were strings. For bsddb, they are opaque, and stored unchanged. In 3.0 the string type is replaced by unicode. A new "byte" type is added. So, code like "db.put('key','value')" needs to be changed to "db.put(bytes('key', 'utf-8'), bytes('value', 'utf-8'))", or something similar. This is ugly and generates incompatible code with previous python releases. I was wondering what to do. The obvious path would be to put a proxy object between application code and bsddb, doing the byte<->unicode translation on the fly. This could be problematic when dealing with legacy data, since it couldn't be a valid encoded bytestring. Data misspresentation would be dangerous and can go undetected for a long time, slowly corrupting the database data. Moreover, the data is application specific, so automatic conversion can introduce incompatibilities and bugs. Another approach would be to add a new bsddb method to specify the default encoding to use to convert unicode->bytes, and to do the conversion internally when getting unicode data as a parameter. The issue here is that "u'hi' != b'hi'", so the translation must be done both when storing and when retrieving data. These problems are caused because now string!=bytes. In fact the approach in 3.0 is the right one, and any try to hide this difference with proxy objects or automatic conversion is going to bite us, someday. So, I'm thinking seriously in accepting *ONLY* "bytes" in the bsddb API (when working under Python 3.0), and do the proxy thing *ONLY* in the testsuite, to be able to reuse it. What do you think?. PS: Since most of the time keys/values are 7bit, a direct "ascii" encoding would be fine... until we are required to store a 8 bit value. PPS: In dbm (gdbm) I'm seeing automatic unicode->byte conversion, but NO byte->unicode. See the problem when storing non ASCII data: """ Python 3.0b2 (r30b2:65080, Jul 19 2008, 03:39:09) [GCC 4.2.3] on sunos5 Type "help", "copyright", "credits" or "license" for more information. |>> import dbm |>> a=dbm.open("z","c") |>> a <_gdbm.gdbm object at 0x82fb560> |>> a["a"]="b" |>> a["b"]="c" |>> a.sync() |>> a.close() |>> a=dbm.open("z","w") |>> a.keys() [b'b', b'a'] |>> a["c"]=chr(210) |>> a["c"] b'\xc3\x92' |>> a["c"]==chr(210) False """ - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ [EMAIL PROTECTED] - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:[EMAIL PROTECTED] _/_/ _/_/ _/_/_/_/_/ . _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iQCVAwUBSI8p+Jlgi5GaxT1NAQKTggP/R+swZ429fecTyNahJj6dw9nJfMgg7YcE NbkueWM4zhUhhKa03sCT9ACiFHaXhmPyF2Q75wrGeI+WZxtafbYj+sjhjyCXpikn cptAnWxXMEchqshwGafXoUi9eyVLMxihvulDf9rXJIqWLR8oRqoRaiJJPWf39ZCk VhF+L1uKWiw= =A3en -----END PGP SIGNATURE----- _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com