DBIers, tl;dr: I think it's time to add proper Unicode support to the DBI. What do you think it should look like?
Background I've brought this up a time or two in the past, but a number of things have happened lately to make me think that it was again time: First, on the DBD::Pg list, we've been having a discussion about improving the DBD::Pg encoding interface. http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html That design discussion followed on the extended discussion in this bug report: https://rt.cpan.org/Ticket/Display.html?id=40199 Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is rather broken in a few ways. Notably, PostgreSQL sends *all* data back to clients in a single encoding -- even binary data (which is usually hex-encoded). So it made no sense to only decode certain columns. How to go about fixing it, though, and adding a useful interface, has proven a bit tricky. Then there was Tom Christiansen's StackOverflow comment: stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 This made me realize that Unicode handling is much trickier than I ever realized. But it also emphasized for me how important it is to do everything on can to do Unicode right. Tom followed up with a *lot* more detail in three OSCON presentations this year, all of which you can read here: http://98.245.80.27/tcpc/OSCON2011/index.html (You're likely gonna want to install the fonts linked at the bottom of that page before you read the presentations in HTML). And finally, I ran into an issue recently with Oracle, where we have an Oracle database that should have only UTF-8 data but some row values are actually in other encodings. This was a problem because I told DBD::Oracle that the encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I got broken data back from the database and then my app crashed when I tried to act on a string with the utf8 flag on but containing non-unicode bytes. I reported this issue in a DBD::Oracle bug report: https://rt.cpan.org/Public/Bug/Display.html?id=70819 But all this together leads me to believe that it's time to examine adding explicit Unicode support to the DBI. But it needs to be designed as carefully as possible to account for a few key points: * The API must be as straightforward as possible without sacrificing necessary flexibility. I think it should mostly stay out of users ways and have reasonable defaults. But it should be clear what each knob we offer does and how it affects things. Side-effects should be avoided. * Ability to enforce the correctness of encoding and decoding must be given priority. Perl has pretty specific ideas about is and is not Unicode, so we should respect that as much as possible. If that means encoding and decoding rather than just flipping the utf8 bit, then fine. * The performance impact must be kept as minimal as possible. So if we can get away with just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that, though, since Perl's internal representation, called "utf8", is not the same thing as UTF-8. But if there's an efficient way to convert between the two, then it should be adopted. For other encodings, obviously a full encode/decode path must be followed. * Drivers must be able to adopt the API in a straight-forward way. That is to say, we need to make sure that the interface covers what most (all?) drivers need. Some, like DBD::Pg, can specify that only one encoding come back from the database. Maybe others (DBD::mysql) can have individual columns in different encodings? It needs to cover that case, too. * It must be able to give the drivers some flexibility. Where we can't account for everything that all drivers need forever, we should make it possible for them to add what they need without changing the overall API or the meaning of the interfaces provided by the DBI. I'm not at all clear what such an API should look like. Based on my extensive experience with DBD::Pg, a fair amount of experience with DBD::SQLite, and limited experience with DBD::Oracle and DBD::mysql, I'd say it'd be useful to have at least these knobs: 1. An attribute indicating the database encoding. This is the encoding one expects all data coming from the database to be in. When this is set, the DBI or the driver would decode incoming data to Perl's internal format and encode data sent to the database. 2. A fourth param to bind_param() to indicate the encoding in which to send column data to the database. Defaults to the database encoding. 3. A new parameter to prepare() to indicate the encodings of specific columns to be selected. 4. An ENCODING attribute on statement handles that indicates the encoding of each columns. This is just a preliminary proposal, but covers most of the basics, I think. (I'm sure I'm suggesting the wrong places for some things). It does assume that one wants one's text data to always be decoded into Perl's internal form and encoded when sent to the database. There ought to be a way for one also to just continue to get binary data and encode and decode one's self (e.g. for actual binary columns). I know that Tim Bunce has thought about this some in the past, and Greg Sabino Mullane and I have discussed it quite a lot with Dave Rolsky and others. So I know that folks have some ideas about this stuff. So let's hear them. Let's put our minds together and try to come up with an interface that we can all work with. Thanks, David