Re: Add Unicode Support to the DBI

2011-09-10 Thread H.Merijn Brand
On Sat, 10 Sep 2011 03:06:49 -, Greg Sabino Mullane
g...@turnstep.com wrote:

 One thing I see bandied about a lot is that Perl 5.14 is highly preferred. 
 However, it's not clear exactly what the gains are and how bad 5.12 is 
 compared to 5.14, how bad 5.10 is, how bad 5.8 is, etc. Right now 5.8 is 
 the required minimum for DBI: should we consider bumping this? I know TC 
 would be horrified to see us attempting to talk about Unicode support 
 with a 5.8.1 requirement, but how much of that will affect database 
 drivers? I have no idea myself.

Unicode-6.0 and Unicode improvements in general are *THE* reason for me
(our company) to plan for a 5.10.1 - 5.14.2 update

I use Unicode a lot, and we require 5.8.4 as an absolute minimum when
dealing with Unicode. 5.8.1 is not good enough.

 Another aspect to think about that came up during some offline DBD::Pg 
 talks was the need to support legacy scripts and legacy data. While the 
 *correct* thing is to blaze forward and use Do Things Correctly everywhere, 
 I think we at least need some prominent knobs so that we can maintain 
 backwards compatiblity for existing scripts that expect a bunch of 
 Latin1, or need the data to come back in the current, undecoded, 
 un-utf8-flagged way.
 
 - -- 
 Greg Sabino Mullane g...@turnstep.com

-- 
H.Merijn Brand  http://tux.nl  Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.14 and porting perl5.15.x on HP-UX 10.20, 11.00,
11.11, 11.23 and 11.31, OpenSuSE 10.1, 11.0 .. 11.4 and AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/   http://www.test-smoke.org/
http://qa.perl.org  http://www.goldmark.org/jeff/stupid-disclaimers/


Re: Add Unicode Support to the DBI

2011-09-10 Thread Martin J. Evans

On 10/09/2011 03:52, David E. Wheeler wrote:

DBIers,

tl;dr: I think it's time to add proper Unicode support to the DBI. What do you 
think it should look like?
I'm not sure any change is required to DBI to support unicode. As far as 
I'm aware unicode already works with DBI if the DBDs do the right thing.


If you stick to the rule that all data Perl receives must be decoded and 
all data Perl exports must be encoded it works (ignoring any issues in 
Perl itself).



Background

I've brought this up a time or two in the past, but a number of things have 
happened lately to make me think that it was again time:

First, on the DBD::Pg list, we've been having a discussion about improving the 
DBD::Pg encoding interface.

   http://www.nntp.perl.org/group/perl.dbd.pg/2011/07/msg603.html

That design discussion followed on the extended discussion in this bug report:

   https://rt.cpan.org/Ticket/Display.html?id=40199

Seems that the pg_enable_utf8 flag that's been in DBD::Pg for a long time is 
rather broken in a few ways. Notably, PostgreSQL sends *all* data back to 
clients in a single encoding -- even binary data (which is usually 
hex-encoded). So it made no sense to only decode certain columns. How to go 
about fixing it, though, and adding a useful interface, has proven a bit tricky.

Then there was Tom Christiansen's StackOverflow comment:

   
stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
I bow to Tom's experience but I'm still not sure how that applies to DBI 
so long as the interface between the database and Perl always encodes 
and decodes then the issues Tom describes are all Perl ones - no?

This made me realize that Unicode handling is much trickier than I ever 
realized. But it also emphasized for me how important it is to do everything on 
can to do Unicode right. Tom followed up with a *lot* more detail in three 
OSCON presentations this year, all of which you can read here:

   http://98.245.80.27/tcpc/OSCON2011/index.html

(You're likely gonna want to install the fonts linked at the bottom of that 
page before you read the presentations in HTML).

And finally, I ran into an issue recently with Oracle, where we have an Oracle 
database that should have only UTF-8 data but some row values are actually in 
other encodings. This was a problem because I told DBD::Oracle that the 
encoding was Unicode, and it just blindly turned on the Perl utf8 flag. So I 
got broken data back from the database and then my app crashed when I tried to 
act on a string with the utf8 flag on but containing non-unicode bytes. I 
reported this issue in a DBD::Oracle bug report:

   https://rt.cpan.org/Public/Bug/Display.html?id=70819
Surely Oracle should return the data encoded as you asked for it and if 
it did not Oracle is broken.
I'd still like to see this case and then we can see if Oracle is broken 
and if there is a fix for it.


In some places DBD::Oracle does sv_utf8_decode(scalar) or 
SvUTF8_on(scalar) (depending on your Perl) and in some places it just 
does SvUTF8_on(scalar). I believe the latter is much quicker as the data 
is not checked. Many people (myself included) are particularly 
interested in DBD::Oracle being fast and if all the occurrences were 
changed to decode I'd patch that out in my copy as I know the data I 
receive is UTF-8 encoded.



But all this together leads me to believe that it's time to examine adding 
explicit Unicode support to the DBI. But it needs to be designed as carefully 
as possible to account for a few key points:

* The API must be as straightforward as possible without sacrificing necessary 
flexibility. I think it should mostly stay out of users ways and have 
reasonable defaults. But it should be clear what each knob we offer does and 
how it affects things. Side-effects should be avoided.

* Ability to enforce the correctness of encoding and decoding must be given 
priority. Perl has pretty specific ideas about is and is not Unicode, so we 
should respect that as much as possible. If that means encoding and decoding 
rather than just flipping the utf8 bit, then fine.
See above. I'd like the chance to go with speed and take the 
consequences rather than go with slower but know incorrect UTF-8 is spotted.



* The performance impact must be kept as minimal as possible. So if we can get away with 
just flipping the UTF-8 bit on and off, it should be so. I'm not entirely clear on that, 
though, since Perl's internal representation, called utf8, is not the same 
thing as UTF-8. But if there's an efficient way to convert between the two, then it 
should be adopted. For other encodings, obviously a full encode/decode path must be 
followed.
I thought UTF-8 when used in Perl used the strict definition and utf-8 
used Perl's looser definition - see 
http://search.cpan.org/~dankogai/Encode-2.44/Encode.pm#UTF-8_vs._utf8_vs._UTF8

* Drivers must be able to adopt the API in a straight-forward way. That is to 
say, we need to 

Re: Add Unicode Support to the DBI

2011-09-10 Thread Lyle

On 10/09/2011 04:06, Greg Sabino Mullane wrote:

Right now 5.8 is the required minimum for DBI: should we consider bumping this?


I know a lot of servers in the wild are still running RHEL5 and it's 
variants, which are stuck on 5.8 in the standard package management. The 
new RHEL6 only has 5.10...

So at this time the impact of such change could be significant.


Lyle