Re: Unicode and Sybase univarchar
Alexander Foken wrote: On 04.06.2010 15:41, Dave Rolsky wrote: On Fri, 4 Jun 2010, Alexander Foken wrote: That's why I proposed to switch to DBD::ODBC: It is well tested and supports Unicode as good as the ODBC driver does. And as I said in private email, that's not really feasible for us. Yes, it would cause you a lot of work, starting by distributing an ODBC manager and an ODBC driver. Perhaps too much work. Think of it as a last resort. But for patching Unicode support into DBD::Sybase, DBD::ODBC could be helpful, both by showing how to patch DBD::Sybase, and by providing a Unicode-aware way to the database on the development machine. Now you just need to find someone who is willing and has the time to patch DBD::Sybase ... ;-) Which is why I'm hoping we can pay Michael to work on this. Throwing money at the problem is definitly a good idea! And it would be great if that work would end on CPAN (and not in a secret private branch). Clearly, he's the most qualified. Otherwise, I might have to do this, which is a scary thought. Well, I also was scared by DBD::ODBC, and I must admit that I still don't understand all of it. I'm still happy that MJE took over the stalled DBD::ODBC development and improved my patch (and many other parts of DBD::ODBC). A first step to patching DBD::Sybase would be like my first step with DBD::ODBC: Just make binding input parameters and fetching results work with Unicode. Don't care about Unicode in the SQL statement (use placeholders, so you don't need Unicode there), don't care about binding output parameters, connect strings, user names, passwords, private functions. Steal the Unicode tests from DBD::ODBC and make DBD::Sybase pass those tests without breaking the existing tests. Alexander Just a little note on copying the unicode tests in DBD::ODBC. There is nothing in particular wrong with them but they can cause some drivers problems because they use functions on parameters and sometimes the driver cannot (or fails) to work out the correct type of the parameter: e.g., SELECT ? as roundtrip, length(?) as roundtriplen Some drivers (MS SQL Server in particular) attempts to rearrange the above SQL to a select on 2 columns in the database to work out the parameter types. In the above case there are no real database columns and it fails. Martin -- Martin J. Evans Easysoft Limited http://www.easysoft.com
Re: Unicode and Sybase univarchar
On 04.06.2010 15:41, Dave Rolsky wrote: On Fri, 4 Jun 2010, Alexander Foken wrote: That's why I proposed to switch to DBD::ODBC: It is well tested and supports Unicode as good as the ODBC driver does. And as I said in private email, that's not really feasible for us. Yes, it would cause you a lot of work, starting by distributing an ODBC manager and an ODBC driver. Perhaps too much work. Think of it as a last resort. But for patching Unicode support into DBD::Sybase, DBD::ODBC could be helpful, both by showing how to patch DBD::Sybase, and by providing a Unicode-aware way to the database on the development machine. Now you just need to find someone who is willing and has the time to patch DBD::Sybase ... ;-) Which is why I'm hoping we can pay Michael to work on this. Throwing money at the problem is definitly a good idea! And it would be great if that work would end on CPAN (and not in a secret private branch). Clearly, he's the most qualified. Otherwise, I might have to do this, which is a scary thought. Well, I also was scared by DBD::ODBC, and I must admit that I still don't understand all of it. I'm still happy that MJE took over the stalled DBD::ODBC development and improved my patch (and many other parts of DBD::ODBC). A first step to patching DBD::Sybase would be like my first step with DBD::ODBC: Just make binding input parameters and fetching results work with Unicode. Don't care about Unicode in the SQL statement (use placeholders, so you don't need Unicode there), don't care about binding output parameters, connect strings, user names, passwords, private functions. Steal the Unicode tests from DBD::ODBC and make DBD::Sybase pass those tests without breaking the existing tests. Alexander -- Alexander Foken mailto:alexan...@foken.de http://www.foken.de/alexander/
Re: Unicode and Sybase univarchar
... should just work. It doesn't quite, because the hex string is not just a dump of a 16 Bit Unicode encoding, but it is a UTF-8 byte stream written with a 16 Bit Hex Format for each byte. Each and every 16-Bit-Word has its most significant byte set to 0. If it was a dump of a 16 Bit Unicode encoding, it should read 0065006d00200064006100730068003a00202014 and not 0065006d00200064006100730068003a002000e200800094 Your call to decode() compensates that, probably because you encoded once too much before writing the data into the database. I did not encode at all. I simply created a utf8 string in Perl land and inserted it into Sybase. And Sybase / DBD::Sybase happily ignored Perl's UTF8 flag and stored the byte stream as if it were characters. Really, DBD::Sybase needs to handle any character set translation, not the end user. Right. (But remember that DBI was there before Unicode support was added to Perl, and also most DBDs are older that the Unicode support. Before Unicode was there, you just passed bytes around and everything just worked.) For the same reason, but with a different DBD, I hacked the first Unicode patch for DBD::ODBC, just enough code to have proper Unicode support in bind parameters from Perl to the database, and in returned fetch values. DBD::ODBC has been improved since the patch was merged by Martin J. Evans into v1.14, and it now supports Unicode in many other places, too. The DBI API was also improved during that process, allowing to have Unicode SQL query strings. That's why I proposed to switch to DBD::ODBC: It is well tested and supports Unicode as good as the ODBC driver does. The raw patch is still available at http://www.alexander-foken.de/unicode-patch.txt.gz. DBD::Sybase obviously lacks such a patch. Sybase may have a Unicode API, but no part of DBD::Sybase uses it (properly). In DBD::ODBC, the patch just tests the UTF8 flag for all relevant data coming from Perl and converts UTF8 to the UCS2 encoding required by the ODBC API, and converts UCS2 encoded data back to UTF8, settings the UTF8 flag when needed. Most of the new test just tests that all conversions work, even inside the database. The length check is very primitive, but a good indicator to test if the database saw bytes (length too large) or characters (length just right). The original patch has some minor problems in the test code with ancient perls that were fixed during DBD::ODBC development, see DBD::ODBC Changes. Now you just need to find someone who is willing and has the time to patch DBD::Sybase ... ;-) The new tests should work nearly unmodified with a properly patched DBD::Sybase. Alexander
Re: Unicode and Sybase univarchar
On Fri, 4 Jun 2010, Alexander Foken wrote: Right. (But remember that DBI was there before Unicode support was added to Perl, and also most DBDs are older that the Unicode support. Before Unicode was there, you just passed bytes around and everything just worked.) Only if by everything you exclude any and all operations that need to be character aware, such as length, substr, regexes, encoding, etc. That's why I proposed to switch to DBD::ODBC: It is well tested and supports Unicode as good as the ODBC driver does. And as I said in private email, that's not really feasible for us. Now you just need to find someone who is willing and has the time to patch DBD::Sybase ... ;-) Which is why I'm hoping we can pay Michael to work on this. Clearly, he's the most qualified. Otherwise, I might have to do this, which is a scary thought. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */
Unicode and Sybase univarchar
I'm working on an i18n project, and we use Sybase (sigh). Newer versions of Sybase have built-in support for Unicode with the univarchar (and other uni*) type. However, it seems like DBD::Sybase doesn't have any support for this. Specifically, if I take a Perl unicode string (utf8 flag is on) and insert it in a univarchar column, it seems to be inserted as raw bytes (or something). What's really bizarre is that when I select the value back I get something like 0065006d00200064006100730068003a002000e200800094. Yes, that's a literal string containing a series of 2-digit hex numbers! I can translate this back to Perl unicode with this madness: my $chars = do { use bytes; join q{}, map { chr( eval '0x' . $_ ) } $fromdb =~ /()/g; }; my $unicode = decode( 'utf8', $chars ); So the data is there, but not in a very usable form. Has anyone researched or solved this problem? Michael Peppler, if you're reading this, is there any work on supporting Perl's unicode format transparently in DBD::Sybase? My employer might be able to pay to have this work done, if you're interested. Alternately, maybe you could give me some hints and I could try to figure it out. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */
Re: Unicode and Sybase univarchar
Hi, Which version of Sybase, which version of Sybase OpenClient, and which version of DBD::Sybase? Are you setting the connection charset to utf8 (in the connect() call?) Thanks, Michael On Jun 3, 2010, at 5:22 PM, Dave Rolsky wrote: I'm working on an i18n project, and we use Sybase (sigh). Newer versions of Sybase have built-in support for Unicode with the univarchar (and other uni*) type. However, it seems like DBD::Sybase doesn't have any support for this. Specifically, if I take a Perl unicode string (utf8 flag is on) and insert it in a univarchar column, it seems to be inserted as raw bytes (or something). What's really bizarre is that when I select the value back I get something like 0065006d00200064006100730068003a002000e200800094. Yes, that's a literal string containing a series of 2-digit hex numbers! I can translate this back to Perl unicode with this madness: my $chars = do { use bytes; join q{}, map { chr( eval '0x' . $_ ) } $fromdb =~ /()/g; }; my $unicode = decode( 'utf8', $chars ); So the data is there, but not in a very usable form. Has anyone researched or solved this problem? Michael Peppler, if you're reading this, is there any work on supporting Perl's unicode format transparently in DBD::Sybase? My employer might be able to pay to have this work done, if you're interested. Alternately, maybe you could give me some hints and I could try to figure it out. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */ -- Michael Peppler Sybase on Linux FAQ: http://www.peppler.org/FAQ/linux.html A successful [software] tool is one that was used to do something undreamed of by its author. -- S. C. Johnson
Re: Unicode and Sybase univarchar
On Jun 3, 2010, at 7:29 PM, Michael Peppler wrote: Hi, Which version of Sybase, which version of Sybase OpenClient, and which version of DBD::Sybase? Are you setting the connection charset to utf8 (in the connect() call?) I just gave this a try - I'm under linux, with ASE 15.5. I created a table with a univarchar column, entered some data via isql, then wrote a minimal perl script to fetch the data. If I use a UTF8 locale (i.e. LANG=en_us.UTF8) I get the correct output. If I don't I do not get the correct output, at least for rows where non-ascii data has been entered into the table. I'm using DBD::Sybase 1.10. Michael On Jun 3, 2010, at 5:22 PM, Dave Rolsky wrote: I'm working on an i18n project, and we use Sybase (sigh). Newer versions of Sybase have built-in support for Unicode with the univarchar (and other uni*) type. However, it seems like DBD::Sybase doesn't have any support for this. Specifically, if I take a Perl unicode string (utf8 flag is on) and insert it in a univarchar column, it seems to be inserted as raw bytes (or something). What's really bizarre is that when I select the value back I get something like 0065006d00200064006100730068003a002000e200800094. Yes, that's a literal string containing a series of 2-digit hex numbers! I can translate this back to Perl unicode with this madness: my $chars = do { use bytes; join q{}, map { chr( eval '0x' . $_ ) } $fromdb =~ /()/g; }; my $unicode = decode( 'utf8', $chars ); So the data is there, but not in a very usable form. Has anyone researched or solved this problem? Michael Peppler, if you're reading this, is there any work on supporting Perl's unicode format transparently in DBD::Sybase? My employer might be able to pay to have this work done, if you're interested. Alternately, maybe you could give me some hints and I could try to figure it out. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */ -- Michael Peppler Sybase on Linux FAQ: http://www.peppler.org/FAQ/linux.html A successful [software] tool is one that was used to do something undreamed of by its author. -- S. C. Johnson -- Michael Peppler Sybase on Linux FAQ: http://www.peppler.org/FAQ/linux.html A successful [software] tool is one that was used to do something undreamed of by its author. -- S. C. Johnson
Re: Unicode and Sybase univarchar
On Thu, 3 Jun 2010, Michael Peppler wrote: I just gave this a try - I'm under linux, with ASE 15.5. I created a table with a univarchar column, entered some data via isql, then wrote a minimal perl script to fetch the data. If I use a UTF8 locale (i.e. LANG=en_us.UTF8) I get the correct output. If I don't I do not get the correct output, at least for rows where non-ascii data has been entered into the table. Define correct output here. From looking at the DBD::Sybase code, I don't see how it could possibly be right, because there's nothing in there to set the utf8 flag on the Perl string when the data is retrieved. So even if I can work around the bizarro bytes as a string issue, I still won't have utf8 after the round trip. My test script did something like this: my $dbh = ...; # set charset to utf8 round_trip(em dash: \x{2014}); sub round_trip { my $unicode = shift; $dbh-do('DELETE FROM unicode'); check($unicode); $dbh-do( 'INSERT INTO unicode (utest) VALUES (?)', {}, $unicode ); my $rows = $dbh-selectall_arrayref('SELECT * FROM unicode'); my $fromdb = $rows-[0][0]; check($fromdb); my $chars = do { use bytes; join q{}, map { chr( eval '0x' . $_ ) } $fromdb =~ /()/g; }; check($chars); my $decoded = decode( 'utf-8', $chars ); check($decoded); } sub check { my $string = shift; print $string is utf8? , ( Encode::is_utf8($string) ? 'yes' : 'no' ), \n; } __END__ CREATE TABLE unicode ( utest univarchar(250) NOT NULL, ); -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */
Re: Unicode and Sybase univarchar
On Thu, 3 Jun 2010, Michael Peppler wrote: Which version of Sybase, which version of Sybase OpenClient, and which version of DBD::Sybase? Ah, I was using the old libs (11.0), which may have been the problem. I was also using DBD::Sybase 1.07. I switch to Sybase 15.0 (OCS 15.0 if that makes sense), OpenClient 15.0 libs, and DBD::Sybase 1.10. Now it's closer to working. If I set charset=utf8 in the dsn, I get 2010/06/03 14:08:11 unicode CRITICAL: FATAL: DBD::Sybase::db do failed: Server message number=2402 severity=16 state=1 line=1 server=HDATADEV1 text=Error converting characters into server's character set. Some character(s) could not be converted. I'm not sure what that means. If I _don't_ set that, the data goes in and comes out as bytes, rather than the bizarro hex string. However, the data does have the utf8 flag set when it comes back from Sybase, so I have to run it through Encode::decode. I really don't think I can realistically tell the bazillion developers here just run all the data through Encode. I'd really like see an end-to-end solution. Also, it's not clear to me that the data is actually being stored as characters at the Sybase level. I'm not even sure how I'd figure this out. When I do a select from sqsh, I see the wacky hex string, but I can't tell if that's Sybase trying to present data to me in a format it thinks my environment can handle. I did try setting LC_ALL=us_english.utf8 when running sqsh, but that didn't make a difference. Basically, what I need is to be able to take Perl native unicode strings, store them in Sybase in Sybase's native format (utf16, I believe), and then retrieve them as Perl native unicode strings again. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */
Re: Unicode and Sybase univarchar
On Thu, 3 Jun 2010, Dave Rolsky wrote: If I _don't_ set that, the data goes in and comes out as bytes, rather than the bizarro hex string. However, the data does have the utf8 flag set when it comes back from Sybase, so I have to run it through Encode::decode. Doh, the data _does not_ have the utf8 flag set. That's what I meant to say. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */
Re: Unicode and Sybase univarchar
On Jun 3, 2010, at 8:14 PM, Dave Rolsky wrote: On Thu, 3 Jun 2010, Michael Peppler wrote: Which version of Sybase, which version of Sybase OpenClient, and which version of DBD::Sybase? Ah, I was using the old libs (11.0), which may have been the problem. I was also using DBD::Sybase 1.07. I switch to Sybase 15.0 (OCS 15.0 if that makes sense), OpenClient 15.0 libs, and DBD::Sybase 1.10. Now it's closer to working. If I set charset=utf8 in the dsn, I get 2010/06/03 14:08:11 unicode CRITICAL: FATAL: DBD::Sybase::db do failed: Server message number=2402 severity=16 state=1 line=1 server=HDATADEV1 text=Error converting characters into server's character set. Some character(s) could not be converted. I'm not sure what that means. Hmmm - is that on a query, or on an insert operation? If I _don't_ set that, the data goes in and comes out as bytes, rather than the bizarro hex string. However, the data does have the utf8 flag set when it comes back from Sybase, so I have to run it through Encode::decode. I really don't think I can realistically tell the bazillion developers here just run all the data through Encode. I'd really like see an end-to-end solution. I agree - I've just not had much opportunity (or requests) to ensure that this works 100%. Ideally if you could send me some sample code, and a simple table structure and data that reproduces the problem for you I could try to look at it and see if I can fix it. Also, it's not clear to me that the data is actually being stored as characters at the Sybase level. I'm not even sure how I'd figure this out. When I do a select from sqsh, I see the wacky hex string, but I can't tell if that's Sybase trying to present data to me in a format it thinks my environment can handle. When in doubt - use the Sybase tools (i.e. isql, and use -Jutf8 to force conversion to/from utf8 when reading/writing the data). You can also do a convert(varbinary, the_univarchar_col) to see the hex representation of the data in the database. Basically, what I need is to be able to take Perl native unicode strings, store them in Sybase in Sybase's native format (utf16, I believe), and then retrieve them as Perl native unicode strings again. If you store data in univarchar, etc columns, then it's utf16. You can also set up the dataserver to use utf8 throughout - but that's something you'd have to discuss with your DBAs. Michael -- Michael Peppler Sybase on Linux FAQ: http://www.peppler.org/FAQ/linux.html A successful [software] tool is one that was used to do something undreamed of by its author. -- S. C. Johnson
Re: Unicode and Sybase univarchar
On Thu, 3 Jun 2010, Michael Peppler wrote: 2010/06/03 14:08:11 unicode CRITICAL: FATAL: DBD::Sybase::db do failed: Server message number=2402 severity=16 state=1 line=1 server=HDATADEV1 text=Error converting characters into server's character set. Some character(s) could not be converted. I'm not sure what that means. Hmmm - is that on a query, or on an insert operation? That was from an insert. If I _don't_ set that, the data goes in and comes out as bytes, rather than the bizarro hex string. However, the data does have the utf8 flag set when it comes back from Sybase, so I have to run it through Encode::decode. I really don't think I can realistically tell the bazillion developers here just run all the data through Encode. I'd really like see an end-to-end solution. I agree - I've just not had much opportunity (or requests) to ensure that this works 100%. Ideally if you could send me some sample code, and a simple table structure and data that reproduces the problem for you I could try to look at it and see if I can fix it. See my previous email. I included some code, and there was a table definition after the __END__ marker. As I said, my employer might be willing to pay to have this done. If that's something you're interested, let's talk off list and I can try to help coordinate that. Also, it's not clear to me that the data is actually being stored as characters at the Sybase level. I'm not even sure how I'd figure this out. When I do a select from sqsh, I see the wacky hex string, but I can't tell if that's Sybase trying to present data to me in a format it thinks my environment can handle. When in doubt - use the Sybase tools (i.e. isql, and use -Jutf8 to force conversion to/from utf8 when reading/writing the data). Unfortunately, from the terminal on my work dev machine, this gives me the hex string. I think something is being helpful here ;) I fiddled with -Jutf8 and setting LC_ALL, to no avail. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */
Re: Unicode and Sybase univarchar
Sorry, forgot to CC to the list ... Original Message Subject:Re: Unicode and Sybase univarchar Date: Thu, 03 Jun 2010 22:20:37 +0200 From: Alexander Foken alexan...@foken.de To: Dave Rolsky auta...@urth.org References: alpine.deb.0..1006031018000.9...@urth.org On 03.06.2010 17:22, Dave Rolsky wrote: What's really bizarre is that when I select the value back I get something like 0065006d00200064006100730068003a002000e200800094. Yes, that's a literal string containing a series of 2-digit hex numbers! I can translate this back to Perl unicode with this madness: my $chars = do { use bytes; join q{}, map { chr( eval '0x' . $_ ) } $fromdb =~ /()/g; }; my $unicode = decode( 'utf8', $chars ); Really strange way to avoid pack()/unpack(). At least, you can get rid of the evil string-eval, simply use hex($_). The combination of use bytes, chr() with an argument larger than 0x00FF, and decode() also looks very strange. $fromdb contains only hex digits, so there should be no need to use bytes. Unless you force byte mode, chr() should already return perfect Unicode, flagged as such. So ... my $unicode=join('',map { chr hex $_ } $fromdb=~/([0-9a-fA-F]{4})/g); ... should just work. It doesn't quite, because the hex string is not just a dump of a 16 Bit Unicode encoding, but it is a UTF-8 byte stream written with a 16 Bit Hex Format for each byte. Each and every 16-Bit-Word has its most significant byte set to 0. If it was a dump of a 16 Bit Unicode encoding, it should read 0065006d00200064006100730068003a00202014 and not 0065006d00200064006100730068003a002000e200800094 Your call to decode() compensates that, probably because you encoded once too much before writing the data into the database. Appart from that, have a look at the tests in DBD::Oracle, there are a few tests for Unicode round trips in 40UnicodeRoundTrip.t and 41Unicode.t, try to run them on DBD::Sybase. Also consider using DBD::ODBC, a Unicode capable ODBC manager (the one(s) on Windows is/are fine) and an Unicode capable ODBC driver for Sybase. It may cost you a few CPU cycles for the extra layers, but DBD::ODBC supports Unicode quite well (on non-Windows, you need to explicitly enable Unicode). Alexander -- Alexander Foken mailto:alexan...@foken.de http://www.foken.de/alexander/
Re: Unicode and Sybase univarchar
On Thu, 3 Jun 2010, Alexander Foken wrote: Really strange way to avoid pack()/unpack(). At least, you can get rid of the Yes, yes, this is hack code, not production. ... should just work. It doesn't quite, because the hex string is not just a dump of a 16 Bit Unicode encoding, but it is a UTF-8 byte stream written with a 16 Bit Hex Format for each byte. Each and every 16-Bit-Word has its most significant byte set to 0. If it was a dump of a 16 Bit Unicode encoding, it should read 0065006d00200064006100730068003a00202014 and not 0065006d00200064006100730068003a002000e200800094 Your call to decode() compensates that, probably because you encoded once too much before writing the data into the database. I did not encode at all. I simply created a utf8 string in Perl land and inserted it into Sybase. Really, DBD::Sybase needs to handle any character set translation, not the end user. -dave /* http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) */