Re: next version of DBD::ODBC including available unicode patch?

Alexander Foken Sun, 01 Jul 2007 10:40:05 -0700

On 30.06.2007 17:31, Martin J. Evans wrote:

Alexander/Gabor,


I've reduced this reply to dbi-dev only for now.

(I currently do NOT read that list.)

[...]

I've got a lot of problems attempting to make this work for UNIX notleast of which is wchar_t on UNIX is typically 4 bytes and the ODBCAPI only really does UCS2 (2 bytes) - this rather makes using wcslenetc rather useless. Then there is the additional issues of the lack ofunicode odbc drivers for UNIX and the ODBC driver manager on UNIX (IBMhave a UCS-2 handling ODBC driver for UNIX - but seehttp://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/ad/c0011522.htm).At this point in time I don't believe anyone is using SQL_Wxxxcharacters on UNIX via ODBC but I'm prepared to be proved wrong. Theproblem is also related to there not being a definite definition ofwhat unicode in ODBC on UNIX is. If it is (as would seem to be theonly sensible thing for ODBC) taken as UCS2 then it is simply a matterof converting between UCS2 (in ODBC) and UTF-8 in Perl - any pointersfrom anyone here on how to do that would be appreciated.
As for UTF-8 I could never see how this could ever be done with theODBC API (on any platform) as the API uses counts of characters inplaces but expects buffers sized by bytes e.g. if it comes back with acolumn is 20 characters in size, how can you tell how many bytes ofspace you need for it. Then there are loads of places where it says ifsomething is a unicode string then the buffers size must be a multipleof 2 etc.

Without searching and reading Unix ODBC doc's: Are you sure the ODBC APIexpects UCS-2 and not UTF-16? If we are talking about UTF-16, acharacter may need two or FOUR bytes. (See<http://unicode.org/faq/basic_q.html#25>)

As it would seem a number of people are using your patch for Windowscurrently, I've integrated it into DBD::ODBC with the followingconditions:
1. all the SQL_Wxxx C code is conditional on compilation on Windowsi.e. in #ifdef WIN32with the exception of a few harmless places which convert a SQL typeto a string etc.

That looks like a good idea.

2. all your new unicode tests are only run on Windows, skipped forother platforms.

logical consequence.

3. There are a few aspects of the patch I am unsure about and ideallyI'd like a comment on them:
a)
/* MS SQL returns bytes, Oracle returns characters ... */
fbh->ColLength*=sizeof(WCHAR);
fbh->ColDisplaySize = DBIc_LongReadLen(imp_sth)+1;
Comment seems to suggest a difference between the two but I don't seea code difference.It looks as though the code agrees with comment as far as SQL Serverbut not Oracle.

Right, there is no code difference. The code avoids wild guessing aboutwhat data type is counted (bytes or unicode characters) in the ODBCdriver, and instead always assumes the worst case (bytes). This wastesmemory, especially on Oracle. For a string of n unicode characters, thecode causes an allocation of n*sizeof(WCHAR) on MS SQL,(n*sizeof(WCHAR))*sizeof(WCHAR) bytes on Oracle. It is a quick and lazyhack to avoid buffer overflows.


b) In dbd_describe() there is a:
     fbh->ColLength += 1; /* add terminator */

This is to make room for a double-byte NUL character at the UTF-16/UCS-2string end. A single NUL byte would not terminate a UTF-16/UCS-2 string.In fact, UTF-16/UCS-2 contains about 50% NUL bytes for german andenglish texts:


Assume you have these bytes that represent a UTF-16 or UCS-2 string:
   00 50 00 65 00 72 00 6C  => "Perl"

Append a single NUL byte like you would to with byte=char semantics (andlook at the random junk that happens to be in memory):00 50 00 65 00 72 00 6C *00* 3D 00 50 00 48 00 50 00 00 00 =>"Perl=PHP\x{0000}"Appending two NUL bytes would have done the job, but you need toallocate 1 more byte than with byte=char semantics:

   00 50 00 65 00 72 00 6C *00* *00*  => "Perl\x{0000}"

in your patch and I'm unclear why that is required.
4. I've only currently tested it on Windows with SQL Server and mayneed to do some tidying up for UNIX.

I still can arrange access to an Oracle database (and MS VC++) licensedto my former employeer for me for testing. Unfortunately, I don't havetime for detailed tests. If wget, untar, perl Makefile.pl, make, maketest, and mailing the output (all on Win2k/WinXP) is sufficient, I'mwilling to help. I could also test against PostgreSQL 8.0 and MS SQL2000 databases.

5. I've completed the integration work for the code and tests but notthe other areas like Changes, README etc as yet.

You should at least make clear that Unicode is not supported everywhere(to be honest, it is only supported on input bind parameters and fetchresults). Just adding a link to<http://www.alexander-foken.de/README.unicode-patch.html> should besufficient for development versions. I'm sure my provider would not mindabout a few more requests to my tiny webspace, as long as you don't postthat link on slashdot.

Ideally I'd like some comments on (3) first but then I could committhis to subversion next week and perhaps some of the people alreadyusing your patch could try it out. By the time we get to that stage Ihope to be able to come up with Perl equivalents of the mbs2utf8 etcfunctions.
Martin

Alexander

--
Alexander Foken
mailto:[EMAIL PROTECTED]  http://www.foken.de/alexander/

Re: next version of DBD::ODBC including available unicode patch?

Reply via email to