On 30.06.2007 17:31, Martin J. Evans wrote:
Alexander/Gabor,

I've reduced this reply to dbi-dev only for now.
(I currently do NOT read that list.)

[...]
I've got a lot of problems attempting to make this work for UNIX not least of which is wchar_t on UNIX is typically 4 bytes and the ODBC API only really does UCS2 (2 bytes) - this rather makes using wcslen etc rather useless. Then there is the additional issues of the lack of unicode odbc drivers for UNIX and the ODBC driver manager on UNIX (IBM have a UCS-2 handling ODBC driver for UNIX - but see http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/ad/c0011522.htm). At this point in time I don't believe anyone is using SQL_Wxxx characters on UNIX via ODBC but I'm prepared to be proved wrong. The problem is also related to there not being a definite definition of what unicode in ODBC on UNIX is. If it is (as would seem to be the only sensible thing for ODBC) taken as UCS2 then it is simply a matter of converting between UCS2 (in ODBC) and UTF-8 in Perl - any pointers from anyone here on how to do that would be appreciated.

As for UTF-8 I could never see how this could ever be done with the ODBC API (on any platform) as the API uses counts of characters in places but expects buffers sized by bytes e.g. if it comes back with a column is 20 characters in size, how can you tell how many bytes of space you need for it. Then there are loads of places where it says if something is a unicode string then the buffers size must be a multiple of 2 etc.
Without searching and reading Unix ODBC doc's: Are you sure the ODBC API expects UCS-2 and not UTF-16? If we are talking about UTF-16, a character may need two or FOUR bytes. (See <http://unicode.org/faq/basic_q.html#25>)


As it would seem a number of people are using your patch for Windows currently, I've integrated it into DBD::ODBC with the following conditions:

1. all the SQL_Wxxx C code is conditional on compilation on Windows i.e. in #ifdef WIN32 with the exception of a few harmless places which convert a SQL type to a string etc.
That looks like a good idea.

2. all your new unicode tests are only run on Windows, skipped for other platforms.
logical consequence.

3. There are a few aspects of the patch I am unsure about and ideally I'd like a comment on them:

a)
/* MS SQL returns bytes, Oracle returns characters ... */
fbh->ColLength*=sizeof(WCHAR);
fbh->ColDisplaySize = DBIc_LongReadLen(imp_sth)+1;

Comment seems to suggest a difference between the two but I don't see a code difference. It looks as though the code agrees with comment as far as SQL Server but not Oracle.
Right, there is no code difference. The code avoids wild guessing about what data type is counted (bytes or unicode characters) in the ODBC driver, and instead always assumes the worst case (bytes). This wastes memory, especially on Oracle. For a string of n unicode characters, the code causes an allocation of n*sizeof(WCHAR) on MS SQL, (n*sizeof(WCHAR))*sizeof(WCHAR) bytes on Oracle. It is a quick and lazy hack to avoid buffer overflows.


b) In dbd_describe() there is a:
     fbh->ColLength += 1; /* add terminator */

This is to make room for a double-byte NUL character at the UTF-16/UCS-2 string end. A single NUL byte would not terminate a UTF-16/UCS-2 string. In fact, UTF-16/UCS-2 contains about 50% NUL bytes for german and english texts:

Assume you have these bytes that represent a UTF-16 or UCS-2 string:
   00 50 00 65 00 72 00 6C  => "Perl"
Append a single NUL byte like you would to with byte=char semantics (and look at the random junk that happens to be in memory): 00 50 00 65 00 72 00 6C *00* 3D 00 50 00 48 00 50 00 00 00 => "Perl=PHP\x{0000}" Appending two NUL bytes would have done the job, but you need to allocate 1 more byte than with byte=char semantics:
   00 50 00 65 00 72 00 6C *00* *00*  => "Perl\x{0000}"


in your patch and I'm unclear why that is required.

4. I've only currently tested it on Windows with SQL Server and may need to do some tidying up for UNIX.

I still can arrange access to an Oracle database (and MS VC++) licensed to my former employeer for me for testing. Unfortunately, I don't have time for detailed tests. If wget, untar, perl Makefile.pl, make, make test, and mailing the output (all on Win2k/WinXP) is sufficient, I'm willing to help. I could also test against PostgreSQL 8.0 and MS SQL 2000 databases.
5. I've completed the integration work for the code and tests but not the other areas like Changes, README etc as yet.
You should at least make clear that Unicode is not supported everywhere (to be honest, it is only supported on input bind parameters and fetch results). Just adding a link to <http://www.alexander-foken.de/README.unicode-patch.html> should be sufficient for development versions. I'm sure my provider would not mind about a few more requests to my tiny webspace, as long as you don't post that link on slashdot.

Ideally I'd like some comments on (3) first but then I could commit this to subversion next week and perhaps some of the people already using your patch could try it out. By the time we get to that stage I hope to be able to come up with Perl equivalents of the mbs2utf8 etc functions.

Martin
Alexander

--
Alexander Foken
mailto:[EMAIL PROTECTED]  http://www.foken.de/alexander/

Reply via email to