On 30.06.2007 17:31, Martin J. Evans wrote:
Alexander/Gabor,
I've reduced this reply to dbi-dev only for now.
(I currently do NOT read that list.)
[...]
I've got a lot of problems attempting to make this work for UNIX not
least of which is wchar_t on UNIX is typically 4 bytes and the ODBC
API only really does UCS2 (2 bytes) - this rather makes using wcslen
etc rather useless. Then there is the additional issues of the lack of
unicode odbc drivers for UNIX and the ODBC driver manager on UNIX (IBM
have a UCS-2 handling ODBC driver for UNIX - but see
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/ad/c0011522.htm).
At this point in time I don't believe anyone is using SQL_Wxxx
characters on UNIX via ODBC but I'm prepared to be proved wrong. The
problem is also related to there not being a definite definition of
what unicode in ODBC on UNIX is. If it is (as would seem to be the
only sensible thing for ODBC) taken as UCS2 then it is simply a matter
of converting between UCS2 (in ODBC) and UTF-8 in Perl - any pointers
from anyone here on how to do that would be appreciated.
As for UTF-8 I could never see how this could ever be done with the
ODBC API (on any platform) as the API uses counts of characters in
places but expects buffers sized by bytes e.g. if it comes back with a
column is 20 characters in size, how can you tell how many bytes of
space you need for it. Then there are loads of places where it says if
something is a unicode string then the buffers size must be a multiple
of 2 etc.
Without searching and reading Unix ODBC doc's: Are you sure the ODBC API
expects UCS-2 and not UTF-16? If we are talking about UTF-16, a
character may need two or FOUR bytes. (See
<http://unicode.org/faq/basic_q.html#25>)
As it would seem a number of people are using your patch for Windows
currently, I've integrated it into DBD::ODBC with the following
conditions:
1. all the SQL_Wxxx C code is conditional on compilation on Windows
i.e. in #ifdef WIN32
with the exception of a few harmless places which convert a SQL type
to a string etc.
That looks like a good idea.
2. all your new unicode tests are only run on Windows, skipped for
other platforms.
logical consequence.
3. There are a few aspects of the patch I am unsure about and ideally
I'd like a comment on them:
a)
/* MS SQL returns bytes, Oracle returns characters ... */
fbh->ColLength*=sizeof(WCHAR);
fbh->ColDisplaySize = DBIc_LongReadLen(imp_sth)+1;
Comment seems to suggest a difference between the two but I don't see
a code difference.
It looks as though the code agrees with comment as far as SQL Server
but not Oracle.
Right, there is no code difference. The code avoids wild guessing about
what data type is counted (bytes or unicode characters) in the ODBC
driver, and instead always assumes the worst case (bytes). This wastes
memory, especially on Oracle. For a string of n unicode characters, the
code causes an allocation of n*sizeof(WCHAR) on MS SQL,
(n*sizeof(WCHAR))*sizeof(WCHAR) bytes on Oracle. It is a quick and lazy
hack to avoid buffer overflows.
b) In dbd_describe() there is a:
fbh->ColLength += 1; /* add terminator */
This is to make room for a double-byte NUL character at the UTF-16/UCS-2
string end. A single NUL byte would not terminate a UTF-16/UCS-2 string.
In fact, UTF-16/UCS-2 contains about 50% NUL bytes for german and
english texts:
Assume you have these bytes that represent a UTF-16 or UCS-2 string:
00 50 00 65 00 72 00 6C => "Perl"
Append a single NUL byte like you would to with byte=char semantics (and
look at the random junk that happens to be in memory):
00 50 00 65 00 72 00 6C *00* 3D 00 50 00 48 00 50 00 00 00 =>
"Perl=PHP\x{0000}"
Appending two NUL bytes would have done the job, but you need to
allocate 1 more byte than with byte=char semantics:
00 50 00 65 00 72 00 6C *00* *00* => "Perl\x{0000}"
in your patch and I'm unclear why that is required.
4. I've only currently tested it on Windows with SQL Server and may
need to do some tidying up for UNIX.
I still can arrange access to an Oracle database (and MS VC++) licensed
to my former employeer for me for testing. Unfortunately, I don't have
time for detailed tests. If wget, untar, perl Makefile.pl, make, make
test, and mailing the output (all on Win2k/WinXP) is sufficient, I'm
willing to help. I could also test against PostgreSQL 8.0 and MS SQL
2000 databases.
5. I've completed the integration work for the code and tests but not
the other areas like Changes, README etc as yet.
You should at least make clear that Unicode is not supported everywhere
(to be honest, it is only supported on input bind parameters and fetch
results). Just adding a link to
<http://www.alexander-foken.de/README.unicode-patch.html> should be
sufficient for development versions. I'm sure my provider would not mind
about a few more requests to my tiny webspace, as long as you don't post
that link on slashdot.
Ideally I'd like some comments on (3) first but then I could commit
this to subversion next week and perhaps some of the people already
using your patch could try it out. By the time we get to that stage I
hope to be able to come up with Perl equivalents of the mbs2utf8 etc
functions.
Martin
Alexander
--
Alexander Foken
mailto:[EMAIL PROTECTED] http://www.foken.de/alexander/