On Wednesday, February 3, 2016 at 9:48:24 AM UTC-5, Stefan Karpinski wrote:
>
> I've been working through encoding issues with this as well. To connect to
> Microsoft SQL Server, this patch made things work:
>
> diff --git a/src/backend.jl b/src/backend.jl
> index b5f24af..bf4ee11 100644
> --- a/src/backend.jl
> +++ b/src/backend.jl
> @@ -40,7 +40,7 @@ end
>
> # Send query to DMBS
> function ODBCQueryExecute(stmt::Ptr{Void}, querystring::AbstractString)
> - if @FAILED SQLExecDirect(stmt, utf16(querystring))
> + if @FAILED SQLExecDirect(stmt, utf8(querystring))
> ODBCError(SQL_HANDLE_STMT,stmt)
> error("[ODBC]: SQLExecDirect failed; Return Code: $ret")
> end
>
>
> The query string gets passed through to SQLExecDirect:
>
> #SQLExecDirect
> #
> http://msdn.microsoft.com/en-us/library/windows/desktop/ms713611(v=vs.85).aspx
> #Description: executes a preparable statement
> #Status:
> function SQLExecDirect(stmt::Ptr{Void},query::AbstractString)
> @windows_only ret = ccall( (:SQLExecDirect, odbc_dm), stdcall,
> Int16, (Ptr{Void},Ptr{UInt8},Int),
> stmt,query,sizeof(query))
> @unix_only ret = ccall( (:SQLExecDirect, odbc_dm),
> Int16, (Ptr{Void},Ptr{UInt8},Int),
> stmt,query,sizeof(query))
> return ret
> end
>
>
> This function just convert whatever it's argument is to a pointer. Looking
> at the docs
> <https://msdn.microsoft.com/en-us/library/windows/desktop/ms713611(v=vs.85).aspx>
> for
> this function, the signature is this:
>
> SQLRETURN SQLExecDirect(
> SQLHSTMT StatementHandle,
> SQLCHAR * StatementText,
> SQLINTEGER TextLength);
>
> and SQLCHAR is defined
> <https://msdn.microsoft.com/en-us/library/ms714556(v=vs.85).aspx> as
> unsigned char. So this would seem to be a non-wide character string – i.e.
> ASCII or UTF-8. And indeed, that's what the Microsoft SQL driver seems to
> be expecting
>
> The question I have is this: how the heck is this working for other ODBC
> drivers? How are they getting pointers to UTF-16 data and interpreting it
> correctly? The correct fix would seem to be to make this always send UTF-8
> strings. But when I made a PR
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FJuliaDB%2FODBC.jl%2Fpull%2F71&sa=D&sntz=1&usg=AFQjCNEiuYPGnEMl7BB5dukz6gomNMzo0g>
>
> that did that, it seemed to break other setups.
>
SQLCHAR is for encodings with 8-bit code units. It doesn't imply ASCII or
UTF-8 (probably one of the more common character sets used with that is
actually Microsoft's CP1252, which is often mistakenly described as ANSI
Latin-1 - of which it is a superset).
Even when something says it is UTF-8, it frequently is not *really* valid
UTF-8, for example, there are two common variations of UTF-8, CESU-8, used
by MySQL and others, which encodes any non-BMP code point using the two
UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8
sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus
embedded \0s are encoded in a "long" form (0xc0 0x80) so that you can still
use old-fashioned C \0 termination for strings while allowing embedded \0s.
(note: handling those formats, common in databases, was part of the reason
I did the work to rewrite the `convert` functions for strings for Julia).
The reason your PR didn't work is because you need to find out from the
database which character set it is using, and then do the conversions
necessary to go back and forth between that character set and whatever
Unicode you are using in Julia (UTF8String, UTF16String, or UTF32String).
One thing that could help is the following package that @nalimilan has been
working on: https://github.com/nalimilan/StringEncodings.jl)
You should also be able to use the SQLWCHAR API, but that can have it's own
issues, see the following from Oracle:
> NOTE: The mapping of *SQLWCHAR* type is somewhat complicated and it can
> create hidden pitfalls for programmers porting their code from
> Windows to Linux. Usually a *SQLWCHAR* character is a 16-bit
> unit and
> we will not consider the exotic cases when *SQLWCHAR* is
> different.
> Windows uses *UTF-16* and maps *SQLWCHAR* to 16-bit wchar_t type.
> However, many Linux versions such as Ubuntu Linux use *UTF-32* as
> an
> internal character set and therefore their 32-bit wchar_t is not
> compatible with *SQLWCHAR*, which is always 16-bit.
Your working theory
https://github.com/JuliaDB/ODBC.jl/pull/71#issuecomment-175095266 is
incorrect, it is an issue of character encodings used, not platforms. You
see platform differences because the databases you are connecting to have
different default character encodings on different platforms.