Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

Scott Jones Wed, 03 Feb 2016 21:34:35 -0800


On Wednesday, February 3, 2016 at 9:48:24 AM UTC-5, Stefan Karpinski wrote:
>
> I've been working through encoding issues with this as well. To connect to 
> Microsoft SQL Server, this patch made things work:
>
> diff --git a/src/backend.jl b/src/backend.jl
> index b5f24af..bf4ee11 100644
> --- a/src/backend.jl
> +++ b/src/backend.jl
> @@ -40,7 +40,7 @@ end
>
>  # Send query to DMBS
>  function ODBCQueryExecute(stmt::Ptr{Void}, querystring::AbstractString)
> -    if @FAILED SQLExecDirect(stmt, utf16(querystring))
> +    if @FAILED SQLExecDirect(stmt, utf8(querystring))
>          ODBCError(SQL_HANDLE_STMT,stmt)
>          error("[ODBC]: SQLExecDirect failed; Return Code: $ret")
>      end
>
>
> The query string gets passed through to SQLExecDirect:
>
> #SQLExecDirect
> #
> http://msdn.microsoft.com/en-us/library/windows/desktop/ms713611(v=vs.85).aspx
> #Description: executes a preparable statement
> #Status:
> function SQLExecDirect(stmt::Ptr{Void},query::AbstractString)
>     @windows_only ret = ccall( (:SQLExecDirect, odbc_dm), stdcall,
>         Int16, (Ptr{Void},Ptr{UInt8},Int),
>         stmt,query,sizeof(query))
>     @unix_only ret = ccall( (:SQLExecDirect, odbc_dm),
>             Int16, (Ptr{Void},Ptr{UInt8},Int),
>             stmt,query,sizeof(query))
>     return ret
> end
>
>  
> This function just convert whatever it's argument is to a pointer. Looking 
> at the docs 
> <https://msdn.microsoft.com/en-us/library/windows/desktop/ms713611(v=vs.85).aspx>
>  for 
> this function, the signature is this:
>
> SQLRETURN SQLExecDirect(
>      SQLHSTMT     StatementHandle,
>      SQLCHAR *    StatementText,
>      SQLINTEGER   TextLength);
>
> and SQLCHAR is defined 
> <https://msdn.microsoft.com/en-us/library/ms714556(v=vs.85).aspx> as 
> unsigned char. So this would seem to be a non-wide character string – i.e. 
> ASCII or UTF-8. And indeed, that's what the Microsoft SQL driver seems to 
> be expecting
>
> The question I have is this: how the heck is this working for other ODBC 
> drivers? How are they getting pointers to UTF-16 data and interpreting it 
> correctly? The correct fix would seem to be to make this always send UTF-8 
> strings. But when I made a PR 
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FJuliaDB%2FODBC.jl%2Fpull%2F71&sa=D&sntz=1&usg=AFQjCNEiuYPGnEMl7BB5dukz6gomNMzo0g>
>  
> that did that, it seemed to break other setups.
>


SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII or 
UTF-8 (probably one of the more common character sets used with that is 
actually Microsoft's CP1252, which is often mistakenly described as ANSI 
Latin-1 - of which it is a superset).
Even when something says it is UTF-8, it frequently is not *really* valid 
UTF-8, for example, there are two common variations of UTF-8, CESU-8, used 
by MySQL and others, which encodes any non-BMP code point using the two 
UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 
sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus 
embedded \0s are encoded in a "long" form (0xc0 0x80) so that you can still 
use old-fashioned C \0 termination for strings while allowing embedded \0s.
(note: handling those formats, common in databases, was part of the reason 
I did the work to rewrite the `convert` functions for strings for Julia).

The reason your PR didn't work is because you need to find out from the 
database which character set it is using, and then do the conversions 
necessary to go back and forth between that character set and whatever 
Unicode you are using in Julia (UTF8String, UTF16String, or UTF32String).
One thing that could help is the following package that @nalimilan has been 
working on: https://github.com/nalimilan/StringEncodings.jl)
You should also be able to use the SQLWCHAR API, but that can have it's own 
issues, see the following from Oracle:

> NOTE: The mapping of *SQLWCHAR* type is somewhat complicated and it can
>          create hidden pitfalls for programmers porting their code from 
>          Windows to Linux. Usually a *SQLWCHAR* character is a 16-bit 
> unit and
>          we will not consider the exotic cases when *SQLWCHAR* is 
> different.
>          Windows uses *UTF-16* and maps *SQLWCHAR* to 16-bit wchar_t type.
>          However, many Linux versions such as Ubuntu Linux use *UTF-32* as 
> an
>          internal character set and therefore their 32-bit wchar_t is not
>          compatible with *SQLWCHAR*, which is always 16-bit.

 
Your working theory 
https://github.com/JuliaDB/ODBC.jl/pull/71#issuecomment-175095266 is 
incorrect, it is an issue of character encodings used, not platforms.  You 
see platform differences because the databases you are connecting to have 
different default character encodings on different platforms.

Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

Reply via email to