Re: [sqlite] UTF8 support?

Roger Binns Mon, 27 Oct 2008 22:01:17 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

William Kyngesburye wrote:
> So, sqlite supports UTF8 directly - UTF8 in, UTF8 out.


No.  SQLite supports Unicode internally.  The APIs let you supply and
receive Unicode strings in UTF8 and UTF16.  The actual encoding
serialized to disk depends on a number of factors, but is also
irrelevant to API usage as the API will accept or supply then in the
encoding you request (UTF8, UTF16).

> And then, ICU adds internal unicode sorting, searching and case  
> conversion.

The builtin SQLite sorting and case conversion only knows about US ascii
and just leaves other codepoints alone for case conversion, or sorts by
codepoint.  ICU lets you specify the locale:

  # standard sqlite
  select upper("instant text")
  # with ICU
  select upper("instant text", "tr_TR")

The former will give "INSTANT TEXT" while the latter gives "İNSTANT
TEXT" (note dot on top of i)

> The spatialite unicode support seems to be conversion routines to/from  
> UTF8 in the shell when the shell uses some other encoding.  

The shell appears to be UTF8 - it seems make no effort to do character
set conversion.  (It also has a number of escaping options such as CSV,
HTML, C style backslashes).  However it really only does codepoints less
than 255.  The various output routines treat the strings as a sequence
of bytes and make output decisions on a byte by byte basis.  This means
for example that if there is a multibyte utf8 sequence the subsequent
bytes will not be treated as part of a utf8 encoded codepoint.
Basically not getting mangled output when using non-latin1 codepoints is
a matter of luck.

> I'm  
> more interested in the library.  I'll have to play with it a bit to  
> see for sure.

The library does unicode, only unicode, full stop end of story.  The
SQLite apis that take or supply text are usually in two variants.  If
there is a "16" suffix then UTF16 is used, else UTF8 is used.  Use
whichever variant is most convenient for you.  The underlying behaviour
is identical.  In general Windows programmers will find the UTF16
variant more useful as the Windows API has been unicode since Windows
NT(*).  Linux programmers will find the UTF8 variant more useful since
that is what other Linux apis tend to be.  I have no idea about Mac.
Note that there is no problem using both variants in the same program -
I regularly do!

(*) There are also legacy versions that take bytes in the local code
page, and then usually convert to Unicode and call the Unicode version
of the API.  Windows 9x had an amusing library named unicows that did
the opposite - it took calls to the Unicode system apis and converted to
local code page and called them since the internals were not unicode.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkkGnIsACgkQmOOfHg372QQz8ACeKaahVpynXD51yVJH2LXsHl++
P2YAoLpXceo492DgQmq2dgabCvL6XuHW
=kgA7
-----END PGP SIGNATURE-----
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] UTF8 support?

Reply via email to