On Sun, 20 Oct 2019 at 17:04, Simon Slavin <slav...@bigfraud.org> wrote:

> Another common request is full support for Unicode (searching, sorting,
> length()).  But even just the tables required to identify character
> boundaries are huge.
>

Nitpick: there are no tables required to identify character boundaries. For
utf-8 you know if there's another byte to come which is part of the current
codepoint based on whether the current byte's high bit is set, and
furthermore you know how many bytes to expect based on the initial byte.

I'm less familiar with utf-16 which SQLite has some support for, but a
quick read suggests there are exactly two reserved bit patterns you need to
care about to identify surrogate pairs and thus codepoint boundaries.

Tables relating to collation order, character case, and similar codepoint
data can of course get huge, so your point stands.
-Rowan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to