Re: [sqlite] FTS3 Unicode support

Scott Hess Wed, 30 Jan 2008 11:28:43 -0800

The [3] status is ... pending, sorry :-(.  But it is more along the
lines of adding stuff to ICU rather than adding ICU-less stuff to
SQLite, so it sounds like that is not relevant to what you're doing.

As Dan mentioned, there's stuff in there for supporting alternate
tokenizers, including an ICU-based tokenizer.  Even if you aren't
using the ICU-based tokenizer, the scheme for loading tokenizers in
README.tokenizers is probably the way to go.  Otherwise, if you're
compiling your own SQLite code, it's not very hard at all to introduce
a custom tokenizer.

Note that if you redefine how your tokenizer tokenizes, it can leave
your existing fts index broken.  Basically, if something tokenizes to
"X" now, and later changes cause it to tokenize to "XY", then you will
no longer be able to match on "X" because it's baked into the index
that way.  The only real solution is to expose this as a new tokenizer
and rebuild the table.  [Indeed, I'm still making up my story in this
area, too.  It's similar to how changing the implementation of a
custom collator can mess with your regular SQLite indices.]

At this time, the fts index is internally ordered using memcmp()
ordering.  This may make the results of prefix queries incorrect in
certain cases.  I am not knowledgeable enough about
internationalization issues to know if this is a real problem, or just
a theoretical problem, and if it's a real problem, is it a problem
which is at all reasonable to solve?

I believe that the existing fts MATCH code makes certain assumptions
about how the tokenizer works.  Specifically, if the tokenizer returns
more than one variant at a position, I don't think the MATCH code is
going to deal with that very well.  For instance, if you want to
tokenize an accented word both with and without the accent, things
might go awry when you run a query with the accented word.  I've
currently got nothing planned for resolving this, but suggestions (or
prospective solutions) are welcome.

-scott

On Thu, Jan 24, 2008 at 4:26 PM, Myk Melez <[EMAIL PROTECTED]> wrote:
> Hi all,
>
>  I'm working to enable FTS3 in the next version of Firefox [1] so that
>  extenders can take advantage of it, although Firefox itself isn't using
>  it for the next release.
>
>  Given Firefox's international audience, it would be useful for FTS3 to
>  support Unicode.  We currently do this for upper(), lower(), and LIKE by
>  redefining them with sqlite3_create_function [2].
>
>  For FTS3 it seems like we'd have to redefine the tokenizer and MATCH.
>  Can that be done using sqlite3_create_function, and what's the status of
>  the international support mentioned in a previous message on this list [3]?
>
>  -myk
>
>
>  [1] https://bugzilla.mozilla.org/show_bug.cgi?id=413589
>  [2]
>  
> http://lxr.mozilla.org/mozilla/source/storage/src/mozStorageUnicodeFunctions.cpp
>  [3] http://www.mail-archive.com/sqlite-users@sqlite.org/msg27238.html
>
>  -----------------------------------------------------------------------------
>  To unsubscribe, send email to [EMAIL PROTECTED]
>  -----------------------------------------------------------------------------
>
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] FTS3 Unicode support

Reply via email to