Re: [sqlite] FTS3 Unicode support

2008-01-30 Thread Myk Melez

Scott Hess wrote:

The [3] status is ... pending, sorry :-(.  But it is more along the
lines of adding stuff to ICU rather than adding ICU-less stuff to
SQLite, so it sounds like that is not relevant to what you're doing.
  

Hi Scott,

Thanks for the info.  Indeed, enhancements to ICU don't sound like the 
right approach for us.  I'll look into implementing an alternate tokenizer.


-myk


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS3 Unicode support

2008-01-30 Thread Scott Hess
The [3] status is ... pending, sorry :-(.  But it is more along the
lines of adding stuff to ICU rather than adding ICU-less stuff to
SQLite, so it sounds like that is not relevant to what you're doing.

As Dan mentioned, there's stuff in there for supporting alternate
tokenizers, including an ICU-based tokenizer.  Even if you aren't
using the ICU-based tokenizer, the scheme for loading tokenizers in
README.tokenizers is probably the way to go.  Otherwise, if you're
compiling your own SQLite code, it's not very hard at all to introduce
a custom tokenizer.

Note that if you redefine how your tokenizer tokenizes, it can leave
your existing fts index broken.  Basically, if something tokenizes to
"X" now, and later changes cause it to tokenize to "XY", then you will
no longer be able to match on "X" because it's baked into the index
that way.  The only real solution is to expose this as a new tokenizer
and rebuild the table.  [Indeed, I'm still making up my story in this
area, too.  It's similar to how changing the implementation of a
custom collator can mess with your regular SQLite indices.]

At this time, the fts index is internally ordered using memcmp()
ordering.  This may make the results of prefix queries incorrect in
certain cases.  I am not knowledgeable enough about
internationalization issues to know if this is a real problem, or just
a theoretical problem, and if it's a real problem, is it a problem
which is at all reasonable to solve?

I believe that the existing fts MATCH code makes certain assumptions
about how the tokenizer works.  Specifically, if the tokenizer returns
more than one variant at a position, I don't think the MATCH code is
going to deal with that very well.  For instance, if you want to
tokenize an accented word both with and without the accent, things
might go awry when you run a query with the accented word.  I've
currently got nothing planned for resolving this, but suggestions (or
prospective solutions) are welcome.

-scott


On Thu, Jan 24, 2008 at 4:26 PM, Myk Melez <[EMAIL PROTECTED]> wrote:
> Hi all,
>
>  I'm working to enable FTS3 in the next version of Firefox [1] so that
>  extenders can take advantage of it, although Firefox itself isn't using
>  it for the next release.
>
>  Given Firefox's international audience, it would be useful for FTS3 to
>  support Unicode.  We currently do this for upper(), lower(), and LIKE by
>  redefining them with sqlite3_create_function [2].
>
>  For FTS3 it seems like we'd have to redefine the tokenizer and MATCH.
>  Can that be done using sqlite3_create_function, and what's the status of
>  the international support mentioned in a previous message on this list [3]?
>
>  -myk
>
>
>  [1] https://bugzilla.mozilla.org/show_bug.cgi?id=413589
>  [2]
>  
> http://lxr.mozilla.org/mozilla/source/storage/src/mozStorageUnicodeFunctions.cpp
>  [3] http://www.mail-archive.com/sqlite-users@sqlite.org/msg27238.html
>
>  -
>  To unsubscribe, send email to [EMAIL PROTECTED]
>  -
>
>

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS3 Unicode support

2008-01-24 Thread Shawn Wilsher
The problem with ICU is that it's a rather large library, and mozilla
already has it's own unicode system.  That's we we opted on doing
unicode support ourselves (less code duplication, and a smaller
binary).

Cheers,

Shawn Wilsher

On Jan 24, 2008 11:35 PM, Dan <[EMAIL PROTECTED]> wrote:
>
> On Jan 25, 2008, at 7:26 AM, Myk Melez wrote:
>
> > Hi all,
> >
> > I'm working to enable FTS3 in the next version of Firefox [1] so
> > that extenders can take advantage of it, although Firefox itself
> > isn't using it for the next release.
> >
> > Given Firefox's international audience, it would be useful for FTS3
> > to support Unicode.  We currently do this for upper(), lower(), and
> > LIKE by redefining them with sqlite3_create_function [2].
> >
> > For FTS3 it seems like we'd have to redefine the tokenizer and
> > MATCH. Can that be done using sqlite3_create_function, and what's
> > the status of the international support mentioned in a previous
> > message on this list [3]?
>
> Hi Myk,
>
> The 'icu' and 'fts3' SQLite extensions can take advantage of the
> ICU library to provide internationalization if it is available.
> The ICU extension provides internationalized versions of upper(),
> lower(), collation sequences and a REGEXP operator. Details
> are available here:
>
>http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt
>
> Fts3 has an API for creating new tokenizers. See here:
>
>http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/
> README.tokenizers
>
> One of the example tokenizers uses the ICU library for localization.
> See the same document for details. It is built if the
> SQLITE_ENABLE_ICU macro is defined when fts3 is compiled.
>
> Regards,
> Dan.
>
>
>
>
>
>
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
> -
>
>

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS3 Unicode support

2008-01-24 Thread Dan


On Jan 25, 2008, at 7:26 AM, Myk Melez wrote:


Hi all,

I'm working to enable FTS3 in the next version of Firefox [1] so  
that extenders can take advantage of it, although Firefox itself  
isn't using it for the next release.


Given Firefox's international audience, it would be useful for FTS3  
to support Unicode.  We currently do this for upper(), lower(), and  
LIKE by redefining them with sqlite3_create_function [2].


For FTS3 it seems like we'd have to redefine the tokenizer and  
MATCH. Can that be done using sqlite3_create_function, and what's  
the status of the international support mentioned in a previous  
message on this list [3]?


Hi Myk,

The 'icu' and 'fts3' SQLite extensions can take advantage of the
ICU library to provide internationalization if it is available.
The ICU extension provides internationalized versions of upper(),
lower(), collation sequences and a REGEXP operator. Details
are available here:

  http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt

Fts3 has an API for creating new tokenizers. See here:

  http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/ 
README.tokenizers


One of the example tokenizers uses the ICU library for localization.
See the same document for details. It is built if the
SQLITE_ENABLE_ICU macro is defined when fts3 is compiled.

Regards,
Dan.






-
To unsubscribe, send email to [EMAIL PROTECTED]
-