Re: [sqlite] FTS3 Unicode support
Scott Hess wrote: The [3] status is ... pending, sorry :-(. But it is more along the lines of adding stuff to ICU rather than adding ICU-less stuff to SQLite, so it sounds like that is not relevant to what you're doing. Hi Scott, Thanks for the info. Indeed, enhancements to ICU don't sound like the right approach for us. I'll look into implementing an alternate tokenizer. -myk - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS3 Unicode support
The [3] status is ... pending, sorry :-(. But it is more along the lines of adding stuff to ICU rather than adding ICU-less stuff to SQLite, so it sounds like that is not relevant to what you're doing. As Dan mentioned, there's stuff in there for supporting alternate tokenizers, including an ICU-based tokenizer. Even if you aren't using the ICU-based tokenizer, the scheme for loading tokenizers in README.tokenizers is probably the way to go. Otherwise, if you're compiling your own SQLite code, it's not very hard at all to introduce a custom tokenizer. Note that if you redefine how your tokenizer tokenizes, it can leave your existing fts index broken. Basically, if something tokenizes to "X" now, and later changes cause it to tokenize to "XY", then you will no longer be able to match on "X" because it's baked into the index that way. The only real solution is to expose this as a new tokenizer and rebuild the table. [Indeed, I'm still making up my story in this area, too. It's similar to how changing the implementation of a custom collator can mess with your regular SQLite indices.] At this time, the fts index is internally ordered using memcmp() ordering. This may make the results of prefix queries incorrect in certain cases. I am not knowledgeable enough about internationalization issues to know if this is a real problem, or just a theoretical problem, and if it's a real problem, is it a problem which is at all reasonable to solve? I believe that the existing fts MATCH code makes certain assumptions about how the tokenizer works. Specifically, if the tokenizer returns more than one variant at a position, I don't think the MATCH code is going to deal with that very well. For instance, if you want to tokenize an accented word both with and without the accent, things might go awry when you run a query with the accented word. I've currently got nothing planned for resolving this, but suggestions (or prospective solutions) are welcome. -scott On Thu, Jan 24, 2008 at 4:26 PM, Myk Melez <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm working to enable FTS3 in the next version of Firefox [1] so that > extenders can take advantage of it, although Firefox itself isn't using > it for the next release. > > Given Firefox's international audience, it would be useful for FTS3 to > support Unicode. We currently do this for upper(), lower(), and LIKE by > redefining them with sqlite3_create_function [2]. > > For FTS3 it seems like we'd have to redefine the tokenizer and MATCH. > Can that be done using sqlite3_create_function, and what's the status of > the international support mentioned in a previous message on this list [3]? > > -myk > > > [1] https://bugzilla.mozilla.org/show_bug.cgi?id=413589 > [2] > > http://lxr.mozilla.org/mozilla/source/storage/src/mozStorageUnicodeFunctions.cpp > [3] http://www.mail-archive.com/sqlite-users@sqlite.org/msg27238.html > > - > To unsubscribe, send email to [EMAIL PROTECTED] > - > > - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS3 Unicode support
The problem with ICU is that it's a rather large library, and mozilla already has it's own unicode system. That's we we opted on doing unicode support ourselves (less code duplication, and a smaller binary). Cheers, Shawn Wilsher On Jan 24, 2008 11:35 PM, Dan <[EMAIL PROTECTED]> wrote: > > On Jan 25, 2008, at 7:26 AM, Myk Melez wrote: > > > Hi all, > > > > I'm working to enable FTS3 in the next version of Firefox [1] so > > that extenders can take advantage of it, although Firefox itself > > isn't using it for the next release. > > > > Given Firefox's international audience, it would be useful for FTS3 > > to support Unicode. We currently do this for upper(), lower(), and > > LIKE by redefining them with sqlite3_create_function [2]. > > > > For FTS3 it seems like we'd have to redefine the tokenizer and > > MATCH. Can that be done using sqlite3_create_function, and what's > > the status of the international support mentioned in a previous > > message on this list [3]? > > Hi Myk, > > The 'icu' and 'fts3' SQLite extensions can take advantage of the > ICU library to provide internationalization if it is available. > The ICU extension provides internationalized versions of upper(), > lower(), collation sequences and a REGEXP operator. Details > are available here: > >http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt > > Fts3 has an API for creating new tokenizers. See here: > >http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/ > README.tokenizers > > One of the example tokenizers uses the ICU library for localization. > See the same document for details. It is built if the > SQLITE_ENABLE_ICU macro is defined when fts3 is compiled. > > Regards, > Dan. > > > > > > > > - > To unsubscribe, send email to [EMAIL PROTECTED] > - > > - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS3 Unicode support
On Jan 25, 2008, at 7:26 AM, Myk Melez wrote: Hi all, I'm working to enable FTS3 in the next version of Firefox [1] so that extenders can take advantage of it, although Firefox itself isn't using it for the next release. Given Firefox's international audience, it would be useful for FTS3 to support Unicode. We currently do this for upper(), lower(), and LIKE by redefining them with sqlite3_create_function [2]. For FTS3 it seems like we'd have to redefine the tokenizer and MATCH. Can that be done using sqlite3_create_function, and what's the status of the international support mentioned in a previous message on this list [3]? Hi Myk, The 'icu' and 'fts3' SQLite extensions can take advantage of the ICU library to provide internationalization if it is available. The ICU extension provides internationalized versions of upper(), lower(), collation sequences and a REGEXP operator. Details are available here: http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt Fts3 has an API for creating new tokenizers. See here: http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/ README.tokenizers One of the example tokenizers uses the ICU library for localization. See the same document for details. It is built if the SQLITE_ENABLE_ICU macro is defined when fts3 is compiled. Regards, Dan. - To unsubscribe, send email to [EMAIL PROTECTED] -