Re: RealSQLDatabase text index

Guyren Howe Thu, 30 Mar 2006 23:37:02 -0800

On Mar 31, 2006, at 1:06 AM, Marco Bambini wrote:

Unfortunately that does not make any difference if the search is made
using 'LIKE' because SQLite will ignore the index and will perform a
full table scan. See 5.5.1 at:
http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

The document refers to version 2.8.6 (10 September 2003), a lot ofchanges occurs in sqlite since that version.

Anyway, sqlite doesn't officially supports full text index, but aguy wrote a small full text search engine for sqlite3 onsourceforge.net.It is C/C++ code and I am not sure you can do the same things withthe REALSQLDatabase, even if I am quite sure you can recreate thesame algorithms with RB and SQLite3ProfessionalPlugin.


This is the description of his work and a link to the project:

Basic ideas are as follow:
--------------------------
* use another sqlite3 file for storing full text index

  information
* store everything into sqlite3 (this is not the
  fastest strategy in particular for inverted index)
* don't care too much about disk space constrains
  (ft3 use roughly 8x more space than initial datas)
* be resonnably efficient (index the 500k small
  documents in the english part of dmoz.org in 2H on
  my desktop PC, a search with frequent words take
  around 3/5 seconds on 1 IDE disk whitout cache, and
  less than 0,5 second if data in cache)

Support basic things:
---------------------
* for words: stemming, metaphone, stopwords,
  dictionnary, some statistics
* for scores: TFIDF, proximity
* special parsing for urls
* special parsing for topics (If you have some)
* web classical syntax for queries (support ™±/""/:)
* make it easy to search from a php module
* configuration stored inside sqlite3
* language detection (currently unplugged)


<snip>

You could certainly do this in REALbasic. A useful, free resource forthe dictionary etc would be WordNet <http://wordnet.princeton.edu/>.

But you're getting into some moderately complex stuff here. See, forexample: <http://citeseer.ist.psu.edu/context/20836/0> to see thatthis is an area of active research for computer scientists. There arestandard approaches, though. Typical is to index all the substringsof a particular length of each word, along with other informationsuch as in what order they occur. As you can imagine, the index canwind up being many times the size of the original data.

Depending on your project, a compelling alternative might be Lucene:<http://lucene.apache.org/java/docs/>. Lucene is very, very good atthis kind of thing.

OTOH, I'd love to see someone put together an all-REALbasic solutionfor this. Extra bonus points if you implement an index that lets medo grep searches. :-)


Regards,

Guyren G Howe
guyren-at-relevantlogic.com
http://relevantlogic.com

REALbasic, PHP, Python programming
PostgreSQL, MySQL database design and consulting
Technical writing and training


_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Re: RealSQLDatabase text index

Reply via email to