Re: [lazarus] Somewhat OT: The massive db-less search

Vincent Snijders Tue, 28 Feb 2006 08:07:44 -0800

A.J. Venter wrote:

On Tuesday 28 February 2006 17:24, Vincent Snijders wrote:
If java is an option for you:
http://lucene.apache.org/java/docs/

If not, maybe you can port it to fpc.

We use this (the .NET port) at work to index all publications of Statistic
Netherlands. Searching is fast.
Thanks, I am looking now, there is of course a nice catch most search enginsdo word-list indexing, which is FINE for web-pages, but NOT for searching12000 books as just about every search would match nearly every book - a bookis MUCH more data than a web-page. So litterally the only "in data" searchthat would give more or less usefull results is full-sentence searches - e.gALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed -easier in one sence since a substring search will either find an exact matchor none at all, but harder in that wordlist indexing simply will not work.


I think lucene supports phrase queries.

Looking at things like lucene and egothor it seems that they actually want tosearch the files themselves... all good and well except for a catch - all thefiles are gz compressed, openbook has on-demand decompression built-in - sousers don't even need to know about it, the file just appears to open from ausers PoV.
Now this is not to say that using the indexes from such a search index willnot work - I can index on the uncompressed copy and then just use the data -but somehow I just don't see keyword based searching as being truly usefullhere, the data is just too different. Most large document warehouses havefairly diverse data in each document, but this is a disk full of books - mostof them fiction, in other words the data you are talking about here isseveral megabytes per file, highly repetitive (in computing terms) and notvery diverse (again in computing terms).A character name will probably get you only a few books, but a search like"Here's looking at you kid" is supposed to get pretty much only cassablanca,not every book that ever used the words looking and kid (which are the onesin that phrase which typical keyword searches would consider uncommon).


Lucene should give you the book you are seaching for.

In Lucene terms, a book is a Document with some properties. One of the is content(or text), you are free to choose. An other one path or ISBN or whatever propertyyou want to use to identify your book (we use a guid to identify our datacubes=publications). These are not indexed, but returned with the hits.

You search for the phrase "Here's looking at you kid" in the content property, youmight even want to turn off stemming.

Lucene returns hits, the search results, which are documents. Then you get the pathor whatever extra property you added. You can use that to show the result to the user.

So IMHO, it is doable, but you would have to test it how large the indices will beand what the performance is.


Vincent.

_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Re: [lazarus] Somewhat OT: The massive db-less search

Reply via email to