A.J. Venter wrote:
On Tuesday 28 February 2006 17:24, Vincent Snijders wrote:

If java is an option for you:
http://lucene.apache.org/java/docs/

If not, maybe you can port it to fpc.

We use this (the .NET port) at work to index all publications of Statistic
Netherlands. Searching is fast.


Thanks, I am looking now, there is of course a nice catch most search engins do word-list indexing, which is FINE for web-pages, but NOT for searching 12000 books as just about every search would match nearly every book - a book is MUCH more data than a web-page. So litterally the only "in data" search that would give more or less usefull results is full-sentence searches - e.g ALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed - easier in one sence since a substring search will either find an exact match or none at all, but harder in that wordlist indexing simply will not work.

I think lucene supports phrase queries.


Looking at things like lucene and egothor it seems that they actually want to search the files themselves... all good and well except for a catch - all the files are gz compressed, openbook has on-demand decompression built-in - so users don't even need to know about it, the file just appears to open from a users PoV.

Now this is not to say that using the indexes from such a search index will not work - I can index on the uncompressed copy and then just use the data - but somehow I just don't see keyword based searching as being truly usefull here, the data is just too different. Most large document warehouses have fairly diverse data in each document, but this is a disk full of books - most of them fiction, in other words the data you are talking about here is several megabytes per file, highly repetitive (in computing terms) and not very diverse (again in computing terms). A character name will probably get you only a few books, but a search like "Here's looking at you kid" is supposed to get pretty much only cassablanca, not every book that ever used the words looking and kid (which are the ones in that phrase which typical keyword searches would consider uncommon).

Lucene should give you the book you are seaching for.

In Lucene terms, a book is a Document with some properties. One of the is content (or text), you are free to choose. An other one path or ISBN or whatever property you want to use to identify your book (we use a guid to identify our data cubes=publications). These are not indexed, but returned with the hits.

You search for the phrase "Here's looking at you kid" in the content property, you might even want to turn off stemming.

Lucene returns hits, the search results, which are documents. Then you get the path or whatever extra property you added. You can use that to show the result to the user.

So IMHO, it is doable, but you would have to test it how large the indices will be and what the performance is.

Vincent.

_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Reply via email to