A.J. Venter wrote:
On Tuesday 28 February 2006 17:24, Vincent Snijders wrote:
If java is an option for you:
http://lucene.apache.org/java/docs/
If not, maybe you can port it to fpc.
We use this (the .NET port) at work to index all publications of Statistic
Netherlands. Searching is fast.
Thanks, I am looking now, there is of course a nice catch most search engins
do word-list indexing, which is FINE for web-pages, but NOT for searching
12000 books as just about every search would match nearly every book - a book
is MUCH more data than a web-page. So litterally the only "in data" search
that would give more or less usefull results is full-sentence searches - e.g
ALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed -
easier in one sence since a substring search will either find an exact match
or none at all, but harder in that wordlist indexing simply will not work.
I think lucene supports phrase queries.
Looking at things like lucene and egothor it seems that they actually want to
search the files themselves... all good and well except for a catch - all the
files are gz compressed, openbook has on-demand decompression built-in - so
users don't even need to know about it, the file just appears to open from a
users PoV.
Now this is not to say that using the indexes from such a search index will
not work - I can index on the uncompressed copy and then just use the data -
but somehow I just don't see keyword based searching as being truly usefull
here, the data is just too different. Most large document warehouses have
fairly diverse data in each document, but this is a disk full of books - most
of them fiction, in other words the data you are talking about here is
several megabytes per file, highly repetitive (in computing terms) and not
very diverse (again in computing terms).
A character name will probably get you only a few books, but a search like
"Here's looking at you kid" is supposed to get pretty much only cassablanca,
not every book that ever used the words looking and kid (which are the ones
in that phrase which typical keyword searches would consider uncommon).
Lucene should give you the book you are seaching for.
In Lucene terms, a book is a Document with some properties. One of the is content
(or text), you are free to choose. An other one path or ISBN or whatever property
you want to use to identify your book (we use a guid to identify our data
cubes=publications). These are not indexed, but returned with the hits.
You search for the phrase "Here's looking at you kid" in the content property, you
might even want to turn off stemming.
Lucene returns hits, the search results, which are documents. Then you get the path
or whatever extra property you added. You can use that to show the result to the user.
So IMHO, it is doable, but you would have to test it how large the indices will be
and what the performance is.
Vincent.
_________________________________________________________________
To unsubscribe: mail [EMAIL PROTECTED] with
"unsubscribe" as the Subject
archives at http://www.lazarus.freepascal.org/mailarchives