Re: [lazarus] Somewhat OT: The massive db-less search

A.J. Venter Wed, 01 Mar 2006 22:54:00 -0800

Actually I found a nicer solution :)
I integrated with wikiquote.org (which was something I came up with while 
discussing the problem with you guys).
Selecting a phrase search hides the  book and submits the search to wikiquote, 
grabs the results, preparses them and displays the list of matches in an 
iphtml panel, clicking on any of the links in the list, then submits the link 
value as a search back into my indexing in BOTH artist and title fields.


So if you search for ¨Wherefore art though romeo¨ one of the results will be 
¨William Shakespear¨, clicking the link brings up the entry ¨Shakespear´s 
first folio¨ by William Shakespear from my list, which happens to contain 
among it´s 35 plays ¨Romeo and Juliet¨.

True it requires an internet connection but what it doesn´t require is any 
real CPU/disk usage - and it has the power of an index maintained by 
thousands of volunteers :), these days programs being able to integrate 
cleanly with online information sources is considdered a good thing right ? 
it was for this that I needed the connectivity check, so that I could disable 
the quote-search checkbox if there was no internet.

Ciao
A.J.
On Tuesday 28 February 2006 17:51, William Cairns wrote:
> Have you considered a two pass approach?
>
> ie Do the first regular search using "Looking" and "Kid" to get a list of
> the books that "might" have the full phrase in it. Then only decompress and
> search in those books for the full phrase.
>
> -----Original Message-----
> From: A.J. Venter [mailto:[EMAIL PROTECTED]
> Sent: 28 February 2006 17:44 PM
> To: lazarus@miraclec.com
> Subject: Re: [lazarus] Somewhat OT: The massive db-less search
>
> On Tuesday 28 February 2006 17:24, Vincent Snijders wrote:
> > If java is an option for you:
> > http://lucene.apache.org/java/docs/
> >
> > If not, maybe you can port it to fpc.
> >
> > We use this (the .NET port) at work to index all publications of
> > Statistic Netherlands. Searching is fast.
>
> Thanks, I am looking now, there is of course a nice catch most search
> engins do word-list indexing, which is FINE for web-pages, but NOT for
> searching 12000 books as just about every search would match nearly every
> book - a book is MUCH more data than a web-page. So litterally the only "in
> data" search that would give more or less usefull results is full-sentence
> searches - e.g ALL the words you entered IN THE ORDER you entered them
> DIRECTLY juxtaposed - easier in one sence since a substring search will
> either find an exact match or none at all, but harder in that wordlist
> indexing simply will not work.
>
> Looking at things like lucene and egothor it seems that they actually want
> to search the files themselves... all good and well except for a catch -
> all the files are gz compressed, openbook has on-demand decompression
> built-in - so users don't even need to know about it, the file just appears
> to open from a users PoV.
>
> Now this is not to say that using the indexes from such a search index will
> not work - I can index on the uncompressed copy and then just use the data
> - but somehow I just don't see keyword based searching as being truly
> usefull here, the data is just too different. Most large document
> warehouses have fairly diverse data in each document, but this is a disk
> full of books - most of them fiction, in other words the data you are
> talking about here is several megabytes per file, highly repetitive (in
> computing terms) and not very diverse (again in computing terms).
> A character name will probably get you only a few books, but a search like
> "Here's looking at you kid" is supposed to get pretty much only
> cassablanca, not every book that ever used the words looking and kid (which
> are the ones in that phrase which typical keyword searches would consider
> uncommon).
>
> Frankly I am ready to tell my boss it cannot be done, doing per-file
> searching on the DVD is likely to take a few DAYS per result, and I just
> don't think you can DO this kind of search from metadata.
> Well maybe if I could stick wikiquotes in there and then compare the
> results to my available book list - of course wikiqoutes is about 20GB and
> needs a webserver etc. - so it cannot exactly run from a DVD.
>
> Basically unless somebody already knows how to do this, I am happy to admit
> I am not smart enough to solve THIS one :)
>
> A.J.

-- 
"80% Of a hardware engineer's job is application of the uncertainty principle.
80% of a software engineer's job is pretending this isn't so."
A.J. Venter
Chief Software Architect
OpenLab International
http://www.getopenlab.com       | +27 82 726 5103 (South Africa)
http://www.silentcoder.co.za    | +55 118 162 2079 (Brazil)

_________________________________________________________________
     To unsubscribe: mail [EMAIL PROTECTED] with
                "unsubscribe" as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives

Re: [lazarus] Somewhat OT: The massive db-less search

Reply via email to