Re: [lazarus] Somewhat OT: The massive db-less search

2006-03-01 Thread A.J. Venter
Actually I found a nicer solution :)
I integrated with wikiquote.org (which was something I came up with while 
discussing the problem with you guys).
Selecting a phrase search hides the  book and submits the search to wikiquote, 
grabs the results, preparses them and displays the list of matches in an 
iphtml panel, clicking on any of the links in the list, then submits the link 
value as a search back into my indexing in BOTH artist and title fields.

So if you search for ¨Wherefore art though romeo¨ one of the results will be 
¨William Shakespear¨, clicking the link brings up the entry ¨Shakespear´s 
first folio¨ by William Shakespear from my list, which happens to contain 
among it´s 35 plays ¨Romeo and Juliet¨.

True it requires an internet connection but what it doesn´t require is any 
real CPU/disk usage - and it has the power of an index maintained by 
thousands of volunteers :), these days programs being able to integrate 
cleanly with online information sources is considdered a good thing right ? 
it was for this that I needed the connectivity check, so that I could disable 
the quote-search checkbox if there was no internet.

Ciao
A.J.
On Tuesday 28 February 2006 17:51, William Cairns wrote:
 Have you considered a two pass approach?

 ie Do the first regular search using Looking and Kid to get a list of
 the books that might have the full phrase in it. Then only decompress and
 search in those books for the full phrase.

 -Original Message-
 From: A.J. Venter [mailto:[EMAIL PROTECTED]
 Sent: 28 February 2006 17:44 PM
 To: lazarus@miraclec.com
 Subject: Re: [lazarus] Somewhat OT: The massive db-less search

 On Tuesday 28 February 2006 17:24, Vincent Snijders wrote:
  If java is an option for you:
  http://lucene.apache.org/java/docs/
 
  If not, maybe you can port it to fpc.
 
  We use this (the .NET port) at work to index all publications of
  Statistic Netherlands. Searching is fast.

 Thanks, I am looking now, there is of course a nice catch most search
 engins do word-list indexing, which is FINE for web-pages, but NOT for
 searching 12000 books as just about every search would match nearly every
 book - a book is MUCH more data than a web-page. So litterally the only in
 data search that would give more or less usefull results is full-sentence
 searches - e.g ALL the words you entered IN THE ORDER you entered them
 DIRECTLY juxtaposed - easier in one sence since a substring search will
 either find an exact match or none at all, but harder in that wordlist
 indexing simply will not work.

 Looking at things like lucene and egothor it seems that they actually want
 to search the files themselves... all good and well except for a catch -
 all the files are gz compressed, openbook has on-demand decompression
 built-in - so users don't even need to know about it, the file just appears
 to open from a users PoV.

 Now this is not to say that using the indexes from such a search index will
 not work - I can index on the uncompressed copy and then just use the data
 - but somehow I just don't see keyword based searching as being truly
 usefull here, the data is just too different. Most large document
 warehouses have fairly diverse data in each document, but this is a disk
 full of books - most of them fiction, in other words the data you are
 talking about here is several megabytes per file, highly repetitive (in
 computing terms) and not very diverse (again in computing terms).
 A character name will probably get you only a few books, but a search like
 Here's looking at you kid is supposed to get pretty much only
 cassablanca, not every book that ever used the words looking and kid (which
 are the ones in that phrase which typical keyword searches would consider
 uncommon).

 Frankly I am ready to tell my boss it cannot be done, doing per-file
 searching on the DVD is likely to take a few DAYS per result, and I just
 don't think you can DO this kind of search from metadata.
 Well maybe if I could stick wikiquotes in there and then compare the
 results to my available book list - of course wikiqoutes is about 20GB and
 needs a webserver etc. - so it cannot exactly run from a DVD.

 Basically unless somebody already knows how to do this, I am happy to admit
 I am not smart enough to solve THIS one :)

 A.J.

-- 
80% Of a hardware engineer's job is application of the uncertainty principle.
80% of a software engineer's job is pretending this isn't so.
A.J. Venter
Chief Software Architect
OpenLab International
http://www.getopenlab.com   | +27 82 726 5103 (South Africa)
http://www.silentcoder.co.za| +55 118 162 2079 (Brazil)

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


[lazarus] Somewhat OT: The massive db-less search

2006-02-28 Thread A.J. Venter
Right, the final missing feature of OpenBook is to be able to search for 
phrases INSIDE the books, now the logical way would be to just check each 
book and determine whether or not it contains a matching phrase...
except that there are 12000 of them.

For the author/title searching it´s easy, I have a prebuilt index of files 
matching them to these details and I JUST search the index, I cannot rely on 
any kind of SQL style datbase (except maybe SQLite or something else that can 
work without a server) since the program must be able to run from DVD.

So the question arises how best to do it, searching for a phrase in 12000 
books one by one will take FOREVER.
Ideally I need to somehow build a wordlist/phrase index and search that, 
something like what htt://dig does, but I need to pass the results back to my 
lazarus code in a format I can easilly map back to the index (so I can 
display author/title information with the results) the last part is easy if 
the indexing contains a filename, I can just look up the filename from the 
author/title index.

So my question is can anybody recommend a good tool for creating such a search 
index ? It should either create it as some kind of easilly parseable text 
format, or alternatively something like an sqlite database could work as 
well. I just reckoned before I test a hundred apps, I would ask if anybody 
has any suggestions on which to try first.

One crucial element is that OpenBook is multiplatform with both windows and 
linux versions so it´s vital that whatever method I use to do this search 
behind the scenes is ALSO multiplatform.

TIA
A.J.
-- 
80% Of a hardware engineer's job is application of the uncertainty principle.
80% of a software engineer's job is pretending this isn't so.
A.J. Venter
Chief Software Architect
OpenLab International
http://www.getopenlab.com   | +27 82 726 5103 (South Africa)
http://www.silentcoder.co.za| +55 118 162 2079 (Brazil)

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] Somewhat OT: The massive db-less search

2006-02-28 Thread Michael Van Canneyt



On Tue, 28 Feb 2006, A.J. Venter wrote:


Right, the final missing feature of OpenBook is to be able to search for
phrases INSIDE the books, now the logical way would be to just check each
book and determine whether or not it contains a matching phrase...
except that there are 12000 of them.

For the author/title searching it´s easy, I have a prebuilt index of files
matching them to these details and I JUST search the index, I cannot rely on
any kind of SQL style datbase (except maybe SQLite or something else that can
work without a server) since the program must be able to run from DVD.

So the question arises how best to do it, searching for a phrase in 12000
books one by one will take FOREVER.
Ideally I need to somehow build a wordlist/phrase index and search that,
something like what htt://dig does, but I need to pass the results back to my
lazarus code in a format I can easilly map back to the index (so I can
display author/title information with the results) the last part is easy if
the indexing contains a filename, I can just look up the filename from the
author/title index.

So my question is can anybody recommend a good tool for creating such a search
index ? It should either create it as some kind of easilly parseable text
format, or alternatively something like an sqlite database could work as
well. I just reckoned before I test a hundred apps, I would ask if anybody
has any suggestions on which to try first.

One crucial element is that OpenBook is multiplatform with both windows and
linux versions so it´s vital that whatever method I use to do this search
behind the scenes is ALSO multiplatform.


Michael Hess,

the Lazarus website webmaster, has a tool called IDKSM, which does
exactly what you need, for HTML files. It comes with Delphi/Java code.
Only one problem: IDKSM is closed-source :/

But surely he can give you a hint on how to go about this.


Michael.

Re: [lazarus] Somewhat OT: The massive db-less search

2006-02-28 Thread Eduardo


Take a look to Managing Gigabytes, a book which explauin how to do 
a mixed database/compression algorithm.


HTH

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] Somewhat OT: The massive db-less search

2006-02-28 Thread Vincent Snijders

If java is an option for you:
http://lucene.apache.org/java/docs/

If not, maybe you can port it to fpc.

We use this (the .NET port) at work to index all publications of Statistic 
Netherlands. Searching is fast.


Vincent.

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] Somewhat OT: The massive db-less search

2006-02-28 Thread A.J. Venter
On Tuesday 28 February 2006 17:24, Vincent Snijders wrote:
 If java is an option for you:
 http://lucene.apache.org/java/docs/

 If not, maybe you can port it to fpc.

 We use this (the .NET port) at work to index all publications of Statistic
 Netherlands. Searching is fast.

Thanks, I am looking now, there is of course a nice catch most search engins 
do word-list indexing, which is FINE for web-pages, but NOT for searching 
12000 books as just about every search would match nearly every book - a book 
is MUCH more data than a web-page. So litterally the only in data search 
that would give more or less usefull results is full-sentence searches - e.g 
ALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed - 
easier in one sence since a substring search will either find an exact match 
or none at all, but harder in that wordlist indexing simply will not work.

Looking at things like lucene and egothor it seems that they actually want to 
search the files themselves... all good and well except for a catch - all the 
files are gz compressed, openbook has on-demand decompression built-in - so 
users don't even need to know about it, the file just appears to open from a 
users PoV.

Now this is not to say that using the indexes from such a search index will 
not work - I can index on the uncompressed copy and then just use the data - 
but somehow I just don't see keyword based searching as being truly usefull 
here, the data is just too different. Most large document warehouses have 
fairly diverse data in each document, but this is a disk full of books - most 
of them fiction, in other words the data you are talking about here is 
several megabytes per file, highly repetitive (in computing terms) and not 
very diverse (again in computing terms). 
A character name will probably get you only a few books, but a search like 
Here's looking at you kid is supposed to get pretty much only cassablanca, 
not every book that ever used the words looking and kid (which are the ones 
in that phrase which typical keyword searches would consider uncommon).

Frankly I am ready to tell my boss it cannot be done, doing per-file searching 
on the DVD is likely to take a few DAYS per result, and I just don't think 
you can DO this kind of search from metadata.
Well maybe if I could stick wikiquotes in there and then compare the results 
to my available book list - of course wikiqoutes is about 20GB and needs a 
webserver etc. - so it cannot exactly run from a DVD.

Basically unless somebody already knows how to do this, I am happy to admit I 
am not smart enough to solve THIS one :)

A.J.
-- 
A.J. Venter
Chief Software Architect
OpenLab International
www.getopenlab.com
www.silentcoder.co.za
+27 82 726 5103

_
 To unsubscribe: mail [EMAIL PROTECTED] with
unsubscribe as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


Re: [lazarus] Somewhat OT: The massive db-less search

2006-02-28 Thread Vincent Snijders

A.J. Venter wrote:

On Tuesday 28 February 2006 17:24, Vincent Snijders wrote:


If java is an option for you:
http://lucene.apache.org/java/docs/

If not, maybe you can port it to fpc.

We use this (the .NET port) at work to index all publications of Statistic
Netherlands. Searching is fast.



Thanks, I am looking now, there is of course a nice catch most search engins 
do word-list indexing, which is FINE for web-pages, but NOT for searching 
12000 books as just about every search would match nearly every book - a book 
is MUCH more data than a web-page. So litterally the only in data search 
that would give more or less usefull results is full-sentence searches - e.g 
ALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed - 
easier in one sence since a substring search will either find an exact match 
or none at all, but harder in that wordlist indexing simply will not work.


I think lucene supports phrase queries.



Looking at things like lucene and egothor it seems that they actually want to 
search the files themselves... all good and well except for a catch - all the 
files are gz compressed, openbook has on-demand decompression built-in - so 
users don't even need to know about it, the file just appears to open from a 
users PoV.


Now this is not to say that using the indexes from such a search index will 
not work - I can index on the uncompressed copy and then just use the data - 
but somehow I just don't see keyword based searching as being truly usefull 
here, the data is just too different. Most large document warehouses have 
fairly diverse data in each document, but this is a disk full of books - most 
of them fiction, in other words the data you are talking about here is 
several megabytes per file, highly repetitive (in computing terms) and not 
very diverse (again in computing terms). 
A character name will probably get you only a few books, but a search like 
Here's looking at you kid is supposed to get pretty much only cassablanca, 
not every book that ever used the words looking and kid (which are the ones 
in that phrase which typical keyword searches would consider uncommon).


Lucene should give you the book you are seaching for.

In Lucene terms, a book is a Document with some properties. One of the is content 
(or text), you are free to choose. An other one path or ISBN or whatever property 
you want to use to identify your book (we use a guid to identify our data 
cubes=publications). These are not indexed, but returned with the hits.


You search for the phrase Here's looking at you kid in the content property, you 
might even want to turn off stemming.


Lucene returns hits, the search results, which are documents. Then you get the path 
or whatever extra property you added. You can use that to show the result to the user.


So IMHO, it is doable, but you would have to test it how large the indices will be 
and what the performance is.


Vincent.

_
To unsubscribe: mail [EMAIL PROTECTED] with
   unsubscribe as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives