Re: [lazarus] Somewhat OT: The massive db-less search
Actually I found a nicer solution :) I integrated with wikiquote.org (which was something I came up with while discussing the problem with you guys). Selecting a phrase search hides the book and submits the search to wikiquote, grabs the results, preparses them and displays the list of matches in an iphtml panel, clicking on any of the links in the list, then submits the link value as a search back into my indexing in BOTH artist and title fields. So if you search for ¨Wherefore art though romeo¨ one of the results will be ¨William Shakespear¨, clicking the link brings up the entry ¨Shakespear´s first folio¨ by William Shakespear from my list, which happens to contain among it´s 35 plays ¨Romeo and Juliet¨. True it requires an internet connection but what it doesn´t require is any real CPU/disk usage - and it has the power of an index maintained by thousands of volunteers :), these days programs being able to integrate cleanly with online information sources is considdered a good thing right ? it was for this that I needed the connectivity check, so that I could disable the quote-search checkbox if there was no internet. Ciao A.J. On Tuesday 28 February 2006 17:51, William Cairns wrote: Have you considered a two pass approach? ie Do the first regular search using Looking and Kid to get a list of the books that might have the full phrase in it. Then only decompress and search in those books for the full phrase. -Original Message- From: A.J. Venter [mailto:[EMAIL PROTECTED] Sent: 28 February 2006 17:44 PM To: lazarus@miraclec.com Subject: Re: [lazarus] Somewhat OT: The massive db-less search On Tuesday 28 February 2006 17:24, Vincent Snijders wrote: If java is an option for you: http://lucene.apache.org/java/docs/ If not, maybe you can port it to fpc. We use this (the .NET port) at work to index all publications of Statistic Netherlands. Searching is fast. Thanks, I am looking now, there is of course a nice catch most search engins do word-list indexing, which is FINE for web-pages, but NOT for searching 12000 books as just about every search would match nearly every book - a book is MUCH more data than a web-page. So litterally the only in data search that would give more or less usefull results is full-sentence searches - e.g ALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed - easier in one sence since a substring search will either find an exact match or none at all, but harder in that wordlist indexing simply will not work. Looking at things like lucene and egothor it seems that they actually want to search the files themselves... all good and well except for a catch - all the files are gz compressed, openbook has on-demand decompression built-in - so users don't even need to know about it, the file just appears to open from a users PoV. Now this is not to say that using the indexes from such a search index will not work - I can index on the uncompressed copy and then just use the data - but somehow I just don't see keyword based searching as being truly usefull here, the data is just too different. Most large document warehouses have fairly diverse data in each document, but this is a disk full of books - most of them fiction, in other words the data you are talking about here is several megabytes per file, highly repetitive (in computing terms) and not very diverse (again in computing terms). A character name will probably get you only a few books, but a search like Here's looking at you kid is supposed to get pretty much only cassablanca, not every book that ever used the words looking and kid (which are the ones in that phrase which typical keyword searches would consider uncommon). Frankly I am ready to tell my boss it cannot be done, doing per-file searching on the DVD is likely to take a few DAYS per result, and I just don't think you can DO this kind of search from metadata. Well maybe if I could stick wikiquotes in there and then compare the results to my available book list - of course wikiqoutes is about 20GB and needs a webserver etc. - so it cannot exactly run from a DVD. Basically unless somebody already knows how to do this, I am happy to admit I am not smart enough to solve THIS one :) A.J. -- 80% Of a hardware engineer's job is application of the uncertainty principle. 80% of a software engineer's job is pretending this isn't so. A.J. Venter Chief Software Architect OpenLab International http://www.getopenlab.com | +27 82 726 5103 (South Africa) http://www.silentcoder.co.za| +55 118 162 2079 (Brazil) _ To unsubscribe: mail [EMAIL PROTECTED] with unsubscribe as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
[lazarus] Somewhat OT: The massive db-less search
Right, the final missing feature of OpenBook is to be able to search for phrases INSIDE the books, now the logical way would be to just check each book and determine whether or not it contains a matching phrase... except that there are 12000 of them. For the author/title searching it´s easy, I have a prebuilt index of files matching them to these details and I JUST search the index, I cannot rely on any kind of SQL style datbase (except maybe SQLite or something else that can work without a server) since the program must be able to run from DVD. So the question arises how best to do it, searching for a phrase in 12000 books one by one will take FOREVER. Ideally I need to somehow build a wordlist/phrase index and search that, something like what htt://dig does, but I need to pass the results back to my lazarus code in a format I can easilly map back to the index (so I can display author/title information with the results) the last part is easy if the indexing contains a filename, I can just look up the filename from the author/title index. So my question is can anybody recommend a good tool for creating such a search index ? It should either create it as some kind of easilly parseable text format, or alternatively something like an sqlite database could work as well. I just reckoned before I test a hundred apps, I would ask if anybody has any suggestions on which to try first. One crucial element is that OpenBook is multiplatform with both windows and linux versions so it´s vital that whatever method I use to do this search behind the scenes is ALSO multiplatform. TIA A.J. -- 80% Of a hardware engineer's job is application of the uncertainty principle. 80% of a software engineer's job is pretending this isn't so. A.J. Venter Chief Software Architect OpenLab International http://www.getopenlab.com | +27 82 726 5103 (South Africa) http://www.silentcoder.co.za| +55 118 162 2079 (Brazil) _ To unsubscribe: mail [EMAIL PROTECTED] with unsubscribe as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] Somewhat OT: The massive db-less search
On Tue, 28 Feb 2006, A.J. Venter wrote: Right, the final missing feature of OpenBook is to be able to search for phrases INSIDE the books, now the logical way would be to just check each book and determine whether or not it contains a matching phrase... except that there are 12000 of them. For the author/title searching it´s easy, I have a prebuilt index of files matching them to these details and I JUST search the index, I cannot rely on any kind of SQL style datbase (except maybe SQLite or something else that can work without a server) since the program must be able to run from DVD. So the question arises how best to do it, searching for a phrase in 12000 books one by one will take FOREVER. Ideally I need to somehow build a wordlist/phrase index and search that, something like what htt://dig does, but I need to pass the results back to my lazarus code in a format I can easilly map back to the index (so I can display author/title information with the results) the last part is easy if the indexing contains a filename, I can just look up the filename from the author/title index. So my question is can anybody recommend a good tool for creating such a search index ? It should either create it as some kind of easilly parseable text format, or alternatively something like an sqlite database could work as well. I just reckoned before I test a hundred apps, I would ask if anybody has any suggestions on which to try first. One crucial element is that OpenBook is multiplatform with both windows and linux versions so it´s vital that whatever method I use to do this search behind the scenes is ALSO multiplatform. Michael Hess, the Lazarus website webmaster, has a tool called IDKSM, which does exactly what you need, for HTML files. It comes with Delphi/Java code. Only one problem: IDKSM is closed-source :/ But surely he can give you a hint on how to go about this. Michael.
Re: [lazarus] Somewhat OT: The massive db-less search
Take a look to Managing Gigabytes, a book which explauin how to do a mixed database/compression algorithm. HTH _ To unsubscribe: mail [EMAIL PROTECTED] with unsubscribe as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] Somewhat OT: The massive db-less search
If java is an option for you: http://lucene.apache.org/java/docs/ If not, maybe you can port it to fpc. We use this (the .NET port) at work to index all publications of Statistic Netherlands. Searching is fast. Vincent. _ To unsubscribe: mail [EMAIL PROTECTED] with unsubscribe as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] Somewhat OT: The massive db-less search
On Tuesday 28 February 2006 17:24, Vincent Snijders wrote: If java is an option for you: http://lucene.apache.org/java/docs/ If not, maybe you can port it to fpc. We use this (the .NET port) at work to index all publications of Statistic Netherlands. Searching is fast. Thanks, I am looking now, there is of course a nice catch most search engins do word-list indexing, which is FINE for web-pages, but NOT for searching 12000 books as just about every search would match nearly every book - a book is MUCH more data than a web-page. So litterally the only in data search that would give more or less usefull results is full-sentence searches - e.g ALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed - easier in one sence since a substring search will either find an exact match or none at all, but harder in that wordlist indexing simply will not work. Looking at things like lucene and egothor it seems that they actually want to search the files themselves... all good and well except for a catch - all the files are gz compressed, openbook has on-demand decompression built-in - so users don't even need to know about it, the file just appears to open from a users PoV. Now this is not to say that using the indexes from such a search index will not work - I can index on the uncompressed copy and then just use the data - but somehow I just don't see keyword based searching as being truly usefull here, the data is just too different. Most large document warehouses have fairly diverse data in each document, but this is a disk full of books - most of them fiction, in other words the data you are talking about here is several megabytes per file, highly repetitive (in computing terms) and not very diverse (again in computing terms). A character name will probably get you only a few books, but a search like Here's looking at you kid is supposed to get pretty much only cassablanca, not every book that ever used the words looking and kid (which are the ones in that phrase which typical keyword searches would consider uncommon). Frankly I am ready to tell my boss it cannot be done, doing per-file searching on the DVD is likely to take a few DAYS per result, and I just don't think you can DO this kind of search from metadata. Well maybe if I could stick wikiquotes in there and then compare the results to my available book list - of course wikiqoutes is about 20GB and needs a webserver etc. - so it cannot exactly run from a DVD. Basically unless somebody already knows how to do this, I am happy to admit I am not smart enough to solve THIS one :) A.J. -- A.J. Venter Chief Software Architect OpenLab International www.getopenlab.com www.silentcoder.co.za +27 82 726 5103 _ To unsubscribe: mail [EMAIL PROTECTED] with unsubscribe as the Subject archives at http://www.lazarus.freepascal.org/mailarchives
Re: [lazarus] Somewhat OT: The massive db-less search
A.J. Venter wrote: On Tuesday 28 February 2006 17:24, Vincent Snijders wrote: If java is an option for you: http://lucene.apache.org/java/docs/ If not, maybe you can port it to fpc. We use this (the .NET port) at work to index all publications of Statistic Netherlands. Searching is fast. Thanks, I am looking now, there is of course a nice catch most search engins do word-list indexing, which is FINE for web-pages, but NOT for searching 12000 books as just about every search would match nearly every book - a book is MUCH more data than a web-page. So litterally the only in data search that would give more or less usefull results is full-sentence searches - e.g ALL the words you entered IN THE ORDER you entered them DIRECTLY juxtaposed - easier in one sence since a substring search will either find an exact match or none at all, but harder in that wordlist indexing simply will not work. I think lucene supports phrase queries. Looking at things like lucene and egothor it seems that they actually want to search the files themselves... all good and well except for a catch - all the files are gz compressed, openbook has on-demand decompression built-in - so users don't even need to know about it, the file just appears to open from a users PoV. Now this is not to say that using the indexes from such a search index will not work - I can index on the uncompressed copy and then just use the data - but somehow I just don't see keyword based searching as being truly usefull here, the data is just too different. Most large document warehouses have fairly diverse data in each document, but this is a disk full of books - most of them fiction, in other words the data you are talking about here is several megabytes per file, highly repetitive (in computing terms) and not very diverse (again in computing terms). A character name will probably get you only a few books, but a search like Here's looking at you kid is supposed to get pretty much only cassablanca, not every book that ever used the words looking and kid (which are the ones in that phrase which typical keyword searches would consider uncommon). Lucene should give you the book you are seaching for. In Lucene terms, a book is a Document with some properties. One of the is content (or text), you are free to choose. An other one path or ISBN or whatever property you want to use to identify your book (we use a guid to identify our data cubes=publications). These are not indexed, but returned with the hits. You search for the phrase Here's looking at you kid in the content property, you might even want to turn off stemming. Lucene returns hits, the search results, which are documents. Then you get the path or whatever extra property you added. You can use that to show the result to the user. So IMHO, it is doable, but you would have to test it how large the indices will be and what the performance is. Vincent. _ To unsubscribe: mail [EMAIL PROTECTED] with unsubscribe as the Subject archives at http://www.lazarus.freepascal.org/mailarchives