I can see the problem, but I'm not sure it's something Lucene should provide. I guess you can try to do some post processing of Lucene results. For AND and OR operations it should be quite easy. If you get any hits for a page in a book, the whole book has the terms. The hard part will be handling "NOT" operations. Seems like you'd have to actually do a '+' search for the term and then rule out all the books that do contain the term. Yuck.
On Wed, Jan 07, 2004 at 09:16:16PM +0100, Thomas Scheffler wrote: > Am Mit, den 07.01.2004 schrieb Dror Matalon um 20:10: > > On Wed, Jan 07, 2004 at 07:58:52PM +0100, Thomas Scheffler wrote: > > > Am Mit, den 07.01.2004 schrieb Dror Matalon um 19:00: > > > > The solution is simple, but you need to think of it conceptually in a > > > > different way. Instead of "all documents with the same DocID are the same > > > > document" think "fetch all the document where DocId is XYZ." > > > > > > > > Assuming the contents are in a field called contents > > > > you do > > > > +(DocID:XYZ) (contents:foo) (contents:bar) > > > > > > I allready was on that way but think of a search like (foo -bar). With > > > your solution it will result in a hit because on page 345 (to keep my > > > example) is the word "foo" and no "bar". Of cause I want with my model, > > > that the book don't get a hit for that query. You see how hard it is to > > > handle, isn't it? > > > > I think, I'm starting to understand. So you want to treat several > > documents as one, and if the hit fails for one of the documents, it > > should fail for all the documents with the same id. OK. This begs the > > question. Why don't you make all these document with the same id one > > document, and index them together? > > This would be a functional but not nice solution. The "pages" are send > to my java class. This point I cannot change cause it api related > restriction. To index 1000 pages I have to index the first one, when I > get the second one I need to reget the first page, bind both together an > send it to the indexwriter. I must keep track of every single page the > "book" contains. This procedure is made for every page and get uglier > while page size is increasing. Furthermore my "book" allows single pages > to be deleted or updated. Every time such a atomic task > (adding/deleting) is performed the index for the whole "book" must be > restored. The mechanism to transfer a "page" to a lucene document is > very time consuming, so I wan't to do that stuff as less as possible. It > would be great as you see, if somehow lucene is possible to thread a > "logical document" (consisting of several lucene documents) like normal > lucene documents. > > > > > > > > > > > > > > For that matter, you can use a standard analyzer on the query and use a > > > > boolean to tie it to the specific document set. > > > > > > > > This is how we do searching on a specific channel at fastbuzz.com. > > > > > > > > Dror > > > > > > > > > > > > On Wed, Jan 07, 2004 at 05:21:43PM +0100, Thomas Scheffler wrote: > > > > > > > > > > Jamie Stallwood sagte: > > > > > > +(DocID:XYZ DocID:ABC) +(foo bar) > > > > > > > > > > > > will find a document that (MUST have (xyz OR abc)) AND (MUST have (foo OR > > > > > > bar)). > > > > > > > > > > This is just the solution for the example in real world I really don't > > > > > have noc documents containing "foo" or "bar". What I meant was: Make > > > > > Lucene think, that all Documents with the same DocID are ONE Document. > > > > > Imagine you have a big book, say 1000 pages. Instead of putting the whole > > > > > book in the index, you split it up in single pages and index them. Now > > > > > it's faster if a page changes or is deleted to update your index instead > > > > > of doing it over and over again for all 1000 pages. So you problem starts > > > > > when you're searching on the book. You search for (foo bar), foo is on > > > > > site 345 while bar ist on 435. You want to get a hit for the book. So I > > > > > need a solution matching this more generic example. > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Thomas Scheffler [mailto:[EMAIL PROTECTED] > > > > > > Sent: 07 January 2004 11:23 > > > > > > To: [EMAIL PROTECTED] > > > > > > Subject: merged search of document > > > > > > > > > > > > Hi, > > > > > > > > > > > > I need a tip for implementation. I have several documents all of them with > > > > > > a field named DocID. DocID identifies not a single Lucene Document but a > > > > > > collection of them. When I wan't to start a seach it should handle the > > > > > > search in that way, as these lucene documents where one. > > > > > > > > > > > > example: > > > > > > > > > > > > Document 1: DocID:XYZ > > > > > > > > > > > > containing: foo > > > > > > > > > > > > Document 2: DocID:XYZ > > > > > > > > > > > > containing: bar > > > > > > > > > > > > Document 3: DocID:ABC > > > > > > > > > > > > containing: foo bar > > > > > > > > > > > > Document 4: GHJ > > > > > > > > > > > > containing: foo > > > > > > > > > > > > As you already guesses, when I'm searching for "+foo +bar" I wan't the > > > > > > hits to contain Document 1, Document 2 and Document 3, not Document 4. Is > > > > > > that clear what I want? How do I implement such a monster? Is that > > > > > > possible with lucene? The content is not stored within lucene it's just > > > > > > tokenized and indexed. > > > > > > > > > > > > Any help? > > > > > > > > > > > > Thanks in advance! > > > > > > > > > > > > Thomas Scheffler > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > -- > > > Fachbegriffe der Informatik - Einfach erkl�rt > > > ============================================= > > > N� 37 -- Fehlertolerant : > > > > > > Das Programm erlaubt keine Benutzereingaben. > > > > -- > Fachbegriffe der Informatik - Einfach erkl�rt > ============================================= > N� 385 -- f�gt sich in bestehende Strukturen ein : > > Microsoft Passport-Account n�tig (Henryk Pl�tz) > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
