Thanks for the very through reply. I ended up storing the revisions number in a database and querying it before searching the Lucene index.
Regards, Anders Lybecker On Tue, Sep 8, 2009 at 11:06 AM, Moray McConnachie < [email protected]> wrote: > Assuming that you need to keep ALL revisions in the index (otherwise you > would just delete the older documents at index time), it will be far > more efficient to do what you need at index time, particularly if your > system receives far more query requests than index requests . For > example, you could store the previous 3/4 version numbers with each > document, and then when you write a revised document into the index you > go to the 5th newest document and add a field "searchinactive" with > value "t". All your searches then simply include a query making sure > that searchinactive is not "t", or use a QueryWrapperFilter to the same > effect. Of course you could also manage actions at index time > differently, e.g. by storing one revision index per document name (most > flexible), or implement a linked list so that you can chain back and > forward (this is most index minimal). > > You imply this is not what you want to do, but you want to do it at > search. The problem with doing it at search is that in your scenario no > individual document knows how many later revisions it has. To do this in > SQL would be complicated - your TOP query would need to be a subquery > of your search query, so even in SQL it would be very slow (e.g. > > SELECT d.Name, d.stuff FROM Documents d WHERE d.Text LIKE > 'searchcriterion' AND d.RevisionNumber IN (SELECT TOP 4 > dr.RevisionNumber FROM Documents AND d.Name LIKE dr.Name ORDER BY > dr.RevisionNumber DESC) > > ) > > We effectively have to use the same technique (and slowness!) for what > you want by calling a Lucene search for every document match. We do this > in a custom HitCollector. When the collect method is called (once per > match), you would have to: > > 1) retrieve that document in Lucene > 2) get its name > 3) search on the index for that name > 4) retrieve a list of all version numbers for that document > 5) check if your current document's revision number is among the most > recent 4. If no, skip 6) > 6) store its lucene document number and score in a results member > variable for the HitCollector. > > You can see example HitCollectors by googling - > http://www.google.co.uk/search?hl=en&q=lucene+hitcollector&btnG=Search&m > eta=&aq=f&oq=<http://www.google.co.uk/search?hl=en&q=lucene+hitcollector&btnG=Search&m%0Aeta=&aq=f&oq=> > > Because I guess it is quite likely that if one version of the document > matches the search, other revisions of the same documenht will also > match, I suggest you cache the list of version numbers per document name > to avoid having to do one Lucene search for each index hit. > > Because this will be hit many, many times if you have a lot of documents > matching a search, you should optimise a HitCollector carefully. In your > case this should definitely include what the quickest form of will be, > and you should profile several different approaches. > > Even if you stick to the HitCollector route, you will find it faster and > less intensive to store a list of the revision versions elsewhere so > that you can calculate faster which versions of the document you should > keep. Preferably you would cache the entire list of revision numbers for > all documents. Then you could use this cache in your HitCollector. > Alternatively you implement some kind of general cache (if this is > ASP.NET you could easily use the application cache) which caches version > numbers for the most recently hit documents for a period (tuned). You > would need to find a way ideally to invalidate the cache members during > indexing, depending on how long your period is and how realtime your > searches need to be. > > I initially thought you could just store the revision numbers you hit > for each document and check those when you get a new hit for the same > document name. This would be much faster because you wouldn't need to do > any Lucene searching within the HitCollector. However it would not work > in the scenario where you have versions 1,2,3,4,5 and versions 1-4 match > a query but version 5 doesn't. So version 5 is never hit, and your > HitCollector returns the 4 newest versions it hears about 1,2,3,4 where > it should return just 2,3,4. > > You may want to extend TopFieldDocCollector rather than write a wholly > new HitCollector if you want to implement paging. > > Hope this is helpful, > > Moray > > ------------------------------------- > Moray McConnachie > Director of IT +44 1865 261 600 > Oxford Analytica http://www.oxan.com > > -----Original Message----- > From: Anders Lybecker [mailto:[email protected]] > Sent: 08 September 2009 07:40 > To: [email protected] > Subject: SQL TOP clause functionality > > Hi all, > > I have a requirement that I'm not sure how to solve with Lucene. Any > help appreciated. > > The solution searches through documents and each document can have > multiple revisions. Each document revision is unique identified by > document name and revision number. The revision number is incremental, > but not sequential - meaning that document X can have revision 1 and 3, > but not necessarily revision 2. Revision number can skip numbers in the > sequence. > > The requirement dictates that the users shall be able to search in the > current revision or 3 last revisions only. How do I solve that with > Lucene? > > I could keep a record for all revisions for each document, but I'll > prefer not to. > > Regards, > Anders Lybecker > > >
