I'm sure that was the easiest thing to do! I find Lucene works best in combination with an sql db, especially when you have tasks in which you need to search and combine datasets (e.g. user permissioning and aggregation).
Yours, Moray ------------------------------------- Moray McConnachie Director of IT +44 1865 261 600 Oxford Analytica http://www.oxan.com -----Original Message----- From: Anders Lybecker [mailto:[email protected]] Sent: 14 September 2009 08:49 To: [email protected] Subject: Re: SQL TOP clause functionality Thanks for the very through reply. I ended up storing the revisions number in a database and querying it before searching the Lucene index. Regards, Anders Lybecker On Tue, Sep 8, 2009 at 11:06 AM, Moray McConnachie < [email protected]> wrote: > Assuming that you need to keep ALL revisions in the index (otherwise > you would just delete the older documents at index time), it will be > far more efficient to do what you need at index time, particularly if > your system receives far more query requests than index requests . For > example, you could store the previous 3/4 version numbers with each > document, and then when you write a revised document into the index > you go to the 5th newest document and add a field "searchinactive" > with value "t". All your searches then simply include a query making > sure that searchinactive is not "t", or use a QueryWrapperFilter to > the same effect. Of course you could also manage actions at index time > differently, e.g. by storing one revision index per document name > (most flexible), or implement a linked list so that you can chain back > and forward (this is most index minimal). > > You imply this is not what you want to do, but you want to do it at > search. The problem with doing it at search is that in your scenario > no individual document knows how many later revisions it has. To do > this in SQL would be complicated - your TOP query would need to be a > subquery of your search query, so even in SQL it would be very slow (e.g. > > SELECT d.Name, d.stuff FROM Documents d WHERE d.Text LIKE > 'searchcriterion' AND d.RevisionNumber IN (SELECT TOP 4 > dr.RevisionNumber FROM Documents AND d.Name LIKE dr.Name ORDER BY > dr.RevisionNumber DESC) > > ) > > We effectively have to use the same technique (and slowness!) for > what you want by calling a Lucene search for every document match. We > do this in a custom HitCollector. When the collect method is called > (once per match), you would have to: > > 1) retrieve that document in Lucene > 2) get its name > 3) search on the index for that name > 4) retrieve a list of all version numbers for that document > 5) check if your current document's revision number is among the most > recent 4. If no, skip 6) > 6) store its lucene document number and score in a results member > variable for the HitCollector. > > You can see example HitCollectors by googling - > http://www.google.co.uk/search?hl=en&q=lucene+hitcollector&btnG=Search > &m > eta=&aq=f&oq=<http://www.google.co.uk/search?hl=en&q=lucene+hitcollect > or&btnG=Search&m%0Aeta=&aq=f&oq=> > > Because I guess it is quite likely that if one version of the document > matches the search, other revisions of the same documenht will also > match, I suggest you cache the list of version numbers per document > name to avoid having to do one Lucene search for each index hit. > > Because this will be hit many, many times if you have a lot of > documents matching a search, you should optimise a HitCollector > carefully. In your case this should definitely include what the > quickest form of will be, and you should profile several different approaches. > > Even if you stick to the HitCollector route, you will find it faster > and less intensive to store a list of the revision versions elsewhere > so that you can calculate faster which versions of the document you > should keep. Preferably you would cache the entire list of revision > numbers for all documents. Then you could use this cache in your HitCollector. > Alternatively you implement some kind of general cache (if this is > ASP.NET you could easily use the application cache) which caches > version numbers for the most recently hit documents for a period > (tuned). You would need to find a way ideally to invalidate the cache > members during indexing, depending on how long your period is and how > realtime your searches need to be. > > I initially thought you could just store the revision numbers you hit > for each document and check those when you get a new hit for the same > document name. This would be much faster because you wouldn't need to > do any Lucene searching within the HitCollector. However it would not > work in the scenario where you have versions 1,2,3,4,5 and versions > 1-4 match a query but version 5 doesn't. So version 5 is never hit, > and your HitCollector returns the 4 newest versions it hears about > 1,2,3,4 where it should return just 2,3,4. > > You may want to extend TopFieldDocCollector rather than write a wholly > new HitCollector if you want to implement paging. > > Hope this is helpful, > > Moray > > ------------------------------------- > Moray McConnachie > Director of IT +44 1865 261 600 > Oxford Analytica http://www.oxan.com > > -----Original Message----- > From: Anders Lybecker [mailto:[email protected]] > Sent: 08 September 2009 07:40 > To: [email protected] > Subject: SQL TOP clause functionality > > Hi all, > > I have a requirement that I'm not sure how to solve with Lucene. Any > help appreciated. > > The solution searches through documents and each document can have > multiple revisions. Each document revision is unique identified by > document name and revision number. The revision number is incremental, > but not sequential - meaning that document X can have revision 1 and > 3, but not necessarily revision 2. Revision number can skip numbers in > the sequence. > > The requirement dictates that the users shall be able to search in the > current revision or 3 last revisions only. How do I solve that with > Lucene? > > I could keep a record for all revisions for each document, but I'll > prefer not to. > > Regards, > Anders Lybecker > > >
