Assuming that you need to keep ALL revisions in the index (otherwise you
would just delete the older documents at index time), it will be far
more efficient to do what you need  at index time, particularly if your
system receives far more query requests than index requests . For
example, you could store the previous 3/4 version numbers with each
document, and then when you write a revised document into the index you
go to the 5th newest document and add a field "searchinactive" with
value "t". All your searches then simply include a query making sure
that searchinactive is not "t", or use a QueryWrapperFilter to the same
effect. Of course you could also manage actions at index time
differently, e.g. by storing one revision index per document name (most
flexible), or implement a linked list so that you can chain back and
forward (this is most index minimal). 

You imply this is not what you want to do, but you want to do it at
search. The problem with doing it at search is that in your scenario no
individual document knows how many later revisions it has. To do this in
SQL would be complicated - your TOP query would need to be a  subquery
of your search query, so even in SQL it would be very slow (e.g. 

SELECT d.Name, d.stuff FROM Documents d WHERE d.Text LIKE
'searchcriterion' AND d.RevisionNumber IN (SELECT TOP 4
dr.RevisionNumber FROM Documents AND d.Name LIKE dr.Name ORDER BY
dr.RevisionNumber DESC)

)

 We effectively have to use the same technique (and slowness!) for what
you want by calling a Lucene search for every document match. We do this
in a custom HitCollector. When the collect method is called (once per
match), you would have to:

1) retrieve that document in Lucene
2) get its name
3) search on the index for that name
4) retrieve a list of all version numbers for that document
5) check if your current document's revision number is among the most
recent 4. If no, skip 6)
6) store its lucene document number and score in a results member
variable for the HitCollector. 

You can see example HitCollectors by googling -
http://www.google.co.uk/search?hl=en&q=lucene+hitcollector&btnG=Search&m
eta=&aq=f&oq=

Because I guess it is quite likely that if one version of the document
matches the search, other revisions of the same documenht will also
match, I suggest you cache the list of version numbers per document name
to avoid having to do one Lucene search for each index hit.

Because this will be hit many, many times if you have a lot of documents
matching a search, you should optimise a HitCollector carefully. In your
case this should definitely include what the quickest form of will be,
and you should profile several different approaches.

Even if you stick to the HitCollector route, you will find it faster and
less intensive to store a list of the revision versions elsewhere so
that you can calculate faster which versions of the document you should
keep. Preferably you would cache the entire list of revision numbers for
all documents. Then you could use this cache in your HitCollector.
Alternatively you implement some kind of general cache (if this is
ASP.NET you could easily use the application cache) which caches version
numbers for the most recently hit documents for a period (tuned). You
would need to find a way ideally to invalidate the cache members during
indexing, depending on how long your period is and how realtime your
searches need to be.

I initially thought you could just store the revision numbers you hit
for each document and check those when you get a new hit for the same
document name. This would be much faster because you wouldn't need to do
any Lucene searching within the HitCollector. However it would not work
in the scenario where you have versions 1,2,3,4,5 and versions 1-4 match
a query but version 5 doesn't. So version 5 is never hit, and your
HitCollector returns the 4 newest versions it  hears about 1,2,3,4 where
it should return just 2,3,4.

You may want to extend TopFieldDocCollector rather than write a wholly
new HitCollector if you want to implement paging.

Hope this is helpful,

Moray

-------------------------------------
Moray McConnachie
Director of IT    +44 1865 261 600
Oxford Analytica  http://www.oxan.com

-----Original Message-----
From: Anders Lybecker [mailto:[email protected]]
Sent: 08 September 2009 07:40
To: [email protected]
Subject: SQL TOP clause functionality

Hi all,

I have a requirement that I'm not sure how to solve with Lucene. Any
help appreciated.

The solution searches through documents and each document can have
multiple revisions. Each document revision is unique identified by
document name and revision number. The revision number is incremental,
but not sequential - meaning that document X can have revision 1 and 3,
but not necessarily revision 2. Revision number can skip numbers in the
sequence.

The requirement dictates that the users shall be able to search in the
current revision or 3 last revisions only. How do I solve that with
Lucene?

I could keep a record for all revisions for each document, but I'll
prefer not to.

Regards,
Anders Lybecker


Reply via email to