Re: SQL TOP clause functionality

Anders Lybecker Mon, 14 Sep 2009 00:50:03 -0700

Thanks for the very through reply.

I ended up storing the revisions number in a database and querying it before
searching the Lucene index.


Regards,
Anders Lybecker



On Tue, Sep 8, 2009 at 11:06 AM, Moray McConnachie <
[email protected]> wrote:

> Assuming that you need to keep ALL revisions in the index (otherwise you
> would just delete the older documents at index time), it will be far
> more efficient to do what you need  at index time, particularly if your
> system receives far more query requests than index requests . For
> example, you could store the previous 3/4 version numbers with each
> document, and then when you write a revised document into the index you
> go to the 5th newest document and add a field "searchinactive" with
> value "t". All your searches then simply include a query making sure
> that searchinactive is not "t", or use a QueryWrapperFilter to the same
> effect. Of course you could also manage actions at index time
> differently, e.g. by storing one revision index per document name (most
> flexible), or implement a linked list so that you can chain back and
> forward (this is most index minimal).
>
> You imply this is not what you want to do, but you want to do it at
> search. The problem with doing it at search is that in your scenario no
> individual document knows how many later revisions it has. To do this in
> SQL would be complicated - your TOP query would need to be a  subquery
> of your search query, so even in SQL it would be very slow (e.g.
>
> SELECT d.Name, d.stuff FROM Documents d WHERE d.Text LIKE
> 'searchcriterion' AND d.RevisionNumber IN (SELECT TOP 4
> dr.RevisionNumber FROM Documents AND d.Name LIKE dr.Name ORDER BY
> dr.RevisionNumber DESC)
>
> )
>
>  We effectively have to use the same technique (and slowness!) for what
> you want by calling a Lucene search for every document match. We do this
> in a custom HitCollector. When the collect method is called (once per
> match), you would have to:
>
> 1) retrieve that document in Lucene
> 2) get its name
> 3) search on the index for that name
> 4) retrieve a list of all version numbers for that document
> 5) check if your current document's revision number is among the most
> recent 4. If no, skip 6)
> 6) store its lucene document number and score in a results member
> variable for the HitCollector.
>
> You can see example HitCollectors by googling -
> http://www.google.co.uk/search?hl=en&q=lucene+hitcollector&btnG=Search&m
> eta=&aq=f&oq=<http://www.google.co.uk/search?hl=en&q=lucene+hitcollector&btnG=Search&m%0Aeta=&aq=f&oq=>
>
> Because I guess it is quite likely that if one version of the document
> matches the search, other revisions of the same documenht will also
> match, I suggest you cache the list of version numbers per document name
> to avoid having to do one Lucene search for each index hit.
>
> Because this will be hit many, many times if you have a lot of documents
> matching a search, you should optimise a HitCollector carefully. In your
> case this should definitely include what the quickest form of will be,
> and you should profile several different approaches.
>
> Even if you stick to the HitCollector route, you will find it faster and
> less intensive to store a list of the revision versions elsewhere so
> that you can calculate faster which versions of the document you should
> keep. Preferably you would cache the entire list of revision numbers for
> all documents. Then you could use this cache in your HitCollector.
> Alternatively you implement some kind of general cache (if this is
> ASP.NET you could easily use the application cache) which caches version
> numbers for the most recently hit documents for a period (tuned). You
> would need to find a way ideally to invalidate the cache members during
> indexing, depending on how long your period is and how realtime your
> searches need to be.
>
> I initially thought you could just store the revision numbers you hit
> for each document and check those when you get a new hit for the same
> document name. This would be much faster because you wouldn't need to do
> any Lucene searching within the HitCollector. However it would not work
> in the scenario where you have versions 1,2,3,4,5 and versions 1-4 match
> a query but version 5 doesn't. So version 5 is never hit, and your
> HitCollector returns the 4 newest versions it  hears about 1,2,3,4 where
> it should return just 2,3,4.
>
> You may want to extend TopFieldDocCollector rather than write a wholly
> new HitCollector if you want to implement paging.
>
> Hope this is helpful,
>
> Moray
>
> -------------------------------------
> Moray McConnachie
> Director of IT    +44 1865 261 600
> Oxford Analytica  http://www.oxan.com
>
> -----Original Message-----
> From: Anders Lybecker [mailto:[email protected]]
> Sent: 08 September 2009 07:40
> To: [email protected]
> Subject: SQL TOP clause functionality
>
> Hi all,
>
> I have a requirement that I'm not sure how to solve with Lucene. Any
> help appreciated.
>
> The solution searches through documents and each document can have
> multiple revisions. Each document revision is unique identified by
> document name and revision number. The revision number is incremental,
> but not sequential - meaning that document X can have revision 1 and 3,
> but not necessarily revision 2. Revision number can skip numbers in the
> sequence.
>
> The requirement dictates that the users shall be able to search in the
> current revision or 3 last revisions only. How do I solve that with
> Lucene?
>
> I could keep a record for all revisions for each document, but I'll
> prefer not to.
>
> Regards,
> Anders Lybecker
>
>
>

Re: SQL TOP clause functionality

Reply via email to