RE: Lucene sort performance roots?

Uwe Schindler Fri, 24 Jun 2011 06:58:28 -0700

 
> Lucene is a great toolkit for search applications. And it's so fast in
most of
> cases. I think I am understand why it's faster than relational databases
for
> information retrieval. For example, Lucene use very efficient index than
> allows to retrieve posting list in constant time and do intersect between
> them.
> 
> But there is one aspect which I couldn't understand for a long time now.
In
> our test cases Lucene perform sorting blazingly fast. This one is freaking
me
> out. I have no explanation why Lucene should do sorting faster than
> relational database. Let me put it another way -- I have no explanation
why
> SQL databases should not do it as fast as Lucene.


This is easy to explain by seeing the main difference between RDBMS and
Lucene:

RDBMS do sorting (because of transactions and other reasons) after
retrieving the data. Especially for lots of results this can take lots of
time. RDBMS are also slower on retrieving the data, as they are always
retrieving *all* data (unless you use limit clauses) from the underlying
store, to be at least also able to sort them. All this data (sorted or
unsorted) is then transferred to the application (with cursors that may be
optimized, but in general the result set is already in memory).

Lucene  has some differences in doing this:

Lucene never returns all stored fields for all documents that hit the query.
This would take lots of time and in fact Lucene would even be much slower
than RDBMS databases. The trick, that all full text search engines use, is
to never touch the actual data during query execution, you only touch the
unique term list and the posting lists. When the query executes, Lucene
simply collects all result ids from the posting lists and returns them. No
data document data is transferred or touched at all. One step further, in
general full text engines only return the most relevant documents, that are
then displayed in e.g. pages on the web (like Google).

To do this score sorting, all collected document ids are inserted into a
Priority Queue to be sorted by score. Documents with too low score will
immediately fall out of this queue if their score is not competitive.

The same applies if you manually sort: In general you only return the
beginning of the list to the user (the top-n documents), so the sorting
algorithm does the same like for scores, all ids are inserted using the
requested sort key into a priority queue and all ids, not competitive fall
out and are not touched at all. Important to know is, that in constrast to a
relational database the sort keys are completely in memory and are currently
never materialized on disk (we have now column stride fields in trunk, but
that's just a materialization as a column - relational databases need to
scan each row to sort. Some databases [like Sybase IQ] use column oriented
storage, those are also faster on sorting, as the sort keys are not stored
row-by-row, they are stored as a single column like Lucene's column-stride
fields [called DocValues, only in Lucene 4.0]).

Back to Lucene: Once the top-n document ids are collected (which also
returns the total result count, as at least all results are counted), you
have a list of those top-n document ids in the order requested. And after
that the stored fields are retrieved from the underlying index for display
only, so it fetches only the really required stored fields from the index
for the few documents in top-n.

Because of this top-n behavior, its generally slow with Lucene to scan
deeply into the result set. If you want to go on page 100 of your search
results, the priority queue must at least have a size of n=docsPerPage*100.
Because of this, most full text search engines (e.g. Google does this, too)
prevent you from going to deep into the result set, as it would get slower
and slower, because the PQ gets bigger and bigger. E.g. Google prevents you
to go as far as I know beyond page 50 or like that.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Lucene sort performance roots?

Reply via email to