The issue here is a general one of trying to perform an efficient join between 
an external resource (rdbms) and Lucene.
This experiment may be of interest:
    http://issues.apache.org/jira/browse/LUCENE-434

KeyMap.java embodies the core service which translates from lucene doc ids to 
DB primary keys or vice versa.
There are a couple of implementations of KeyMap that are not optimal (they 
pre-date Lucene's FieldCache) but it may give you food for thought.

Cheers
Mark


----- Original Message ----
From: Stephane Nicoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 1 May, 2008 9:00:33 AM
Subject: hybrid query (lucene + db)

Hi there,

We're using lucene with Hibernate search and we're very happy so far
with the performance and the usability of lucene. We have however a
specific use cases that prevent us to use only lucene: spatial
queries. I already sent a mail on this list a while back about the
problem and we started investigating multiple solutions.

When the user selects a geographic area and some keywords we do the following:

* Perform a search on the lucene index for the keywords with a
projection that returns only the primaryKey of the element sorted by
primary key
* Perform a search on the database with other criterias and a
projection that returns only the primary key of the elements
* Iterate on both list to find N matching IDs, optionally with paging
(some from X to X + N where X is the first result of the page)
* Run a query on the database to return the actual objects (select a
from MyClass a where a.id IN (the list of matching IDs) ) We limit the
page to 1000 results

We have searched a way to optimize the queries and to avoid to consume
too much memory, knowing that we must support paging.

With a single user a search by kewyords takes 30msec to complete, a
search by box takes 45msec. With both (keywords + spatial area)  it
takes 300msec

With 10 concurrent users, a search by keywords takes 150msec/user  but
for both it takes 3 sec/user !!!

I had the profiler running on this scenario and I've found that *all*
threads are waiting on org.apache.lucene.index.SegmentReader. I then
configured Hibernate Search to use a separate index reader per thread.
The deadlocks disappeared but it's still very slow (2.8sec).

Some questions:

* Does anyone knows where the deadlocks on SegmentReader are coming from?
* Is the sorting on the primary keys a bad idea regarding performance
and memory usage?
* Does anyone has an idea to perform this kind of hybrid query in an
efficient way?

I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
support on the Hibernate Search forum but did not get any answer so
far.

Thanks,
Stéphane

-- 
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to