Re: Query performance in Lucene 4.x

Vitaly Funstein Wed, 02 Oct 2013 13:37:13 -0700

Matt,

I think you are mostly on track with suspecting thread pool task overload
as the possible culprit here. First, the old school (prior to Java 7)
ThreadPoolExecutor only accepts a BlockingQueue to use internally for
worker tasks, instead of a concurrent variant (not sure why). So this
internal work queue will become a significant point of contention when
using the pool in a pattern similar to your use case, i.e. submitting lots
of tasks to the pool as fast as possible.


Second, I am not too familiar with the internals of the fork/join pool
implementation in 1.7 (if that's what you're using), but from reading the
daunting javadoc for ForkJoinTask my rough guess is it's not terribly well
suited for use in IndexSearcher. In particular, one of the possible
"non-compliant" behaviors is that a mutex lock is taken for each call()
invocation that operates on an individual leaf slice. This is evident from
code inspection, and based on this, I am not sure what benefit, if any,
multi-threaded search over a multi-segment index would provide in general -
regardless of the choice of thread pool implementation.

I think a better strategy, as mentioned in another thread, is to optimize
your system for multiple concurrent queries, rather than focusing on
forcing each query to run across multiple threads/cores. With that
approach, you could for instance, set up a non-blocking queue like
ConcurrentLinkedQueue that would contain individual query tasks, then use a
fixed pool of worker threads to consume the queue in a loop and run them.
In this scenario, you shouldn't need to pass an ExecutorService instance to
IndexSearcher.

A strategy like that should provide for better query throughput, regardless
of whether each shard consists of a single segment or not - provided each
of the queries is tied to a particular shard and can't search any others.



On Tue, Oct 1, 2013 at 4:10 PM, Desidero <[email protected]> wrote:

> Uwe,
>
> I was using a bounded thread pool.
>
> I don't know if the problem was the task overload or something about the
> actual efficiency of searching a single segment rather than iterating over
> multiple AtomicReaderContexts, but I'd lean toward task overload. I will do
> some testing tonight to find out for sure.
>
> Matt
>  Hi,
>
> use a bounded thread pool.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
> > -----Original Message-----
> > From: Desidero [mailto:[email protected]]
> > Sent: Tuesday, October 01, 2013 11:37 PM
> > To: [email protected]
> > Subject: Re: Query performance in Lucene 4.x
> >
> > For anyone who was wondering, this was actually resolved in a different
> > thread today. I misread the information in the
> > IndexSearcher(IndexReader,ExecutorService) constructor documentation - I
> > was under the impression that it was submitting a thread for each index
> > shard (MultiReader wraps 20 shards, so 20 tasks) but it was really
> submitting
> > a task for each segment within each shard (20 shards * ~10 segments =
> ~200
> > tasks) which is horrible. Since my index changes infrequently, I'm using
> > forceMerge(1) before sending out updated indexes to the slave servers.
> > Without any extra tuning (threads, # of shards, etc) I've gone from ~2900
> > requests per minute to ~10k requests per minute.
> >
> > Thanks to Adrien and Mike for the clarification and Benson for bringing
> up
> > the question that led to my answer.
> >
> > I'm still pretty new to Lucene so I have a lot of poking around to do,
> but I'm
> > going to try to implement the "virtual segment" concept that Mike
> > mentioned. It'll be really helpful for those of us who want parallelism
> within
> > queries and don't want to forceMerge.
> >
> >
> > On Fri, Sep 27, 2013 at 9:55 AM, Desidero <[email protected]> wrote:
> >
> > > Erick,
> > >
> > > Thank you for responding.
> > >
> > > I ran tests using both compressed fields and uncompressed fields, and
> > > it was significantly slower with uncompressed fields. I looked into
> > > the lazy field loading per your suggestion, but we don't get any
> > > values from the returned Documents until the result set has been
> > appropriately reduced.
> > > Since we only store one retrievable field and we always need to get
> > > it, it doesn't save any time loading it lazily.
> > >
> > > I'll try running a test without loading any fields just to see how it
> > > affects performance and let you know how that goes.
> > >
> > > Regards,
> > > Matt
> > >
> > >
> > > On Fri, Sep 27, 2013 at 8:01 AM, Erick Erickson
> > <[email protected]>wrote:
> > >
> > >> Hmmm, since 4.1, fields have been stored compressed by default.
> > >> I suppose it's possible that this is a result of
> > >> compressing/uncompressing.
> > >>
> > >> What happens if
> > >> 1> you enable lazy field loading
> > >> 2> don't load any fields?
> > >>
> > >> FWIW,
> > >> Erick
> > >>
> > >> On Thu, Sep 26, 2013 at 10:55 AM, Desidero <[email protected]>
> > wrote:
> > >> > A quick update:
> > >> >
> > >> > In order to confirm that none of the standard migration changes had
> > >> > a negative effect on performance, I ported my Lucene 4.x version
> > >> > back to Lucene 3.6.2 and kept the newer API rather than using the
> > >> > custom ParallelMultiSearcher and other deprecated methods/classes.
> > >> >
> > >> > Performance in 3.6.2 is even faster than before (~2900 requests/min
> > >> with 4.x
> > >> > vs ~6200 requests/min with 3.6.2), so none of my code changes
> > >> > should be causing the difference. It seems to be something Lucene
> > >> > is doing under
> > >> the
> > >> > covers.
> > >> >
> > >> > Again, if there's any other information if I can provide to help
> > >> determine
> > >> > what's going on, please let me know.
> > >> >
> > >> > Thanks,
> > >> > Matt
> > >> >
> > >> >
> > >> >
> > >> > -------------------------------------------------------------------
> > >> > -- To unsubscribe, e-mail: [email protected]
> > >> > For additional commands, e-mail: [email protected]
> > >> >
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >>
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: Query performance in Lucene 4.x

Reply via email to