Hmm, I guess your IndexSearcher is backed by a MultiReader which operates on these "shards" you're referring to, which are supposed to be single-segment indexes? If so, this topology sounds fairly equivalent, at least in concept but maybe similar in performance as well, to the regular case when you search a single index shard but allow multiple segments in it, and let Lucene manage segment creation/merging, instead of effectively disabling it.
But this is all a speculation on my part, absent a concrete side by side comparison. Also, from your description, what you are doing to extend/replace internal plumbing in IndexSearcher sounds awfully complex and hard to envision... do make sure to run functional tests when it's all in place. :) On Wed, Oct 2, 2013 at 5:37 PM, Desidero <desid...@gmail.com> wrote: > Vitaly, > > Thanks for your comments. > > Unfortunately, thread pool task overload is not the problem. When I > extended the IndexSearcher class last night, I had it create one task per > shard (20 tasks) instead of the default which turned out to be somewhere > around 320 (I didn't realize it created quite so many segments when I > recreated my index yesterday!) and performance was still awful as long as > there were multiple segments per shard. > > I understand that the fork/join framework doesn't work well with the > standard IndexSearcher's parallel search implementation, but I extended the > class and my implementation avoids the pitfalls as long as there are very > few (preferably one) segments per shard. The blocking in the original > SegmentWorkerCallable is still there, but I set it up so that if one worker > already has the lock, any others that try to lock it will just throw their > results in a separate queue which is polled by the thread that managed to > get the lock. After all the child tasks are complete, the queue is checked > one more time to be safe and then processing continues. There is basically > no waiting in that chunk of code (per the profiler). After doing some tests > with the ForkJoinPool and custom code as well as a standard > ThreadPoolExecutor, I found that the Fork/Join implementation was faster. > > The system I'm working on is much more sensitive to response time (latency) > than overall throughput, so we've had to focus on parallelism within > queries rather than across many queries. Batch processing performance is > still important, but we cannot sacrifice quick "online" responses. It would > be much easier to avoid this whole mess, but we cannot meet our performance > requirements without it. > > Regards, > > Matt > > > On Oct 2, 2013 3:36 PM, "Vitaly Funstein" <vfunst...@gmail.com> wrote: > > > Matt, > > > > I think you are mostly on track with suspecting thread pool task overload > > as the possible culprit here. First, the old school (prior to Java 7) > > ThreadPoolExecutor only accepts a BlockingQueue to use internally for > > worker tasks, instead of a concurrent variant (not sure why). So this > > internal work queue will become a significant point of contention when > > using the pool in a pattern similar to your use case, i.e. submitting > lots > > of tasks to the pool as fast as possible. > > > > Second, I am not too familiar with the internals of the fork/join pool > > implementation in 1.7 (if that's what you're using), but from reading the > > daunting javadoc for ForkJoinTask my rough guess is it's not terribly > well > > suited for use in IndexSearcher. In particular, one of the possible > > "non-compliant" behaviors is that a mutex lock is taken for each call() > > invocation that operates on an individual leaf slice. This is evident > from > > code inspection, and based on this, I am not sure what benefit, if any, > > multi-threaded search over a multi-segment index would provide in > general - > > regardless of the choice of thread pool implementation. > > > > I think a better strategy, as mentioned in another thread, is to optimize > > your system for multiple concurrent queries, rather than focusing on > > forcing each query to run across multiple threads/cores. With that > > approach, you could for instance, set up a non-blocking queue like > > ConcurrentLinkedQueue that would contain individual query tasks, then > use a > > fixed pool of worker threads to consume the queue in a loop and run them. > > In this scenario, you shouldn't need to pass an ExecutorService instance > to > > IndexSearcher. > > > > A strategy like that should provide for better query throughput, > regardless > > of whether each shard consists of a single segment or not - provided each > > of the queries is tied to a particular shard and can't search any others. > > > > > > > > On Tue, Oct 1, 2013 at 4:10 PM, Desidero <desid...@gmail.com> wrote: > > > > > Uwe, > > > > > > I was using a bounded thread pool. > > > > > > I don't know if the problem was the task overload or something about > the > > > actual efficiency of searching a single segment rather than iterating > > over > > > multiple AtomicReaderContexts, but I'd lean toward task overload. I > will > > do > > > some testing tonight to find out for sure. > > > > > > Matt > > > Hi, > > > > > > use a bounded thread pool. > > > > > > Uwe > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: u...@thetaphi.de > > > > > > > > > > -----Original Message----- > > > > From: Desidero [mailto:desid...@gmail.com] > > > > Sent: Tuesday, October 01, 2013 11:37 PM > > > > To: java-user@lucene.apache.org > > > > Subject: Re: Query performance in Lucene 4.x > > > > > > > > For anyone who was wondering, this was actually resolved in a > different > > > > thread today. I misread the information in the > > > > IndexSearcher(IndexReader,ExecutorService) constructor documentation > - > > I > > > > was under the impression that it was submitting a thread for each > index > > > > shard (MultiReader wraps 20 shards, so 20 tasks) but it was really > > > submitting > > > > a task for each segment within each shard (20 shards * ~10 segments = > > > ~200 > > > > tasks) which is horrible. Since my index changes infrequently, I'm > > using > > > > forceMerge(1) before sending out updated indexes to the slave > servers. > > > > Without any extra tuning (threads, # of shards, etc) I've gone from > > ~2900 > > > > requests per minute to ~10k requests per minute. > > > > > > > > Thanks to Adrien and Mike for the clarification and Benson for > bringing > > > up > > > > the question that led to my answer. > > > > > > > > I'm still pretty new to Lucene so I have a lot of poking around to > do, > > > but I'm > > > > going to try to implement the "virtual segment" concept that Mike > > > > mentioned. It'll be really helpful for those of us who want > parallelism > > > within > > > > queries and don't want to forceMerge. > > > > > > > > > > > > On Fri, Sep 27, 2013 at 9:55 AM, Desidero <desid...@gmail.com> > wrote: > > > > > > > > > Erick, > > > > > > > > > > Thank you for responding. > > > > > > > > > > I ran tests using both compressed fields and uncompressed fields, > and > > > > > it was significantly slower with uncompressed fields. I looked into > > > > > the lazy field loading per your suggestion, but we don't get any > > > > > values from the returned Documents until the result set has been > > > > appropriately reduced. > > > > > Since we only store one retrievable field and we always need to get > > > > > it, it doesn't save any time loading it lazily. > > > > > > > > > > I'll try running a test without loading any fields just to see how > it > > > > > affects performance and let you know how that goes. > > > > > > > > > > Regards, > > > > > Matt > > > > > > > > > > > > > > > On Fri, Sep 27, 2013 at 8:01 AM, Erick Erickson > > > > <erickerick...@gmail.com>wrote: > > > > > > > > > >> Hmmm, since 4.1, fields have been stored compressed by default. > > > > >> I suppose it's possible that this is a result of > > > > >> compressing/uncompressing. > > > > >> > > > > >> What happens if > > > > >> 1> you enable lazy field loading > > > > >> 2> don't load any fields? > > > > >> > > > > >> FWIW, > > > > >> Erick > > > > >> > > > > >> On Thu, Sep 26, 2013 at 10:55 AM, Desidero <desid...@gmail.com> > > > > wrote: > > > > >> > A quick update: > > > > >> > > > > > >> > In order to confirm that none of the standard migration changes > > had > > > > >> > a negative effect on performance, I ported my Lucene 4.x version > > > > >> > back to Lucene 3.6.2 and kept the newer API rather than using > the > > > > >> > custom ParallelMultiSearcher and other deprecated > methods/classes. > > > > >> > > > > > >> > Performance in 3.6.2 is even faster than before (~2900 > > requests/min > > > > >> with 4.x > > > > >> > vs ~6200 requests/min with 3.6.2), so none of my code changes > > > > >> > should be causing the difference. It seems to be something > Lucene > > > > >> > is doing under > > > > >> the > > > > >> > covers. > > > > >> > > > > > >> > Again, if there's any other information if I can provide to help > > > > >> determine > > > > >> > what's going on, please let me know. > > > > >> > > > > > >> > Thanks, > > > > >> > Matt > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > ------------------------------------------------------------------- > > > > >> > -- To unsubscribe, e-mail: > > java-user-unsubscr...@lucene.apache.org > > > > >> > For additional commands, e-mail: > java-user-h...@lucene.apache.org > > > > >> > > > > > >> > > > > >> > > --------------------------------------------------------------------- > > > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >> > > > > >> > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > >