I agree with you that we should not tie concurrency w/in a single search to index segments. That solution is just a hack. will lucene 4 support multithreads search for a single query? I haven't found any patch about this.
2011/1/4 Michael McCandless <luc...@mikemccandless.com>: > Here's the paper: > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.8091 > > I haven't read it yet... > > In general I don't like tying concurrency w/in a single search to > index segments; I'd rather they be (relatively?) independent. EG an > optimized index would then force single thread queries, which is > crazy... > > In Lucene it should be easy to chop up the maxDoc(), ie if I have 10M > docs and I want to use 4 threads I just give each thread a chunk of > 2.5M docs. Each thread would first skip to its start doc, and then go > from there. > > I'm not sure we'd want to share a single collector vs private > collector per thread and then merge in the end (this is a similar > tradeoff as we've discussed on the per-segment collectors). > > Mike > > 2011/1/1 Li Li <fancye...@gmail.com>: >> I sent a mail to MG4J group and Sebastiano Vigna recommended the paper >> Reducing query latencies in web search using fine-grained parallelism. >> World Wide Web, 12(4):441-460, 2009. >> I read it roughly. But there are some questions >> it says: >> it first coalesces all disk reads in a single process, and then >> distributes the index data among >> the parallel threads. So when the index server receives a query, it >> loads from storage (disk or cache) >> the required posting lists, initializes the query execution, and then >> spawns lightweight threads, one per core. >> Each thread receives an equal-sized subset of document IDs to scan, >> together covering the entire index >> partition. All threads execute the same code on the same query, but >> with private data structures. >> The only writable shared data structure is a heap of the top-scoring >> hits, protected by a lock. At the end of the threads' >> execution, the heap contains the highest-scoring hits in the entire >> index partition, which is then transmitted to the query >> integrator as usual. Since the index contains skip-lists that permit >> near-random-access to any document ID, and since >> hits are generally distributed evenly in the index, our measurements >> show that all threads complete their work with >> little variability, within 1-2% of each other. >> >> I have some questions >> 1.Since the index contains skip-lists that permit >> near-random-access to any document ID >> skip list can be near random access?(especially when it's in hard disk) >> >> 2.For "or query", it's easy. e.g. we search "search" and "engine", >> we using one main thread to get postings the >> the 2 terms and divide their postings into 2 groups(e.g. even docIds >> and odd docIds) and using 2 threads to score >> them and finally merge it(or using locked priority queue) >> but for "and query", we usally don't decode all the docIDs >> because we can skip many documents. especially when >> searching low frequent terms with high frequent terms. only a small >> number of docIDs of high frequent term is decoded. >> >> btw. I think or query is often much slower than and query. If we can >> parallel or query well, it's also very useful. >> >> On Dec 31, 2010, at 7:25 AM, Li Li wrote: >> >>> Which one is used in MG4J to support multithreads searching? Are >> >> 2010/12/31 Li Li <fancye...@gmail.com>: >>> is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) >>> it says Multithreading. Indices can be queried and scored concurrently. >>> maybe we can learn something from it. >>> >>> 2010/12/31 Li Li <fancye...@gmail.com>: >>>> plus >>>> 2 means search a term need seek many times for tis(if it's not cached in >>>> tii) >>>> >>>> 2010/12/31 Li Li <fancye...@gmail.com>: >>>>> searching multi segments is a alternative solution but it has some >>>>> disadvantages. >>>>> 1. idf is not global?(I am not familiar with its implementation) maybe >>>>> it's easy to solve it by share global idf >>>>> 2. each segments will has it's own tii and tis files, which may make >>>>> search slower(that's why optimization of >>>>> index is neccessary) >>>>> 3. one term's docList is distributed in many files rather than one. >>>>> more than one frq files means >>>>> hard disk must seek different tracks, it's time consuming. if there is >>>>> only one segment, the are likely >>>>> stored in a single track. >>>>> >>>>> >>>>> 2010/12/31 Earwin Burrfoot <ear...@gmail.com>: >>>>>>>>>until we fix Lucene to run a single search concurrently (which we >>>>>>>>>badly need to do). >>>>>>> I am interested in this idea.(I have posted it before) do you have some >>>>>>> resources such as papers or tech articles about it? >>>>>>> I have tried but it need to modify index format dramatically and we use >>>>>>> solr distributed search to relieve the problem of response time. so >>>>>>> finally >>>>>>> give it up. >>>>>>> lucene4's index format is more flexible that it supports customed codecs >>>>>>> and it's now on development, I think it's good time to take it into >>>>>>> consideration >>>>>>> that let it support multithread searching for a single query. >>>>>>> I have a naive solution. dividing docList into many groups >>>>>>> e.g grouping docIds by it's even or odd >>>>>>> term1 df1=4 docList = 0 4 8 10 >>>>>>> term1 df2=4 docList = 1 3 9 11 >>>>>>> >>>>>>> term2 df1=4 docList = 0 6 8 12 >>>>>>> term2 df2=4 docList = 3 9 11 15 >>>>>>> then we can use 2 threads to search topN docs on even group and odd >>>>>>> group >>>>>>> and finally merge their results into a single on just like solr >>>>>>> distributed search. >>>>>>> But it's better than solr distributed search. >>>>>>> First, it's in a single process and data communication between >>>>>>> threads is much >>>>>>> faster than network. >>>>>>> Second, each threads process the same number of documents.For solr >>>>>>> distributed >>>>>>> search, one shard may process 7 documents and another shard may 1 >>>>>>> document >>>>>>> Even if we can make each shard have the same document number. we can not >>>>>>> make it uniformly for each term. >>>>>>> e.g. shard1 has doc1 doc2 >>>>>>> shard2 has doc3 doc4 >>>>>>> but term1 may only occur in doc1 and doc2 >>>>>>> while term2 may only occur in doc3 and doc4 >>>>>>> we may modify it >>>>>>> shard1 doc1 doc3 >>>>>>> shard2 doc2 doc4 >>>>>>> it's good for term1 and term2 >>>>>>> but term3 may occur in doc1 and doc3... >>>>>>> So I think it's fine-grained distributed in index while solr >>>>>>> distributed search is coarse- >>>>>>> grained. >>>>>> This is just crazy :) >>>>>> >>>>>> The simple way is just to search different segments in parallel. >>>>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized >>>>>> large segments (and small ones don't count, they're small!). >>>>>> If you're bound on squeezing out that extra millisecond (and making >>>>>> your life miserable along the way), you can search a single segment >>>>>> with multiple threads (by dividing it in even chunks, and then doing >>>>>> skipTo to position your iterators to the beginning of each chunk). >>>>>> >>>>>> First approach is really easy to implement. Second one is harder, but >>>>>> still doesn't require you to cook the number of CPU cores available >>>>>> into your index! >>>>>> >>>>>> It's the law of diminishing returns at play here. You're most likely >>>>>> to search in parallel over mostly memory-resident index >>>>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems >>>>>> tend to slow down considerably on parallel sequential reads, so you >>>>>> already have pretty decent speed. >>>>>> Searching different segments in parallel (with BSMP) makes you several >>>>>> times faster. >>>>>> Searching in parallel within a segment requires some weird hacks, but >>>>>> has maybe a few percent advantage over previous solution. >>>>>> Sharding posting lists requires a great deal of weird hacks, makes >>>>>> index machine-bound, and boosts speed by another couple of percent. >>>>>> Sounds worthless. >>>>>> >>>>>> -- >>>>>> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) >>>>>> Phone: +7 (495) 683-567-4 >>>>>> ICQ: 104465785 >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org