Re: strange problem of PForDelta decoder

Li Li Mon, 03 Jan 2011 22:18:47 -0800

I agree with you that we should not tie concurrency w/in a single search to
index segments.
That solution is just a hack.
will lucene 4 support multithreads search for a single query?
I haven't found any patch about this.


2011/1/4 Michael McCandless <luc...@mikemccandless.com>:
> Here's the paper:
>
>    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.8091
>
> I haven't read it yet...
>
> In general I don't like tying concurrency w/in a single search to
> index segments; I'd rather they be (relatively?) independent.  EG an
> optimized index would then force single thread queries, which is
> crazy...
>
> In Lucene it should be easy to chop up the maxDoc(), ie if I have 10M
> docs and I want to use 4 threads I just give each thread a chunk of
> 2.5M docs.  Each thread would first skip to its start doc, and then go
> from there.
>
> I'm not sure we'd want to share a single collector vs private
> collector per thread and then merge in the end (this is a similar
> tradeoff as we've discussed on the per-segment collectors).
>
> Mike
>
> 2011/1/1 Li Li <fancye...@gmail.com>:
>>   I sent a mail to MG4J group and Sebastiano Vigna recommended the paper
>> Reducing query latencies in web search using fine-grained parallelism.
>> World Wide Web, 12(4):441-460, 2009.
>>   I read it roughly. But there are some questions
>>     it says:
>>    it first coalesces all disk reads in a single process, and then
>> distributes the index data among
>> the parallel threads. So when the index server receives a query, it
>> loads from storage (disk or cache)
>> the required posting lists, initializes the query execution, and then
>> spawns lightweight threads, one per core.
>> Each thread receives an equal-sized subset of document IDs to scan,
>> together covering the entire index
>> partition. All threads execute the same code on the same query, but
>> with private data structures.
>> The only writable shared data structure is a heap of the top-scoring
>> hits, protected by a lock. At the end of the threads'
>> execution, the heap contains the highest-scoring hits in the entire
>> index partition, which is then transmitted to the query
>> integrator as usual. Since the index contains skip-lists that permit
>> near-random-access to any document ID, and since
>> hits are generally distributed evenly in the index, our measurements
>> show that all threads complete their work with
>> little variability, within 1-2% of each other.
>>
>>  I have some questions
>>  1.Since the index contains skip-lists that permit
>> near-random-access to any document ID
>>   skip list can be near random access?(especially when it's in hard disk)
>>
>>  2.For "or query", it's easy. e.g. we search "search" and "engine",
>> we using one main thread to get postings the
>> the 2 terms and divide their postings into 2 groups(e.g. even docIds
>> and odd docIds) and using 2 threads to score
>> them and finally merge it(or using locked priority queue)
>>     but for "and query", we usally don't decode all the docIDs
>> because we can skip many documents. especially when
>> searching low frequent terms with high frequent terms. only a small
>> number of docIDs of high frequent term is decoded.
>>
>>   btw. I think or query is often much slower than and query. If we can
>> parallel or query well, it's also very useful.
>>
>> On Dec 31, 2010, at 7:25 AM, Li Li wrote:
>>
>>>    Which one is used in MG4J to support multithreads searching? Are
>>
>> 2010/12/31 Li Li <fancye...@gmail.com>:
>>> is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/)
>>> it says Multithreading. Indices can be queried and scored concurrently.
>>> maybe we can learn something from it.
>>>
>>> 2010/12/31 Li Li <fancye...@gmail.com>:
>>>> plus
>>>> 2 means search a term need seek many times for tis(if it's not cached in 
>>>> tii)
>>>>
>>>> 2010/12/31 Li Li <fancye...@gmail.com>:
>>>>> searching multi segments is a alternative solution but it has some
>>>>> disadvantages.
>>>>> 1. idf is not global?(I am not familiar with its implementation) maybe
>>>>> it's easy to solve it by share global idf
>>>>> 2. each segments will has it's own tii and tis files, which may make
>>>>> search slower(that's why optimization of
>>>>> index is neccessary)
>>>>> 3. one term's  docList is distributed in many files rather than one.
>>>>> more than one frq files means
>>>>> hard disk must seek different tracks, it's time consuming. if there is
>>>>> only one segment, the are likely
>>>>> stored in a single track.
>>>>>
>>>>>
>>>>> 2010/12/31 Earwin Burrfoot <ear...@gmail.com>:
>>>>>>>>>until we fix Lucene to run a single search concurrently (which we
>>>>>>>>>badly need to do).
>>>>>>> I am interested in this idea.(I have posted it before) do you have some
>>>>>>> resources such as papers or tech articles about it?
>>>>>>> I have tried but it need to modify index format dramatically and we use
>>>>>>> solr distributed search to relieve the problem of response time. so 
>>>>>>> finally
>>>>>>> give it up.
>>>>>>> lucene4's index format is more flexible that it supports customed codecs
>>>>>>> and it's now on development, I think it's good time to take it into
>>>>>>> consideration
>>>>>>> that let it support multithread searching for a single query.
>>>>>>> I have a naive solution. dividing docList into many groups
>>>>>>> e.g grouping docIds by it's even or odd
>>>>>>> term1 df1=4  docList =  0  4  8  10
>>>>>>> term1 df2=4  docList = 1  3  9  11
>>>>>>>
>>>>>>> term2 df1=4  docList = 0  6  8  12
>>>>>>> term2 df2=4  docList = 3  9  11 15
>>>>>>>   then we can use 2 threads to search topN docs on even group and odd 
>>>>>>> group
>>>>>>> and finally merge their results into a single on just like solr
>>>>>>> distributed search.
>>>>>>> But it's better than solr distributed search.
>>>>>>>   First, it's in a single process and data communication between
>>>>>>> threads is much
>>>>>>> faster than network.
>>>>>>>   Second, each threads process the same number of documents.For solr
>>>>>>> distributed
>>>>>>> search, one shard may process 7 documents and another shard may 1 
>>>>>>> document
>>>>>>> Even if we can make each shard have the same document number. we can not
>>>>>>> make it uniformly for each term.
>>>>>>>    e.g. shard1 has doc1 doc2
>>>>>>>           shard2 has doc3 doc4
>>>>>>>    but term1 may only occur in doc1 and doc2
>>>>>>>    while term2 may only occur in doc3 and doc4
>>>>>>>    we may modify it
>>>>>>>           shard1 doc1 doc3
>>>>>>>           shard2 doc2 doc4
>>>>>>>    it's good for term1 and term2
>>>>>>>    but term3 may occur in doc1 and doc3...
>>>>>>>    So I think it's fine-grained distributed in index while solr
>>>>>>> distributed search is coarse-
>>>>>>> grained.
>>>>>> This is just crazy :)
>>>>>>
>>>>>> The simple way is just to search different segments in parallel.
>>>>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized
>>>>>> large segments (and small ones don't count, they're small!).
>>>>>> If you're bound on squeezing out that extra millisecond (and making
>>>>>> your life miserable along the way), you can search a single segment
>>>>>> with multiple threads (by dividing it in even chunks, and then doing
>>>>>> skipTo to position your iterators to the beginning of each chunk).
>>>>>>
>>>>>> First approach is really easy to implement. Second one is harder, but
>>>>>> still doesn't require you to cook the number of CPU cores available
>>>>>> into your index!
>>>>>>
>>>>>> It's the law of diminishing returns at play here. You're most likely
>>>>>> to search in parallel over mostly memory-resident index
>>>>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
>>>>>> tend to slow down considerably on parallel sequential reads, so you
>>>>>> already have pretty decent speed.
>>>>>> Searching different segments in parallel (with BSMP) makes you several
>>>>>> times faster.
>>>>>> Searching in parallel within a segment requires some weird hacks, but
>>>>>> has maybe a few percent advantage over previous solution.
>>>>>> Sharding posting lists requires a great deal of weird hacks, makes
>>>>>> index machine-bound, and boosts speed by another couple of percent.
>>>>>> Sounds worthless.
>>>>>>
>>>>>> --
>>>>>> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
>>>>>> Phone: +7 (495) 683-567-4
>>>>>> ICQ: 104465785
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: strange problem of PForDelta decoder

Reply via email to