Re: strange problem of PForDelta decoder

Li Li Thu, 30 Dec 2010 19:46:06 -0800

is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/)
it says Multithreading. Indices can be queried and scored concurrently.
maybe we can learn something from it.


2010/12/31 Li Li <fancye...@gmail.com>:
> plus
> 2 means search a term need seek many times for tis(if it's not cached in tii)
>
> 2010/12/31 Li Li <fancye...@gmail.com>:
>> searching multi segments is a alternative solution but it has some
>> disadvantages.
>> 1. idf is not global?(I am not familiar with its implementation) maybe
>> it's easy to solve it by share global idf
>> 2. each segments will has it's own tii and tis files, which may make
>> search slower(that's why optimization of
>> index is neccessary)
>> 3. one term's  docList is distributed in many files rather than one.
>> more than one frq files means
>> hard disk must seek different tracks, it's time consuming. if there is
>> only one segment, the are likely
>> stored in a single track.
>>
>>
>> 2010/12/31 Earwin Burrfoot <ear...@gmail.com>:
>>>>>>until we fix Lucene to run a single search concurrently (which we
>>>>>>badly need to do).
>>>> I am interested in this idea.(I have posted it before) do you have some
>>>> resources such as papers or tech articles about it?
>>>> I have tried but it need to modify index format dramatically and we use
>>>> solr distributed search to relieve the problem of response time. so finally
>>>> give it up.
>>>> lucene4's index format is more flexible that it supports customed codecs
>>>> and it's now on development, I think it's good time to take it into
>>>> consideration
>>>> that let it support multithread searching for a single query.
>>>> I have a naive solution. dividing docList into many groups
>>>> e.g grouping docIds by it's even or odd
>>>> term1 df1=4  docList =  0  4  8  10
>>>> term1 df2=4  docList = 1  3  9  11
>>>>
>>>> term2 df1=4  docList = 0  6  8  12
>>>> term2 df2=4  docList = 3  9  11 15
>>>>   then we can use 2 threads to search topN docs on even group and odd group
>>>> and finally merge their results into a single on just like solr
>>>> distributed search.
>>>> But it's better than solr distributed search.
>>>>   First, it's in a single process and data communication between
>>>> threads is much
>>>> faster than network.
>>>>   Second, each threads process the same number of documents.For solr
>>>> distributed
>>>> search, one shard may process 7 documents and another shard may 1 document
>>>> Even if we can make each shard have the same document number. we can not
>>>> make it uniformly for each term.
>>>>    e.g. shard1 has doc1 doc2
>>>>           shard2 has doc3 doc4
>>>>    but term1 may only occur in doc1 and doc2
>>>>    while term2 may only occur in doc3 and doc4
>>>>    we may modify it
>>>>           shard1 doc1 doc3
>>>>           shard2 doc2 doc4
>>>>    it's good for term1 and term2
>>>>    but term3 may occur in doc1 and doc3...
>>>>    So I think it's fine-grained distributed in index while solr
>>>> distributed search is coarse-
>>>> grained.
>>> This is just crazy :)
>>>
>>> The simple way is just to search different segments in parallel.
>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized
>>> large segments (and small ones don't count, they're small!).
>>> If you're bound on squeezing out that extra millisecond (and making
>>> your life miserable along the way), you can search a single segment
>>> with multiple threads (by dividing it in even chunks, and then doing
>>> skipTo to position your iterators to the beginning of each chunk).
>>>
>>> First approach is really easy to implement. Second one is harder, but
>>> still doesn't require you to cook the number of CPU cores available
>>> into your index!
>>>
>>> It's the law of diminishing returns at play here. You're most likely
>>> to search in parallel over mostly memory-resident index
>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
>>> tend to slow down considerably on parallel sequential reads, so you
>>> already have pretty decent speed.
>>> Searching different segments in parallel (with BSMP) makes you several
>>> times faster.
>>> Searching in parallel within a segment requires some weird hacks, but
>>> has maybe a few percent advantage over previous solution.
>>> Sharding posting lists requires a great deal of weird hacks, makes
>>> index machine-bound, and boosts speed by another couple of percent.
>>> Sounds worthless.
>>>
>>> --
>>> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
>>> Phone: +7 (495) 683-567-4
>>> ICQ: 104465785
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: strange problem of PForDelta decoder

Reply via email to