Re: strange problem of PForDelta decoder

Lance Norskog Fri, 31 Dec 2010 15:56:17 -0800

Please purchase and read Lucene In Action 2. It will explain how all
of this works and how to used Lucene efficiently.


http://www.lucidimagination.com/blog/2009/03/11/lia2/
http://www.lucidimagination.com/blog/2010/08/01/lucene-in-action-2d-edition-authors-round-table-podcast/
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/Lucene-in-Action

It is available from Manning as a downloadable E-Book (PDF) so you
don't have to wait for it to be shipped.

2010/12/30 Li Li <fancye...@gmail.com>:
> is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/)
> it says Multithreading. Indices can be queried and scored concurrently.
> maybe we can learn something from it.
>
> 2010/12/31 Li Li <fancye...@gmail.com>:
>> plus
>> 2 means search a term need seek many times for tis(if it's not cached in tii)
>>
>> 2010/12/31 Li Li <fancye...@gmail.com>:
>>> searching multi segments is a alternative solution but it has some
>>> disadvantages.
>>> 1. idf is not global?(I am not familiar with its implementation) maybe
>>> it's easy to solve it by share global idf
>>> 2. each segments will has it's own tii and tis files, which may make
>>> search slower(that's why optimization of
>>> index is neccessary)
>>> 3. one term's  docList is distributed in many files rather than one.
>>> more than one frq files means
>>> hard disk must seek different tracks, it's time consuming. if there is
>>> only one segment, the are likely
>>> stored in a single track.
>>>
>>>
>>> 2010/12/31 Earwin Burrfoot <ear...@gmail.com>:
>>>>>>>until we fix Lucene to run a single search concurrently (which we
>>>>>>>badly need to do).
>>>>> I am interested in this idea.(I have posted it before) do you have some
>>>>> resources such as papers or tech articles about it?
>>>>> I have tried but it need to modify index format dramatically and we use
>>>>> solr distributed search to relieve the problem of response time. so 
>>>>> finally
>>>>> give it up.
>>>>> lucene4's index format is more flexible that it supports customed codecs
>>>>> and it's now on development, I think it's good time to take it into
>>>>> consideration
>>>>> that let it support multithread searching for a single query.
>>>>> I have a naive solution. dividing docList into many groups
>>>>> e.g grouping docIds by it's even or odd
>>>>> term1 df1=4  docList =  0  4  8  10
>>>>> term1 df2=4  docList = 1  3  9  11
>>>>>
>>>>> term2 df1=4  docList = 0  6  8  12
>>>>> term2 df2=4  docList = 3  9  11 15
>>>>>   then we can use 2 threads to search topN docs on even group and odd 
>>>>> group
>>>>> and finally merge their results into a single on just like solr
>>>>> distributed search.
>>>>> But it's better than solr distributed search.
>>>>>   First, it's in a single process and data communication between
>>>>> threads is much
>>>>> faster than network.
>>>>>   Second, each threads process the same number of documents.For solr
>>>>> distributed
>>>>> search, one shard may process 7 documents and another shard may 1 document
>>>>> Even if we can make each shard have the same document number. we can not
>>>>> make it uniformly for each term.
>>>>>    e.g. shard1 has doc1 doc2
>>>>>           shard2 has doc3 doc4
>>>>>    but term1 may only occur in doc1 and doc2
>>>>>    while term2 may only occur in doc3 and doc4
>>>>>    we may modify it
>>>>>           shard1 doc1 doc3
>>>>>           shard2 doc2 doc4
>>>>>    it's good for term1 and term2
>>>>>    but term3 may occur in doc1 and doc3...
>>>>>    So I think it's fine-grained distributed in index while solr
>>>>> distributed search is coarse-
>>>>> grained.
>>>> This is just crazy :)
>>>>
>>>> The simple way is just to search different segments in parallel.
>>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized
>>>> large segments (and small ones don't count, they're small!).
>>>> If you're bound on squeezing out that extra millisecond (and making
>>>> your life miserable along the way), you can search a single segment
>>>> with multiple threads (by dividing it in even chunks, and then doing
>>>> skipTo to position your iterators to the beginning of each chunk).
>>>>
>>>> First approach is really easy to implement. Second one is harder, but
>>>> still doesn't require you to cook the number of CPU cores available
>>>> into your index!
>>>>
>>>> It's the law of diminishing returns at play here. You're most likely
>>>> to search in parallel over mostly memory-resident index
>>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
>>>> tend to slow down considerably on parallel sequential reads, so you
>>>> already have pretty decent speed.
>>>> Searching different segments in parallel (with BSMP) makes you several
>>>> times faster.
>>>> Searching in parallel within a segment requires some weird hacks, but
>>>> has maybe a few percent advantage over previous solution.
>>>> Sharding posting lists requires a great deal of weird hacks, makes
>>>> index machine-bound, and boosts speed by another couple of percent.
>>>> Sounds worthless.
>>>>
>>>> --
>>>> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
>>>> Phone: +7 (495) 683-567-4
>>>> ICQ: 104465785
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>



-- 
Lance Norskog
goks...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: strange problem of PForDelta decoder

Reply via email to