Re: anyone has interests about mg4j's new integer compression algorithm?

Robert Muir Fri, 06 Jul 2012 03:06:59 -0700

I reviewed the benchmarking code on his website very quickly:

* I don't like his NullCollector, it sets acceptsDocsOutOfOrder() =
false, but its doing nothing but counting. By returning false here, he
is declaring that the collector cares about docid order (which it
doesnt), and preventing the use of BooleanScorer... he could just use
TotalHitCountCollector:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java


* I'm not sure I like that he uses SpanNearQuery for the 'proximity
window' benchmarking. For just a list of terms, I think
SloppyPhraseQuery is the more natural choice and would be faster: "foo
bar baz"~5 or whatever.

On Fri, Jul 6, 2012 at 5:53 AM, Dawid Weiss
<[email protected]> wrote:
> That 4.0 is significantly faster than 3.6 for this benchmark and there
> were minor glitches in the benchmarking code itself.
>
> Dawid
>
> On Fri, Jul 6, 2012 at 11:47 AM, Li Li <[email protected]> wrote:
>> I can understand these quotes. what's the conclusion from your communication?
>>
>> On Fri, Jul 6, 2012 at 4:20 PM, Dawid Weiss
>> <[email protected]> wrote:
>>> I've repeated Sebastiano's experiments (and so did he). A few quotes
>>> from the communication.
>>>
>>>> The index appears to be larger now--43.1GB. Probably they have better 
>>>> skipping structures that take more space.
>>>>
>>>> From what I can see the format is the same as before--the .frq file 
>>>> contains document pointers and positions. So my SearchFiles class still 
>>>> reads documents *and* counts.
>>>>
>>>> But the most interesting part I've read in a blog is that now Lucene has a 
>>>> pluggable index format. This means that someone can actually write a QS 
>>>> index for Lucene and test what happens in production. That's a most 
>>>> interesting change!
>>>
>>> and:
>>>
>>>> Well, they made a great job:
>>>>
>>>> trec-40-text    unscored        terms   result: 5511    494901
>>>> trec-40-text    unscored        and     result: 2193 769110
>>>> trec-40-text    unscored        phrase  result: 6615 148663
>>>> trec-40-text    unscored        spans   result: 12407 545090
>>>>
>>>> So conjunction is still better, but by a really smaller margin. The worst 
>>>> part is term scanning--they are now significantly faster than QS indices.
>>>
>>> Dawid
>>>
>>>
>>>
>>> On Sun, Jun 24, 2012 at 9:31 AM, Dawid Weiss
>>> <[email protected]> wrote:
>>>> Fyi. I contacted Sebastiano and will get hold of the data set and
>>>> benchmarks he used to repeat his experiment with current trunk
>>>> (curiosity). Any hints on which configuration should be used will be
>>>> welcome.
>>>>
>>>> Dawid
>>>>
>>>> On Sat, Jun 23, 2012 at 12:38 PM, Li Li <[email protected]> wrote:
>>>>> http://mg4j.di.unimi.it/
>>>>> http://vigna.di.unimi.it/papers.php#VigQSI
>>>>>
>>>>> sounds very interesting and attractive.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: anyone has interests about mg4j's new integer compression algorithm?

Reply via email to