I reviewed the benchmarking code on his website very quickly: * I don't like his NullCollector, it sets acceptsDocsOutOfOrder() = false, but its doing nothing but counting. By returning false here, he is declaring that the collector cares about docid order (which it doesnt), and preventing the use of BooleanScorer... he could just use TotalHitCountCollector: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java
* I'm not sure I like that he uses SpanNearQuery for the 'proximity window' benchmarking. For just a list of terms, I think SloppyPhraseQuery is the more natural choice and would be faster: "foo bar baz"~5 or whatever. On Fri, Jul 6, 2012 at 5:53 AM, Dawid Weiss <[email protected]> wrote: > That 4.0 is significantly faster than 3.6 for this benchmark and there > were minor glitches in the benchmarking code itself. > > Dawid > > On Fri, Jul 6, 2012 at 11:47 AM, Li Li <[email protected]> wrote: >> I can understand these quotes. what's the conclusion from your communication? >> >> On Fri, Jul 6, 2012 at 4:20 PM, Dawid Weiss >> <[email protected]> wrote: >>> I've repeated Sebastiano's experiments (and so did he). A few quotes >>> from the communication. >>> >>>> The index appears to be larger now--43.1GB. Probably they have better >>>> skipping structures that take more space. >>>> >>>> From what I can see the format is the same as before--the .frq file >>>> contains document pointers and positions. So my SearchFiles class still >>>> reads documents *and* counts. >>>> >>>> But the most interesting part I've read in a blog is that now Lucene has a >>>> pluggable index format. This means that someone can actually write a QS >>>> index for Lucene and test what happens in production. That's a most >>>> interesting change! >>> >>> and: >>> >>>> Well, they made a great job: >>>> >>>> trec-40-text unscored terms result: 5511 494901 >>>> trec-40-text unscored and result: 2193 769110 >>>> trec-40-text unscored phrase result: 6615 148663 >>>> trec-40-text unscored spans result: 12407 545090 >>>> >>>> So conjunction is still better, but by a really smaller margin. The worst >>>> part is term scanning--they are now significantly faster than QS indices. >>> >>> Dawid >>> >>> >>> >>> On Sun, Jun 24, 2012 at 9:31 AM, Dawid Weiss >>> <[email protected]> wrote: >>>> Fyi. I contacted Sebastiano and will get hold of the data set and >>>> benchmarks he used to repeat his experiment with current trunk >>>> (curiosity). Any hints on which configuration should be used will be >>>> welcome. >>>> >>>> Dawid >>>> >>>> On Sat, Jun 23, 2012 at 12:38 PM, Li Li <[email protected]> wrote: >>>>> http://mg4j.di.unimi.it/ >>>>> http://vigna.di.unimi.it/papers.php#VigQSI >>>>> >>>>> sounds very interesting and attractive. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
