[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288675#comment-13288675 ]
Michael McCandless commented on LUCENE-3892: -------------------------------------------- Excellent! All tests also pass for me w/ PFor postings format as well... this is a great starting point :) One Solr test failed (ContentStreamTest)... but I think it was false failure... I did notice the tests seem to run slower, especially certain ones eg TestJoinUtil. Still missing a couple license headers (TestMin, TestCompress)... I ran a quick perf test using http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc Wikipedia index. Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 sec). But more important is the slower search times: {noformat} Task QPS base StdDev base QPS pfor StdDev pfor Pct diff Phrase 8.52 0.50 4.43 0.40 -55% - -39% SloppyPhrase 12.52 0.39 7.87 0.51 -43% - -30% AndHighMed 67.69 2.82 44.22 1.47 -39% - -29% SpanNear 5.19 0.12 3.90 0.28 -31% - -17% PKLookup 112.16 1.71 95.61 1.30 -17% - -12% AndHighHigh 13.22 0.34 11.86 0.72 -17% - -2% Wildcard 46.04 0.37 41.68 4.45 -19% - 1% Fuzzy1 50.11 2.03 48.06 1.91 -11% - 3% OrHighMed 9.26 0.48 8.90 0.37 -12% - 5% OrHighHigh 12.28 0.56 11.83 0.49 -11% - 5% TermBGroup1M1P 40.47 1.94 39.88 2.51 -11% - 10% Fuzzy2 53.71 2.66 53.01 2.08 -9% - 7% TermGroup1M 36.46 1.21 35.99 1.58 -8% - 6% TermBGroup1M 55.53 1.99 55.26 2.68 -8% - 8% Respell 69.71 4.49 69.73 2.07 -8% - 10% Term 94.38 7.62 94.96 12.19 -18% - 23% Prefix3 41.63 0.34 42.21 5.82 -13% - 16% IntNRQ 7.08 0.15 7.28 1.29 -17% - 23% {noformat} The queries that do skipping are quite a bit slower; this makes sense, since on skip we do a full block decode. A smaller block size (we use 128 now right?) should help I think. It's strange that the non-skipping queries (Term, OrHighMed, OrHighHigh) don't show any performance gain ... maybe we need to optimize the decode... or it could be the removal of the bulk api is hurting us here. I'm also curious if we tried a pure FOR (no patching, so we must set numBits according to the max value = larger index but hopefully faster decode) if the results would improve... > Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, > Simple9/16/64, etc.) > ------------------------------------------------------------------------------------- > > Key: LUCENE-3892 > URL: https://issues.apache.org/jira/browse/LUCENE-3892 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Michael McCandless > Labels: gsoc2012, lucene-gsoc-12 > Fix For: 4.1 > > Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, > LUCENE-3892_settings.patch, LUCENE-3892_settings.patch > > > On the flex branch we explored a number of possible intblock > encodings, but for whatever reason never brought them to completion. > There are still a number of issues opened with patches in different > states. > Initial results (based on prototype) were excellent (see > http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html > ). > I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org