[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
Uwe Schindler commented on LUCENE-3892 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) We should keep the size of methods small, as bigger methods work against the code cache of hotspot and if Lucene is not used alone, may get de-optimized. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438728#comment-13438728 ] Robert Muir commented on LUCENE-3892: - Thanks Billy for all the hard work and endless benchmarking, so nice to have a block codec that is simple and clean and reuses our packed ints optimizations. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 5.0, 4.0 Attachments: LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-javadocs.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch, LUCENE-3892-trunk.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13437825#comment-13437825 ] Michael McCandless commented on LUCENE-3892: Woops sorry! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-javadocs.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch, LUCENE-3892-trunk.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435033#comment-13435033 ] Michael McCandless commented on LUCENE-3892: Uwe just started builds for this branch (thanks!): http://jenkins.sd-datasolutions.de/job/pforcodec-3892-branch Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433143#comment-13433143 ] Adrien Grand commented on LUCENE-3892: -- bq. (From mailing-list) So I think if its this ambiguous for wikipedia we should shoot for the most COMPACT form as a safe default. +1 too. I just committed the change. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432716#comment-13432716 ] Adrien Grand commented on LUCENE-3892: -- I ran the comparison between acceptableOverheadRatio=PackedInts.COMPACT (0%) and PackedInts.DEFAULT (20%) and it seems to be much faster with PackedInts.COMPACT: {noformat} base=COMPACT, challenger=DEFAULT TaskQPS base StdDev base QPS def StdDev def Pct diff IntNRQ 81.835.43 74.142.94 -18% - 0% HighTerm 146.55 10.34 133.579.02 -20% - 4% LowPhrase 93.911.63 86.901.67 -10% - -4% MedTerm 824.58 43.48 766.35 38.78 -16% - 3% LowSloppyPhrase 83.291.99 77.651.18 -10% - -3% OrHighMed 94.155.28 88.344.54 -15% - 4% OrHighHigh 100.635.42 94.574.20 -14% - 3% OrHighLow 128.627.21 120.926.07 -15% - 4% HighPhrase 13.050.45 12.290.39 -11% - 0% Prefix3 217.066.82 205.054.62 -10% - 0% MedPhrase 27.500.97 26.330.79 -10% - 2% Wildcard 183.204.87 175.583.89 -8% - 0% LowTerm 1763.31 43.24 1693.31 39.29 -8% - 0% HighSloppyPhrase 10.050.489.670.40 -11% - 5% AndHighHigh 111.591.15 107.451.66 -6% - -1% LowSpanNear 56.161.32 54.251.01 -7% - 0% AndHighMed 423.447.40 409.325.10 -6% - 0% MedSpanNear 33.140.91 32.320.74 -7% - 2% AndHighLow 2177.50 30.79 2134.05 28.64 -4% - 0% Fuzzy1 95.342.41 93.662.32 -6% - 3% HighSpanNear5.280.175.210.11 -6% - 3% MedSloppyPhrase 18.410.72 18.190.70 -8% - 6% Fuzzy2 37.731.31 37.311.14 -7% - 5% Respell 109.713.09 108.642.76 -6% - 4% PKLookup 257.326.64 260.007.15 -4% - 6% {noformat} Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431709#comment-13431709 ] Adrien Grand commented on LUCENE-3892: -- The comment you added in 1371011 on the value of {{BLOCK_SIZE}} caught my attention: I think that BLOCK_SIZE should be at least 64 with PackedInts encoding/decoding since these conversions are long-aligned (I backported your two commits and added a comment about this). For example, the {{PACKED}} 7-bits encoder cannot encode less than 64 values in one iteration. In case someone would really want to use smaller block sizes (eg. 32), I think it should still perform pretty well if {{acceptableOverheadRatio = ~25%}} (in that case, all bits-per-value in the [1-24] range either use a {{PACKED_SINGLE_BLOCK}} encoder or an 8-bits, 16-bits or 24-bits {{PACKED}} decoder). Do we plan to make the block size configurable? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431733#comment-13431733 ] Michael McCandless commented on LUCENE-3892: Thanks Adrien. So now we just have to replace Block with BlockPacked right? OK let's just fix the comment to be multiple of 64. I don't think we need to make BLOCK_SIZE configurable. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431735#comment-13431735 ] Adrien Grand commented on LUCENE-3892: -- bq. So now we just have to replace Block with BlockPacked right? Yes, I think so. bq. I don't think we need to make BLOCK_SIZE configurable. In that case, should we also hard-code the value of {{acceptableOverheadRatio}}? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431742#comment-13431742 ] Michael McCandless commented on LUCENE-3892: Actually let's hold off a bit on replacing Block w/ BlockPacked: Billy was going to do some more tests with PFOR... bq. In that case, should we also hard-code the value of acceptableOverheadRatio? Hmm that one seems more compelling to let apps change? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431753#comment-13431753 ] Michael McCandless commented on LUCENE-3892: Shouldn't MIN_ENCODED_SIZE be MAX_ENCODED_SIZE? Ie the max number of bytes encoding will ever require. And I think the same for MIN - MAX_DATA_SIZE? Or maybe MIN_REQUIRED_XXX? I think readVIntBlock shouldn't be in ForUtil? Ie it's very postings-format-specific and it's not using packed ints at all. Also the equivalent readVIntBlock code for the positions case (in the readPositions methods) is still in the BlockPackedPostingsReader. I think it's great to have writeBlock/readBlock/skipBlock in ForUtil. Do we really need to write/write the 32 format.getId(), numBits into the postings file header? I guess it's either that or ... store the float acceptableOverheadRatio (eg using Float.floatToIntBits I guess) and have some back-compat enforced in the logic in PackedInts.fastestFormatAndBits... hmm. Hmm ... MIN_DATA_SIZE is 147 (PACKED_SINGLE_BLOCK, bpv=3), but BLOCK_SIZE is 128 ... so I guess this means if we ever pick that format (because acceptableOverheadRatio allowed us to), we're encoding/decoding those extra 19 unused ints right? (I was just trying to understand why we alloc all the int[] to MIN_DATA_SIZE not BLOCK_SIZE...). ForUtil.getMinRequiredBufferSize seems like dead code? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431762#comment-13431762 ] Han Jiang commented on LUCENE-3892: --- Thank you Adrien! The BlockPacked PF also worked well on my computer :) {noformat} TaskQPS base StdDev base QPS packedStdDev packed Pct diff AndHighHigh 122.573.01 123.902.49 -3% - 5% AndHighLow 2260.53 21.18 2273.77 55.09 -2% - 3% AndHighMed 328.018.18 329.31 11.36 -5% - 6% Fuzzy1 86.370.94 86.242.12 -3% - 3% Fuzzy2 31.400.46 31.220.64 -4% - 2% HighPhrase9.090.519.150.40 -8% - 11% HighSloppyPhrase5.300.255.340.08 -5% - 7% HighSpanNear 10.110.44 10.420.34 -4% - 11% HighTerm 179.437.26 178.965.70 -7% - 7% IntNRQ 61.873.79 60.594.31 -14% - 11% LowPhrase 41.231.54 42.971.32 -2% - 11% LowSloppyPhrase 62.832.11 68.230.993% - 14% LowSpanNear 81.282.74 85.742.67 -1% - 12% LowTerm 1763.70 29.21 1778.41 23.07 -2% - 3% MedPhrase 27.061.16 27.540.88 -5% - 9% MedSloppyPhrase 31.821.16 33.700.141% - 10% MedSpanNear 23.090.93 23.840.79 -4% - 11% MedTerm 659.09 22.65 671.54 19.79 -4% - 8% OrHighHigh 27.360.52 27.411.25 -6% - 6% OrHighLow 154.992.07 156.207.08 -5% - 6% OrHighMed 105.131.52 105.304.65 -5% - 6% PKLookup 210.646.95 217.572.080% - 7% Prefix3 170.226.22 166.804.18 -7% - 4% Respell 83.961.47 83.751.25 -3% - 3% Wildcard 155.084.31 155.313.12 -4% - 5% {noformat} Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431764#comment-13431764 ] Michael McCandless commented on LUCENE-3892: I think, for a fair test, we should also test w/ acceptableOverheadRatio=0 ... I'll run that. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431767#comment-13431767 ] Adrien Grand commented on LUCENE-3892: -- bq. Shouldn't MIN_ENCODED_SIZE be MAX_ENCODED_SIZE? I prefixed with MIN because it is the minimum size the encoded buffer size must have to be able to handle all cases. But I think you are right, MAX or REQUIRED would be clearer. bq. I think readVIntBlock shouldn't be in ForUtil? I'll move it back to BlockPackedPostingsReader. {quote} Do we really need to write/write the 32 format.getId(), numBits into the postings file header? I guess it's either that or ... store the float acceptableOverheadRatio (eg using Float.floatToIntBits I guess) and have some back-compat enforced in the logic in PackedInts.fastestFormatAndBits... hmm.{quote} I hesitated between these two approaches but I think writing all cases to the header is less error-prone? Moreover it would allow us to change the logic of {{fastestFormatAndBits}} without having to bump the version number. {quote} Hmm ... MIN_DATA_SIZE is 147 (PACKED_SINGLE_BLOCK, bpv=3), but BLOCK_SIZE is 128 ... so I guess this means if we ever pick that format (because acceptableOverheadRatio allowed us to), we're encoding/decoding those extra 19 unused ints right? (I was just trying to understand why we alloc all the int[] to MIN_DATA_SIZE not BLOCK_SIZE...).{quote} Exactly. The other problem is that we are also storing these unnecessary 19 values (but it is not easy to fix since PACKED_SINGLE_BLOCK writes values in the low-order long bits first (little endian)). Maybe we should make PACKED_SINGLE_BLOCK write values in the high-order bits first and split byte encoders and decoders from the long ones (so that they have a lower {{valueCount()}}). bq. ForUtil.getMinRequiredBufferSize seems like dead code? I'll remove it. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431882#comment-13431882 ] Han Jiang commented on LUCENE-3892: --- I revived the PFor codes, and test it agains BlockFor and BlockPacked: BlockFor as base: {noformat} TaskQPS base StdDev baseQPS pfor StdDev pfor Pct diff AndHighHigh 121.541.37 116.692.03 -6% - -1% AndHighLow 2286.36 14.19 2212.92 11.48 -4% - -2% AndHighMed 322.977.37 294.194.76 -12% - -5% Fuzzy1 85.561.46 87.973.27 -2% - 8% Fuzzy2 30.940.56 32.161.34 -2% - 10% HighPhrase9.390.389.020.45 -12% - 5% HighSloppyPhrase5.380.085.240.12 -6% - 1% HighSpanNear 10.380.399.920.08 -8% - 0% HighTerm 180.306.87 172.836.26 -11% - 3% IntNRQ 62.013.73 60.893.54 -12% - 10% LowPhrase 42.440.67 38.730.89 -12% - -5% LowSloppyPhrase 62.820.79 56.790.43 -11% - -7% LowSpanNear 81.792.00 74.101.13 -12% - -5% LowTerm 1763.95 39.62 1721.30 34.22 -6% - 1% MedPhrase 27.870.59 25.820.74 -11% - -2% MedSloppyPhrase 32.150.41 29.910.31 -9% - -4% MedSpanNear 23.480.71 22.000.05 -9% - -3% MedTerm 662.11 24.22 638.81 19.31 -9% - 3% OrHighHigh 26.820.47 27.141.93 -7% - 10% OrHighLow 152.403.54 156.58 11.11 -6% - 12% OrHighMed 103.202.26 105.847.55 -6% - 12% PKLookup 216.384.32 219.322.59 -1% - 4% Prefix3 169.894.97 163.823.34 -8% - 1% Respell 83.231.44 86.203.00 -1% - 9% Wildcard 155.812.79 152.302.54 -5% - 1% {noformat} BlockPacked as base: {noformat} TaskQPS base StdDev baseQPS pfor StdDev pfor Pct diff AndHighHigh 122.943.43 116.241.90 -9% - -1% AndHighLow 2294.32 58.32 2199.14 31.97 -7% - 0% AndHighMed 325.55 12.44 290.203.80 -15% - -6% Fuzzy1 88.331.84 87.862.54 -5% - 4% Fuzzy2 31.920.80 32.000.92 -5% - 5% HighPhrase9.730.479.040.29 -14% - 0% HighSloppyPhrase5.490.195.160.03 -9% - -1% HighSpanNear 10.930.239.900.09 -12% - -6% HighTerm 178.316.37 171.066.14 -10% - 3% IntNRQ 60.874.71 62.385.49 -13% - 20% LowPhrase 44.971.18 38.361.01 -19% - -10% LowSloppyPhrase 69.611.19 55.901.39 -23% - -16% LowSpanNear 88.500.66 72.802.23 -20% - -14% LowTerm 1769.84 32.66 1717.02 39.75 -6% - 1% MedPhrase 28.880.84 25.570.68 -16% - -6% MedSloppyPhrase 34.470.50 29.290.54 -17% - -12% MedSpanNear 24.880.32 21.690.38 -15% - -10% MedTerm 667.95 21.61 633.73 22.17 -11% - 1% OrHighHigh 27.961.29 26.820.81 -11% - 3% OrHighLow 158.625.82 155.085.05 -8% - 4% OrHighMed 107.164.19 104.813.17 -8% - 4% PKLookup 217.221.86 216.831.87 -1% - 1% Prefix3 167.326.72 166.126.53 -8% - 7% Respell 85.252.27 85.852.16 -4% - 6% Wildcard 156.245.69 154.633.02 -6% - 4% {noformat} Current PFor impl only saves 1.8% against For, but get quite some perf loss. Let's use the Packed version! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431914#comment-13431914 ] Michael McCandless commented on LUCENE-3892: I compared Block w/ BlockPacked, but set acceptableOverheadRatio to 0 for a fairer test: {noformat} TaskQPS base StdDev baseQPS pack StdDev pack Pct diff HighSloppyPhrase1.940.011.910.05 -4% - 2% LowPhrase 21.050.07 20.840.37 -3% - 1% MedPhrase 13.050.04 12.930.23 -3% - 1% Wildcard 43.872.76 43.492.10 -11% - 10% IntNRQ8.881.398.830.78 -21% - 28% Fuzzy1 63.071.96 62.781.46 -5% - 5% LowSloppyPhrase6.920.016.910.13 -2% - 1% Prefix3 71.385.20 71.353.17 -10% - 12% PKLookup 157.001.78 158.012.01 -1% - 3% AndHighLow 668.764.82 674.807.480% - 2% HighPhrase1.560.031.580.03 -3% - 5% MedSloppyPhrase7.710.037.800.110% - 2% AndHighMed 74.050.49 75.350.360% - 2% AndHighHigh 25.920.30 26.780.191% - 5% Respell 57.072.70 59.201.80 -3% - 12% Fuzzy2 60.812.92 63.321.68 -3% - 12% OrHighHigh8.990.179.390.111% - 7% OrHighMed 17.650.37 18.520.132% - 7% MedSpanNear3.900.174.110.09 -1% - 12% OrHighLow 22.990.51 24.220.152% - 8% HighSpanNear1.400.061.480.030% - 12% LowSpanNear7.840.318.320.170% - 12% LowTerm 406.02 28.53 444.21 37.75 -6% - 27% MedTerm 149.838.11 167.60 15.06 -3% - 28% HighTerm 29.571.67 33.423.20 -3% - 31% {noformat} Curiously it seems even faster than w/ acceptableOverheadRatio=0.2! But it makes it clear we should do a hard cutover. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431916#comment-13431916 ] Michael McCandless commented on LUCENE-3892: bq. I revived the PFor codes, and test it agains BlockFor and BlockPacked Thanks Billy, I'll run a test too ... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431951#comment-13431951 ] Adrien Grand commented on LUCENE-3892: -- bq. Curiously it seems even faster than w/ acceptableOverheadRatio=0.2! But it makes it clear we should do a hard cutover. I had been doing some tests with the bulk version of PackedInts.get (which uses the same methods that we use for BlockPacked) while working on LUCENE-4098 and it seemed that the bottleneck was more memory bandwidth than CPU (for large arrays at least). If you look at the last graph of http://people.apache.org/~jpountz/packed_ints3.html, the throughput seems to depend more on the memory efficiency of the picked impl than on the way it stores data. Maybe we are experiencing a similar phenomenon here... Unless I am missing something, the only difference between BlockPacked and Block is that BlockPacked decodes directly from byte[] whereas Block uses ByteBuffer.asLongBuffer to translate from bytes to ints and then decodes from the ints... Interesting to know it has so much overhead... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431983#comment-13431983 ] Michael McCandless commented on LUCENE-3892: OK indeed PFOR is slower for me too: {noformat} TaskQPS base StdDev baseQPS pfor StdDev pfor Pct diff HighPhrase1.560.031.250.12 -28% - -10% MedPhrase 13.050.10 10.500.58 -24% - -14% LowPhrase 21.080.08 17.350.85 -22% - -13% AndHighMed 73.780.66 62.501.68 -18% - -12% AndHighLow 674.602.54 573.00 12.06 -17% - -12% LowSpanNear8.040.176.970.23 -17% - -8% MedSpanNear3.970.103.580.15 -15% - -3% MedSloppyPhrase7.580.116.930.14 -11% - -5% AndHighHigh 25.710.47 23.580.61 -12% - -4% HighSpanNear1.420.041.310.05 -12% - -1% MedTerm 155.44 18.75 144.46 12.33 -24% - 14% HighTerm 30.274.31 28.252.88 -26% - 19% LowSloppyPhrase6.730.136.280.12 -10% - -3% OrHighHigh9.060.248.530.33 -11% - 0% OrHighLow 23.090.67 21.880.91 -11% - 1% OrHighMed 17.710.51 16.790.67 -11% - 1% HighSloppyPhrase1.880.051.800.04 -9% - 0% IntNRQ9.420.509.050.89 -17% - 11% Prefix3 72.672.42 70.423.61 -11% - 5% Fuzzy1 63.711.07 62.341.55 -6% - 1% Wildcard 45.250.99 44.281.55 -7% - 3% PKLookup 159.042.13 157.171.90 -3% - 1% Fuzzy2 62.512.28 63.401.65 -4% - 8% LowTerm 400.06 57.60 407.73 52.40 -22% - 34% Respell 56.723.19 59.832.10 -3% - 15% {noformat} I think we should replace Block with BlockPacked now? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431985#comment-13431985 ] Michael McCandless commented on LUCENE-3892: bq. I had been doing some tests with the bulk version of PackedInts.get (which uses the same methods that we use for BlockPacked) while working on LUCENE-4098 and it seemed that the bottleneck was more memory bandwidth than CPU (for large arrays at least). Ahh, interesting... So I think we should test different acceptableOverheadRatios to find the best ... it could be it's 0! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431991#comment-13431991 ] Michael McCandless commented on LUCENE-3892: {quote} bq. Do we really need to write/write the 32 format.getId(), numBits into the postings file header? I guess it's either that or ... store the float acceptableOverheadRatio (eg using Float.floatToIntBits I guess) and have some back-compat enforced in the logic in PackedInts.fastestFormatAndBits... hmm. I hesitated between these two approaches but I think writing all cases to the header is less error-prone? Moreover it would allow us to change the logic of fastestFormatAndBits without having to bump the version number. {quote} Maybe for starters we should just hardwire acceptableOverheadRatio at 0 ... then we simplify this back-compat until/unless we really need to make this configurable. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432001#comment-13432001 ] Michael McCandless commented on LUCENE-3892: bq. The other problem is that we are also storing these unnecessary 19 values (but it is not easy to fix since PACKED_SINGLE_BLOCK writes values in the low-order long bits first (little endian)). Maybe we should make PACKED_SINGLE_BLOCK write values in the high-order bits first and split byte encoders and decoders from the long ones (so that they have a lower valueCount()). OK, we can explore that later (another reason to simply always use Format.PACKED for now...). Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432009#comment-13432009 ] Robert Muir commented on LUCENE-3892: - {quote} OK indeed PFOR is slower for me too: {quote} I think for starters since you guys have gotten FOR pretty nice we should just focus on that one? We could later see if PFOR could get additional wins as a second step: getting FOR working nice and fast is awesome on its own! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432235#comment-13432235 ] Michael McCandless commented on LUCENE-3892: bq. I think for starters since you guys have gotten FOR pretty nice we should just focus on that one? Yeah I think we should do that. I think the branch is nearly ready to land! I just replaced Block with BlockPacked ... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431117#comment-13431117 ] Han Jiang commented on LUCENE-3892: --- And result on skipMulitiplier, use current 8 as the baseline: http://pastebin.com/TG4C6u6S Somewhat noisy, but or-queries benifit a little when skipMultiplier=32. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431193#comment-13431193 ] Robert Muir commented on LUCENE-3892: - {quote} So ... most of the gains come from BlockPF cutover. This is sort of ... surprising/disappointing, ie, our bottlenecks are the abstraction layers, not the actual decode cost. Still it's good to make progress on removing the abstractions. {quote} I don't think its that disappointing. This isnt a very interesting benchmark for a compression algorithm like FOR: instead imagine the very common case of apps today indexing small fields like product names, restaurant names, or something like that. Freqs are nearly always 1, and positions are tiny, but often people still want the ability to use things like phrase queries. And imagine cases where people are indexing data from a database and there are only a few unique values (e.g. product type = tshirt, pants, shoes) in a field. I think the wikipedia benchmark doesn't do a very good job of illustrating performance on use-cases like this, which I think are common and also where I'm fairly positive FOR will be a win. Its nice that its not slower or too much bigger in the worst case of large docs where the numbers aren't so tiny? {quote} Also, it looks like the only query that is slower than Lucene40 is AndHighLow ... however, it's also an extremely fast query to begin with so I think it's a fine tradeoff that it gets slower while the hard/slower queries get faster. {quote} +1, lets not even think twice about that one. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431324#comment-13431324 ] Adrien Grand commented on LUCENE-3892: -- I did some changes to the {{BlockPacked}} codec: - encoding and decoding using int[] instead of long[] - selection of the format based on a configurable overhead ratio. The results are encouraging: {noformat} TaskQPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed Pct diff PKLookup 256.938.89 256.857.47 -6% - 6% OrHighLow 145.149.86 145.149.35 -12% - 14% Respell 110.261.84 110.272.01 -3% - 3% AndHighHigh 112.970.81 113.192.17 -2% - 2% Fuzzy1 102.151.47 102.863.13 -3% - 5% OrHighHigh 94.566.56 95.436.35 -11% - 15% Fuzzy2 42.490.77 42.891.43 -4% - 6% OrHighMed 175.30 11.34 177.42 10.83 -10% - 14% AndHighLow 1925.02 23.92 1952.57 48.68 -2% - 5% HighPhrase8.960.419.110.46 -7% - 11% Wildcard 189.792.13 193.121.570% - 3% HighSpanNear6.470.156.590.25 -4% - 8% Prefix3 256.672.58 262.402.840% - 4% LowTerm 1746.52 52.80 1789.54 54.30 -3% - 8% HighTerm 238.70 13.46 245.63 16.60 -9% - 16% MedTerm 923.64 38.19 951.18 46.85 -5% - 12% AndHighMed 364.463.65 377.09 10.030% - 7% IntNRQ 56.581.02 58.840.800% - 7% HighSloppyPhrase 11.730.30 12.400.62 -2% - 13% LowSpanNear 29.640.96 32.440.982% - 16% MedSpanNear 22.960.72 25.160.852% - 16% MedPhrase 40.991.25 45.091.243% - 16% LowSloppyPhrase 37.880.99 41.981.494% - 17% LowPhrase 64.402.04 71.841.415% - 17% MedSloppyPhrase 42.291.16 47.321.545% - 18% {noformat} I hope this will be confirmed on your computers this time .:-) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431485#comment-13431485 ] Michael McCandless commented on LUCENE-3892: I also see (smaller) gains with BlockPacked vs Block (this is 10M doc index): {noformat} TaskQPS base StdDev base QPS packedStdDev packed Pct diff AndHighMed 69.190.53 66.430.63 -5% - -2% Fuzzy2 63.711.24 62.251.58 -6% - 2% Respell 62.691.41 61.531.47 -6% - 2% IntNRQ 11.860.43 11.730.03 -4% - 2% Fuzzy1 75.481.21 75.051.52 -4% - 3% Wildcard 53.230.63 52.960.25 -2% - 1% MedSpanNear4.880.164.880.11 -5% - 5% PKLookup 191.482.84 191.623.98 -3% - 3% HighTerm 35.710.63 35.910.06 -1% - 2% Prefix3 83.141.34 83.830.49 -1% - 3% LowTerm 513.350.77 517.921.500% - 1% HighSpanNear1.700.061.710.03 -4% - 6% AndHighHigh 23.450.09 23.690.100% - 1% OrHighLow 27.271.06 27.590.15 -3% - 5% OrHighMed 23.610.92 23.890.17 -3% - 6% OrHighHigh 11.420.44 11.590.12 -3% - 6% MedSloppyPhrase6.840.176.950.23 -4% - 7% LowPhrase 22.020.39 22.430.150% - 4% MedTerm 196.763.01 200.620.330% - 3% LowSpanNear9.600.249.820.31 -3% - 8% MedPhrase 13.080.30 13.410.120% - 5% LowSloppyPhrase7.550.217.770.27 -3% - 9% AndHighLow 649.84 18.26 669.086.630% - 6% HighSloppyPhrase1.980.082.040.09 -4% - 12% HighPhrase1.760.111.960.100% - 24% {noformat} The index is 4669 MB with Block and 4790 with BlockPacked = ~2.6% larger ... seems worth it! Apps can always tune the 20% too. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431498#comment-13431498 ] Adrien Grand commented on LUCENE-3892: -- Thanks Mike for your tests. Do you think {{BlockPacked}} is now fast enough to replace {{Block}} with {{BlockPacked}}? I am asking because it is a little painful to always have to backport changes from one format to the other. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431504#comment-13431504 ] Michael McCandless commented on LUCENE-3892: Yes I think we should do a hard cutover now? Ie, merge any final changes (sorry for all the commits! we are nearly ready to land I think...) over to BlockPacked, then remove Block and rename BlockPacked to Block? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431506#comment-13431506 ] Adrien Grand commented on LUCENE-3892: -- Sounds good. I think the only commits that have not been merged yet are 1371010 and 1371011. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431511#comment-13431511 ] Michael McCandless commented on LUCENE-3892: OK I'll merge replace Block w/ BlockPacked... likely sometime tomorrow. Thanks Adrien! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430373#comment-13430373 ] Adrien Grand commented on LUCENE-3892: -- I backported Mike's changes to the {{BlockPacked}} codec and tried to understand why it was slower than {{Block}}... The use of {{java.nio.*Buffer}} seemed to be the bottleneck ({{ByteBuffer.asLongBuffer}} and {{ByteBuffer.getLong}} especially are _very_ slow) of the decoding step so I switched back to decoding from long[] (instead of LongBuffer) and added direct decoding from byte[] to avoid having to convert the bytes to longs before decoding. Tests passed with -Dtests.postingsformat=BlockPacked. Here are the results of the benchmark (unfortunately, it started before Mike committed r1370179): {noformat} TaskQPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed Pct diff PKLookup 259.419.06 255.778.89 -8% - 5% AndHighLow 1656.30 50.44 1653.85 55.05 -6% - 6% AndHighHigh 82.901.82 83.472.52 -4% - 6% AndHighMed 274.76 11.11 278.51 13.42 -7% - 10% Prefix3 285.414.82 289.606.31 -2% - 5% HighTerm 230.78 14.33 235.16 20.61 -12% - 18% IntNRQ 55.911.03 57.132.73 -4% - 9% LowTerm 1720.10 47.06 1759.16 55.47 -3% - 8% Wildcard 290.543.82 297.395.420% - 5% MedTerm 733.01 35.38 750.46 50.37 -8% - 14% HighSpanNear6.930.237.120.39 -6% - 11% HighPhrase6.460.226.650.46 -7% - 14% Respell 96.112.84 99.003.98 -3% - 10% OrHighHigh 38.072.53 39.233.06 -10% - 19% Fuzzy2 50.291.70 51.872.25 -4% - 11% MedPhrase 26.200.94 27.031.07 -4% - 11% OrHighMed 138.837.76 143.549.79 -8% - 16% Fuzzy1 100.582.15 104.213.99 -2% - 9% HighSloppyPhrase5.260.115.450.24 -3% - 10% OrHighLow 78.435.55 81.806.89 -10% - 21% MedSpanNear 32.751.13 34.281.73 -3% - 13% LowPhrase 90.273.20 95.063.58 -2% - 13% LowSpanNear 46.401.95 48.892.40 -3% - 15% MedSloppyPhrase 36.291.00 38.591.460% - 13% LowSloppyPhrase 37.411.11 40.481.391% - 15% {noformat} Mike, Billy, could you check that {{BLockPacked}} is at least as fast as {{Block}} on your computer too? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430423#comment-13430423 ] Han Jiang commented on LUCENE-3892: --- Thanks Adrien! Your codes are really clean! At first glance, I think we should still support all-value-the-same case? For some applications(like index with payloads), that might be helpful. And, I'm a little confused about your performance test. Did you use BlockPF before r1370179 as a baseline, and compare it with your latest commit? Here, I tested these two PF under latest versions(r1370345). {noformat} TaskQPS base StdDev baseQPS comp StdDev comp Pct diff AndHighHigh 124.539.36 100.463.31 -27% - -9% AndHighLow 2141.08 63.93 1922.73 36.32 -14% - -5% AndHighMed 281.48 36.49 218.68 13.10 -35% - -5% Fuzzy1 84.332.56 83.941.67 -5% - 4% Fuzzy2 30.491.13 30.480.71 -5% - 6% HighPhrase9.080.287.560.20 -21% - -11% HighSloppyPhrase5.460.214.880.23 -17% - -2% HighSpanNear 10.120.219.210.30 -13% - -3% HighTerm 176.526.13 146.135.43 -22% - -11% IntNRQ 59.561.98 51.051.33 -19% - -9% LowPhrase 40.021.03 32.750.37 -21% - -15% LowSloppyPhrase 59.592.85 51.491.33 -19% - -6% LowSpanNear 73.863.17 61.981.45 -21% - -10% LowTerm 1755.38 15.56 1622.61 26.87 -9% - -5% MedPhrase 25.990.47 21.010.17 -21% - -16% MedSloppyPhrase 30.520.89 24.770.55 -22% - -14% MedSpanNear 22.260.43 18.730.47 -19% - -12% MedTerm 651.90 18.97 573.34 19.25 -17% - -6% OrHighHigh 26.750.33 23.530.50 -14% - -9% OrHighLow 151.692.13 134.173.19 -14% - -8% OrHighMed 102.481.48 90.732.01 -14% - -8% PKLookup 216.595.70 215.992.99 -4% - 3% Prefix3 166.000.78 145.251.29 -13% - -11% Respell 82.013.01 82.801.66 -4% - 6% Wildcard 151.662.22 141.141.57 -9% - -4% {noformat} Strange that it isn't working well on my computer. And results are similar when I change MMapDirectory to NIOFSDirectory. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe,
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430439#comment-13430439 ] Michael McCandless commented on LUCENE-3892: Hmm also not great results on my env (base=Block, packed=BlockPacked), based on current branch head: {noformat} TaskQPS base StdDev base QPS packedStdDev packed Pct diff AndHighMed 59.233.07 34.240.69 -46% - -37% AndHighLow 576.35 21.09 349.577.44 -42% - -35% AndHighHigh 23.830.72 15.530.29 -37% - -31% MedPhrase 12.560.208.870.31 -32% - -25% LowPhrase 20.520.21 14.890.43 -30% - -24% MedSloppyPhrase7.460.205.410.13 -31% - -23% LowSloppyPhrase6.730.184.920.12 -30% - -22% LowSpanNear7.630.325.650.19 -31% - -20% HighSloppyPhrase1.900.081.520.05 -25% - -14% HighPhrase1.570.041.260.08 -26% - -12% MedSpanNear3.840.183.140.14 -25% - -10% LowTerm 433.22 34.89 364.03 15.63 -25% - -4% HighSpanNear1.400.071.190.06 -23% - -6% IntNRQ9.500.438.090.92 -27% - 0% HighTerm 29.474.89 25.462.35 -32% - 13% MedTerm 148.76 21.53 129.179.59 -29% - 9% Prefix3 72.812.20 63.653.88 -20% - -4% Wildcard 44.790.92 39.912.20 -17% - -4% OrHighMed 16.810.48 15.280.21 -12% - -5% OrHighLow 21.850.67 20.030.32 -12% - -3% OrHighHigh8.490.287.800.14 -12% - -3% Fuzzy1 61.331.95 58.911.11 -8% - 1% PKLookup 156.871.14 154.082.13 -3% - 0% Respell 58.721.57 59.601.28 -3% - 6% Fuzzy2 60.982.34 62.031.89 -5% - 9% {noformat} I think optimizing the all-values-same case is actually quite important for payloads (but luceneutil doesn't test this today). But, curiously, my BlockPacked index is a bit smaller than my Block index (4643 MB vs 4650 MB). I do wonder about using long[] to hold the uncompressed results (they only need int[]); that's one big difference still. Also: I'd love to see how acceptableOverheadRatio 0 does ... (and, using PACKED_SINGLE_BLOCK ... we'd have to put a bit in the header to record the format). Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see:
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430507#comment-13430507 ] Michael McCandless commented on LUCENE-3892: I tried smaller block sizes than 128. Here's 128 (base) vs 64: {noformat} TaskQPS base StdDev base QPS block64StdDev block64 Pct diff AndHighHigh 23.910.57 22.280.27 -10% - -3% AndHighMed 60.631.02 56.961.13 -9% - -2% MedSloppyPhrase7.690.017.300.13 -6% - -3% HighSloppyPhrase1.930.021.830.04 -8% - -1% LowSloppyPhrase6.840.036.570.11 -6% - -1% Fuzzy1 65.490.85 63.501.68 -6% - 0% HighPhrase1.570.041.530.04 -7% - 3% OrHighLow 22.890.98 22.380.61 -8% - 4% OrHighMed 17.650.70 17.270.43 -8% - 4% IntNRQ9.500.489.330.36 -10% - 7% OrHighHigh8.980.368.840.19 -7% - 4% HighTerm 29.602.64 29.161.44 -13% - 13% Fuzzy2 65.540.86 64.632.13 -5% - 3% Wildcard 45.271.27 44.780.48 -4% - 2% MedTerm 150.40 12.65 148.996.63 -12% - 12% Prefix3 72.552.55 72.311.02 -5% - 4% LowTerm 421.62 38.27 422.409.47 -10% - 12% LowSpanNear7.550.347.620.22 -6% - 8% HighSpanNear1.340.091.350.06 -9% - 12% MedPhrase 12.450.24 12.660.13 -1% - 4% Respell 59.541.80 60.951.86 -3% - 8% MedSpanNear3.700.243.800.15 -7% - 14% PKLookup 154.562.45 158.961.890% - 5% LowPhrase 20.210.33 20.950.151% - 6% AndHighLow 577.81 12.46 637.96 29.803% - 18% {noformat} And 128 (base) vs 32: {noformat} TaskQPS base StdDev base QPS block64StdDev block64 Pct diff AndHighHigh 23.860.52 20.680.59 -17% - -8% IntNRQ9.480.388.840.46 -15% - 2% HighSloppyPhrase1.870.041.760.06 -11% - 0% Prefix3 72.652.18 68.242.96 -12% - 1% HighTerm 29.911.40 28.282.94 -19% - 9% Wildcard 44.740.83 42.431.49 -10% - 0% HighSpanNear1.370.081.300.07 -15% - 6% MedTerm 152.735.28 145.45 14.69 -17% - 8% MedSloppyPhrase7.460.127.120.25 -9% - 0% HighPhrase1.570.031.500.01 -7% - -1% OrHighLow 22.940.70 22.001.10 -11% - 3% AndHighMed 58.721.79 56.601.95 -9% - 2% LowSloppyPhrase6.670.106.440.20 -7% - 1% OrHighMed 17.520.56 17.000.82 -10% - 5% LowSpanNear7.530.357.340.39 -11% - 7% OrHighHigh8.840.318.620.43 -10% - 6% MedSpanNear3.790.203.710.21 -12% - 9% PKLookup 153.343.22 150.194.91 -7% - 3% Fuzzy1 62.931.77 62.282.23 -7% - 5% LowTerm 410.23 21.57 410.83 35.19 -13% - 14% MedPhrase 12.550.14 12.650.080% - 2% LowPhrase 20.420.17 20.770.210% - 3% Fuzzy2 61.443.12 64.131.97 -3% - 13% Respell 56.653.29 60.211.39 -1% - 15% AndHighLow 588.05 12.37 720.63 19.33 16% - 28% {noformat} It looks like there's some speedup to AndHighLow and LowPhrase ... but slowdowns in other (harder) queries... so I think net/net we should leave block size at 128. Add a useful intblock postings format (eg,
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430798#comment-13430798 ] Han Jiang commented on LUCENE-3892: --- Thanks Mike. And detailed comparison result on my computer is here: http://pastebin.com/HLaAuCNp I tried block size range from 1024~32, also used 128 as the base. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428936#comment-13428936 ] Michael McCandless commented on LUCENE-3892: I just committed an optimization to BlockPF DocsEnum.advance, inlining the scanning step (still have to do DPEnum and EverythingEnum): {noformat} TaskQPS base StdDev base QPS for StdDev for Pct diff IntNRQ 12.461.45 11.600.04 -16% - 5% Wildcard 54.362.75 52.720.38 -8% - 2% Prefix3 85.434.97 83.080.47 -8% - 3% Fuzzy2 63.862.13 62.441.79 -8% - 4% Respell 62.751.52 61.422.02 -7% - 3% Fuzzy1 75.681.65 74.691.44 -5% - 2% LowSpanNear9.240.209.130.19 -5% - 3% PKLookup 192.892.91 190.662.43 -3% - 1% HighSpanNear1.710.051.690.05 -6% - 4% MedSpanNear4.800.114.760.12 -5% - 4% MedPhrase 12.570.27 12.560.21 -3% - 3% MedSloppyPhrase6.570.116.560.11 -3% - 3% LowPhrase 21.550.35 21.550.28 -2% - 2% LowSloppyPhrase7.250.167.280.12 -3% - 4% HighPhrase1.810.111.820.10 -10% - 13% HighSloppyPhrase1.940.101.960.05 -6% - 9% LowTerm 512.535.66 518.312.300% - 2% MedTerm 196.094.68 198.760.30 -1% - 3% HighTerm 35.530.95 36.110.03 -1% - 4% OrHighMed 23.340.83 23.850.70 -4% - 9% OrHighLow 26.910.98 27.530.82 -4% - 9% OrHighHigh 11.270.41 11.530.34 -4% - 9% AndHighHigh 21.240.05 23.790.13 11% - 12% AndHighLow 553.198.47 621.354.019% - 14% AndHighMed 57.450.13 67.780.70 16% - 19% {noformat} Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425153#comment-13425153 ] Michael McCandless commented on LUCENE-3892: I'm confused by these two patches: are they against trunk? How come eg they have mods to build.xml? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425246#comment-13425246 ] Michael McCandless commented on LUCENE-3892: OK I think I understand the two patches now. First, the build.xml changes are noise I think. Second, the patches both mix in the removal of the current For/PFor postings formats based on sep (I will separately commit this removal: BlockPF is faster). Then, one patch (LUCENE-3892-blockForhardcode(base).patch) keeps using the separate packed-ints impl we have, but cuts over to LongBuffer instead of int[] for the decoded values (still uses IntBuffer for the encoded values), while the other patch (LUCENE-3892-blockForpackedecoder(comp).patch) uses oal.util.packed and LongBuffer for both encoded and decoded values. So it's nice to see that merely switching to LongBuffer to pass encoded/decoded values around doesn't seem to hurt much, except for And queries (odd?), but then switching to oal.util.packed does hurt (also odd because our packed ints impl has been heavily optimized lately). Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425296#comment-13425296 ] Adrien Grand commented on LUCENE-3892: -- My benchmark results are a little different but oal.util.packed is still behind... (it compares the current branch vs. patched with PackedInts): {noformat} TaskQPS pforcodecStdDev pforcodecQPS pforcodec-packedintsStdDev pforcodec-packedints Pct diff Phrase 38.213.01 35.732.41 -19% - 8% SpanNear 27.991.30 26.301.23 -14% - 3% SloppyPhrase 43.322.98 41.022.53 -16% - 7% AndHighMed 230.238.48 219.889.35 -11% - 3% AndHighHigh 52.532.02 50.802.62 -11% - 5% IntNRQ 43.243.42 41.842.79 -16% - 12% Wildcard 113.263.17 109.913.50 -8% - 3% Prefix3 194.569.56 189.399.64 -11% - 7% Term 301.86 14.49 295.28 17.51 -12% - 8% OrHighMed 100.608.30 99.068.00 -16% - 15% OrHighHigh 32.352.92 31.902.88 -17% - 18% Fuzzy2 36.270.67 35.870.93 -5% - 3% Fuzzy1 81.141.24 80.241.68 -4% - 2% TermGroup100K 193.403.36 191.274.13 -4% - 2% TermBGroup100K1P 152.785.06 151.233.98 -6% - 5% TermBGroup100K 242.787.06 240.718.01 -6% - 5% Respell 85.751.36 85.172.04 -4% - 3% PKLookup 206.025.05 205.574.63 -4% - 4% {noformat} I am not sure why oal.util.packed is slower. The only differences I see is that they use inheritance instead of a switch block to know how to decode data and that they encode values in the high-order long bits first while the branch currently starts with the low-order int bits. I'll try to dig deeper to understand what happens... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425306#comment-13425306 ] Michael McCandless commented on LUCENE-3892: I just committed a new BlockPacked postings format, which is a copy of Block postings format but using oal.util.packed for encode/decode. I left Block unchanged, except I moved the util classes it had been using out of oal.codecs.pfor, and removed oal.codecs.pfor. So now we can iterate to speed up packed ints cutover, and do perf tests off the branch. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425313#comment-13425313 ] Michael McCandless commented on LUCENE-3892: Sorry I meant to say: the BlockPacked PF is from Billy's LUCENE-3892-blockForpackedecoder(comp).patch. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425388#comment-13425388 ] Michael McCandless commented on LUCENE-3892: I tested Block vs BlockPacked as checked in. On a Westmere Xeon machine (Java 1.7.0_04): {noformat} TaskQPS base StdDev base QPS for StdDev for Pct diff AndHighMed 15.140.14 13.780.13 -10% - -7% SloppyPhrase2.550.112.330.09 -15% - -1% OrHighHigh3.750.163.440.09 -14% - -1% Wildcard8.440.017.780.28 -11% - -4% SpanNear1.110.041.030.04 -13% - 0% Prefix3 17.910.08 16.630.50 -10% - -3% OrHighMed 11.350.65 10.630.44 -15% - 3% IntNRQ6.730.036.320.27 -10% - -1% TermBGroup1M3.870.033.680.04 -6% - -3% AndHighHigh4.860.094.630.03 -7% - -2% Phrase1.100.061.050.06 -14% - 6% Term7.860.037.520.04 -5% - -3% TermBGroup1M1P4.650.124.490.06 -6% - 0% TermGroup1M2.970.042.880.02 -4% - -1% Fuzzy1 71.221.93 71.021.44 -4% - 4% Fuzzy2 49.761.33 49.901.23 -4% - 5% Respell 76.232.67 76.932.67 -5% - 8% PKLookup 161.893.28 168.287.87 -2% - 11% {noformat} And on an desktop Ivy Bridge (Java 1.7.0_04): {noformat} TaskQPS base StdDev base QPS for StdDev for Pct diff AndHighMed 17.320.12 15.410.03 -11% - -10% SloppyPhrase2.740.212.560.11 -16% - 5% Phrase1.320.071.230.06 -15% - 3% Wildcard9.650.119.080.12 -8% - -3% SpanNear1.200.011.130.01 -7% - -3% AndHighHigh5.320.035.040.02 -6% - -4% Prefix3 18.930.20 18.040.24 -6% - -2% IntNRQ7.790.137.480.13 -7% - 0% Term9.480.109.150.43 -8% - 2% TermBGroup1M4.740.054.590.12 -6% - 0% OrHighMed 13.010.24 12.600.55 -9% - 2% OrHighHigh4.080.053.970.17 -8% - 2% TermGroup1M3.300.033.220.07 -5% - 0% TermBGroup1M1P5.520.115.420.22 -7% - 4% PKLookup 194.624.43 193.445.07 -5% - 4% Fuzzy1 79.231.31 79.210.96 -2% - 2% Respell 78.971.04 79.871.15 -1% - 3% Fuzzy2 56.170.93 56.820.64 -1% - 4% {noformat} So packed is still behind ... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockForhardcode(base).patch, LUCENE-3892-blockForpackedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13420732#comment-13420732 ] Robert Muir commented on LUCENE-3892: - FYI: I committed the TestPostingsFormat here to trunk/4.x to get it going in jenkins. I will merge back to the branch... it can then be modified/improved as usual! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419381#comment-13419381 ] Robert Muir commented on LUCENE-3892: - {quote} I'm afraid that the for loop of readLong() hurts the performance. Here is the comparison against last patch: {quote} I think so too. I think in each enum, up front you want a pre-allocated byte[] (maximum size possible for the block), and you do ByteBuffer.wrap(x).asLongBuffer. after you read the header, call readBytes() and then just rewind()? So this is just like what you do now in the branch, except with LongBuffer instead of IntBuffer Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415072#comment-13415072 ] Michael McCandless commented on LUCENE-3892: Thanks Billy, I committed last baseline patch! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415085#comment-13415085 ] Michael McCandless commented on LUCENE-3892: I opened LUCENE-4225 with a new base PostingsFormat that gives better perf for For than Sep... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415226#comment-13415226 ] Michael McCandless commented on LUCENE-3892: I think a good thing to explore next is to stop using our own packed ints impl and instead cutover to oal.util.packed? (Since so much effort has gone into making those impls fast). LUCENE-4161 has already taken a big step towards making them usable ... we should prototype an initial cutover and then iterate? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415230#comment-13415230 ] Adrien Grand commented on LUCENE-3892: -- +1 Don't hesitate to tell me if you're missing methods for this issue (I'm thinking at least of bulk int[] read/write, we currently only make it possible with longs). Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415236#comment-13415236 ] Han Jiang commented on LUCENE-3892: --- bq. I opened LUCENE-4225 with a new base PostingsFormat that gives better perf for For than Sep... Wow, the result looks great! Quite curious why some queries improve so much, like AndHighHigh. bq. LUCENE-4161 has already taken a big step towards making them usable ... we should prototype an initial cutover and then iterate? Yes, but we should make the PostingsFormat pass test first? Currently it also fails some tests for ForPF. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415335#comment-13415335 ] Michael McCandless commented on LUCENE-3892: bq. Yes, but we should make the PostingsFormat pass test first? Currently it also fails some tests for ForPF. Uh oh I didn't know tests are failing on the branch: do you have a seed? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413682#comment-13413682 ] Han Jiang commented on LUCENE-3892: --- bq. Was the numBits==0 case for all 0s not all 1s? We may want to have it mean all 1s instead? OK, I just tested this, and for most cases(93%) when the whole block shares one value v, v==1. This change improves index speed and reduce file size a bit(280s vs 320s and 589M vs 591M). But why? Does lucene store freq() when it is 0 as well, so a whole block with v==1 will be more possible? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413698#comment-13413698 ] Michael McCandless commented on LUCENE-3892: bq. But why? Does lucene store freq() when it is 0 as well, so a whole block with v==1 will be more possible? A whole block of 1s can easily happen: if all freqs are one (the term always occurred only once in each document), or if the term occurs in every document than the delta between docIDs is always 1. I don't think we should ever hit an all 0s block today (hmm: except for positions, if the given term always occurred at the first position in each doc). We could in theory subtract 1 from all these deltas (except the first one! so maybe we add one to the docID to begin with...) so that these turn into all 0s blocks, but then at decode time we'd have to add 1 back and I'm not sure that'd net/net be a win. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413709#comment-13413709 ] Han Jiang commented on LUCENE-3892: --- bq. We could in theory subtract 1 from all these deltas (except the first one! so maybe we add one to the docID to begin with...) so that these turn into all 0s blocks, but then at decode time we'd have to add 1 back and I'm not sure that'd net/net be a win. Hmm , so current strategy is: 1.for docIDs, store v[i+1]-v[i]-1; 2. for freq and positions, store v[i] directly? Yes there are blocks with all 0s, although very rare to see. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413719#comment-13413719 ] Michael McCandless commented on LUCENE-3892: No, for docIDs we store docID - lastDocID. So that delta can be 0 for the first doc in a posting list, and then = 1 thereafter. But an all 0s block is possible if a bunch of terms in a row occurred only in doc 0. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413935#comment-13413935 ] Michael McCandless commented on LUCENE-3892: Those are interesting results! Curious how much faster indexing is for PFor if you use all_Vs; cutting the header is also a nice reduction on index size. Instead of having P/ForUtil reach up into P/ForPostingsFormat for the default block size, I think we can assume the int[] array length (of the decoded buffer) is the size of the block? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413312#comment-13413312 ] Michael McCandless commented on LUCENE-3892: Thanks Billy, I'll commit! One thing I noticed: I think we shouldn't separately read numBytes and the int header? Can't we do a single readVInt(), and that encodes numBytes as well as format (bit width and format, once we tie into oal.util.packed APIs)? Also, we shouldn't encode numInts at all, ie, this should be fixed for the whole segment, and not written per block. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413314#comment-13413314 ] Michael McCandless commented on LUCENE-3892: I didn't commit lucene/core/src/java/org/apache/lucene/codecs/pfor/ForPostingsFormat.java -- your IDE had changed it to a wildcard import (I prefer we stick with individual imports). Was the numBits==0 case for all 0s not all 1s? We may want to have it mean all 1s instead? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411472#comment-13411472 ] Michael McCandless commented on LUCENE-3892: bq. I'm still not sure about the IOUtils.closeWhileHandlingException(), I think the exceptions should not be suppressed when out.close() is called? Actually I think you want them to be suppressed, so that the original exception is seen? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411475#comment-13411475 ] Michael McCandless commented on LUCENE-3892: Docs/cleanup patch looks good, I'll commit to the branch! Thanks. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411495#comment-13411495 ] Han Jiang commented on LUCENE-3892: --- bq. Actually I think you want them to be suppressed, so that the original exception is seen? Not my idea actually, I think the exception should be thrown for out.close()? closeWhileHandlingException() will suppress those exceptions. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411515#comment-13411515 ] Michael McCandless commented on LUCENE-3892: bq. Not my idea actually, I think the exception should be thrown for out.close()? closeWhileHandlingException() will suppress those exceptions But the problem is some other exception has already been thrown (because success is false). If out.close then hits a second exception we have to pick which one should be thrown, and I think the original one is better? (Since it's likely the root cause of whatever went wrong). Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411517#comment-13411517 ] Han Jiang commented on LUCENE-3892: --- bq. The Pulsing parts in last patch is not included here, because they doesn't improve performance significantly. Here are some tests between For vs PulsingFor, PFor vs PulsingPFor. Run on the 1M docs with wikimediumhard.tasks It is strange that PKLookup still doesn't benefit for FixedBlockInt: {noformat} Task QPS For StdDev ForQPS PulsingForStdDev PulsingFor Pct diff AndHighHigh 23.010.33 22.940.66 -4% - 4% AndHighMed 56.410.76 57.411.74 -2% - 6% Fuzzy1 86.740.85 82.222.39 -8% - -1% Fuzzy2 28.230.38 26.150.97 -11% - -2% IntNRQ 41.781.65 40.783.53 -14% - 10% OrHighHigh 14.440.34 14.500.92 -8% - 9% OrHighMed 30.590.77 31.121.93 -6% - 10% PKLookup 110.312.03 109.222.43 -4% - 3% Phrase8.180.447.970.40 -12% - 8% Prefix3 99.642.38 97.093.46 -8% - 3% Respell 99.660.45 92.762.81 -10% - -3% SloppyPhrase4.280.164.080.13 -11% - 2% SpanNear4.080.133.930.06 -7% - 0% Term 33.631.25 34.061.71 -7% - 10% TermBGroup1M 15.540.46 15.780.56 -4% - 8% TermBGroup1M1P 20.340.73 20.620.62 -5% - 8% TermGroup1M 19.180.52 19.720.49 -2% - 8% Wildcard 34.860.88 34.271.77 -9% - 6% {noformat} {noformat} AndHighHigh 19.980.31 19.920.26 -3% - 2% AndHighMed 58.211.51 57.861.18 -5% - 4% Fuzzy1 91.861.17 85.861.18 -8% - -4% Fuzzy2 32.660.58 30.080.57 -11% - -4% IntNRQ 33.890.82 32.661.10 -9% - 2% OrHighHigh 15.791.29 14.960.67 -16% - 7% OrHighMed 30.312.09 28.911.67 -15% - 8% PKLookup 112.800.81 111.822.90 -4% - 2% Phrase6.140.116.230.10 -1% - 5% Prefix3 147.802.88 138.352.11 -9% - -3% Respell 118.571.18 108.301.86 -11% - -6% SloppyPhrase5.780.155.660.29 -9% - 5% SpanNear6.320.146.400.16 -3% - 6% Term 41.602.44 38.120.33 -14% - -1% TermBGroup1M 14.400.48 13.730.19 -8% - 0% TermBGroup1M1P 23.680.44 22.820.44 -7% - 0% TermGroup1M 15.250.48 14.510.20 -9% - 0% Wildcard 32.760.53 31.760.62 -6% - 0% {noformat} Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411533#comment-13411533 ] Han Jiang commented on LUCENE-3892: --- bq. But the problem is some other exception has already been thrown (because success is false). If out.close then hits a second exception we have to pick which one should be thrown, and I think the original one is better? (Since it's likely the root cause of whatever went wrong). OK, I see, then let's change ForPostingsFormat.fieldsConsumer/Producer as well. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411546#comment-13411546 ] Michael McCandless commented on LUCENE-3892: OK I committed that! Let me know if I missed any... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411556#comment-13411556 ] Han Jiang commented on LUCENE-3892: --- OK, thanks! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409664#comment-13409664 ] Michael McCandless commented on LUCENE-3892: bq. Current branch cannot pass tests like this: Thanks, I committed the patch. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405477#comment-13405477 ] Michael McCandless commented on LUCENE-3892: Thanks Billy, I committed this to the branch. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399869#comment-13399869 ] Chris Male commented on LUCENE-3892: It's really interesting the effect of peeling back those abstractions. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399883#comment-13399883 ] Han Jiang commented on LUCENE-3892: --- Yes, really interesting. And that should make sense. As far as I know, a method with exception handling may be quite slow than a simple if statement check. Here is part of the result in my test, with Mike's patch: {noformat} OrHighMed2.530.312.570.13 -13% - 21% Wildcard3.860.123.940.38 -10% - 15% OrHighHigh1.570.181.610.08 -12% - 21% TermBGroup1M1P1.930.032.480.10 21% - 35% TermGroup1M1.370.021.810.05 26% - 37% TermBGroup1M1.170.021.640.07 32% - 47% Term2.920.134.460.23 38% - 68% {noformat} Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398800#comment-13398800 ] Han Jiang commented on LUCENE-3892: --- And same codes with the wikimediumhard.tasks file.(This is really a hard testcase, since QPS are so small that we can hardly depend on Pct Diff :) ) {noformat} TaskQPS Base StdDev Base QPS For StdDev For Pct diff AndHighMed 10.760.216.470.32 -43% - -35% AndHighHigh2.890.082.570.19 -20% - -1% SpanNear0.600.010.550.01 -11% - -6% SloppyPhrase0.610.010.570.01 -9% - -3% PKLookup 87.722.61 86.281.48 -6% - 3% Fuzzy1 36.221.14 35.900.97 -6% - 5% Phrase1.220.031.220.08 -9% - 8% Respell 32.840.92 33.550.87 -3% - 7% IntNRQ3.660.353.740.08 -8% - 15% Fuzzy2 21.620.66 22.100.51 -3% - 7% Prefix3 13.300.49 14.090.76 -3% - 15% OrHighMed3.430.163.650.45 -10% - 25% OrHighHigh1.660.091.790.22 -10% - 28% Wildcard3.390.143.740.200% - 21% TermBGroup1M1P1.840.032.100.163% - 25% TermGroup1M1.140.031.340.105% - 29% TermBGroup1M1.490.051.780.137% - 32% Term3.490.134.380.652% - 49% {noformat} Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397605#comment-13397605 ] Michael McCandless commented on LUCENE-3892: OK I created a branch and committed last For patch: https://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397694#comment-13397694 ] Han Jiang commented on LUCENE-3892: --- OK, just reproduce your test. But Mike, are we using a same task file? Our relative speeds for different queries are not the same. {quote} TaskQPS Base StdDev Base QPS For StdDev For Pct diff Phrase5.070.453.760.19 -35% - -14% (-44% - -18%) AndHighMed 28.322.34 22.670.67 -28% - -10% (-38% - -9%) SpanNear2.720.132.360.14 -22% - -3% (-36% - -8%) SloppyPhrase4.180.203.830.15 -16% - 0% (-33% - -6%) Respell 42.021.83 38.862.30 -16% - 2% (-18% -0%) Fuzzy1 44.961.58 42.851.69 -11% - 2% (-12% -0%) Fuzzy2 16.780.69 16.340.68 -10% - 5% (-12% -3%) PKLookup 89.112.15 87.332.19 -6% - 2% ( -2% -5%) AndHighHigh7.610.447.690.21 -7% - 10% (-21% - 10%) Wildcard 19.500.91 20.020.72 -5% - 11% (-21% -3%) TermBGroup1M 20.820.37 21.730.690% - 9% ( 2% - 10%) TermGroup1M 13.790.13 14.610.322% - 9% ( 1% -9%) IntNRQ4.110.564.560.56 -14% - 43% (-25% - 33%) TermBGroup1M1P 21.450.75 24.000.515% - 18% ( -1% - 22%) OrHighMed5.080.495.730.150% - 28% (-16% - 25%) OrHighHigh4.220.394.780.131% - 28% (-15% - 24%) Prefix3 30.911.63 35.652.023% - 28% (-14% - 21%) Term 44.361.87 54.011.96 12% - 31% ( -1% - 33%) {quote} Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397958#comment-13397958 ] Michael McCandless commented on LUCENE-3892: bq. But Mike, are we using a same task file? Our relative speeds for different queries are not the same. Sorry, I'm using a hand edited hard tasks file; I'll commit push to luceneutil. But, separately: each run picks a different subset of the tasks from each category to run, so results from one run to another in general aren't comparable unless we fix the random seed it uses. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396987#comment-13396987 ] Han Jiang commented on LUCENE-3892: --- Oh, thank you Mike! I haven't thought too much about those skipping policies. bq. Up above, in ForFactory, when we readInt() to get numBytes ... it seems like we could stuff the header numBits into that same int and save checking that in FORUtil.decompress Ah, yes, I just forgot to remove the redundant codes. Here is a initial try to remove header and call ForDecompressImpl directly in readBlock():with For, blockSize=128. Data in bracket show prior benchmark. {noformat} TaskQPS Base StdDev Base QPS For StdDev For Pct diff Phrase4.990.373.570.26 -38% - -17% (-44% - -18%) AndHighMed 28.912.17 22.660.82 -29% - -12% (-38% - -9%) SpanNear2.720.142.220.13 -26% - -8% (-36% - -8%) SloppyPhrase4.240.263.700.16 -21% - -3% (-33% - -6%) Respell 40.712.59 37.661.36 -16% - 2% (-18% -0%) Fuzzy1 43.222.01 40.660.32 -10% - 0% (-12% -0%) Fuzzy2 16.250.90 15.640.26 -10% - 3% (-12% -3%) Wildcard 19.070.86 19.070.73 -8% - 8% (-21% -3%) AndHighHigh7.760.477.770.15 -7% - 8% (-21% - 10%) PKLookup 87.504.56 88.511.24 -5% - 8% ( -2% -5%) TermBGroup1M 20.420.87 21.320.74 -3% - 12% ( 2% - 10%) OrHighMed5.330.685.610.14 -9% - 23% (-16% - 25%) OrHighHigh4.430.534.690.12 -8% - 23% (-15% - 24%) TermGroup1M 13.300.34 14.310.402% - 13% ( 0% - 13%) TermBGroup1M1P 20.920.59 23.710.866% - 20% ( -1% - 22%) Prefix3 30.301.41 35.141.765% - 27% (-14% - 21%) IntNRQ3.900.544.580.47 -7% - 50% (-25% - 33%) Term 42.171.55 52.332.57 13% - 35% ( 1% - 33%) {noformat} The improvement is quite general. However, I still suppose this just benefits from less method calling. I'm trying to change the PFor codes, and remove those nested call. bq. Get more direct access to the file as an int[]; ... Ok, this will be considered when the pfor+pulsing is completed. I'm just curious why we don't have readInts in ora.util yet... bq. Skipping: can we partially decode a block? ... The pfor-opt approach(encode lower bits of exception in normal area, and other bits in exception area) natually fits partially decode a block, that'll be possible when we optimize skipping queries. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397228#comment-13397228 ] Han Jiang commented on LUCENE-3892: --- And result for PFor(blocksize=128): {noformat} TaskQPS Base StdDev BaseQPS PFor StdDev PFor Pct diff Phrase4.870.363.390.18 -38% - -20% (-47% - -25%) AndHighMed 27.782.35 21.130.52 -31% - -14% (-37% - -15%) SpanNear2.700.142.200.11 -26% - -9% (-36% - -13%) SloppyPhrase4.170.153.770.21 -17% - 0% (-30% - -6%) Respell 39.971.56 37.651.95 -14% - 3% (-15% -2%) Wildcard 19.080.77 18.330.92 -12% - 5% (-17% -3%) Fuzzy1 42.291.13 40.781.44 -9% - 2% (-11% -1%) AndHighHigh7.610.557.450.08 -9% - 6% (-19% -6%) Fuzzy2 15.790.55 15.640.70 -8% - 7% (-11% -6%) PKLookup 86.712.13 88.922.24 -2% - 7% ( -2% -7%) TermGroup1M 13.040.23 14.030.402% - 12% ( 1% -9%) IntNRQ3.970.484.350.61 -15% - 41% (-16% - 24%) TermBGroup1M1P 21.040.35 23.200.605% - 14% ( 0% - 14%) TermBGroup1M 19.270.47 21.280.843% - 17% ( 1% - 10%) OrHighHigh4.130.474.630.27 -5% - 34% (-14% - 27%) OrHighMed4.950.595.580.34 -5% - 35% (-14% - 27%) Prefix3 30.331.36 34.262.141% - 25% ( -6% - 20%) Term 41.991.19 50.751.72 13% - 28% ( 2% - 26%) {noformat} It works, and it is quite interesting that StdDev for Term query is reduced significantly. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395894#comment-13395894 ] Han Jiang commented on LUCENE-3892: --- There's a potential bottleneck during method calling...Here is an example for PFor, with blocksize=128, exception rate = 97%, normal value = 2 bits, exception value = 32 bits: {noformat} Decoding normal values: 4703 ns Patching exceptions: 5797 ns Single call of PForUtil.decompress totally takes: 58318 ns {noformat} In addition, it costs about 4000ns to record the time span. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396325#comment-13396325 ] Michael McCandless commented on LUCENE-3892: On the For patch ... we shouldn't encode/decode numInts right? It's always 128? Up above, in ForFactory, when we readInt() to get numBytes ... it seems like we could stuff the header numBits into that same int and save checking that in FORUtil.decompress I think there are a few possible ideas to explore to get faster PFor/For performance: * Get more direct access to the file as an int[]; eg MMapDir could expose an IntBuffer from its ByteBuffer (saving the initial copy into byte[] that we now do). Or maybe we add IndexInput.readInts(int[]) and dir impl can optimize how that's done (MMapDir could use Unsafe.copyBytes... except for little endian architectures ... we'd probably have to have separate specialized decoder rather than letting Int/ByteBuffer do the byte swapping). This would require the whole file stays aligned w/ int (eg the header must be 0 mod 4). * Copy/share how oal.packed works, i.e. being able to waste a bit to have faster decode (eg storing the 7 bit case as byte[], wasting 1 bit for each value). * Skipping: can we partially decode a block? EG if we are skipping and we know we only want values after the 80th one, then we shouldn't decode those first 80... * Since doc/freq are aligned, when we store pointers to a given spot, eg in the terms dict or in skip data, we should only store the offset once (today we store it twice). * Alternatively, maybe we should only save skip data on doc/freq block boundaries (prox would still need skip-within-block). * Maybe we should store doc frq blocks interleaved in a single file (since they are aligned) and then skip would skip to the start of a doc/frq block pair. Other ideas...? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289296#comment-13289296 ] Michael McCandless commented on LUCENE-3892: Hi Billy, bq. Can I get it from a wiki dump instead? You can download it at http://people.apache.org/~mikemccand/enwiki-20120502-lines-1k.txt.lzma That's ~6.3 GB (compressed) and 28.7 GB (decompressed); it's the 2012/05/02 Wikipedia en export, filtered to plain text and then broken into 33.3 M ~1 KB sized docs. I can help you get the luceneutil env set up... {quote} bq. Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 sec). Yes, it is expected, actually it scans every block 33 times to estimate metadata such as numFrameBits and numExceptions. {quote} OK, in that case I'm surprised it's only ~18% slower! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288675#comment-13288675 ] Michael McCandless commented on LUCENE-3892: Excellent! All tests also pass for me w/ PFor postings format as well... this is a great starting point :) One Solr test failed (ContentStreamTest)... but I think it was false failure... I did notice the tests seem to run slower, especially certain ones eg TestJoinUtil. Still missing a couple license headers (TestMin, TestCompress)... I ran a quick perf test using http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc Wikipedia index. Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 sec). But more important is the slower search times: {noformat} TaskQPS base StdDev baseQPS pfor StdDev pfor Pct diff Phrase8.520.504.430.40 -55% - -39% SloppyPhrase 12.520.397.870.51 -43% - -30% AndHighMed 67.692.82 44.221.47 -39% - -29% SpanNear5.190.123.900.28 -31% - -17% PKLookup 112.161.71 95.611.30 -17% - -12% AndHighHigh 13.220.34 11.860.72 -17% - -2% Wildcard 46.040.37 41.684.45 -19% - 1% Fuzzy1 50.112.03 48.061.91 -11% - 3% OrHighMed9.260.488.900.37 -12% - 5% OrHighHigh 12.280.56 11.830.49 -11% - 5% TermBGroup1M1P 40.471.94 39.882.51 -11% - 10% Fuzzy2 53.712.66 53.012.08 -9% - 7% TermGroup1M 36.461.21 35.991.58 -8% - 6% TermBGroup1M 55.531.99 55.262.68 -8% - 8% Respell 69.714.49 69.732.07 -8% - 10% Term 94.387.62 94.96 12.19 -18% - 23% Prefix3 41.630.34 42.215.82 -13% - 16% IntNRQ7.080.157.281.29 -17% - 23% {noformat} The queries that do skipping are quite a bit slower; this makes sense, since on skip we do a full block decode. A smaller block size (we use 128 now right?) should help I think. It's strange that the non-skipping queries (Term, OrHighMed, OrHighHigh) don't show any performance gain ... maybe we need to optimize the decode... or it could be the removal of the bulk api is hurting us here. I'm also curious if we tried a pure FOR (no patching, so we must set numBits according to the max value = larger index but hopefully faster decode) if the results would improve... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289104#comment-13289104 ] Han Jiang commented on LUCENE-3892: --- Thanks Mike, we have so much details to help optimize! bq.Still missing a couple license headers (TestMin, TestCompress)... Ok, I'll add them later. bq.I ran a quick perf test using http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc Wikipedia index. The script is wonderful! But the wiki data is missing? Can I get it from a wiki dump instead? bq.Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 sec). Yes, it is expected, actually it scans every block 33 times to estimate metadata such as numFrameBits and numExceptions. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287936#comment-13287936 ] Michael McCandless commented on LUCENE-3892: Awesome progress! Nice to have a dirt path online that we can then iterate from ... Hmm, I'm seeing some test failures when I run: {noformat} ant test -Dtests.postingsformat=PFor {noformat} Eg, TestNRTThreads, TestShardSearching, TestTimeLimitingCollector. Remember to add the standard copyright headers to each new source file... We don't have to do this now, but I wonder if we can share code w/ the packed ints impl we have, instead generating another one with the .py source. TestDemo makes a nice TestMin... I usually start with TestDemo when testing scary new code, and then it's a huge milestone once TestDemo passes :) We should definitely cutover to BlockTree terms dict (I would upgrade that TODO to a nocommit!). I suspect that wrapping the blocks byte[] as ByteBuffer and then IntBuffer is going to be too costly per decode so we should init them once and re-use (upgrade that TODO to a nocommit). Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287951#comment-13287951 ] Han Jiang commented on LUCENE-3892: --- Ah, yes, I forgot to use -Dtests.postingsformat...I can see the errors now. {quote} TestDemo makes a nice TestMin... I usually start with TestDemo when testing scary new code, and then it's a huge milestone once TestDemo passes {quote} Hmm, that means I should remove TestMin.java? This testcase works fine for the patch. {quote} We should definitely cutover to BlockTree terms dict (I would upgrade that TODO to a nocommit!). {quote} I'm not quite familiar with these sign stuff, shall I change all the TODO sign into nocommit? Are the signs related to documentation, or just marked to remember not to commit current codes? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287952#comment-13287952 ] Michael McCandless commented on LUCENE-3892: bq. Hmm, that means I should remove TestMin.java? This testcase works fine for the patch. Oh it's fine to keep TestMin now that you wrote it ... I was just saying that TestDemo is the test I run when I want the most trivial test for a new big change. {quote} I'm not quite familiar with these sign stuff, shall I change all the TODO sign into nocommit? Are the signs related to documentation, or just marked to remember not to commit current codes? {quote} Sorry - this is just a convention I use: I put a // nocommit comment whenever there's a blocker to committing; this way I can grep for nocommit to see what still needs fixing... and towards the end, nocommits will often be downgraded to TODOs since on closer inspection they really don't have to block committing... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.1 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265950#comment-13265950 ] Han Jiang commented on LUCENE-3892: --- A postings format named VSEncoding also seems promising! It is available here: http://integerencoding.isti.cnr.it/ And license compatible: https://github.com/maropu/integer_encoding_library/blob/master/LICENSE Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262694#comment-13262694 ] Han Jiang commented on LUCENE-3892: --- It's quite strange that sometimes I cannot access repo1.maven.org, therefore ant ivy-boostrap ant resolve will fail to work.(Since I'm in China, the network connection might be limited). Once Mike and I hoped to make things work by configuring lucene/common-build.xml dev-tools/scripts/poll-mirrors.pl to another maven mirror, listed in http://docs.codehaus.org/display/MAVENUSER/Mirrors+Repositories. Unfortunately, the main site repo1.maven.org was configured into ivy-2.2.0.jar, and even we pass ant ivy-bootstrap, ant resolve still fails. Well, here is how I get things work(too ugly, hope a better suggestion!): change /etc/hosts, and redirect current maven site to a mirror with same directory structure, for example: 194.8.197.22repo1.maven.org # to http://mirror.netcologne.de/ Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262701#comment-13262701 ] Michael McCandless commented on LUCENE-3892: Phew, I'm glad to hear you got it working! So ant resolve finished successfully? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262707#comment-13262707 ] Han Jiang commented on LUCENE-3892: --- Yes, and ant test is running now. Maybe we can configure something to avoid the ugly hack? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262711#comment-13262711 ] Robert Muir commented on LUCENE-3892: - Maybe a good solution is if we have an ant property (that we somehow pass to ivy), and we conditionally set it in ant by default to a server we know that works in china, if the ${user.language}=zh ? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262715#comment-13262715 ] Han Jiang commented on LUCENE-3892: --- Thank you, Robert! But currently, the maven mirror in China(http://mirrors.redv.com/maven2) is not available. And can we pass a property to ivy to replace the repo1* stuff? Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262777#comment-13262777 ] Robert Muir commented on LUCENE-3892: - Patch does not yet fix ivy-bootstrap. Ivy-bootstrap still only tries repo1.maven.org. We need a different strategy for that: either we depend on try-catch from ant contrib (undesired), use custom ant task (g), or use a chain of targets with fail-on-error=false unless the file already exists and checksum at the end... Lemme see if i can fix ivy-bootstrap, too! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 Attachments: LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262835#comment-13262835 ] Robert Muir commented on LUCENE-3892: - I will commit this patch: please let us know if you have more problems from china! :) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260422#comment-13260422 ] Michael McCandless commented on LUCENE-3892: Hi Billy, I'm very excited your proposal is accepted! Congrats :) Now the fun work begins... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260149#comment-13260149 ] Han Jiang commented on LUCENE-3892: --- Thank all of you for providing me this opportunity! Let us begin! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247175#comment-13247175 ] Han Jiang commented on LUCENE-3892: --- {quote} * There are actually more than 2 codecs (eg we also have Lucene3x, SimpleText, sep/intblock (abstract), random codecs/postings formats for testing...), but our default codec now is Lucene40. {quote} Yes, but it seems that our baseline will be Lucene40 and Pulsing? Lucene3x is read-only, and other approaches are not productive. And, what is random codec? Does it mean to randomly pick up a codec for user? {quote} * I think you can use the existing abstract sep/intblock classes (ie, they implement layers like FieldsProducer/Consumer...), and then you can just implement the required methods (eg to encode/decode one int[] block). {quote} And this was my initial thought about the PForDelta interface: The class hierarchy will be as below (quite similar to pulsing): * PForDeltaPostingsFormat(extends PostingsFormat): It will define global behaviors such as file suffix, and provide customized FieldsWriter/Reader * PForDeltaFieldsWriter(extends FieldsConsumer): It will define how terms,docids,freq,offset are written into posting files. inner classes include: ** PForDeltaTermsConsumer(extends TermsConsumer) ** PForDeltaPostingsConsumer(extends PostingsConsumer) * PForDeltaFieldsReader(extends FieldsProducer): It will define how postings are read from index, and provide *Enum class to iterate docids, freqs etc. inner classes include: ** PForDeltaFieldsEnum(extends FieldsEnum) ** PForDeltaTermsEnum(extends TermsEnum) ** PForDeltaDocsEnum(extends DocsEnum) ** PForDeltaDocsAndPositonsEnum(extends DocsAndPostionsEnum) ** PForDeltaTerms(extends Terms) It seems that BlockTermsReader/Writer have already implement those subclasses, and we can just pass our Postings(Writer/Reader)Base as an argument, like PatchedFrameOfRefCodec::fieldsConsumer() does. Then, to introduce PForDeltaCodec into trunk, we should also introduce the fixed codec? Also, why isn't lucene40codec implemented with this line? {quote} * We may need to tune the skipper settings, based on profiling results from skip-intensive (Phrase, And) queries... since it's currently geared towards single-doc-at-once encoding. I don't think we should try to make a new skipper impl here... (there is a separate issue for that). {quote} I haven't investigated much about different kinds of queries. What are skipper settings? {quote} * Maybe explore the combination of pulsing and PForDelta codecs; seems like the combination of those two could be important, since for low docFreq terms, retrieving the docs is now more expensive... {quote} Yes, it seems that if PForDelta outperforms current approaches, a Pulsing version will work better? This feature will also come as phase 2. Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245374#comment-13245374 ] Michael McCandless commented on LUCENE-3892: The proposal at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/billybob/1 looks great! Some initial feedback: * There are actually more than 2 codecs (eg we also have Lucene3x, SimpleText, sep/intblock (abstract), random codecs/postings formats for testing...), but our default codec now is Lucene40. * I think you can use the existing abstract sep/intblock classes (ie, they implement layers like FieldsProducer/Consumer...), and then you can just implement the required methods (eg to encode/decode one int[] block). * We may need to tune the skipper settings, based on profiling results from skip-intensive (Phrase, And) queries... since it's currently geared towards single-doc-at-once encoding. I don't think we should try to make a new skipper impl here... (there is a separate issue for that). * Maybe explore the combination of pulsing and PForDelta codecs; seems like the combination of those two could be important, since for low docFreq terms, retrieving the docs is now more expensive... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240403#comment-13240403 ] Han Jiang commented on LUCENE-3892: --- Hi, I have submitted my proposal. Comments are welcome! Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240527#comment-13240527 ] Michael McCandless commented on LUCENE-3892: That's great Han, I'll have a look. I can be a mentor for this... Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.) - Key: LUCENE-3892 URL: https://issues.apache.org/jira/browse/LUCENE-3892 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Labels: gsoc2012, lucene-gsoc-12 Fix For: 4.0 On the flex branch we explored a number of possible intblock encodings, but for whatever reason never brought them to completion. There are still a number of issues opened with patches in different states. Initial results (based on prototype) were excellent (see http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html ). I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org