[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-24 Thread Uwe Schindler (JIRA)














































Uwe Schindler
 commented on  LUCENE-3892


Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)















We should keep the size of methods small, as bigger methods work against the code cache of hotspot and if Lucene is not used alone, may get de-optimized.



























This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438728#comment-13438728
 ] 

Robert Muir commented on LUCENE-3892:
-

Thanks Billy for all the hard work and endless benchmarking, so nice to have a 
block codec that is simple and clean and reuses our packed ints optimizations.


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 5.0, 4.0

 Attachments: LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-javadocs.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892-trunk.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13437825#comment-13437825
 ] 

Michael McCandless commented on LUCENE-3892:


Woops sorry!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-javadocs.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892-trunk.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13435033#comment-13435033
 ] 

Michael McCandless commented on LUCENE-3892:


Uwe just started builds for this branch (thanks!): 
http://jenkins.sd-datasolutions.de/job/pforcodec-3892-branch

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-13 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433143#comment-13433143
 ] 

Adrien Grand commented on LUCENE-3892:
--

bq. (From mailing-list) So I think if its this ambiguous for wikipedia we 
should shoot for the most COMPACT form as a safe default.

+1 too. I just committed the change.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-10 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432716#comment-13432716
 ] 

Adrien Grand commented on LUCENE-3892:
--

I ran the comparison between acceptableOverheadRatio=PackedInts.COMPACT (0%) 
and PackedInts.DEFAULT (20%) and it seems to be much faster with 
PackedInts.COMPACT:

{noformat}
base=COMPACT, challenger=DEFAULT
TaskQPS base StdDev base QPS def  StdDev def  Pct 
diff
  IntNRQ   81.835.43   74.142.94  -18% -
0%
HighTerm  146.55   10.34  133.579.02  -20% -
4%
   LowPhrase   93.911.63   86.901.67  -10% -   
-4%
 MedTerm  824.58   43.48  766.35   38.78  -16% -
3%
 LowSloppyPhrase   83.291.99   77.651.18  -10% -   
-3%
   OrHighMed   94.155.28   88.344.54  -15% -
4%
  OrHighHigh  100.635.42   94.574.20  -14% -
3%
   OrHighLow  128.627.21  120.926.07  -15% -
4%
  HighPhrase   13.050.45   12.290.39  -11% -
0%
 Prefix3  217.066.82  205.054.62  -10% -
0%
   MedPhrase   27.500.97   26.330.79  -10% -
2%
Wildcard  183.204.87  175.583.89   -8% -
0%
 LowTerm 1763.31   43.24 1693.31   39.29   -8% -
0%
HighSloppyPhrase   10.050.489.670.40  -11% -
5%
 AndHighHigh  111.591.15  107.451.66   -6% -   
-1%
 LowSpanNear   56.161.32   54.251.01   -7% -
0%
  AndHighMed  423.447.40  409.325.10   -6% -
0%
 MedSpanNear   33.140.91   32.320.74   -7% -
2%
  AndHighLow 2177.50   30.79 2134.05   28.64   -4% -
0%
  Fuzzy1   95.342.41   93.662.32   -6% -
3%
HighSpanNear5.280.175.210.11   -6% -
3%
 MedSloppyPhrase   18.410.72   18.190.70   -8% -
6%
  Fuzzy2   37.731.31   37.311.14   -7% -
5%
 Respell  109.713.09  108.642.76   -6% -
4%
PKLookup  257.326.64  260.007.15   -4% -
6%
{noformat}

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431709#comment-13431709
 ] 

Adrien Grand commented on LUCENE-3892:
--

The comment you added in 1371011 on the value of {{BLOCK_SIZE}} caught my 
attention: I think that BLOCK_SIZE should be at least 64 with PackedInts 
encoding/decoding since these conversions are long-aligned (I backported your 
two commits and added a comment about this). For example, the {{PACKED}} 7-bits 
encoder cannot encode less than 64 values in one iteration.

In case someone would really want to use smaller block sizes (eg. 32), I think 
it should still perform pretty well if {{acceptableOverheadRatio = ~25%}} (in 
that case, all bits-per-value in the [1-24] range either use a 
{{PACKED_SINGLE_BLOCK}} encoder or an 8-bits, 16-bits or 24-bits {{PACKED}} 
decoder).

Do we plan to make the block size configurable?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431733#comment-13431733
 ] 

Michael McCandless commented on LUCENE-3892:


Thanks Adrien.  So now we just have to replace Block with BlockPacked right?

OK let's just fix the comment to be multiple of 64.

I don't think we need to make BLOCK_SIZE configurable.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431735#comment-13431735
 ] 

Adrien Grand commented on LUCENE-3892:
--

bq. So now we just have to replace Block with BlockPacked right?

Yes, I think so.

bq. I don't think we need to make BLOCK_SIZE configurable.

In that case, should we also hard-code the value of {{acceptableOverheadRatio}}?



 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431742#comment-13431742
 ] 

Michael McCandless commented on LUCENE-3892:


Actually let's hold off a bit on replacing Block w/ BlockPacked: Billy was 
going to do some more tests with PFOR...

bq. In that case, should we also hard-code the value of acceptableOverheadRatio?

Hmm that one seems more compelling to let apps change?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431753#comment-13431753
 ] 

Michael McCandless commented on LUCENE-3892:


Shouldn't MIN_ENCODED_SIZE be MAX_ENCODED_SIZE?  Ie the max number of
bytes encoding will ever require.  And I think the same for
MIN - MAX_DATA_SIZE?  Or maybe MIN_REQUIRED_XXX?

I think readVIntBlock shouldn't be in ForUtil?  Ie it's very
postings-format-specific and it's not using packed ints at all.  Also
the equivalent readVIntBlock code for the positions case (in the
readPositions methods) is still in the BlockPackedPostingsReader.  I
think it's great to have writeBlock/readBlock/skipBlock in ForUtil.

Do we really need to write/write the 32 format.getId(), numBits into
the postings file header?  I guess it's either that or ... store the float
acceptableOverheadRatio (eg using Float.floatToIntBits I guess) and
have some back-compat enforced in the logic in
PackedInts.fastestFormatAndBits... hmm.

Hmm ... MIN_DATA_SIZE is 147 (PACKED_SINGLE_BLOCK, bpv=3), but
BLOCK_SIZE is 128 ... so I guess this means if we ever pick that
format (because acceptableOverheadRatio allowed us to), we're
encoding/decoding those extra 19 unused ints right?  (I was just
trying to understand why we alloc all the int[] to MIN_DATA_SIZE not
BLOCK_SIZE...).

ForUtil.getMinRequiredBufferSize seems like dead code?


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431762#comment-13431762
 ] 

Han Jiang commented on LUCENE-3892:
---

Thank you Adrien! The BlockPacked PF also worked well on my computer :)
{noformat}
TaskQPS base StdDev base  QPS packedStdDev packed  Pct 
diff
 AndHighHigh  122.573.01  123.902.49   -3% -
5%
  AndHighLow 2260.53   21.18 2273.77   55.09   -2% -
3%
  AndHighMed  328.018.18  329.31   11.36   -5% -
6%
  Fuzzy1   86.370.94   86.242.12   -3% -
3%
  Fuzzy2   31.400.46   31.220.64   -4% -
2%
  HighPhrase9.090.519.150.40   -8% -   
11%
HighSloppyPhrase5.300.255.340.08   -5% -
7%
HighSpanNear   10.110.44   10.420.34   -4% -   
11%
HighTerm  179.437.26  178.965.70   -7% -
7%
  IntNRQ   61.873.79   60.594.31  -14% -   
11%
   LowPhrase   41.231.54   42.971.32   -2% -   
11%
 LowSloppyPhrase   62.832.11   68.230.993% -   
14%
 LowSpanNear   81.282.74   85.742.67   -1% -   
12%
 LowTerm 1763.70   29.21 1778.41   23.07   -2% -
3%
   MedPhrase   27.061.16   27.540.88   -5% -
9%
 MedSloppyPhrase   31.821.16   33.700.141% -   
10%
 MedSpanNear   23.090.93   23.840.79   -4% -   
11%
 MedTerm  659.09   22.65  671.54   19.79   -4% -
8%
  OrHighHigh   27.360.52   27.411.25   -6% -
6%
   OrHighLow  154.992.07  156.207.08   -5% -
6%
   OrHighMed  105.131.52  105.304.65   -5% -
6%
PKLookup  210.646.95  217.572.080% -
7%
 Prefix3  170.226.22  166.804.18   -7% -
4%
 Respell   83.961.47   83.751.25   -3% -
3%
Wildcard  155.084.31  155.313.12   -4% -
5%
{noformat}

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431764#comment-13431764
 ] 

Michael McCandless commented on LUCENE-3892:


I think, for a fair test, we should also test w/ acceptableOverheadRatio=0 ... 
I'll run that.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431767#comment-13431767
 ] 

Adrien Grand commented on LUCENE-3892:
--

bq. Shouldn't MIN_ENCODED_SIZE be MAX_ENCODED_SIZE?

I prefixed with MIN because it is the minimum size the encoded buffer size 
must have to be able to handle all cases. But I think you are right, MAX or 
REQUIRED would be clearer.

bq. I think readVIntBlock shouldn't be in ForUtil?

I'll move it back to BlockPackedPostingsReader.


{quote} Do we really need to write/write the 32 format.getId(), numBits into
the postings file header? I guess it's either that or ... store the float
acceptableOverheadRatio (eg using Float.floatToIntBits I guess) and
have some back-compat enforced in the logic in
PackedInts.fastestFormatAndBits... hmm.{quote}

I hesitated between these two approaches but I think writing all cases to the 
header is less error-prone? Moreover it would allow us to change the logic of 
{{fastestFormatAndBits}} without having to bump the version number.

{quote} Hmm ... MIN_DATA_SIZE is 147 (PACKED_SINGLE_BLOCK, bpv=3), but
BLOCK_SIZE is 128 ... so I guess this means if we ever pick that
format (because acceptableOverheadRatio allowed us to), we're
encoding/decoding those extra 19 unused ints right? (I was just
trying to understand why we alloc all the int[] to MIN_DATA_SIZE not
BLOCK_SIZE...).{quote}

Exactly. The other problem is that we are also storing these unnecessary 19 
values (but it is not easy to fix since PACKED_SINGLE_BLOCK writes values in 
the low-order long bits first (little endian)). Maybe we should make 
PACKED_SINGLE_BLOCK write values in the high-order bits first and split byte 
encoders and decoders from the long ones (so that they have a lower 
{{valueCount()}}). 

bq. ForUtil.getMinRequiredBufferSize seems like dead code?

I'll remove it.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431882#comment-13431882
 ] 

Han Jiang commented on LUCENE-3892:
---

I revived the PFor codes, and test it agains BlockFor and BlockPacked:

BlockFor as base:
{noformat}
TaskQPS base StdDev baseQPS pfor StdDev pfor  Pct 
diff
 AndHighHigh  121.541.37  116.692.03   -6% -   
-1%
  AndHighLow 2286.36   14.19 2212.92   11.48   -4% -   
-2%
  AndHighMed  322.977.37  294.194.76  -12% -   
-5%
  Fuzzy1   85.561.46   87.973.27   -2% -
8%
  Fuzzy2   30.940.56   32.161.34   -2% -   
10%
  HighPhrase9.390.389.020.45  -12% -
5%
HighSloppyPhrase5.380.085.240.12   -6% -
1%
HighSpanNear   10.380.399.920.08   -8% -
0%
HighTerm  180.306.87  172.836.26  -11% -
3%
  IntNRQ   62.013.73   60.893.54  -12% -   
10%
   LowPhrase   42.440.67   38.730.89  -12% -   
-5%
 LowSloppyPhrase   62.820.79   56.790.43  -11% -   
-7%
 LowSpanNear   81.792.00   74.101.13  -12% -   
-5%
 LowTerm 1763.95   39.62 1721.30   34.22   -6% -
1%
   MedPhrase   27.870.59   25.820.74  -11% -   
-2%
 MedSloppyPhrase   32.150.41   29.910.31   -9% -   
-4%
 MedSpanNear   23.480.71   22.000.05   -9% -   
-3%
 MedTerm  662.11   24.22  638.81   19.31   -9% -
3%
  OrHighHigh   26.820.47   27.141.93   -7% -   
10%
   OrHighLow  152.403.54  156.58   11.11   -6% -   
12%
   OrHighMed  103.202.26  105.847.55   -6% -   
12%
PKLookup  216.384.32  219.322.59   -1% -
4%
 Prefix3  169.894.97  163.823.34   -8% -
1%
 Respell   83.231.44   86.203.00   -1% -
9%
Wildcard  155.812.79  152.302.54   -5% -
1%
{noformat}

BlockPacked as base:
{noformat}
TaskQPS base StdDev baseQPS pfor StdDev pfor  Pct 
diff
 AndHighHigh  122.943.43  116.241.90   -9% -   
-1%
  AndHighLow 2294.32   58.32 2199.14   31.97   -7% -
0%
  AndHighMed  325.55   12.44  290.203.80  -15% -   
-6%
  Fuzzy1   88.331.84   87.862.54   -5% -
4%
  Fuzzy2   31.920.80   32.000.92   -5% -
5%
  HighPhrase9.730.479.040.29  -14% -
0%
HighSloppyPhrase5.490.195.160.03   -9% -   
-1%
HighSpanNear   10.930.239.900.09  -12% -   
-6%
HighTerm  178.316.37  171.066.14  -10% -
3%
  IntNRQ   60.874.71   62.385.49  -13% -   
20%
   LowPhrase   44.971.18   38.361.01  -19% -  
-10%
 LowSloppyPhrase   69.611.19   55.901.39  -23% -  
-16%
 LowSpanNear   88.500.66   72.802.23  -20% -  
-14%
 LowTerm 1769.84   32.66 1717.02   39.75   -6% -
1%
   MedPhrase   28.880.84   25.570.68  -16% -   
-6%
 MedSloppyPhrase   34.470.50   29.290.54  -17% -  
-12%
 MedSpanNear   24.880.32   21.690.38  -15% -  
-10%
 MedTerm  667.95   21.61  633.73   22.17  -11% -
1%
  OrHighHigh   27.961.29   26.820.81  -11% -
3%
   OrHighLow  158.625.82  155.085.05   -8% -
4%
   OrHighMed  107.164.19  104.813.17   -8% -
4%
PKLookup  217.221.86  216.831.87   -1% -
1%
 Prefix3  167.326.72  166.126.53   -8% -
7%
 Respell   85.252.27   85.852.16   -4% -
6%
Wildcard  156.245.69  154.633.02   -6% -
4%
{noformat}

Current PFor impl only saves 1.8% against For, but get quite some perf loss. 
Let's use the Packed version!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431914#comment-13431914
 ] 

Michael McCandless commented on LUCENE-3892:


I compared Block w/ BlockPacked, but set acceptableOverheadRatio to 0 for a 
fairer test:

{noformat}
TaskQPS base StdDev baseQPS pack StdDev pack  Pct 
diff
HighSloppyPhrase1.940.011.910.05   -4% -
2%
   LowPhrase   21.050.07   20.840.37   -3% -
1%
   MedPhrase   13.050.04   12.930.23   -3% -
1%
Wildcard   43.872.76   43.492.10  -11% -   
10%
  IntNRQ8.881.398.830.78  -21% -   
28%
  Fuzzy1   63.071.96   62.781.46   -5% -
5%
 LowSloppyPhrase6.920.016.910.13   -2% -
1%
 Prefix3   71.385.20   71.353.17  -10% -   
12%
PKLookup  157.001.78  158.012.01   -1% -
3%
  AndHighLow  668.764.82  674.807.480% -
2%
  HighPhrase1.560.031.580.03   -3% -
5%
 MedSloppyPhrase7.710.037.800.110% -
2%
  AndHighMed   74.050.49   75.350.360% -
2%
 AndHighHigh   25.920.30   26.780.191% -
5%
 Respell   57.072.70   59.201.80   -3% -   
12%
  Fuzzy2   60.812.92   63.321.68   -3% -   
12%
  OrHighHigh8.990.179.390.111% -
7%
   OrHighMed   17.650.37   18.520.132% -
7%
 MedSpanNear3.900.174.110.09   -1% -   
12%
   OrHighLow   22.990.51   24.220.152% -
8%
HighSpanNear1.400.061.480.030% -   
12%
 LowSpanNear7.840.318.320.170% -   
12%
 LowTerm  406.02   28.53  444.21   37.75   -6% -   
27%
 MedTerm  149.838.11  167.60   15.06   -3% -   
28%
HighTerm   29.571.67   33.423.20   -3% -   
31%
{noformat}

Curiously it seems even faster than w/ acceptableOverheadRatio=0.2!  But it 
makes it clear we should do a hard cutover.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431916#comment-13431916
 ] 

Michael McCandless commented on LUCENE-3892:


bq. I revived the PFor codes, and test it agains BlockFor and BlockPacked

Thanks Billy, I'll run a test too ...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431951#comment-13431951
 ] 

Adrien Grand commented on LUCENE-3892:
--

bq. Curiously it seems even faster than w/ acceptableOverheadRatio=0.2! But it 
makes it clear we should do a hard cutover.

I had been doing some tests with the bulk version of PackedInts.get (which uses 
the same methods that we use for BlockPacked) while working on LUCENE-4098 and 
it seemed that the bottleneck was more memory bandwidth than CPU (for large 
arrays at least). If you look at the last graph of 
http://people.apache.org/~jpountz/packed_ints3.html, the throughput seems to 
depend more on the memory efficiency of the picked impl than on the way it 
stores data. Maybe we are experiencing a similar phenomenon here...

Unless I am missing something, the only difference between BlockPacked and 
Block is that BlockPacked decodes directly from byte[] whereas Block uses 
ByteBuffer.asLongBuffer to translate from bytes to ints and then decodes from 
the ints... Interesting to know it has so much overhead...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431983#comment-13431983
 ] 

Michael McCandless commented on LUCENE-3892:


OK indeed PFOR is slower for me too:

{noformat}
TaskQPS base StdDev baseQPS pfor StdDev pfor  Pct 
diff
  HighPhrase1.560.031.250.12  -28% -  
-10%
   MedPhrase   13.050.10   10.500.58  -24% -  
-14%
   LowPhrase   21.080.08   17.350.85  -22% -  
-13%
  AndHighMed   73.780.66   62.501.68  -18% -  
-12%
  AndHighLow  674.602.54  573.00   12.06  -17% -  
-12%
 LowSpanNear8.040.176.970.23  -17% -   
-8%
 MedSpanNear3.970.103.580.15  -15% -   
-3%
 MedSloppyPhrase7.580.116.930.14  -11% -   
-5%
 AndHighHigh   25.710.47   23.580.61  -12% -   
-4%
HighSpanNear1.420.041.310.05  -12% -   
-1%
 MedTerm  155.44   18.75  144.46   12.33  -24% -   
14%
HighTerm   30.274.31   28.252.88  -26% -   
19%
 LowSloppyPhrase6.730.136.280.12  -10% -   
-3%
  OrHighHigh9.060.248.530.33  -11% -
0%
   OrHighLow   23.090.67   21.880.91  -11% -
1%
   OrHighMed   17.710.51   16.790.67  -11% -
1%
HighSloppyPhrase1.880.051.800.04   -9% -
0%
  IntNRQ9.420.509.050.89  -17% -   
11%
 Prefix3   72.672.42   70.423.61  -11% -
5%
  Fuzzy1   63.711.07   62.341.55   -6% -
1%
Wildcard   45.250.99   44.281.55   -7% -
3%
PKLookup  159.042.13  157.171.90   -3% -
1%
  Fuzzy2   62.512.28   63.401.65   -4% -
8%
 LowTerm  400.06   57.60  407.73   52.40  -22% -   
34%
 Respell   56.723.19   59.832.10   -3% -   
15%
{noformat}

I think we should replace Block with BlockPacked now?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431985#comment-13431985
 ] 

Michael McCandless commented on LUCENE-3892:


bq. I had been doing some tests with the bulk version of PackedInts.get (which 
uses the same methods that we use for BlockPacked) while working on LUCENE-4098 
and it seemed that the bottleneck was more memory bandwidth than CPU (for large 
arrays at least). 

Ahh, interesting...

So I think we should test different acceptableOverheadRatios to find the best 
... it could be it's 0!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431991#comment-13431991
 ] 

Michael McCandless commented on LUCENE-3892:


{quote}
bq. Do we really need to write/write the 32 format.getId(), numBits into the 
postings file header? I guess it's either that or ... store the float 
acceptableOverheadRatio (eg using Float.floatToIntBits I guess) and have some 
back-compat enforced in the logic in PackedInts.fastestFormatAndBits... hmm.

I hesitated between these two approaches but I think writing all cases to the 
header is less error-prone? Moreover it would allow us to change the logic of 
fastestFormatAndBits without having to bump the version number.
{quote}

Maybe for starters we should just hardwire acceptableOverheadRatio at
0 ... then we simplify this back-compat until/unless we really need to
make this configurable.


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432001#comment-13432001
 ] 

Michael McCandless commented on LUCENE-3892:


bq. The other problem is that we are also storing these unnecessary 19 values 
(but it is not easy to fix since PACKED_SINGLE_BLOCK writes values in the 
low-order long bits first (little endian)). Maybe we should make 
PACKED_SINGLE_BLOCK write values in the high-order bits first and split byte 
encoders and decoders from the long ones (so that they have a lower 
valueCount()).

OK, we can explore that later (another reason to simply always use 
Format.PACKED for now...).

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432009#comment-13432009
 ] 

Robert Muir commented on LUCENE-3892:
-

{quote}
OK indeed PFOR is slower for me too:
{quote}

I think for starters since you guys have gotten FOR pretty nice we should just 
focus on that one?

We could later see if PFOR could get additional wins as a second step: getting 
FOR working nice and fast
is awesome on its own!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432235#comment-13432235
 ] 

Michael McCandless commented on LUCENE-3892:


bq. I think for starters since you guys have gotten FOR pretty nice we should 
just focus on that one?

Yeah I think we should do that.  I think the branch is nearly ready to land!

I just replaced Block with BlockPacked ...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-blockpfor.patch, 
 LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431117#comment-13431117
 ] 

Han Jiang commented on LUCENE-3892:
---

And result on skipMulitiplier, use current 8 as the baseline: 
http://pastebin.com/TG4C6u6S
Somewhat noisy, but or-queries benifit a little when skipMultiplier=32.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431193#comment-13431193
 ] 

Robert Muir commented on LUCENE-3892:
-

{quote}
So ... most of the gains come from BlockPF cutover. This is sort of
... surprising/disappointing, ie, our bottlenecks are the abstraction
layers, not the actual decode cost. Still it's good to make progress
on removing the abstractions.
{quote}

I don't think its that disappointing. This isnt a very interesting
benchmark for a compression algorithm like FOR: instead imagine the
very common case of apps today indexing small fields like product names,
restaurant names, or something like that. Freqs are nearly always 1,
and positions are tiny, but often people still want the ability to
use things like phrase queries. And imagine cases where people
are indexing data from a database and there are only a few unique
values (e.g. product type = tshirt, pants, shoes) in a field. 

I think the wikipedia benchmark doesn't do a very good job of illustrating 
performance on use-cases like this, which I think are common and also
where I'm fairly positive FOR will be a win. 

Its nice that its not slower or too much bigger in the worst case
of large docs where the numbers aren't so tiny?

{quote}
Also, it looks like the only query that is slower than Lucene40 is
AndHighLow ... however, it's also an extremely fast query to begin
with so I think it's a fine tradeoff that it gets slower while the
hard/slower queries get faster.
{quote}

+1, lets not even think twice about that one.



 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431324#comment-13431324
 ] 

Adrien Grand commented on LUCENE-3892:
--

I did some changes to the {{BlockPacked}} codec:
 - encoding and decoding using int[] instead of long[]
 - selection of the format based on a configurable overhead ratio.

The results are encouraging:
{noformat}
TaskQPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed   
   Pct diff
PKLookup  256.938.89  256.857.47   -6% -
6%
   OrHighLow  145.149.86  145.149.35  -12% -   
14%
 Respell  110.261.84  110.272.01   -3% -
3%
 AndHighHigh  112.970.81  113.192.17   -2% -
2%
  Fuzzy1  102.151.47  102.863.13   -3% -
5%
  OrHighHigh   94.566.56   95.436.35  -11% -   
15%
  Fuzzy2   42.490.77   42.891.43   -4% -
6%
   OrHighMed  175.30   11.34  177.42   10.83  -10% -   
14%
  AndHighLow 1925.02   23.92 1952.57   48.68   -2% -
5%
  HighPhrase8.960.419.110.46   -7% -   
11%
Wildcard  189.792.13  193.121.570% -
3%
HighSpanNear6.470.156.590.25   -4% -
8%
 Prefix3  256.672.58  262.402.840% -
4%
 LowTerm 1746.52   52.80 1789.54   54.30   -3% -
8%
HighTerm  238.70   13.46  245.63   16.60   -9% -   
16%
 MedTerm  923.64   38.19  951.18   46.85   -5% -   
12%
  AndHighMed  364.463.65  377.09   10.030% -
7%
  IntNRQ   56.581.02   58.840.800% -
7%
HighSloppyPhrase   11.730.30   12.400.62   -2% -   
13%
 LowSpanNear   29.640.96   32.440.982% -   
16%
 MedSpanNear   22.960.72   25.160.852% -   
16%
   MedPhrase   40.991.25   45.091.243% -   
16%
 LowSloppyPhrase   37.880.99   41.981.494% -   
17%
   LowPhrase   64.402.04   71.841.415% -   
17%
 MedSloppyPhrase   42.291.16   47.321.545% -   
18%
{noformat}

I hope this will be confirmed on your computers this time .:-)

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431485#comment-13431485
 ] 

Michael McCandless commented on LUCENE-3892:


I also see (smaller) gains with BlockPacked vs Block (this is 10M doc index):
{noformat}
TaskQPS base StdDev base  QPS packedStdDev packed  Pct 
diff
  AndHighMed   69.190.53   66.430.63   -5% -   
-2%
  Fuzzy2   63.711.24   62.251.58   -6% -
2%
 Respell   62.691.41   61.531.47   -6% -
2%
  IntNRQ   11.860.43   11.730.03   -4% -
2%
  Fuzzy1   75.481.21   75.051.52   -4% -
3%
Wildcard   53.230.63   52.960.25   -2% -
1%
 MedSpanNear4.880.164.880.11   -5% -
5%
PKLookup  191.482.84  191.623.98   -3% -
3%
HighTerm   35.710.63   35.910.06   -1% -
2%
 Prefix3   83.141.34   83.830.49   -1% -
3%
 LowTerm  513.350.77  517.921.500% -
1%
HighSpanNear1.700.061.710.03   -4% -
6%
 AndHighHigh   23.450.09   23.690.100% -
1%
   OrHighLow   27.271.06   27.590.15   -3% -
5%
   OrHighMed   23.610.92   23.890.17   -3% -
6%
  OrHighHigh   11.420.44   11.590.12   -3% -
6%
 MedSloppyPhrase6.840.176.950.23   -4% -
7%
   LowPhrase   22.020.39   22.430.150% -
4%
 MedTerm  196.763.01  200.620.330% -
3%
 LowSpanNear9.600.249.820.31   -3% -
8%
   MedPhrase   13.080.30   13.410.120% -
5%
 LowSloppyPhrase7.550.217.770.27   -3% -
9%
  AndHighLow  649.84   18.26  669.086.630% -
6%
HighSloppyPhrase1.980.082.040.09   -4% -   
12%
  HighPhrase1.760.111.960.100% -   
24%
{noformat}

The index is 4669 MB with Block and 4790 with BlockPacked = ~2.6%
larger ... seems worth it!  Apps can always tune the 20% too.


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431498#comment-13431498
 ] 

Adrien Grand commented on LUCENE-3892:
--

Thanks Mike for your tests. Do you think {{BlockPacked}} is now fast enough to 
replace {{Block}} with {{BlockPacked}}? I am asking because it is a little 
painful to always have to backport changes from one format to the other.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431504#comment-13431504
 ] 

Michael McCandless commented on LUCENE-3892:


Yes I think we should do a hard cutover now?  Ie, merge any final changes 
(sorry for all the commits!  we are nearly ready to land I think...) over to 
BlockPacked, then remove Block and rename BlockPacked to Block?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431506#comment-13431506
 ] 

Adrien Grand commented on LUCENE-3892:
--

Sounds good. I think the only commits that have not been merged yet are 1371010 
and 1371011.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431511#comment-13431511
 ] 

Michael McCandless commented on LUCENE-3892:


OK I'll merge  replace Block w/ BlockPacked... likely sometime tomorrow.  
Thanks Adrien!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, LUCENE-3892-bulkVInt.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892-non-specialized.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-07 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430373#comment-13430373
 ] 

Adrien Grand commented on LUCENE-3892:
--

I backported Mike's changes to the {{BlockPacked}} codec and tried to 
understand why it was slower than {{Block}}...

The use of {{java.nio.*Buffer}} seemed to be the bottleneck 
({{ByteBuffer.asLongBuffer}} and {{ByteBuffer.getLong}} especially are _very_ 
slow) of the decoding step so I switched back to decoding from long[] (instead 
of LongBuffer) and added direct decoding from byte[] to avoid having to convert 
the bytes to longs before decoding.

Tests passed with -Dtests.postingsformat=BlockPacked. Here are the results of 
the benchmark (unfortunately, it started before Mike committed r1370179):

{noformat}
TaskQPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed   
   Pct diff
PKLookup  259.419.06  255.778.89   -8% -
5%
  AndHighLow 1656.30   50.44 1653.85   55.05   -6% -
6%
 AndHighHigh   82.901.82   83.472.52   -4% -
6%
  AndHighMed  274.76   11.11  278.51   13.42   -7% -   
10%
 Prefix3  285.414.82  289.606.31   -2% -
5%
HighTerm  230.78   14.33  235.16   20.61  -12% -   
18%
  IntNRQ   55.911.03   57.132.73   -4% -
9%
 LowTerm 1720.10   47.06 1759.16   55.47   -3% -
8%
Wildcard  290.543.82  297.395.420% -
5%
 MedTerm  733.01   35.38  750.46   50.37   -8% -   
14%
HighSpanNear6.930.237.120.39   -6% -   
11%
  HighPhrase6.460.226.650.46   -7% -   
14%
 Respell   96.112.84   99.003.98   -3% -   
10%
  OrHighHigh   38.072.53   39.233.06  -10% -   
19%
  Fuzzy2   50.291.70   51.872.25   -4% -   
11%
   MedPhrase   26.200.94   27.031.07   -4% -   
11%
   OrHighMed  138.837.76  143.549.79   -8% -   
16%
  Fuzzy1  100.582.15  104.213.99   -2% -
9%
HighSloppyPhrase5.260.115.450.24   -3% -   
10%
   OrHighLow   78.435.55   81.806.89  -10% -   
21%
 MedSpanNear   32.751.13   34.281.73   -3% -   
13%
   LowPhrase   90.273.20   95.063.58   -2% -   
13%
 LowSpanNear   46.401.95   48.892.40   -3% -   
15%
 MedSloppyPhrase   36.291.00   38.591.460% -   
13%
 LowSloppyPhrase   37.411.11   40.481.391% -   
15%
{noformat}

Mike, Billy, could you check that {{BLockPacked}} is at least as fast as 
{{Block}} on your computer too?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-07 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430423#comment-13430423
 ] 

Han Jiang commented on LUCENE-3892:
---

Thanks Adrien! Your codes are really clean!

At first glance, I think we should still support all-value-the-same case? For 
some applications(like index with payloads), that might be helpful.

And, I'm a little confused about your performance test. Did you use BlockPF 
before r1370179 as a baseline, and compare it with your latest commit? Here, I 
tested these two PF under latest versions(r1370345).

{noformat}
TaskQPS base StdDev baseQPS comp StdDev comp  Pct 
diff
 AndHighHigh  124.539.36  100.463.31  -27% -   
-9%
  AndHighLow 2141.08   63.93 1922.73   36.32  -14% -   
-5%
  AndHighMed  281.48   36.49  218.68   13.10  -35% -   
-5%
  Fuzzy1   84.332.56   83.941.67   -5% -
4%
  Fuzzy2   30.491.13   30.480.71   -5% -
6%
  HighPhrase9.080.287.560.20  -21% -  
-11%
HighSloppyPhrase5.460.214.880.23  -17% -   
-2%
HighSpanNear   10.120.219.210.30  -13% -   
-3%
HighTerm  176.526.13  146.135.43  -22% -  
-11%
  IntNRQ   59.561.98   51.051.33  -19% -   
-9%
   LowPhrase   40.021.03   32.750.37  -21% -  
-15%
 LowSloppyPhrase   59.592.85   51.491.33  -19% -   
-6%
 LowSpanNear   73.863.17   61.981.45  -21% -  
-10%
 LowTerm 1755.38   15.56 1622.61   26.87   -9% -   
-5%
   MedPhrase   25.990.47   21.010.17  -21% -  
-16%
 MedSloppyPhrase   30.520.89   24.770.55  -22% -  
-14%
 MedSpanNear   22.260.43   18.730.47  -19% -  
-12%
 MedTerm  651.90   18.97  573.34   19.25  -17% -   
-6%
  OrHighHigh   26.750.33   23.530.50  -14% -   
-9%
   OrHighLow  151.692.13  134.173.19  -14% -   
-8%
   OrHighMed  102.481.48   90.732.01  -14% -   
-8%
PKLookup  216.595.70  215.992.99   -4% -
3%
 Prefix3  166.000.78  145.251.29  -13% -  
-11%
 Respell   82.013.01   82.801.66   -4% -
6%
Wildcard  151.662.22  141.141.57   -9% -   
-4%
{noformat}

Strange that it isn't working well on my computer. And results are similar when 
I change MMapDirectory to NIOFSDirectory.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, 

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430439#comment-13430439
 ] 

Michael McCandless commented on LUCENE-3892:


Hmm also not great results on my env (base=Block, packed=BlockPacked), based on 
current branch head:

{noformat}
TaskQPS base StdDev base  QPS packedStdDev packed  Pct 
diff
  AndHighMed   59.233.07   34.240.69  -46% -  
-37%
  AndHighLow  576.35   21.09  349.577.44  -42% -  
-35%
 AndHighHigh   23.830.72   15.530.29  -37% -  
-31%
   MedPhrase   12.560.208.870.31  -32% -  
-25%
   LowPhrase   20.520.21   14.890.43  -30% -  
-24%
 MedSloppyPhrase7.460.205.410.13  -31% -  
-23%
 LowSloppyPhrase6.730.184.920.12  -30% -  
-22%
 LowSpanNear7.630.325.650.19  -31% -  
-20%
HighSloppyPhrase1.900.081.520.05  -25% -  
-14%
  HighPhrase1.570.041.260.08  -26% -  
-12%
 MedSpanNear3.840.183.140.14  -25% -  
-10%
 LowTerm  433.22   34.89  364.03   15.63  -25% -   
-4%
HighSpanNear1.400.071.190.06  -23% -   
-6%
  IntNRQ9.500.438.090.92  -27% -
0%
HighTerm   29.474.89   25.462.35  -32% -   
13%
 MedTerm  148.76   21.53  129.179.59  -29% -
9%
 Prefix3   72.812.20   63.653.88  -20% -   
-4%
Wildcard   44.790.92   39.912.20  -17% -   
-4%
   OrHighMed   16.810.48   15.280.21  -12% -   
-5%
   OrHighLow   21.850.67   20.030.32  -12% -   
-3%
  OrHighHigh8.490.287.800.14  -12% -   
-3%
  Fuzzy1   61.331.95   58.911.11   -8% -
1%
PKLookup  156.871.14  154.082.13   -3% -
0%
 Respell   58.721.57   59.601.28   -3% -
6%
  Fuzzy2   60.982.34   62.031.89   -5% -
9%
{noformat}

I think optimizing the all-values-same case is actually quite important for 
payloads (but luceneutil doesn't test this today).

But, curiously, my BlockPacked index is a bit smaller than my Block index (4643 
MB vs 4650 MB).

I do wonder about using long[] to hold the uncompressed results (they only need 
int[]); that's one big difference still.  Also: I'd love to see how 
acceptableOverheadRatio  0 does ... (and, using PACKED_SINGLE_BLOCK ... we'd 
have to put a bit in the header to record the format).

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: 

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430507#comment-13430507
 ] 

Michael McCandless commented on LUCENE-3892:


I tried smaller block sizes than 128.  Here's 128 (base) vs 64:
{noformat}
TaskQPS base StdDev base QPS block64StdDev block64  Pct 
diff
 AndHighHigh   23.910.57   22.280.27  -10% -   
-3%
  AndHighMed   60.631.02   56.961.13   -9% -   
-2%
 MedSloppyPhrase7.690.017.300.13   -6% -   
-3%
HighSloppyPhrase1.930.021.830.04   -8% -   
-1%
 LowSloppyPhrase6.840.036.570.11   -6% -   
-1%
  Fuzzy1   65.490.85   63.501.68   -6% -
0%
  HighPhrase1.570.041.530.04   -7% -
3%
   OrHighLow   22.890.98   22.380.61   -8% -
4%
   OrHighMed   17.650.70   17.270.43   -8% -
4%
  IntNRQ9.500.489.330.36  -10% -
7%
  OrHighHigh8.980.368.840.19   -7% -
4%
HighTerm   29.602.64   29.161.44  -13% -   
13%
  Fuzzy2   65.540.86   64.632.13   -5% -
3%
Wildcard   45.271.27   44.780.48   -4% -
2%
 MedTerm  150.40   12.65  148.996.63  -12% -   
12%
 Prefix3   72.552.55   72.311.02   -5% -
4%
 LowTerm  421.62   38.27  422.409.47  -10% -   
12%
 LowSpanNear7.550.347.620.22   -6% -
8%
HighSpanNear1.340.091.350.06   -9% -   
12%
   MedPhrase   12.450.24   12.660.13   -1% -
4%
 Respell   59.541.80   60.951.86   -3% -
8%
 MedSpanNear3.700.243.800.15   -7% -   
14%
PKLookup  154.562.45  158.961.890% -
5%
   LowPhrase   20.210.33   20.950.151% -
6%
  AndHighLow  577.81   12.46  637.96   29.803% -   
18%
{noformat}

And 128 (base) vs 32:
{noformat}
TaskQPS base StdDev base QPS block64StdDev block64  Pct 
diff
 AndHighHigh   23.860.52   20.680.59  -17% -   
-8%
  IntNRQ9.480.388.840.46  -15% -
2%
HighSloppyPhrase1.870.041.760.06  -11% -
0%
 Prefix3   72.652.18   68.242.96  -12% -
1%
HighTerm   29.911.40   28.282.94  -19% -
9%
Wildcard   44.740.83   42.431.49  -10% -
0%
HighSpanNear1.370.081.300.07  -15% -
6%
 MedTerm  152.735.28  145.45   14.69  -17% -
8%
 MedSloppyPhrase7.460.127.120.25   -9% -
0%
  HighPhrase1.570.031.500.01   -7% -   
-1%
   OrHighLow   22.940.70   22.001.10  -11% -
3%
  AndHighMed   58.721.79   56.601.95   -9% -
2%
 LowSloppyPhrase6.670.106.440.20   -7% -
1%
   OrHighMed   17.520.56   17.000.82  -10% -
5%
 LowSpanNear7.530.357.340.39  -11% -
7%
  OrHighHigh8.840.318.620.43  -10% -
6%
 MedSpanNear3.790.203.710.21  -12% -
9%
PKLookup  153.343.22  150.194.91   -7% -
3%
  Fuzzy1   62.931.77   62.282.23   -7% -
5%
 LowTerm  410.23   21.57  410.83   35.19  -13% -   
14%
   MedPhrase   12.550.14   12.650.080% -
2%
   LowPhrase   20.420.17   20.770.210% -
3%
  Fuzzy2   61.443.12   64.131.97   -3% -   
13%
 Respell   56.653.29   60.211.39   -1% -   
15%
  AndHighLow  588.05   12.37  720.63   19.33   16% -   
28%
{noformat}

It looks like there's some speedup to AndHighLow and LowPhrase ... but
slowdowns in other (harder) queries... so I think net/net we should
leave block size at 128.


 Add a useful intblock postings format (eg, 

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-07 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13430798#comment-13430798
 ] 

Han Jiang commented on LUCENE-3892:
---

Thanks Mike. And detailed comparison result on my computer is here: 
http://pastebin.com/HLaAuCNp
I tried block size range from 1024~32, also used 128 as the base.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-08-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428936#comment-13428936
 ] 

Michael McCandless commented on LUCENE-3892:


I just committed an optimization to BlockPF DocsEnum.advance, inlining
the scanning step (still have to do DPEnum and EverythingEnum):

{noformat}
TaskQPS base StdDev base QPS for  StdDev for  Pct 
diff
  IntNRQ   12.461.45   11.600.04  -16% -
5%
Wildcard   54.362.75   52.720.38   -8% -
2%
 Prefix3   85.434.97   83.080.47   -8% -
3%
  Fuzzy2   63.862.13   62.441.79   -8% -
4%
 Respell   62.751.52   61.422.02   -7% -
3%
  Fuzzy1   75.681.65   74.691.44   -5% -
2%
 LowSpanNear9.240.209.130.19   -5% -
3%
PKLookup  192.892.91  190.662.43   -3% -
1%
HighSpanNear1.710.051.690.05   -6% -
4%
 MedSpanNear4.800.114.760.12   -5% -
4%
   MedPhrase   12.570.27   12.560.21   -3% -
3%
 MedSloppyPhrase6.570.116.560.11   -3% -
3%
   LowPhrase   21.550.35   21.550.28   -2% -
2%
 LowSloppyPhrase7.250.167.280.12   -3% -
4%
  HighPhrase1.810.111.820.10  -10% -   
13%
HighSloppyPhrase1.940.101.960.05   -6% -
9%
 LowTerm  512.535.66  518.312.300% -
2%
 MedTerm  196.094.68  198.760.30   -1% -
3%
HighTerm   35.530.95   36.110.03   -1% -
4%
   OrHighMed   23.340.83   23.850.70   -4% -
9%
   OrHighLow   26.910.98   27.530.82   -4% -
9%
  OrHighHigh   11.270.41   11.530.34   -4% -
9%
 AndHighHigh   21.240.05   23.790.13   11% -   
12%
  AndHighLow  553.198.47  621.354.019% -   
14%
  AndHighMed   57.450.13   67.780.70   16% -   
19%
{noformat}


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425153#comment-13425153
 ] 

Michael McCandless commented on LUCENE-3892:


I'm confused by these two patches: are they against trunk?  How come eg they 
have mods to build.xml?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425246#comment-13425246
 ] 

Michael McCandless commented on LUCENE-3892:


OK I think I understand the two patches now.

First, the build.xml changes are noise I think.  Second, the patches
both mix in the removal of the current For/PFor postings formats based
on sep (I will separately commit this removal: BlockPF is faster).

Then, one patch (LUCENE-3892-blockForhardcode(base).patch) keeps
using the separate packed-ints impl we have, but cuts over to
LongBuffer instead of int[] for the decoded values (still uses
IntBuffer for the encoded values), while the other patch
(LUCENE-3892-blockForpackedecoder(comp).patch) uses oal.util.packed
and LongBuffer for both encoded and decoded values.

So it's nice to see that merely switching to LongBuffer to pass
encoded/decoded values around doesn't seem to hurt much, except for
And queries (odd?), but then switching to oal.util.packed does hurt
(also odd because our packed ints impl has been heavily optimized
lately).


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-30 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425296#comment-13425296
 ] 

Adrien Grand commented on LUCENE-3892:
--

My benchmark results are a little different but oal.util.packed is still 
behind... (it compares the current branch vs. patched with PackedInts):

{noformat}
TaskQPS pforcodecStdDev pforcodecQPS pforcodec-packedintsStdDev 
pforcodec-packedints  Pct diff
  Phrase   38.213.01   35.732.41  -19% -
8%
SpanNear   27.991.30   26.301.23  -14% -
3%
SloppyPhrase   43.322.98   41.022.53  -16% -
7%
  AndHighMed  230.238.48  219.889.35  -11% -
3%
 AndHighHigh   52.532.02   50.802.62  -11% -
5%
  IntNRQ   43.243.42   41.842.79  -16% -   
12%
Wildcard  113.263.17  109.913.50   -8% -
3%
 Prefix3  194.569.56  189.399.64  -11% -
7%
Term  301.86   14.49  295.28   17.51  -12% -
8%
   OrHighMed  100.608.30   99.068.00  -16% -   
15%
  OrHighHigh   32.352.92   31.902.88  -17% -   
18%
  Fuzzy2   36.270.67   35.870.93   -5% -
3%
  Fuzzy1   81.141.24   80.241.68   -4% -
2%
   TermGroup100K  193.403.36  191.274.13   -4% -
2%
TermBGroup100K1P  152.785.06  151.233.98   -6% -
5%
  TermBGroup100K  242.787.06  240.718.01   -6% -
5%
 Respell   85.751.36   85.172.04   -4% -
3%
PKLookup  206.025.05  205.574.63   -4% -
4%
{noformat}

I am not sure why oal.util.packed is slower. The only differences I see is that 
they use inheritance instead of a switch block to know how to decode data and 
that they encode values in the high-order long bits first while the branch 
currently starts with the low-order int bits. I'll try to dig deeper to 
understand what happens...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425306#comment-13425306
 ] 

Michael McCandless commented on LUCENE-3892:


I just committed a new BlockPacked postings format, which is a copy of
Block postings format but using oal.util.packed for encode/decode.

I left Block unchanged, except I moved the util classes it had been
using out of oal.codecs.pfor, and removed oal.codecs.pfor.

So now we can iterate to speed up packed ints cutover, and do perf
tests off the branch.


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425313#comment-13425313
 ] 

Michael McCandless commented on LUCENE-3892:


Sorry I meant to say: the BlockPacked PF is from Billy's 
LUCENE-3892-blockForpackedecoder(comp).patch.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425388#comment-13425388
 ] 

Michael McCandless commented on LUCENE-3892:


I tested Block vs BlockPacked as checked in.

On a Westmere Xeon machine (Java 1.7.0_04):

{noformat}
TaskQPS base StdDev base QPS for  StdDev for  Pct 
diff
  AndHighMed   15.140.14   13.780.13  -10% -   
-7%
SloppyPhrase2.550.112.330.09  -15% -   
-1%
  OrHighHigh3.750.163.440.09  -14% -   
-1%
Wildcard8.440.017.780.28  -11% -   
-4%
SpanNear1.110.041.030.04  -13% -
0%
 Prefix3   17.910.08   16.630.50  -10% -   
-3%
   OrHighMed   11.350.65   10.630.44  -15% -
3%
  IntNRQ6.730.036.320.27  -10% -   
-1%
TermBGroup1M3.870.033.680.04   -6% -   
-3%
 AndHighHigh4.860.094.630.03   -7% -   
-2%
  Phrase1.100.061.050.06  -14% -
6%
Term7.860.037.520.04   -5% -   
-3%
  TermBGroup1M1P4.650.124.490.06   -6% -
0%
 TermGroup1M2.970.042.880.02   -4% -   
-1%
  Fuzzy1   71.221.93   71.021.44   -4% -
4%
  Fuzzy2   49.761.33   49.901.23   -4% -
5%
 Respell   76.232.67   76.932.67   -5% -
8%
PKLookup  161.893.28  168.287.87   -2% -   
11%
{noformat}

And on an desktop Ivy Bridge (Java 1.7.0_04):
{noformat}
TaskQPS base StdDev base QPS for  StdDev for  Pct 
diff
  AndHighMed   17.320.12   15.410.03  -11% -  
-10%
SloppyPhrase2.740.212.560.11  -16% -
5%
  Phrase1.320.071.230.06  -15% -
3%
Wildcard9.650.119.080.12   -8% -   
-3%
SpanNear1.200.011.130.01   -7% -   
-3%
 AndHighHigh5.320.035.040.02   -6% -   
-4%
 Prefix3   18.930.20   18.040.24   -6% -   
-2%
  IntNRQ7.790.137.480.13   -7% -
0%
Term9.480.109.150.43   -8% -
2%
TermBGroup1M4.740.054.590.12   -6% -
0%
   OrHighMed   13.010.24   12.600.55   -9% -
2%
  OrHighHigh4.080.053.970.17   -8% -
2%
 TermGroup1M3.300.033.220.07   -5% -
0%
  TermBGroup1M1P5.520.115.420.22   -7% -
4%
PKLookup  194.624.43  193.445.07   -5% -
4%
  Fuzzy1   79.231.31   79.210.96   -2% -
2%
 Respell   78.971.04   79.871.15   -1% -
3%
  Fuzzy2   56.170.93   56.820.64   -1% -
4%
{noformat}

So packed is still behind ...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockForhardcode(base).patch, 
 LUCENE-3892-blockForpackedecoder(comp).patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever 

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13420732#comment-13420732
 ] 

Robert Muir commented on LUCENE-3892:
-

FYI: I committed the TestPostingsFormat here to trunk/4.x to get it going in 
jenkins.

I will merge back to the branch... it can then be modified/improved as usual!


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-20 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13419381#comment-13419381
 ] 

Robert Muir commented on LUCENE-3892:
-

{quote}
I'm afraid that the for loop of readLong() hurts the performance. Here is the 
comparison against last patch:
{quote}

I think so too. I think in each enum, up front you want a pre-allocated byte[] 
(maximum size possible for the block),
and you do ByteBuffer.wrap(x).asLongBuffer.

after you read the header, call readBytes() and then just rewind()?

So this is just like what you do now in the branch, except with LongBuffer 
instead of IntBuffer

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-blockFor-with-packedints-decoder.patch, 
 LUCENE-3892-blockFor-with-packedints.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415072#comment-13415072
 ] 

Michael McCandless commented on LUCENE-3892:


Thanks Billy, I committed last baseline patch!


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415085#comment-13415085
 ] 

Michael McCandless commented on LUCENE-3892:


I opened LUCENE-4225 with a new base PostingsFormat that gives better perf for 
For than Sep...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415226#comment-13415226
 ] 

Michael McCandless commented on LUCENE-3892:


I think a good thing to explore next is to stop using our own packed ints impl 
and instead cutover to oal.util.packed?  (Since so much effort has gone into 
making those impls fast).

LUCENE-4161 has already taken a big step towards making them usable ... we 
should prototype an initial cutover and then iterate?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-16 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415230#comment-13415230
 ] 

Adrien Grand commented on LUCENE-3892:
--

+1 Don't hesitate to tell me if you're missing methods for this issue (I'm 
thinking at least of bulk int[] read/write, we currently only make it possible 
with longs).

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-16 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415236#comment-13415236
 ] 

Han Jiang commented on LUCENE-3892:
---

bq. I opened LUCENE-4225 with a new base PostingsFormat that gives better perf 
for For than Sep...
Wow, the result looks great! Quite curious why some queries improve so much, 
like AndHighHigh.

bq. LUCENE-4161 has already taken a big step towards making them usable ... we 
should prototype an initial cutover and then iterate?
Yes, but we should make the PostingsFormat pass test first? Currently it also 
fails some tests for ForPF.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13415335#comment-13415335
 ] 

Michael McCandless commented on LUCENE-3892:


bq. Yes, but we should make the PostingsFormat pass test first? Currently it 
also fails some tests for ForPF.

Uh oh I didn't know tests are failing on the branch: do you have a seed?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892-pfor-compress-iterate-numbits.patch, 
 LUCENE-3892-pfor-compress-slow-estimate.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413682#comment-13413682
 ] 

Han Jiang commented on LUCENE-3892:
---

bq. Was the numBits==0 case for all 0s not all 1s? We may want to have it mean 
all 1s instead?
OK, I just tested this, and for most cases(93%) when the whole block shares one 
value v, v==1. This change improves index speed and reduce file size a bit(280s 
vs 320s and 589M vs 591M). But why? Does lucene store freq() when it is 0 as 
well, so a whole block with v==1 will be more possible?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413698#comment-13413698
 ] 

Michael McCandless commented on LUCENE-3892:


bq. But why? Does lucene store freq() when it is 0 as well, so a whole block 
with v==1 will be more possible?

A whole block of 1s can easily happen: if all freqs are one (the term always 
occurred only once in each document), or if the term occurs in every document 
than the delta between docIDs is always 1.

I don't think we should ever hit an all 0s block today (hmm: except for 
positions, if the given term always occurred at the first position in each doc).

We could in theory subtract 1 from all these deltas (except the first one!  so 
maybe we add one to the docID to begin with...) so that these turn into all 0s 
blocks, but then at decode time we'd have to add 1 back and I'm not sure that'd 
net/net be a win.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-13 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413709#comment-13413709
 ] 

Han Jiang commented on LUCENE-3892:
---

bq. We could in theory subtract 1 from all these deltas (except the first one! 
so maybe we add one to the docID to begin with...) so that these turn into all 
0s blocks, but then at decode time we'd have to add 1 back and I'm not sure 
that'd net/net be a win.

Hmm , so current strategy is: 1.for docIDs, store v[i+1]-v[i]-1; 2. for freq 
and positions, store v[i] directly? Yes there are blocks with all 0s, although 
very rare to see. 


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413719#comment-13413719
 ] 

Michael McCandless commented on LUCENE-3892:


No, for docIDs we store docID - lastDocID.  So that delta can be 0 for the 
first doc in a posting list, and then = 1 thereafter.

But an all 0s block is possible if a bunch of terms in a row occurred only in 
doc 0.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413935#comment-13413935
 ] 

Michael McCandless commented on LUCENE-3892:


Those are interesting results!  Curious how much faster indexing is for PFor if 
you use all_Vs; cutting the header is also a nice reduction on index size.

Instead of having P/ForUtil reach up into P/ForPostingsFormat for the default 
block size, I think we can assume the int[] array length (of the decoded 
buffer) is the size of the block?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413312#comment-13413312
 ] 

Michael McCandless commented on LUCENE-3892:


Thanks Billy, I'll commit!

One thing I noticed: I think we shouldn't separately read numBytes and the int 
header?  Can't we do a single readVInt(), and that encodes numBytes as well as 
format (bit width and format, once we tie into oal.util.packed APIs)?  Also, we 
shouldn't encode numInts at all, ie, this should be fixed for the whole 
segment, and not written per block.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413314#comment-13413314
 ] 

Michael McCandless commented on LUCENE-3892:


I didn't commit 
lucene/core/src/java/org/apache/lucene/codecs/pfor/ForPostingsFormat.java -- 
your IDE had changed it to a wildcard import (I prefer we stick with individual 
imports).

Was the numBits==0 case for all 0s not all 1s?  We may want to have it mean all 
1s instead?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor-with-javadoc.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411472#comment-13411472
 ] 

Michael McCandless commented on LUCENE-3892:


bq. I'm still not sure about the IOUtils.closeWhileHandlingException(), I think 
the exceptions should not be suppressed when out.close() is called?

Actually I think you want them to be suppressed, so that the original exception 
is seen?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411475#comment-13411475
 ] 

Michael McCandless commented on LUCENE-3892:


Docs/cleanup patch looks good, I'll commit to the branch!  Thanks.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411495#comment-13411495
 ] 

Han Jiang commented on LUCENE-3892:
---

bq. Actually I think you want them to be suppressed, so that the original 
exception is seen?

Not my idea actually, I think the exception should be thrown for out.close()? 
closeWhileHandlingException() will suppress those exceptions. 

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411515#comment-13411515
 ] 

Michael McCandless commented on LUCENE-3892:


bq. Not my idea actually, I think the exception should be thrown for 
out.close()? closeWhileHandlingException() will suppress those exceptions

But the problem is some other exception has already been thrown (because 
success is false).  If out.close then hits a second exception we have to pick 
which one should be thrown, and I think the original one is better?  (Since 
it's likely the root cause of whatever went wrong).

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411517#comment-13411517
 ] 

Han Jiang commented on LUCENE-3892:
---

bq. The Pulsing parts in last patch is not included here, because they doesn't 
improve performance significantly. 

Here are some tests between For vs PulsingFor, PFor vs PulsingPFor. Run on the 
1M docs with wikimediumhard.tasks

It is strange that PKLookup still doesn't benefit for FixedBlockInt:

{noformat}
Task QPS For  StdDev ForQPS PulsingForStdDev PulsingFor 
 Pct diff
 AndHighHigh   23.010.33   22.940.66   -4% -
4%  
  AndHighMed   56.410.76   57.411.74   -2% -
6%  
  Fuzzy1   86.740.85   82.222.39   -8% -   
-1% 
  Fuzzy2   28.230.38   26.150.97  -11% -   
-2% 
  IntNRQ   41.781.65   40.783.53  -14% -   
10% 
  OrHighHigh   14.440.34   14.500.92   -8% -
9%  
   OrHighMed   30.590.77   31.121.93   -6% -   
10% 
PKLookup  110.312.03  109.222.43   -4% -
3%  
  Phrase8.180.447.970.40  -12% -
8%  
 Prefix3   99.642.38   97.093.46   -8% -
3%  
 Respell   99.660.45   92.762.81  -10% -   
-3% 
SloppyPhrase4.280.164.080.13  -11% -
2%  
SpanNear4.080.133.930.06   -7% -
0%  
Term   33.631.25   34.061.71   -7% -   
10% 
TermBGroup1M   15.540.46   15.780.56   -4% -
8%  
  TermBGroup1M1P   20.340.73   20.620.62   -5% -
8%  
 TermGroup1M   19.180.52   19.720.49   -2% -
8%  
Wildcard   34.860.88   34.271.77   -9% -
6% 
{noformat}

{noformat}
 AndHighHigh   19.980.31   19.920.26   -3% -
2%  
  AndHighMed   58.211.51   57.861.18   -5% -
4%  
  Fuzzy1   91.861.17   85.861.18   -8% -   
-4% 
  Fuzzy2   32.660.58   30.080.57  -11% -   
-4% 
  IntNRQ   33.890.82   32.661.10   -9% -
2%  
  OrHighHigh   15.791.29   14.960.67  -16% -
7%
   OrHighMed   30.312.09   28.911.67  -15% -
8%
PKLookup  112.800.81  111.822.90   -4% -
2%
  Phrase6.140.116.230.10   -1% -
5%
 Prefix3  147.802.88  138.352.11   -9% -   
-3%
 Respell  118.571.18  108.301.86  -11% -   
-6%
SloppyPhrase5.780.155.660.29   -9% -
5%
SpanNear6.320.146.400.16   -3% -
6%
Term   41.602.44   38.120.33  -14% -   
-1%
TermBGroup1M   14.400.48   13.730.19   -8% -
0%
  TermBGroup1M1P   23.680.44   22.820.44   -7% -
0%
 TermGroup1M   15.250.48   14.510.20   -9% -
0%
Wildcard   32.760.53   31.760.62   -6% -
0%
{noformat}


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based 

[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411533#comment-13411533
 ] 

Han Jiang commented on LUCENE-3892:
---

bq. But the problem is some other exception has already been thrown (because 
success is false). If out.close then hits a second exception we have to pick 
which one should be thrown, and I think the original one is better? (Since it's 
likely the root cause of whatever went wrong).

OK, I see, then let's change ForPostingsFormat.fieldsConsumer/Producer as well.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411546#comment-13411546
 ] 

Michael McCandless commented on LUCENE-3892:


OK I committed that!  Let me know if I missed any...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-11 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411556#comment-13411556
 ] 

Han Jiang commented on LUCENE-3892:
---

OK, thanks!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor-with-javadoc.patch, 
 LUCENE-3892-forpfor.patch, LUCENE-3892-handle_open_files.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409664#comment-13409664
 ] 

Michael McCandless commented on LUCENE-3892:


bq. Current branch cannot pass tests like this:

Thanks, I committed the patch.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892-handle_open_files.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-07-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405477#comment-13405477
 ] 

Michael McCandless commented on LUCENE-3892:


Thanks Billy, I committed this to the branch.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-forpfor.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-23 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399869#comment-13399869
 ] 

Chris Male commented on LUCENE-3892:


It's really interesting the effect of peeling back those abstractions.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-23 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399883#comment-13399883
 ] 

Han Jiang commented on LUCENE-3892:
---

Yes, really interesting. And that should make sense. As far as I know, a method 
with exception handling may be quite slow than a simple if statement check. 
Here is part of the result in my test, with Mike's patch:
{noformat}
   OrHighMed2.530.312.570.13  -13% -   
21%
Wildcard3.860.123.940.38  -10% -   
15%
  OrHighHigh1.570.181.610.08  -12% -   
21%
  TermBGroup1M1P1.930.032.480.10   21% -   
35%
 TermGroup1M1.370.021.810.05   26% -   
37%
TermBGroup1M1.170.021.640.07   32% -   
47%
Term2.920.134.460.23   38% -   
68%
{noformat}

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-BlockTermScorer.patch, 
 LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, 
 LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, 
 LUCENE-3892_for_unfold_method.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-21 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398800#comment-13398800
 ] 

Han Jiang commented on LUCENE-3892:
---

And same codes with the wikimediumhard.tasks file.(This is really a hard 
testcase, since QPS are so small that we can hardly depend on Pct Diff :) )
{noformat}
TaskQPS Base StdDev Base QPS For  StdDev For  Pct 
diff
  AndHighMed   10.760.216.470.32  -43% -  
-35%
 AndHighHigh2.890.082.570.19  -20% -   
-1%
SpanNear0.600.010.550.01  -11% -   
-6%
SloppyPhrase0.610.010.570.01   -9% -   
-3%
PKLookup   87.722.61   86.281.48   -6% -
3%
  Fuzzy1   36.221.14   35.900.97   -6% -
5%
  Phrase1.220.031.220.08   -9% -
8%
 Respell   32.840.92   33.550.87   -3% -
7%
  IntNRQ3.660.353.740.08   -8% -   
15%
  Fuzzy2   21.620.66   22.100.51   -3% -
7%
 Prefix3   13.300.49   14.090.76   -3% -   
15%
   OrHighMed3.430.163.650.45  -10% -   
25%
  OrHighHigh1.660.091.790.22  -10% -   
28%
Wildcard3.390.143.740.200% -   
21%
  TermBGroup1M1P1.840.032.100.163% -   
25%
 TermGroup1M1.140.031.340.105% -   
29%
TermBGroup1M1.490.051.780.137% -   
32%
Term3.490.134.380.652% -   
49%
{noformat}

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_byte[].patch, 
 LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397605#comment-13397605
 ] 

Michael McCandless commented on LUCENE-3892:


OK I created a branch and committed last For patch: 
https://svn.apache.org/repos/asf/lucene/dev/branches/pforcodec_3892

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-20 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397694#comment-13397694
 ] 

Han Jiang commented on LUCENE-3892:
---

OK, just reproduce your test. But Mike, are we using a same task file? Our 
relative speeds for different queries are not the same. 
{quote}
TaskQPS Base StdDev Base QPS For  StdDev For  Pct 
diff
  Phrase5.070.453.760.19  -35% -  
-14% (-44% -  -18%)
  AndHighMed   28.322.34   22.670.67  -28% -  
-10% (-38% -   -9%)
SpanNear2.720.132.360.14  -22% -   
-3% (-36% -   -8%)
SloppyPhrase4.180.203.830.15  -16% -
0% (-33% -   -6%)
 Respell   42.021.83   38.862.30  -16% -
2% (-18% -0%)
  Fuzzy1   44.961.58   42.851.69  -11% -
2% (-12% -0%)
  Fuzzy2   16.780.69   16.340.68  -10% -
5% (-12% -3%)
PKLookup   89.112.15   87.332.19   -6% -
2% ( -2% -5%)
 AndHighHigh7.610.447.690.21   -7% -   
10% (-21% -   10%)
Wildcard   19.500.91   20.020.72   -5% -   
11% (-21% -3%)
TermBGroup1M   20.820.37   21.730.690% -
9% (  2% -   10%)
 TermGroup1M   13.790.13   14.610.322% -
9% (  1% -9%)
  IntNRQ4.110.564.560.56  -14% -   
43% (-25% -   33%)
  TermBGroup1M1P   21.450.75   24.000.515% -   
18% ( -1% -   22%)
   OrHighMed5.080.495.730.150% -   
28% (-16% -   25%)
  OrHighHigh4.220.394.780.131% -   
28% (-15% -   24%)
 Prefix3   30.911.63   35.652.023% -   
28% (-14% -   21%)
Term   44.361.87   54.011.96   12% -   
31% ( -1% -   33%)
{quote}

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397958#comment-13397958
 ] 

Michael McCandless commented on LUCENE-3892:


bq.  But Mike, are we using a same task file? Our relative speeds for different 
queries are not the same.

Sorry, I'm using a hand edited hard tasks file; I'll commit  push to 
luceneutil.  But, separately: each run picks a different subset of the tasks 
from each category to run, so results from one run to another in general aren't 
comparable unless we fix the random seed it uses.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892-direct-IntBuffer.patch, 
 LUCENE-3892_for.patch, LUCENE-3892_for_unfold_method.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-19 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396987#comment-13396987
 ] 

Han Jiang commented on LUCENE-3892:
---

Oh, thank you Mike! I haven't thought too much about those skipping policies.

bq. Up above, in ForFactory, when we readInt() to get numBytes ... it seems 
like we could stuff the header numBits into that same int and save checking 
that in FORUtil.decompress
Ah, yes, I just forgot to remove the redundant codes. Here is a initial try to 
remove header and call ForDecompressImpl directly in readBlock():with For, 
blockSize=128. Data in bracket show prior benchmark.
{noformat}
TaskQPS Base StdDev Base QPS For  StdDev For  Pct 
diff
  Phrase4.990.373.570.26  -38% -  
-17% (-44% -  -18%)
  AndHighMed   28.912.17   22.660.82  -29% -  
-12% (-38% -   -9%)
SpanNear2.720.142.220.13  -26% -   
-8% (-36% -   -8%)
SloppyPhrase4.240.263.700.16  -21% -   
-3% (-33% -   -6%)
 Respell   40.712.59   37.661.36  -16% -
2% (-18% -0%)
  Fuzzy1   43.222.01   40.660.32  -10% -
0% (-12% -0%)
  Fuzzy2   16.250.90   15.640.26  -10% -
3% (-12% -3%)
Wildcard   19.070.86   19.070.73   -8% -
8% (-21% -3%)
 AndHighHigh7.760.477.770.15   -7% -
8% (-21% -   10%)
PKLookup   87.504.56   88.511.24   -5% -
8% ( -2% -5%)
TermBGroup1M   20.420.87   21.320.74   -3% -   
12% (  2% -   10%)
   OrHighMed5.330.685.610.14   -9% -   
23% (-16% -   25%)
  OrHighHigh4.430.534.690.12   -8% -   
23% (-15% -   24%)
 TermGroup1M   13.300.34   14.310.402% -   
13% (  0% -   13%)
  TermBGroup1M1P   20.920.59   23.710.866% -   
20% ( -1% -   22%)
 Prefix3   30.301.41   35.141.765% -   
27% (-14% -   21%)
  IntNRQ3.900.544.580.47   -7% -   
50% (-25% -   33%)
Term   42.171.55   52.332.57   13% -   
35% (  1% -   33%)
{noformat}
The improvement is quite general. However, I still suppose this just benefits 
from less method calling. I'm trying to change the PFor codes, and remove those 
nested call.

bq. Get more direct access to the file as an int[]; ...
Ok, this will be considered when the pfor+pulsing is completed. I'm just 
curious why we don't have readInts in ora.util yet...

bq. Skipping: can we partially decode a block? ...
The pfor-opt approach(encode lower bits of exception in normal area, and other 
bits in exception area)  natually fits partially decode a block, that'll be 
possible when we optimize skipping queries.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-19 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13397228#comment-13397228
 ] 

Han Jiang commented on LUCENE-3892:
---

And result for PFor(blocksize=128):
{noformat}
TaskQPS Base StdDev BaseQPS PFor StdDev PFor  Pct 
diff
  Phrase4.870.363.390.18  -38% -  
-20% (-47% -  -25%)
  AndHighMed   27.782.35   21.130.52  -31% -  
-14% (-37% -  -15%)
SpanNear2.700.142.200.11  -26% -   
-9% (-36% -  -13%)
SloppyPhrase4.170.153.770.21  -17% -
0% (-30% -   -6%)
 Respell   39.971.56   37.651.95  -14% -
3% (-15% -2%)
Wildcard   19.080.77   18.330.92  -12% -
5% (-17% -3%)
  Fuzzy1   42.291.13   40.781.44   -9% -
2% (-11% -1%)
 AndHighHigh7.610.557.450.08   -9% -
6% (-19% -6%)
  Fuzzy2   15.790.55   15.640.70   -8% -
7% (-11% -6%)
PKLookup   86.712.13   88.922.24   -2% -
7% ( -2% -7%)
 TermGroup1M   13.040.23   14.030.402% -   
12% (  1% -9%)
  IntNRQ3.970.484.350.61  -15% -   
41% (-16% -   24%)
  TermBGroup1M1P   21.040.35   23.200.605% -   
14% (  0% -   14%)
TermBGroup1M   19.270.47   21.280.843% -   
17% (  1% -   10%)
  OrHighHigh4.130.474.630.27   -5% -   
34% (-14% -   27%)
   OrHighMed4.950.595.580.34   -5% -   
35% (-14% -   27%)
 Prefix3   30.331.36   34.262.141% -   
25% ( -6% -   20%)
Term   41.991.19   50.751.72   13% -   
28% (  2% -   26%)
{noformat}
It works, and it is quite interesting that StdDev for Term query is reduced 
significantly.  

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-18 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13395894#comment-13395894
 ] 

Han Jiang commented on LUCENE-3892:
---

There's a potential bottleneck during method calling...Here is an example for 
PFor, with blocksize=128, exception rate = 97%, normal value = 2 bits, 
exception value = 32 bits:

{noformat}
Decoding normal values:  4703 ns
Patching exceptions: 5797 ns
Single call of PForUtil.decompress totally takes:   58318 ns
{noformat}

In addition, it costs about 4000ns to record the time span.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-18 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13396325#comment-13396325
 ] 

Michael McCandless commented on LUCENE-3892:


On the For patch ... we shouldn't encode/decode numInts right?  It's
always 128?

Up above, in ForFactory, when we readInt() to get numBytes ... it
seems like we could stuff the header numBits into that same int and
save checking that in FORUtil.decompress

I think there are a few possible ideas to explore to get faster
PFor/For performance:

  * Get more direct access to the file as an int[]; eg MMapDir could
expose an IntBuffer from its ByteBuffer (saving the initial copy
into byte[] that we now do).  Or maybe we add
IndexInput.readInts(int[]) and dir impl can optimize how that's
done (MMapDir could use Unsafe.copyBytes... except for little
endian architectures ... we'd probably have to have separate
specialized decoder rather than letting Int/ByteBuffer do the byte
swapping).  This would require the whole file stays aligned w/ int
(eg the header must be 0 mod 4).

  * Copy/share how oal.packed works, i.e. being able to waste a bit to
have faster decode (eg storing the 7 bit case as byte[], wasting 1
bit for each value).

  * Skipping: can we partially decode a block?  EG if we are skipping
and we know we only want values after the 80th one, then we
shouldn't decode those first 80...

  * Since doc/freq are aligned, when we store pointers to a given
spot, eg in the terms dict or in skip data, we should only store
the offset once (today we store it twice).

  * Alternatively, maybe we should only save skip data on doc/freq
block boundaries (prox would still need skip-within-block).

  * Maybe we should store doc  frq blocks interleaved in a single
file (since they are aligned) and then skip would skip to the
start of a doc/frq block pair.

Other ideas...?


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289296#comment-13289296
 ] 

Michael McCandless commented on LUCENE-3892:


Hi Billy,

bq. Can I get it from a wiki dump instead?

You can download it at 
http://people.apache.org/~mikemccand/enwiki-20120502-lines-1k.txt.lzma

That's ~6.3 GB (compressed) and 28.7 GB (decompressed); it's the 2012/05/02 
Wikipedia en export, filtered to plain text and then broken into 33.3 M ~1 KB 
sized docs.  I can help you get the luceneutil env set up...

{quote}
bq. Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 
sec).

Yes, it is expected, actually it scans every block 33 times to estimate 
metadata such as numFrameBits and numExceptions.
{quote}

OK, in that case I'm surprised it's only ~18% slower!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288675#comment-13288675
 ] 

Michael McCandless commented on LUCENE-3892:


Excellent!  All tests also pass for me w/ PFor postings format as
well... this is a great starting point :) One Solr test failed
(ContentStreamTest)... but I think it was false failure...

I did notice the tests seem to run slower, especially certain ones eg
TestJoinUtil.

Still missing a couple license headers (TestMin, TestCompress)...

I ran a quick perf test using
http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc
Wikipedia index.

Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs
1261 sec).

But more important is the slower search times:

{noformat}
TaskQPS base StdDev baseQPS pfor StdDev pfor  Pct 
diff
  Phrase8.520.504.430.40  -55% -  
-39%
SloppyPhrase   12.520.397.870.51  -43% -  
-30%
  AndHighMed   67.692.82   44.221.47  -39% -  
-29%
SpanNear5.190.123.900.28  -31% -  
-17%
PKLookup  112.161.71   95.611.30  -17% -  
-12%
 AndHighHigh   13.220.34   11.860.72  -17% -   
-2%
Wildcard   46.040.37   41.684.45  -19% -
1%
  Fuzzy1   50.112.03   48.061.91  -11% -
3%
   OrHighMed9.260.488.900.37  -12% -
5%
  OrHighHigh   12.280.56   11.830.49  -11% -
5%
  TermBGroup1M1P   40.471.94   39.882.51  -11% -   
10%
  Fuzzy2   53.712.66   53.012.08   -9% -
7%
 TermGroup1M   36.461.21   35.991.58   -8% -
6%
TermBGroup1M   55.531.99   55.262.68   -8% -
8%
 Respell   69.714.49   69.732.07   -8% -   
10%
Term   94.387.62   94.96   12.19  -18% -   
23%
 Prefix3   41.630.34   42.215.82  -13% -   
16%
  IntNRQ7.080.157.281.29  -17% -   
23%
{noformat}

The queries that do skipping are quite a bit slower; this makes sense,
since on skip we do a full block decode.  A smaller block size (we use
128 now right?) should help I think.

It's strange that the non-skipping queries (Term, OrHighMed,
OrHighHigh) don't show any performance gain ... maybe we need to
optimize the decode... or it could be the removal of the bulk api
is hurting us here.

I'm also curious if we tried a pure FOR (no patching, so we must set
numBits according to the max value = larger index but hopefully faster
decode) if the results would improve...



 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-04 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289104#comment-13289104
 ] 

Han Jiang commented on LUCENE-3892:
---

Thanks Mike, we have so much details to help optimize!

bq.Still missing a couple license headers (TestMin, TestCompress)...
Ok, I'll add them later.

bq.I ran a quick perf test using 
http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc Wikipedia 
index.
The script is wonderful! But the wiki data is missing? Can I get it from a wiki 
dump instead?

bq.Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 
sec).
Yes, it is expected, actually it scans every block 33 times to estimate 
metadata such as numFrameBits and numExceptions.

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, 
 LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287936#comment-13287936
 ] 

Michael McCandless commented on LUCENE-3892:


Awesome progress!  Nice to have a dirt path online that we can then
iterate from ...

Hmm, I'm seeing some test failures when I run:
{noformat}
ant test -Dtests.postingsformat=PFor
{noformat}
Eg, TestNRTThreads, TestShardSearching, TestTimeLimitingCollector.

Remember to add the standard copyright headers to each new source
file...

We don't have to do this now, but I wonder if we can share code w/ the
packed ints impl we have, instead generating another one with the .py
source.

TestDemo makes a nice TestMin... I usually start with TestDemo when
testing scary new code, and then it's a huge milestone once TestDemo
passes :)

We should definitely cutover to BlockTree terms dict (I would upgrade
that TODO to a nocommit!).

I suspect that wrapping the blocks byte[] as ByteBuffer and then
IntBuffer is going to be too costly per decode so we should init them
once and re-use (upgrade that TODO to a nocommit).


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-02 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287951#comment-13287951
 ] 

Han Jiang commented on LUCENE-3892:
---

Ah, yes, I forgot to use -Dtests.postingsformat...I can see the errors
now.

{quote}
TestDemo makes a nice TestMin... I usually start with TestDemo when
testing scary new code, and then it's a huge milestone once TestDemo
passes 
{quote}
Hmm, that means I should remove TestMin.java? This testcase works fine
for the patch.

{quote}
We should definitely cutover to BlockTree terms dict (I would upgrade
that TODO to a nocommit!).
{quote}
I'm not quite familiar with these sign stuff, shall I change all the 
TODO sign into nocommit? Are the signs related to documentation, 
or just marked to remember not to commit current codes?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-06-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287952#comment-13287952
 ] 

Michael McCandless commented on LUCENE-3892:


bq. Hmm, that means I should remove TestMin.java? This testcase works fine for 
the patch.

Oh it's fine to keep TestMin now that you wrote it ... I was just saying that 
TestDemo is the test I run when I want the most trivial test for a new big 
change.

{quote}
I'm not quite familiar with these sign stuff, shall I change all the 
 TODO sign into nocommit? Are the signs related to documentation, 
 or just marked to remember not to commit current codes?
{quote}

Sorry - this is just a convention I use: I put a // nocommit comment whenever 
there's a blocker to committing; this way I can grep for nocommit to see what 
still needs fixing... and towards the end, nocommits will often be downgraded 
to TODOs since on closer inspection they really don't have to block 
committing...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.1

 Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
 LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-05-01 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265950#comment-13265950
 ] 

Han Jiang commented on LUCENE-3892:
---

A postings format named VSEncoding also seems promising! 

It is available here: http://integerencoding.isti.cnr.it/

And license compatible: 
https://github.com/maropu/integer_encoding_library/blob/master/LICENSE

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0

 Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-26 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262694#comment-13262694
 ] 

Han Jiang commented on LUCENE-3892:
---

It's quite strange that sometimes I cannot access repo1.maven.org, therefore 
ant ivy-boostrap  ant resolve will fail to work.(Since I'm in China, the 
network connection might be limited).

Once Mike and I hoped to make things work by configuring 
lucene/common-build.xml  dev-tools/scripts/poll-mirrors.pl to another 
maven mirror, listed in 
http://docs.codehaus.org/display/MAVENUSER/Mirrors+Repositories. Unfortunately, 
the main site repo1.maven.org was configured into ivy-2.2.0.jar, and even we 
pass ant ivy-bootstrap, ant resolve still fails.

Well, here is how I get things work(too ugly, hope a better suggestion!):

change /etc/hosts,
and redirect current maven site to a mirror with same directory structure, for 
example: 

194.8.197.22repo1.maven.org # to http://mirror.netcologne.de/

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-26 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262701#comment-13262701
 ] 

Michael McCandless commented on LUCENE-3892:


Phew, I'm glad to hear you got it working!  So ant resolve finished 
successfully?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-26 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262707#comment-13262707
 ] 

Han Jiang commented on LUCENE-3892:
---

Yes, and ant test is running now. Maybe we can configure something to avoid 
the ugly hack?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262711#comment-13262711
 ] 

Robert Muir commented on LUCENE-3892:
-

Maybe a good solution is if we have an ant property (that we somehow pass to 
ivy), and
we conditionally set it in ant by default to a server we know that works in 
china,
if the ${user.language}=zh ?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-26 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262715#comment-13262715
 ] 

Han Jiang commented on LUCENE-3892:
---

Thank you, Robert! But currently, the maven mirror in 
China(http://mirrors.redv.com/maven2) is not available. And can we pass a 
property to ivy to replace the repo1* stuff?

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262777#comment-13262777
 ] 

Robert Muir commented on LUCENE-3892:
-

Patch does not yet fix ivy-bootstrap. Ivy-bootstrap still only tries 
repo1.maven.org. We need a different strategy for that: either we depend on 
try-catch from ant contrib (undesired), use custom ant task (g), or use a 
chain of targets with fail-on-error=false unless the file already exists and 
checksum at the end... Lemme see if i can fix ivy-bootstrap, too!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0

 Attachments: LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262835#comment-13262835
 ] 

Robert Muir commented on LUCENE-3892:
-

I will commit this patch: please let us know if you have more problems from 
china! :)

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0

 Attachments: LUCENE-3892_settings.patch, LUCENE-3892_settings.patch


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260422#comment-13260422
 ] 

Michael McCandless commented on LUCENE-3892:


Hi Billy, I'm very excited your proposal is accepted!  Congrats :)  Now the fun 
work begins...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-23 Thread Han Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260149#comment-13260149
 ] 

Han Jiang commented on LUCENE-3892:
---

Thank all of you for providing me this opportunity! Let us begin!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-05 Thread Han Jiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247175#comment-13247175
 ] 

Han Jiang commented on LUCENE-3892:
---

{quote}
* There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.
{quote}

Yes, but it seems that our baseline will be Lucene40 and Pulsing? Lucene3x is 
read-only, and other approaches are not productive.
And, what is random codec? Does it mean to randomly pick up a codec for user?

{quote}
* I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can just implement the required methods (eg to
encode/decode one int[] block).
{quote}

And this was my initial thought about the PForDelta interface:

The class hierarchy will be as below (quite similar to pulsing):
* PForDeltaPostingsFormat(extends PostingsFormat): 
It will define global behaviors such as file suffix, and provide 
customized FieldsWriter/Reader
* PForDeltaFieldsWriter(extends FieldsConsumer): 
It will define how terms,docids,freq,offset are written into posting 
files.
inner classes include: 
** PForDeltaTermsConsumer(extends TermsConsumer)
** PForDeltaPostingsConsumer(extends PostingsConsumer)
* PForDeltaFieldsReader(extends FieldsProducer):
It will define how postings are read from index, and provide *Enum 
class to iterate docids, freqs etc.
inner classes include:
** PForDeltaFieldsEnum(extends FieldsEnum)
** PForDeltaTermsEnum(extends TermsEnum)
** PForDeltaDocsEnum(extends DocsEnum)
** PForDeltaDocsAndPositonsEnum(extends DocsAndPostionsEnum)
** PForDeltaTerms(extends Terms)

It seems that BlockTermsReader/Writer have already implement those 
subclasses, and we can just pass our Postings(Writer/Reader)Base as an 
argument, like PatchedFrameOfRefCodec::fieldsConsumer() does.
Then, to introduce PForDeltaCodec into trunk, we should also introduce the 
fixed codec? Also, why isn't lucene40codec implemented with this line? 

{quote}
* We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding. I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).
{quote}

I haven't investigated much about different kinds of queries. What are skipper 
settings? 

{quote}
* Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...
{quote}

Yes, it seems that if PForDelta outperforms current approaches, a Pulsing 
version will work better? This feature will also come as phase 2.


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-04-03 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245374#comment-13245374
 ] 

Michael McCandless commented on LUCENE-3892:


The proposal at
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/billybob/1
looks great!  Some initial feedback:

  * There are actually more than 2 codecs (eg we also have Lucene3x,
SimpleText, sep/intblock (abstract), random codecs/postings
formats for testing...), but our default codec now is Lucene40.

  * I think you can use the existing abstract sep/intblock classes
(ie, they implement layers like FieldsProducer/Consumer...), and
then you can just implement the required methods (eg to
encode/decode one int[] block).

  * We may need to tune the skipper settings, based on profiling
results from skip-intensive (Phrase, And) queries... since it's
currently geared towards single-doc-at-once encoding.  I don't think
we should try to make a new skipper impl here... (there is a separate
issue for that).

  * Maybe explore the combination of pulsing and PForDelta codecs;
seems like the combination of those two could be important, since
for low docFreq terms, retrieving the docs is now more
expensive...


 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-03-28 Thread Han Jiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240403#comment-13240403
 ] 

Han Jiang commented on LUCENE-3892:
---

Hi, I have submitted my proposal. Comments are welcome!

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)

2012-03-28 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240527#comment-13240527
 ] 

Michael McCandless commented on LUCENE-3892:


That's great Han, I'll have a look.

I can be a mentor for this...

 Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
 Simple9/16/64, etc.)
 -

 Key: LUCENE-3892
 URL: https://issues.apache.org/jira/browse/LUCENE-3892
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.0


 On the flex branch we explored a number of possible intblock
 encodings, but for whatever reason never brought them to completion.
 There are still a number of issues opened with patches in different
 states.
 Initial results (based on prototype) were excellent (see
 http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
 ).
 I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org