[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-07-02 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-4062:
---

Attachment: PackedZero.java
PackedIntsBenchmark.java

Attached improved benchmark that allows for user specified value count, bpv's 
and implementations. Also attached is PackedZero that is a dummy Mutable that 
is used for measuring the maximum possible throughput.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, 
 Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, 
 PackedIntsBenchmark.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
 PackedZero.java, measurements_te_graphs.pdf, measurements_te_i7.txt, 
 measurements_te_p4.txt, measurements_te_xeon.txt


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 32-bits computer to test 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-29 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-4062:
---

Attachment: PackedIntsBenchmark.java
Packed64Strategy.java

Nice trick with the inheritance in Packed64SingleBlock.java. I really helped a 
lot on the machines I tested with. I tried to use the Strategy Pattern myself, 
using switch on the final bitsPerValue in set  get, but it seems that Hotspot 
is not keen on optimizing such a conditional away. I have now used the 
inheritance method to provide faster implementations in Packed64 for 1, 2, 4, 
8, 16, 32 and 64 bpv. Due to the existence of the Direct classes, this is only 
relevant for 1, 2  4 bpv though.

Some new measurements at http://ekot.dk/misc/packedints/ with the following 
implementations:
- contiguous: The Packed64 that is being reviewed at LYCENE-4171
- padding: The new Packed64SingleBlock.java with sub-classes
- 3 blocks: As before
- direct: As before
- old padding: Packed64SingleBlock.java without sub-classes
- contiguous strategy: Packed64Strategy.java, which uses sub-classes to 
optimize for bpv 2^n

I changed PackedIntsBenchmark to report the highest measured performance from 
the iterations instead of the mean. The rationale is that the mean is 
vulnerable to machine load fluxations and GC throughout the whole measurement 
process.

My observations are that the measurements from atria  pc254, which are 
relatively old machines, are very similar and fits well with the theory: 
Padding does give higher or equal performance than Packed64 at all other bpv's 
than 2^n.

The story is not so clear for mars, which is a very fast machine: For the 
current test case of 10M values, the padding only provides better performance 
after 18-20 bpv and for some bpv, Packed64 is a bit faster. I suspect that 
point would be moved if other processes were competing for memory cache. Still 
pending are measurements from my i7-machine. I hope to add them this evening or 
tomorrow.

Also pending is why the performance of Direct set is so low for mars. It makes 
no sense: Even if there is some freaky hit for requesting bytes, shorts or ints 
on that machine, performance should be = everything else at 64bpv.

When should we stop optimizing? For 1 bpv, a list of booleans would probably be 
faster, so should we make a PackedBoolean? Similarly, Packed8, Packed16 and 
Packed32 would also make sense if we were really determined. None of the 
implementations are hard to make, but it becomes a hassle to update when a new 
feature is needed.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, 
 Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, 
 PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
 measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, 
 measurements_te_xeon.txt


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-28 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: Packed64SingleBlock.java

Very interesting. Out of curiousity could you try to run the tests on your 
machines with the attached Packed64SingleBlock.java? It generates sub-classes 
that might help the JVM optimize some multiplications/remainders. I'd be 
interested to see what the difference is on your machines (on mine, it is 
slightly faster than the current Packed64SingleBlock impl).

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, 
 Packed64calc.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
 measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, 
 measurements_te_xeon.txt


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-27 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-4062:
---

Attachment: measurements_te_xeon.txt
measurements_te_p4.txt
measurements_te_i7.txt
measurements_te_graphs.pdf

I ran the test on three different machines. results are attached as 
measurements*.txt along with a PDF with graphs generated from iteration #6 
(which should probably be the mean or max of run 2-5). The setter-graph for the 
p4 looks extremely strange for Direct, but I tried generating a graph for 
iteration #5 instead and it looked the same. in the same vein, the Direct 
performance for the Xeon is suspiciously low, so I wonder if there's some 
freaky JITting happening to the test code.

Unfortunately I did not find an AMD machine to test on. For the three tested 
Intels, it seems that the Packed64calc does perform very well.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64calc.java, 
 PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
 measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, 
 measurements_te_xeon.txt


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-26 Thread Toke Eskildsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toke Eskildsen updated LUCENE-4062:
---

Attachment: PackedIntsBenchmark.java
Packed64calc.java

I tried running the PackedIntsBenchmark on an i7 processor and I agree on the 
overall conclusion with regard to the speed, or rather lack of speed, for 
Packed64. However, my the winners for the different BPVs were not always the 
same as Adrien observed. I suspect that CPU architecture and memory system, 
especially caching, plays a very large role here.

Re-thinking the no-branching idea for Packed64, it seems that in reality it is 
slower to perform two memory requests (where the second will most probably be 
cached) than take the pipeline flush. Therefore I have created Packed64calc 
(see attachment), which is a full replacement for Packed64.

My own tests shows Packed64calc to be significantly faster than Packed64 and in 
many cases faster than Packed64SingleBlock. I suspect the latter to be caused 
either by caching or the fact that Packed64SingleBlock uses division and modulo 
for set  get. While the modulo can be avoided in Packed64SingleBlock, I never 
did find a reliable way to bypass the division when I experimented with it.

I have attached an updated PackedIntsBenchmark which can be used with 
Packed64calc and hope that Adrien will take a look.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64calc.java, 
 PackedIntsBenchmark.java, PackedIntsBenchmark.java


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-15 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: PackedIntsBenchmark.java

@Dawid I attached the file I used to compute these numbers. There is a lot of 
garbage in the output because I print some stuff to prevent the JVM from doing 
optimizations.

@Mike Yes, I was actually thinking of it too... :-)

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, PackedIntsBenchmark.java


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 32-bits computer to test the performance improvements).
 I think this would allow more fine-grained control over the speed/space 
 trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-14 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: LUCENE-4062-2.patch

I have run more tests on {{PackedInts}} impls over the last days to test their 
relative performance.

It appears that the specializations in {{Packed64SingleBlock}} don't help much 
and even hurt performance in some cases. Moreover, replacing the naive bulk 
operations by a {{System.arraycopy}} in {{Direct64}} is a big win. (See 
attached patch.)

You can look at the details of the tests here: 
http://people.apache.org/~jpountz/packed_ints.html (contiguous=Packed64, 
padding=Packed64SingleBlock,3 blocks=Packed*ThreeBlocks,direct=Direct*).

The tests were run on a 64-bit computer (Core 2 Duo E5500) with valueCount=10 
000 000. Memory overhead is {unused space in bits}/{bits per value} while the 
other charts measure the number of gets/sets per second.

The random get/set results are very good for the packed versions, probably 
because they manage to fit much more values into the CPU caches than other 
impls. The reason why bulk get/set is faster when bitsPerValue32 is that 
Direct64 uses System.arraycopy instead of naive copy (in a for loop).

Interestingly, the different impls have very close random get performance.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-05-24 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: LUCENE-4062.patch

New patch. I added a VInt flag to the streams generated by writers so that the 
readers can know how to parse the stream. All tests passed.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Michael McCandless
Priority: Minor
  Labels: performance
 Fix For: 4.1

 Attachments: LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 32-bits computer to test the performance improvements).
 I think this would allow more fine-grained control over the speed/space 
 trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-05-22 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: LUCENE-4062.patch

Oh right, cases are swapped, I uploaded the wrong patch. Here is a correct 
version.

I guess there are two options:
 1. always writing the most compact form to disk and deserializing it in a fast 
{{PackedInts.Mutable}} implementation (depending on acceptableOverheadRatio),
 2. writing a possibly aligned version on disk (depending on 
acceptableOverheadRatio) and then selecting the fastest PackedInts.Mutable 
implementation that exactly matches bitsPerValue (with no padding bits).

A third option could be to write padding bits ({{Packed64SingleBlock}} 
subclasses may have such padding bits) as well, but I really dislike the fact 
that the on-disk format is implementation-dependent.

Option 1 is likely to make deserialization slower since some decoding might 
occur (the copyData method) but on the other hand, option 2 would prevent us 
from using implementations that add padding bits (such as Packed64SingleBlock21 
which has one padding bit every 3 21-bits integers or Packed64SingleBlock12 
which has 4 padding bits every 5 12-bits integers but not 
Packed64SingleBlock{1,2,4} since 64%{1,2,4}=0).

I initially chose option 1 because I think it is nice to have 
Packed64SingleBlock21 when bitsPerValue is close to 21, since it might be 
significantly faster than Packed64.

In order to know how slower deserialization is (option 1), I ran a 
micro-benchmark which:
  - loads a PackedInts.Reader from a ByteArrayDataInput,
  - performs n random reads on the reader.

Here is the code for the micro-benchmark in case you would like to run it.
{code}
int valueCount = 1000;
int bitsPerValue = 21;
int[] offsets = new int[valueCount];
Random random = new Random();
for (int i = 0; i  valueCount; ++i) {
  offsets[i] = random.nextInt(valueCount);
}
byte[] bytes = new byte[valueCount * 4];
DataOutput out = new ByteArrayDataOutput(bytes);
PackedInts.Writer writer = PackedInts.getWriter(out, valueCount, bitsPerValue);
for (int i = 0; i  valueCount; ++i) {
  writer.add(random.nextInt(1  bitsPerValue));
}
writer.finish();
for (int i = 0; i  50; ++i) {
  long start = System.nanoTime();
  DataInput in = new ByteArrayDataInput(bytes);
  PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64
  // PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); // 
Packed64SingleBlock
  for (int j = 0, n = valueCount; j  n; ++j) {
reader.get(offsets[j]);
  }
  long end = System.nanoTime();
  System.out.println(end - start);
}
{code}

I ran this microbenchmark for bitsPerValue=21 and valueCount in (1 000 000, 10 
000 000). The loading time (n = 0) is 2x to 3x slower with 
Packed64SingleBlock21. However, as soon as you perform valueCount/4 or more 
random reads (n = valueCount/4), the total time is better for 
Packed64SingleBlock21. When n=valueCount, the total time is even 2x better for 
Packed64SingleBlock21.

The loading overhead doesn't seem too bad.

However I guess that some people might still be more interested in the loading 
time than in the query time (for write-intensive applications), but in this 
case they could still request a reader in its compact form (Packed 64 or 
Direct* when bitsPerValue is 8, 16, 32 or 64). If they want to be sure to have 
a fast reader too, they could also make sure that they used 8, 16, 32 or 64 as 
the bitsPerValue parameter of {{PackedInts.getWriter}}.

Mike, what do you think?

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Michael McCandless
Priority: Minor
  Labels: performance
 Fix For: 4.1

 Attachments: LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-05-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: LUCENE-4062.patch

New patch with updated getReader method. All tests still pass.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Michael McCandless
Priority: Minor
  Labels: performance
 Fix For: 4.1

 Attachments: LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 32-bits computer to test the performance improvements).
 I think this would allow more fine-grained control over the speed/space 
 trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-05-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: LUCENE-4062.patch

Updated javadocs (there were still a few references to `fasterButMoreRam`).

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Michael McCandless
Priority: Minor
  Labels: performance
 Fix For: 4.1

 Attachments: LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 32-bits computer to test the performance improvements).
 I think this would allow more fine-grained control over the speed/space 
 trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: 

[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-05-16 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: LUCENE-4062.patch

I added the changes we discussed in the patch, replacing fasterButMoreRam=true 
by PackedInts.FAST and fasterButMoreRam=false by PackedInts.DEFAULT = 20%. All 
Lucene tests passed.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Michael McCandless
Priority: Minor
  Labels: performance
 Fix For: 4.1

 Attachments: LUCENE-4062.patch, LUCENE-4062.patch


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 32-bits computer to test the performance improvements).
 I think this would allow more fine-grained control over the speed/space 
 trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-05-16 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4062:
-

Attachment: LUCENE-4062.patch

Updated patch with improved performance when bitsPerValue is close (depending 
on the acceptable overhead) to 24 or 48. The two new implementations store 
values on three contiguous blocks (byte and short) and have very close 
performance to the Direct* implementations. Their maximum capacity is lower (by 
a factor 3) so {{PackedInts.getMutable}} decides whether they should be 
selected based on valueCount. All tests still pass.

 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Michael McCandless
Priority: Minor
  Labels: performance
 Fix For: 4.1

 Attachments: LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: Packed64
  * 41: Packed64
  * 42: Packed64
  * 43: Packed64
  * 44: Packed64
  * 45: Packed64
  * 46: Packed64
  * 47: Packed64
  * 48: Packed64
  * 49: Packed64
  * 50: Packed64
  * 51: Packed64
  * 52: Packed64
  * 53: Packed64
  * 54: Direct64
  * 55: Direct64
  * 56: Direct64
  * 57: Direct64
  * 58: Direct64
  * 59: Direct64
  * 60: Direct64
  * 61: Direct64
  * 62: Direct64
 Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
 choose the slower Packed64 implementation. Allowing a 50% overhead would 
 prevent the packed implementation to be selected for bits per value under 32. 
 Allowing an overhead of 32 bits per value would make sure that a Direct* 
 implementation is always selected.
 Next steps would be to:
  * make lucene components use this {{getMutable}} method and let users decide 
 what trade-off better suits them,
  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
 because I have no 32-bits computer to test the performance improvements).
 I think this would allow more fine-grained control over the speed/space 
 trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent