[
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573590#comment-13573590
]
Gilad Barkai edited comment on LUCENE-4609 at 2/7/13 3:29 PM:
--------------------------------------------------------------
Finally figured out I was doing things completely wrong.. instead of having a
super smart optimizing code for semi-packed encoder - there's now a strait
forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values
are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per
value (zero marker + 5 of VInt).
The idea, is to pay the penalty for variable length encoding for large values
which should be less common in a sort-uniq-dgap scenario.
Wrote two versions w and w/o dgap specialization, though I'm not sure how
useful is the non-specialized code.
I do not currently have the means to run the LuceneUtil (nor the wikipedia
index with the categories) - but I ran the {EncodingSpeed} test - and was
surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the
decoding speed is significantly faster. The new encode is the only (?!) encoder
to beat {SimpleIntEncoder} (which writes plain 32 bits per value) in decoding
time.
Those values being used in the EncodingSpeed are real scenario, but I'm not
sure how much they represent a "common" case (e.g wikipedia).
Mike - could you please try this encoder? I guess it only makes sense to run
the specialized {DGapSemiPackedEncoder}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or
unique). It would be interesting to test it as well. We will pay in more I/O
and much larger file size (~4 times larger..) but it doesn't mean it will be
any slower.
Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of:
2430) 41152 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
190 1.9000 165
1.6500
VInt8 18.4955
436 4.3600 359
3.5900
Sorting(Unique(VInt8)) 18.4955
3557 35.5702 314
3.1400
Sorting(Unique(DGap(VInt8))) 8.5597
3485 34.8502 270
2.7000
Sorting(Unique(DGapVInt8)) 8.5597
3434 34.3402 192
1.9200
Sorting(Unique(DGap(SemiPacked))) 8.6453
3386 33.8602 156
1.5600
Sorting(Unique(DGapSemiPacked)) 8.6453
3397 33.9702 99
0.9900
Sorting(Unique(DGap(EightFlags(VInt)))) 4.9679
4002 40.0203 381
3.8100
Sorting(Unique(DGap(FourFlags(VInt)))) 4.8198
3972 39.7203 399
3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 4.5794
4448 44.4803 645
6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 4.5794
4461 44.6103 641
6.4100
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of:
1489) 67159 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
169 1.6900 163
1.6300
VInt8 18.2673
421 4.2100 374
3.7400
Sorting(Unique(VInt8)) 18.2673
2230 22.3001 337
3.3700
Sorting(Unique(DGap(VInt8))) 8.9456
2257 22.5701 273
2.7300
Sorting(Unique(DGapVInt8)) 8.9456
2214 22.1401 192
1.9200
Sorting(Unique(DGap(SemiPacked))) 9.2787
2162 21.6201 180
1.8000
Sorting(Unique(DGapSemiPacked)) 9.2787
2148 21.4801 120
1.2000
Sorting(Unique(DGap(EightFlags(VInt)))) 5.7542
2937 29.3701 395
3.9500
Sorting(Unique(DGap(FourFlags(VInt)))) 5.5447
2768 27.6801 407
4.0700
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 5.3566
3294 32.9401 651
6.5100
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 5.3996
3318 33.1801 662
6.6200
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18)
5555555 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
316 3.1600 318
3.1800
VInt8 20.8889
569 5.6900 496
4.9600
Sorting(Unique(VInt8)) 20.8889
1196 11.9600 488
4.8800
Sorting(Unique(DGap(VInt8))) 12.0000
1151 11.5100 481
4.8100
Sorting(Unique(DGapVInt8)) 12.0000
1100 11.0000 352
3.5200
Sorting(Unique(DGap(SemiPacked))) 14.6667
1107 11.0700 507
5.0700
Sorting(Unique(DGapSemiPacked)) 14.6667
1037 10.3700 439
4.3900
Sorting(Unique(DGap(EightFlags(VInt)))) 10.2222
1315 13.1500 656
6.5600
Sorting(Unique(DGap(FourFlags(VInt)))) 10.2222
1408 14.0800 675
6.7500
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 9.7778
1617 16.1700 990
9.9000
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 10.2222
1708 17.0800 992
9.9200
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of:
957) 104493 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
181 1.8100 168
1.6800
VInt8 16.5768
428 4.2800 351
3.5100
Sorting(Unique(VInt8)) 16.5768
1586 15.8600 330
3.3000
Sorting(Unique(DGap(VInt8))) 8.4848
1477 14.7700 267
2.6700
Sorting(Unique(DGapVInt8)) 8.4848
1435 14.3500 193
1.9300
Sorting(Unique(DGap(SemiPacked))) 8.6771
1424 14.2400 159
1.5900
Sorting(Unique(DGapSemiPacked)) 8.6771
1402 14.0200 100
1.0000
Sorting(Unique(DGap(EightFlags(VInt)))) 4.4138
1603 16.0300 376
3.7600
Sorting(Unique(DGap(FourFlags(VInt)))) 4.1797
1700 17.0000 382
3.8200
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 3.8955
2011 20.1100 602
6.0200
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 3.8871
2024 20.2400 594
5.9400
{noformat}
I have another version which uses two bytes before using the variable length
(e.g values smaller than 0x10000 are encoded as is in two bytes, other values
are encoded as two leading zero bytes and the VInt) - but I did not optimize it
yet, nor I'm sure it's very useful.
was (Author: gilad):
Finally figured out I was doing things completely wrong.. instead of having
a super smart optimizing code for semi-packed encoder - there's now a strait
forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values
are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per
value (zero marker + 5 of VInt).
The idea, is to pay the penalty for variable length encoding for large values
which should be less common in a sort-uniq-dgap scenario.
Wrote two versions w and w/o dgap specialization, though I'm not sure how
useful is the non-specialized code.
I do not currently have the means to run the LuceneUtil (nor the wikipedia
index with the categories) - but I ran the EncodingSpeed test - and was
surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the
decoding speed is significantly faster. The new encode is the only (?!) encoder
to beat {code}SimpleIntEncoder{code} (which writes plain 32 bits per value) in
decoding time.
Those values being used in the EncodingSpeed are real scenario, but I'm not
sure how much they represent a "common" case (e.g wikipedia).
Mike - could you please try this encoder? I guess it only makes sense to run
the specialized {code}DGapSemiPackedEncoder{code}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or
unique). It would be interesting to test it as well. We will pay in more I/O
and much larger file size (~4 times larger..) but it doesn't mean it will be
any slower.
Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of:
2430) 41152 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
190 1.9000 165
1.6500
VInt8 18.4955
436 4.3600 359
3.5900
Sorting(Unique(VInt8)) 18.4955
3557 35.5702 314
3.1400
Sorting(Unique(DGap(VInt8))) 8.5597
3485 34.8502 270
2.7000
Sorting(Unique(DGapVInt8)) 8.5597
3434 34.3402 192
1.9200
Sorting(Unique(DGap(SemiPacked))) 8.6453
3386 33.8602 156
1.5600
Sorting(Unique(DGapSemiPacked)) 8.6453
3397 33.9702 99
0.9900
Sorting(Unique(DGap(EightFlags(VInt)))) 4.9679
4002 40.0203 381
3.8100
Sorting(Unique(DGap(FourFlags(VInt)))) 4.8198
3972 39.7203 399
3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 4.5794
4448 44.4803 645
6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 4.5794
4461 44.6103 641
6.4100
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of:
1489) 67159 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
169 1.6900 163
1.6300
VInt8 18.2673
421 4.2100 374
3.7400
Sorting(Unique(VInt8)) 18.2673
2230 22.3001 337
3.3700
Sorting(Unique(DGap(VInt8))) 8.9456
2257 22.5701 273
2.7300
Sorting(Unique(DGapVInt8)) 8.9456
2214 22.1401 192
1.9200
Sorting(Unique(DGap(SemiPacked))) 9.2787
2162 21.6201 180
1.8000
Sorting(Unique(DGapSemiPacked)) 9.2787
2148 21.4801 120
1.2000
Sorting(Unique(DGap(EightFlags(VInt)))) 5.7542
2937 29.3701 395
3.9500
Sorting(Unique(DGap(FourFlags(VInt)))) 5.5447
2768 27.6801 407
4.0700
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 5.3566
3294 32.9401 651
6.5100
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 5.3996
3318 33.1801 662
6.6200
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18)
5555555 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
316 3.1600 318
3.1800
VInt8 20.8889
569 5.6900 496
4.9600
Sorting(Unique(VInt8)) 20.8889
1196 11.9600 488
4.8800
Sorting(Unique(DGap(VInt8))) 12.0000
1151 11.5100 481
4.8100
Sorting(Unique(DGapVInt8)) 12.0000
1100 11.0000 352
3.5200
Sorting(Unique(DGap(SemiPacked))) 14.6667
1107 11.0700 507
5.0700
Sorting(Unique(DGapSemiPacked)) 14.6667
1037 10.3700 439
4.3900
Sorting(Unique(DGap(EightFlags(VInt)))) 10.2222
1315 13.1500 656
6.5600
Sorting(Unique(DGap(FourFlags(VInt)))) 10.2222
1408 14.0800 675
6.7500
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 9.7778
1617 16.1700 990
9.9000
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 10.2222
1708 17.0800 992
9.9200
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of:
957) 104493 times.
Encoder Bits/Int
Encode Time Encode Time Decode Time
Decode Time
[milliseconds] [microsecond / int] [milliseconds]
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000
181 1.8100 168
1.6800
VInt8 16.5768
428 4.2800 351
3.5100
Sorting(Unique(VInt8)) 16.5768
1586 15.8600 330
3.3000
Sorting(Unique(DGap(VInt8))) 8.4848
1477 14.7700 267
2.6700
Sorting(Unique(DGapVInt8)) 8.4848
1435 14.3500 193
1.9300
Sorting(Unique(DGap(SemiPacked))) 8.6771
1424 14.2400 159
1.5900
Sorting(Unique(DGapSemiPacked)) 8.6771
1402 14.0200 100
1.0000
Sorting(Unique(DGap(EightFlags(VInt)))) 4.4138
1603 16.0300 376
3.7600
Sorting(Unique(DGap(FourFlags(VInt)))) 4.1797
1700 17.0000 382
3.8200
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 3.8955
2011 20.1100 602
6.0200
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 3.8871
2024 20.2400 594
5.9400
{noformat}
I have another version which uses two bytes before using the variable length
(e.g values smaller than 0x10000 are encoded as is in two bytes, other values
are encoded as two leading zero bytes and the VInt) - but I did not optimize it
yet, nor I'm sure it's very useful.
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
> Key: LUCENE-4609
> URL: https://issues.apache.org/jira/browse/LUCENE-4609
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/facet
> Reporter: Shai Erera
> Priority: Minor
> Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch,
> LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the
> category ordinals. We have several such encoders, including VInt (default),
> and block encoders.
> It would be interesting to implement and benchmark a
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and
> the max value you can see and (2) one that decides for each doc on the
> optimal bitsPerValue, writes it as a header in the byte[] or something.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]