Github user maropu commented on the pull request:
https://github.com/apache/spark/pull/11236#issuecomment-187733938
I tried to implement `IntDeltaBinaryPacking` in `compressionSchemes`; this
is the simplified version of `IntDeltaBinaryPackingReader/Writer` in
`parquet-column` so as to calculate compressed size easily in
`gatherCompressibilityStats`. The benchmark results are as follows;
```
Running benchmark: INT Decode(Lower Skew)
Running case: PassThrough(1.000)
Running case: RunLengthEncoding(1.002)
Running case: DictionaryEncoding(0.500)
Running case: IntDelta(0.250)
Running case: IntDeltaBinaryPacking(0.068)
Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
INT Decode(Lower Skew): Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------
PassThrough(1.000) 285 / 360 235.7
4.2 1.0X
RunLengthEncoding(1.002) 700 / 715 95.8
10.4 0.4X
DictionaryEncoding(0.500) 763 / 782 88.0
11.4 0.4X
IntDelta(0.250) 684 / 702 98.1
10.2 0.4X
IntDeltaBinaryPacking(0.068) 805 / 811 83.4
12.0 0.4X
Running benchmark: INT Decode(Higher Skew)
Running case: PassThrough(1.000)
Running case: RunLengthEncoding(1.337)
Running case: DictionaryEncoding(0.501)
Running case: IntDelta(0.250)
Running case: IntDeltaBinaryPacking(0.182)
Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
INT Decode(Higher Skew): Best/Avg Time(ms) Rate(M/s) Per
Row(ns) Relative
-------------------------------------------------------------------------------------------
PassThrough(1.000) 690 / 716 97.3
10.3 1.0X
RunLengthEncoding(1.337) 1127 / 1148 59.5
16.8 0.6X
DictionaryEncoding(0.501) 836 / 856 80.2
12.5 0.8X
IntDelta(0.250) 763 / 778 88.0
11.4 0.9X
IntDeltaBinaryPacking(0.182) 873 / 884 76.9
13.0 0.8X
```
The speeds of encoding/decoding get a little worse though, the compression
ratios get much better.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]