GitHub user kiszk opened a pull request:
https://github.com/apache/spark/pull/18033
Add compression/decompression of column data to ColumnVector
## What changes were proposed in this pull request?
This PR adds compression/decompression of column data to `ColumnVector`.
While current `CachedBatch` can compress column data by using of multiple
compression schemes, `ColumnVector` cannot compress column data. The
compression is mandatory for table cache.
At first, this PR enables `RunLengthEncoding` for
boolean/byte/short/int/long and `BooleanBitSet` for boolean. Another JIRA will
support comrpession schemes.
At high level view, when `ColumnVector.compress()` is called, compression
is performed from an array for primitive data type to byte array in
`ColumnVector`. When `ColumnVector.decompress()` is called, decompression is
performed from the byte array to the array for primitive data type to byte
array in `ColumnVector`. For these compression/decompression, `ArrayBuffer` is
used for accessing data.
This PR added and changed the following APIs:
`ArrayBuffer`
* This new class is similar to `java.io.ByteBuffer`. `ArrayBuffer` class
can wrap an array for any primitive data type such as `Array[Int]` or
`Array[Long]`. This class manages current position to be accessed.
`ColumnType.get(buffer: ArrayBuffer): jvmType, ColumnType.put(buffer:
ArrayBuffer)`
* These APIs gets a primitive value from the current position or puts a
primitive value into the current position at the given `ArrayBuffer`.
`Encoder.gatherCompressibilityStats(in: ArrayBuffer)`
* This API calculates uncompressed and compressed size by using a given
compression method.
`Encoder.compress(from: ArrayBuffer, to: ArrayBuffer): Unit`
* This API compresses data in `from` and stores compressed data to `to`.
`to` has to have an byte array with enough size for compressed data.
`Decoder.decompress(values: ArrayBuffer): Unit`
* This API decompresses data in `Decoder` by providing its constructor and
stores uncompressed data to `values`. `to` has to have an byte array with
enough size for uncompressed data.
## How was this patch tested?
Added new test suites
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kiszk/spark SPARK-20807
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18033.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18033
----
commit 6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-05-19T09:33:38Z
initial commit
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]