[GitHub] spark pull request #18033: Add compression/decompression of column data to C...

kiszk Fri, 19 May 2017 02:36:39 -0700

GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/18033


    Add compression/decompression of column data to ColumnVector

    ## What changes were proposed in this pull request?
    
    This PR adds compression/decompression of column data to `ColumnVector`. 
    While current `CachedBatch` can compress column data by using of multiple 
compression schemes, `ColumnVector` cannot compress column data. The 
compression is mandatory for table cache.
    
    At first, this PR enables `RunLengthEncoding` for 
boolean/byte/short/int/long and `BooleanBitSet` for boolean. Another JIRA will 
support comrpession schemes.
    
    At high level view, when `ColumnVector.compress()` is called, compression 
is performed from an array for primitive data type to byte array in 
`ColumnVector`. When `ColumnVector.decompress()` is called, decompression is 
performed from the byte array to the array for primitive data type to byte 
array in `ColumnVector`. For these compression/decompression, `ArrayBuffer` is 
used for accessing data.
    
    
    This PR added and changed the following APIs:
    
    `ArrayBuffer`
    * This new class is similar to `java.io.ByteBuffer`. `ArrayBuffer` class 
can wrap an array for any primitive data type such as `Array[Int]` or 
`Array[Long]`. This class manages current position to be accessed.
    
    `ColumnType.get(buffer: ArrayBuffer): jvmType, ColumnType.put(buffer: 
ArrayBuffer)`
    * These APIs gets a primitive value from the current position or puts a 
primitive value into the current position at the given `ArrayBuffer`. 
    
    `Encoder.gatherCompressibilityStats(in: ArrayBuffer)`
    * This API calculates uncompressed and compressed size by using a given 
compression method.
    
    `Encoder.compress(from: ArrayBuffer, to: ArrayBuffer): Unit`
    * This API compresses data in `from` and stores compressed data to `to`. 
`to` has to have an byte array with enough size for compressed data.
    
    `Decoder.decompress(values: ArrayBuffer): Unit`
    * This API decompresses data in `Decoder` by providing its constructor and 
stores uncompressed data to `values`. `to` has to have an byte array with 
enough size for uncompressed data.
    
    ## How was this patch tested?
    
    Added new test suites

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark SPARK-20807

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18033.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18033
    
----
commit 6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-05-19T09:33:38Z

    initial commit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18033: Add compression/decompression of column data to C...

Reply via email to