GitHub user maropu opened a pull request:
https://github.com/apache/spark/pull/11461
[SPARK-13607][SQL] Improve compression performance for integer-typed values
on cache
## What changes were proposed in this pull request?
This pr improves compression performance for integer-typed values on cache
to reduce GC pressure.
A goal of this activity is to make in-memory cache size approaching to
parquet formatted data size on disk. Since spark uses simpler compression
algorithms than parquet does in `compressionSchemes`,
the size of in-memory columnar cache is much bigger than parquet data on
disk. In one use-case (See
https://www.mail-archive.com/[email protected]/msg45241.html), 24.59GB of
parquet data on disk becomes 41.7GB on cache. This pr uses bit packers
implemented in parquet-column that spark already has as a package dependency.
## How was this patch tested?
Add `DeltaBinaryPackingSuite` that uses various input patterns for
compression.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/maropu/spark BinaryPackingSpike
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11461.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11461
----
commit d443e90c3b623edd3dad51353ccbe2448f30db0d
Author: Takeshi YAMAMURO <[email protected]>
Date: 2016-02-23T05:23:41Z
Implement IntDeltaBinaryPacking in CompressionSchemes
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]