Github user maropu commented on the pull request:
https://github.com/apache/spark/pull/11236#issuecomment-185123607
Anyway, I'd like to making prs to improve compression performance in
`InMemoryRelation`.
A goal of this activity is to make in-memory cache size approaching to
parquet formatted data size.
As a first step, I'd like to use `DeltaBinaryPackingValuesReader/Writer` of
`parquet-column` in `IntDelta` and `LongDelta` encoders because this efficient
integer compression can be widely applied in many types such as SHORT, INT, and
LONG... However, I have one technical issue;
`DeltaBinaryPackingValuesReader/Writer` has internal buffer to
compress/decompress data, so we need to copy the whole data into Spark internal
buffer. It is a kind of overheads. To avoid this overhead, we can inline the
parquet codes in Spark though, it has a maintenance issue.
In a second step, I have a plan to add codes to apply general-purpose
compression algorithms like LZ4 and Snappy in the final step of
`ColumnBuilder#build`. This is because byte arrays generated
by some type-specific encoders like `DictionaryEncoding` are compressible
with these algorithms.
Parquet also applys compression just before writing data into disk.
Please give me some suggestion on this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]