[GitHub] spark pull request: [SPARK-13361][SQL] Add benchmark codes for Enc...

maropu Wed, 17 Feb 2016 01:49:52 -0800

Github user maropu commented on the pull request:

    https://github.com/apache/spark/pull/11236#issuecomment-185123607
  
    Anyway, I'd like to making prs to improve compression performance in 
`InMemoryRelation`.
    A goal of this activity is to make in-memory cache size approaching to 
parquet formatted data size.
    As a first step, I'd like to use `DeltaBinaryPackingValuesReader/Writer` of 
`parquet-column` in `IntDelta` and `LongDelta` encoders because this efficient 
integer compression can be widely applied in many types such as SHORT, INT, and 
LONG... However, I have one technical issue; 
`DeltaBinaryPackingValuesReader/Writer` has internal buffer to 
compress/decompress data, so we need to copy the whole data into Spark internal 
buffer. It is a kind of overheads. To avoid this overhead, we can inline the 
parquet codes in Spark though, it has a  maintenance  issue.
    
    In a second step, I have a plan to add codes to apply general-purpose 
compression algorithms like LZ4 and Snappy in the final step of 
`ColumnBuilder#build`. This is because byte arrays generated
    by some type-specific encoders like `DictionaryEncoding` are compressible 
with these algorithms.
    Parquet also applys compression just before writing data into disk.
    
    Please give me some suggestion on this?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13361][SQL] Add benchmark codes for Enc...

Reply via email to