[GitHub] [spark] parthchandra commented on pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

GitBox Wed, 17 Nov 2021 20:32:39 -0800


parthchandra commented on pull request #34471:
URL: https://github.com/apache/spark/pull/34471#issuecomment-972529729



   I also checked on a real-world dataset. After writing the data compression 
was marginally better.
   Parquet v1 (RLE) = 301.9 GB, Parquet v2 (DELTA_BINARY, DELTA_BYTE_ARRAY) = 
299.9 GB. The bulk of the data was String data so this does not reflect the 
effectiveness of  DELTA_BINARY since most of the data would be encoded using 
DELTA_BYTE_ARRAY. 
   Haven't been able to run a read benchmark on this yet. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] parthchandra commented on pull request #34471: [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path

Reply via email to