parthchandra commented on pull request #34471: URL: https://github.com/apache/spark/pull/34471#issuecomment-972529729
I also checked on a real-world dataset. After writing the data compression was marginally better. Parquet v1 (RLE) = 301.9 GB, Parquet v2 (DELTA_BINARY, DELTA_BYTE_ARRAY) = 299.9 GB. The bulk of the data was String data so this does not reflect the effectiveness of DELTA_BINARY since most of the data would be encoded using DELTA_BYTE_ARRAY. Haven't been able to run a read benchmark on this yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
