Hi guys,

We’re evaluating Parquet as the high compression format for our logs. We took 
some ~850Gb of TSV data (some columns are JSON) and Parquet 
(CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with 
Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data. 

So the questions are:

1) is this somewhat expected compression rate (compared to GZip)?
2) As we specially crafted Parquet schema with maps and lists for certain 
fields, is there any tool to show the sizes of individual Parquet columns so we 
can find the biggest ones?

Thanks in advance,
 Kirill

Reply via email to