Hi guys, We’re evaluating Parquet as the high compression format for our logs. We took some ~850Gb of TSV data (some columns are JSON) and Parquet (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
So the questions are: 1) is this somewhat expected compression rate (compared to GZip)? 2) As we specially crafted Parquet schema with maps and lists for certain fields, is there any tool to show the sizes of individual Parquet columns so we can find the biggest ones? Thanks in advance, Kirill
