Hi Kirill, It's hard to say what the expected compression rate should be since that's heavily data-dependent. Sounds like Parquet isn't doing too bad, though.
For inspecting the files, check out parquet-tools [1]. That can dump the metadata from a file all the way down to the page level. The "meta" command will print out each row group and column within those row groups, which should give you the info you're looking for. rb [1]: http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparquet-tools%7C1.8.1%7Cjar On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov <[email protected]> wrote: > Hi guys, > > We’re evaluating Parquet as the high compression format for our logs. We > took some ~850Gb of TSV data (some columns are JSON) and Parquet > (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with > Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data. > > So the questions are: > > 1) is this somewhat expected compression rate (compared to GZip)? > 2) As we specially crafted Parquet schema with maps and lists for certain > fields, is there any tool to show the sizes of individual Parquet columns > so we can find the biggest ones? > > Thanks in advance, > Kirill -- Ryan Blue Software Engineer Netflix
