For anyone that would also like to test the compression codecs, I’ve
uploaded a copy of parquet-cli that can read and write zstd, lz4, and
brotli to my Apache public folder:

http://home.apache.org/~blue/

There’s also a copy of hadoop-common that has all the codec bits for
testing zstd. LZ4 should be supported by default, and brotli is built into
the parquet-cli Jar. If you want to build the brotli-codec that the Jar
uses, the project is here:

https://github.com/rdblue/brotli-codec

All you need to do is add the hadoop-common Jar to your Hadoop install,
copy over the native libs, and run the Parquet CLI like this:

alias parquet='hadoop jar parquet-cli-0.2.0.jar org.apache.parquet.cli.Main'
parquet help

rb
​

On Wed, Sep 27, 2017 at 5:04 PM, Ryan Blue <[email protected]> wrote:

> Hi everyone,
>
> I ran some tests using 4 of our large tables to compare compression
> codecs. I tested gzip, brotli, lz4, and zstd, all with the default
> configuration. You can find the raw data and summary tables/graphs in this
> spreadsheet:
>
> https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu-
> AI0AshTPSC6C0ttuIw/edit?usp=sharing
>
> For the test, I used parquet-cli to convert Avro data to Parquet using
> each compression format. Run times come from the `time` utility, so this is
> an end-to-end test, not just time spent in the compression algorithm.
> Still, the overhead was the same across all runs for a given table.
>
> I ran the tests on my laptop. I had more physical memory available than
> the maximum size of the JVM, so I don't think paging was an issue. Data was
> read from and written to my local SSD. I wrote an output file for each
> compression codec and table 5 times.
>
> I'm also attaching some sanitized summary information for a row group in
> each table.
>
> Everyone should be able to comment on the results using that link.
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to