Thanks for sharing these Ryan. Definitely intriguing.

On Wed, Sep 27, 2017 at 5:38 PM, Ryan Blue <[email protected]>
wrote:

> For anyone that would also like to test the compression codecs, I’ve
> uploaded a copy of parquet-cli that can read and write zstd, lz4, and
> brotli to my Apache public folder:
>
> http://home.apache.org/~blue/
>
> There’s also a copy of hadoop-common that has all the codec bits for
> testing zstd. LZ4 should be supported by default, and brotli is built into
> the parquet-cli Jar. If you want to build the brotli-codec that the Jar
> uses, the project is here:
>
> https://github.com/rdblue/brotli-codec
>
> All you need to do is add the hadoop-common Jar to your Hadoop install,
> copy over the native libs, and run the Parquet CLI like this:
>
> alias parquet='hadoop jar parquet-cli-0.2.0.jar
> org.apache.parquet.cli.Main'
> parquet help
>
> rb
> ​
>
> On Wed, Sep 27, 2017 at 5:04 PM, Ryan Blue <[email protected]> wrote:
>
> > Hi everyone,
> >
> > I ran some tests using 4 of our large tables to compare compression
> > codecs. I tested gzip, brotli, lz4, and zstd, all with the default
> > configuration. You can find the raw data and summary tables/graphs in
> this
> > spreadsheet:
> >
> > https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu-
> > AI0AshTPSC6C0ttuIw/edit?usp=sharing
> >
> > For the test, I used parquet-cli to convert Avro data to Parquet using
> > each compression format. Run times come from the `time` utility, so this
> is
> > an end-to-end test, not just time spent in the compression algorithm.
> > Still, the overhead was the same across all runs for a given table.
> >
> > I ran the tests on my laptop. I had more physical memory available than
> > the maximum size of the JVM, so I don't think paging was an issue. Data
> was
> > read from and written to my local SSD. I wrote an output file for each
> > compression codec and table 5 times.
> >
> > I'm also attaching some sanitized summary information for a row group in
> > each table.
> >
> > Everyone should be able to comment on the results using that link.
> >
> > rb
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to