Thanks for sharing these Ryan. Definitely intriguing. On Wed, Sep 27, 2017 at 5:38 PM, Ryan Blue <[email protected]> wrote:
> For anyone that would also like to test the compression codecs, I’ve > uploaded a copy of parquet-cli that can read and write zstd, lz4, and > brotli to my Apache public folder: > > http://home.apache.org/~blue/ > > There’s also a copy of hadoop-common that has all the codec bits for > testing zstd. LZ4 should be supported by default, and brotli is built into > the parquet-cli Jar. If you want to build the brotli-codec that the Jar > uses, the project is here: > > https://github.com/rdblue/brotli-codec > > All you need to do is add the hadoop-common Jar to your Hadoop install, > copy over the native libs, and run the Parquet CLI like this: > > alias parquet='hadoop jar parquet-cli-0.2.0.jar > org.apache.parquet.cli.Main' > parquet help > > rb > > > On Wed, Sep 27, 2017 at 5:04 PM, Ryan Blue <[email protected]> wrote: > > > Hi everyone, > > > > I ran some tests using 4 of our large tables to compare compression > > codecs. I tested gzip, brotli, lz4, and zstd, all with the default > > configuration. You can find the raw data and summary tables/graphs in > this > > spreadsheet: > > > > https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu- > > AI0AshTPSC6C0ttuIw/edit?usp=sharing > > > > For the test, I used parquet-cli to convert Avro data to Parquet using > > each compression format. Run times come from the `time` utility, so this > is > > an end-to-end test, not just time spent in the compression algorithm. > > Still, the overhead was the same across all runs for a given table. > > > > I ran the tests on my laptop. I had more physical memory available than > > the maximum size of the JVM, so I don't think paging was an issue. Data > was > > read from and written to my local SSD. I wrote an output file for each > > compression codec and table 5 times. > > > > I'm also attaching some sanitized summary information for a row group in > > each table. > > > > Everyone should be able to comment on the results using that link. > > > > rb > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > Ryan Blue > Software Engineer > Netflix >
