For anyone that would also like to test the compression codecs, I’ve uploaded a copy of parquet-cli that can read and write zstd, lz4, and brotli to my Apache public folder:
http://home.apache.org/~blue/ There’s also a copy of hadoop-common that has all the codec bits for testing zstd. LZ4 should be supported by default, and brotli is built into the parquet-cli Jar. If you want to build the brotli-codec that the Jar uses, the project is here: https://github.com/rdblue/brotli-codec All you need to do is add the hadoop-common Jar to your Hadoop install, copy over the native libs, and run the Parquet CLI like this: alias parquet='hadoop jar parquet-cli-0.2.0.jar org.apache.parquet.cli.Main' parquet help rb On Wed, Sep 27, 2017 at 5:04 PM, Ryan Blue <[email protected]> wrote: > Hi everyone, > > I ran some tests using 4 of our large tables to compare compression > codecs. I tested gzip, brotli, lz4, and zstd, all with the default > configuration. You can find the raw data and summary tables/graphs in this > spreadsheet: > > https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu- > AI0AshTPSC6C0ttuIw/edit?usp=sharing > > For the test, I used parquet-cli to convert Avro data to Parquet using > each compression format. Run times come from the `time` utility, so this is > an end-to-end test, not just time spent in the compression algorithm. > Still, the overhead was the same across all runs for a given table. > > I ran the tests on my laptop. I had more physical memory available than > the maximum size of the JVM, so I don't think paging was an issue. Data was > read from and written to my local SSD. I wrote an output file for each > compression codec and table 5 times. > > I'm also attaching some sanitized summary information for a row group in > each table. > > Everyone should be able to comment on the results using that link. > > rb > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix
