Re: Parquet files on s3

Uwe L. Korn Tue, 07 Feb 2017 01:08:11 -0800

Hello Cary,

actually this is a feature directly built-in to Apache Parquet. The
format is designed so that you only need to read the data for the
columns you need inside a single file, there is no need to spread them
over multiple files. Parquet's metadata contains the needed pointers to
the bytes so that you only need to parse and load the metadata and then
can directly read only the data for the columns. This should be
transparent to the FS implementation you use, only needs to supported by
the specific ParquetReader you use. If you want to learn more over the
format, it is worth reading Julien's blog post
https://blog.twitter.com/2013/dremel-made-simple-with-parquet or the
format spec: https://github.com/apache/parquet-format

Note that the S3(a) implementation of Hadoop <2.8 isn't supporting
random access quite well. Using Hadoop 2.8+/3.0+ and setting
fs.s3a.experimental.input.fadvise=random should give you a very
significant performance boost. Sadly both are still unreleased but you
can already checkout the alphas.

Cheers
Uwe

On Mon, Feb 6, 2017, at 10:03 PM, Cary Cherng wrote:
> How does one write to more than just one file and output to multiple
> s3 files so that reading a single column does not read all s3 data how
> can this be done?
> 
> If its a single file I've figured out how to output using
> AvroParquetWriter. But I can't find any documentation or examples of
> how to partition by column to multiple files on s3. Does Parquet even
> support this?

Re: Parquet files on s3

Reply via email to