Hello Cary, actually this is a feature directly built-in to Apache Parquet. The format is designed so that you only need to read the data for the columns you need inside a single file, there is no need to spread them over multiple files. Parquet's metadata contains the needed pointers to the bytes so that you only need to parse and load the metadata and then can directly read only the data for the columns. This should be transparent to the FS implementation you use, only needs to supported by the specific ParquetReader you use. If you want to learn more over the format, it is worth reading Julien's blog post https://blog.twitter.com/2013/dremel-made-simple-with-parquet or the format spec: https://github.com/apache/parquet-format
Note that the S3(a) implementation of Hadoop <2.8 isn't supporting random access quite well. Using Hadoop 2.8+/3.0+ and setting fs.s3a.experimental.input.fadvise=random should give you a very significant performance boost. Sadly both are still unreleased but you can already checkout the alphas. Cheers Uwe On Mon, Feb 6, 2017, at 10:03 PM, Cary Cherng wrote: > How does one write to more than just one file and output to multiple > s3 files so that reading a single column does not read all s3 data how > can this be done? > > If its a single file I've figured out how to output using > AvroParquetWriter. But I can't find any documentation or examples of > how to partition by column to multiple files on s3. Does Parquet even > support this?
