Suppose I want to read from columns A and B where column A has some value v. That is I want to read from columns A and B and filter out all the rows where column A does not have value v. Is there a way to partition into many files based on the value of column A so as to avoid filtering unwanted values?
On Tue, Feb 7, 2017 at 1:06 AM, Uwe L. Korn <[email protected]> wrote: > Hello Cary, > > actually this is a feature directly built-in to Apache Parquet. The > format is designed so that you only need to read the data for the > columns you need inside a single file, there is no need to spread them > over multiple files. Parquet's metadata contains the needed pointers to > the bytes so that you only need to parse and load the metadata and then > can directly read only the data for the columns. This should be > transparent to the FS implementation you use, only needs to supported by > the specific ParquetReader you use. If you want to learn more over the > format, it is worth reading Julien's blog post > https://blog.twitter.com/2013/dremel-made-simple-with-parquet or the > format spec: https://github.com/apache/parquet-format > > Note that the S3(a) implementation of Hadoop <2.8 isn't supporting > random access quite well. Using Hadoop 2.8+/3.0+ and setting > fs.s3a.experimental.input.fadvise=random should give you a very > significant performance boost. Sadly both are still unreleased but you > can already checkout the alphas. > > Cheers > Uwe > > On Mon, Feb 6, 2017, at 10:03 PM, Cary Cherng wrote: >> How does one write to more than just one file and output to multiple >> s3 files so that reading a single column does not read all s3 data how >> can this be done? >> >> If its a single file I've figured out how to output using >> AvroParquetWriter. But I can't find any documentation or examples of >> how to partition by column to multiple files on s3. Does Parquet even >> support this?
