Suppose I want to read from columns A and B where column A has some
value v. That is I want to read from columns A and B and filter out
all the rows where column A does not have value v. Is there a way to
partition into many files based on the value of column A so as to
avoid filtering unwanted values?

On Tue, Feb 7, 2017 at 1:06 AM, Uwe L. Korn <[email protected]> wrote:
> Hello Cary,
>
> actually this is a feature directly built-in to Apache Parquet. The
> format is designed so that you only need to read the data for the
> columns you need inside a single file, there is no need to spread them
> over multiple files. Parquet's metadata contains the needed pointers to
> the bytes so that you only need to parse and load the metadata and then
> can directly read only the data for the columns. This should be
> transparent to the FS implementation you use, only needs to supported by
> the specific ParquetReader you use. If you want to learn more over the
> format, it is worth reading Julien's blog post
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet or the
> format spec: https://github.com/apache/parquet-format
>
> Note that the S3(a) implementation of Hadoop <2.8 isn't supporting
> random access quite well. Using Hadoop 2.8+/3.0+ and setting
> fs.s3a.experimental.input.fadvise=random should give you a very
> significant performance boost. Sadly both are still unreleased but you
> can already checkout the alphas.
>
> Cheers
> Uwe
>
> On Mon, Feb 6, 2017, at 10:03 PM, Cary Cherng wrote:
>> How does one write to more than just one file and output to multiple
>> s3 files so that reading a single column does not read all s3 data how
>> can this be done?
>>
>> If its a single file I've figured out how to output using
>> AvroParquetWriter. But I can't find any documentation or examples of
>> how to partition by column to multiple files on s3. Does Parquet even
>> support this?

Reply via email to