Hi,

My comments inline

On Sun, Aug 17, 2014 at 7:30 PM, Gary Malouf <[email protected]> wrote:

> My team currently uses Apache Spark over different types 'tables' of
> protobuf serialized to HDFS.  Today, the performance of our queries is less
> than ideal and we are trying to figure out if using Parquet in specific
> places will help us.
>
> Questions:
>
> 1) Does a single protobuf message get broken up over a number of columns as
> it seems to read?
>

Storing a file with Protobuf messages written one directly after another
would be an example of a row-wise format. Meaning that all columns in a row
are grouped together. In a columnar format like Parquet, values of columns
are grouped together. This allows you to efficiently encode (compress) and
read columns.


>
> 2) Our protobuf has mostly required fields - how does Parquet work with
> this when at query time we sometimes only need say 2 of our 15 fields?
>

This is an ideal use of a columnar format such as Parquet since you won't
have to read the fields you don't care (the other 13) about off disk.

Reply via email to