Hi Brock, Thank you for following up - I have some follow ups in-line:
On Sun, Aug 17, 2014 at 11:44 PM, Brock Noland <[email protected]> wrote: > Hi, > > My comments inline > > On Sun, Aug 17, 2014 at 7:30 PM, Gary Malouf <[email protected]> > wrote: > > > My team currently uses Apache Spark over different types 'tables' of > > protobuf serialized to HDFS. Today, the performance of our queries is > less > > than ideal and we are trying to figure out if using Parquet in specific > > places will help us. > > > > Questions: > > > > 1) Does a single protobuf message get broken up over a number of columns > as > > it seems to read? > > > > Storing a file with Protobuf messages written one directly after another > would be an example of a row-wise format. Meaning that all columns in a row > are grouped together. In a columnar format like Parquet, values of columns > are grouped together. This allows you to efficiently encode (compress) and > read columns. > So I interpret this to mean each field in a protobuf message's value is grouped along with that same value for other messages together. > > > > > > 2) Our protobuf has mostly required fields - how does Parquet work with > > this when at query time we sometimes only need say 2 of our 15 fields? > > > > This is an ideal use of a columnar format such as Parquet since you won't > have to read the fields you don't care (the other 13) about off disk. > This only partially answers my question - protobuf messages have a concept of 'required fields' - wouldn't the message fail to initialize in-memory (for example, the JVM) if some of these are not set?
