On Sun, Aug 17, 2014 at 9:12 PM, Gary Malouf <[email protected]> wrote:
> Hi Brock, > > Thank you for following up - I have some follow ups in-line: > > > On Sun, Aug 17, 2014 at 11:44 PM, Brock Noland <[email protected]> wrote: > > > Hi, > > > > My comments inline > > > > On Sun, Aug 17, 2014 at 7:30 PM, Gary Malouf <[email protected]> > > wrote: > > > > > My team currently uses Apache Spark over different types 'tables' of > > > protobuf serialized to HDFS. Today, the performance of our queries is > > less > > > than ideal and we are trying to figure out if using Parquet in specific > > > places will help us. > > > > > > Questions: > > > > > > 1) Does a single protobuf message get broken up over a number of > columns > > as > > > it seems to read? > > > > > > > Storing a file with Protobuf messages written one directly after another > > would be an example of a row-wise format. Meaning that all columns in a > row > > are grouped together. In a columnar format like Parquet, values of > columns > > are grouped together. This allows you to efficiently encode (compress) > and > > read columns. > > > > So I interpret this to mean each field in a protobuf message's value is > grouped along with that same value for other messages together. > Here is an example PB schema: https://github.com/apache/incubator-parquet-mr/blob/master/parquet-protobuf/src/test/resources/TestProtobuf.proto#L34 and the resulting Parquet schema: https://github.com/apache/incubator-parquet-mr/blob/master/parquet-protobuf/src/test/java/parquet/proto/ProtoSchemaConverterTest.java#L47 > > > > > > > > > > > 2) Our protobuf has mostly required fields - how does Parquet work with > > > this when at query time we sometimes only need say 2 of our 15 fields? > > > > > > > This is an ideal use of a columnar format such as Parquet since you won't > > have to read the fields you don't care (the other 13) about off disk. > > > > This only partially answers my question - protobuf messages have a concept > of 'required fields' - wouldn't the message fail to initialize in-memory > (for example, the JVM) if some of these are not set? > IIRC Protobuf throws this exception when calling build on the message builder. I am guessing that is why ProtoRecordConverter allows you to obtain the builder as opposed to the build message, by setting a flag: https://github.com/apache/incubator-parquet-mr/blob/master/parquet-protobuf/src/main/java/parquet/proto/ProtoRecordConverter.java#L78
