My team currently uses Apache Spark over different types 'tables' of
protobuf serialized to HDFS.  Today, the performance of our queries is less
than ideal and we are trying to figure out if using Parquet in specific
places will help us.

Questions:

1) Does a single protobuf message get broken up over a number of columns as
it seems to read?

2) Our protobuf has mostly required fields - how does Parquet work with
this when at query time we sometimes only need say 2 of our 15 fields?

Reply via email to