My team currently uses Apache Spark over different types 'tables' of protobuf serialized to HDFS. Today, the performance of our queries is less than ideal and we are trying to figure out if using Parquet in specific places will help us.
Questions: 1) Does a single protobuf message get broken up over a number of columns as it seems to read? 2) Our protobuf has mostly required fields - how does Parquet work with this when at query time we sometimes only need say 2 of our 15 fields?
