Hello, We are using Parquet, Avro and Spark for analytics.
About 40-50% of our query time is spent in deserializing data. That's the Parquet-Avro interaction (AvroRecordMaterializer) in constructing Avro's GenericRecord. We have determined this by using a custom ReadSupport class with a dummy record materializer - it does nothing with the data passed to it. What options do we have about optimizing in this area? We could move away from Avro, but with protobufs we observed the same overhead. What do others do to overcome this? Additionally, it seems that even with a predicate push down (Filter2 API) not matching some rows, we are still seeing the overhead for cases where row groups are not excluded based on metadata checks. How can we avoid the whole record deserialization when a predicate doesn't match? Thanks for your great work!
