Hello,

We are using Parquet, Avro and Spark for analytics.

About 40-50% of our query time is spent in deserializing data. That's the
Parquet-Avro interaction (AvroRecordMaterializer) in constructing Avro's
GenericRecord. We have determined this by using a custom ReadSupport class
with a dummy record materializer - it does nothing with the data passed to
it.

What options do we have about optimizing in this area? We could move away
from Avro, but with protobufs we observed the same overhead. What do others
do to overcome this?

Additionally, it seems that even with a predicate push down (Filter2 API)
not matching some rows, we are still seeing the overhead for cases where
row groups are not excluded based on metadata checks. How can we avoid the
whole record deserialization when a predicate doesn't match?

Thanks for your great work!

Reply via email to