Hi Petar,
First when measuring the time spend deserializing, be careful that using a
no-op read support might trigger JVM optimizations that remove code paths
that do nothing.
That's why JMH provides a Blackhole class to work around this.
See: http://java-performance.info/jmh/
The predicate push down is most useful when a whole row group doesn't match.
To optimize the materializer you can start by using a profiler.


On Wed, Feb 15, 2017 at 11:30 PM, Petar Petrov <[email protected]>
wrote:

> Hello,
>
> We are using Parquet, Avro and Spark for analytics.
>
> About 40-50% of our query time is spent in deserializing data. That's the
> Parquet-Avro interaction (AvroRecordMaterializer) in constructing Avro's
> GenericRecord. We have determined this by using a custom ReadSupport class
> with a dummy record materializer - it does nothing with the data passed to
> it.
>
> What options do we have about optimizing in this area? We could move away
> from Avro, but with protobufs we observed the same overhead. What do others
> do to overcome this?
>
> Additionally, it seems that even with a predicate push down (Filter2 API)
> not matching some rows, we are still seeing the overhead for cases where
> row groups are not excluded based on metadata checks. How can we avoid the
> whole record deserialization when a predicate doesn't match?
>
> Thanks for your great work!
>



-- 
Julien

Reply via email to