Hi Petar, First when measuring the time spend deserializing, be careful that using a no-op read support might trigger JVM optimizations that remove code paths that do nothing. That's why JMH provides a Blackhole class to work around this. See: http://java-performance.info/jmh/ The predicate push down is most useful when a whole row group doesn't match. To optimize the materializer you can start by using a profiler.
On Wed, Feb 15, 2017 at 11:30 PM, Petar Petrov <[email protected]> wrote: > Hello, > > We are using Parquet, Avro and Spark for analytics. > > About 40-50% of our query time is spent in deserializing data. That's the > Parquet-Avro interaction (AvroRecordMaterializer) in constructing Avro's > GenericRecord. We have determined this by using a custom ReadSupport class > with a dummy record materializer - it does nothing with the data passed to > it. > > What options do we have about optimizing in this area? We could move away > from Avro, but with protobufs we observed the same overhead. What do others > do to overcome this? > > Additionally, it seems that even with a predicate push down (Filter2 API) > not matching some rows, we are still seeing the overhead for cases where > row groups are not excluded based on metadata checks. How can we avoid the > whole record deserialization when a predicate doesn't match? > > Thanks for your great work! > -- Julien
