msamirkhan edited a comment on pull request #29354:
URL: https://github.com/apache/spark/pull/29354#issuecomment-669519533


   The [pdf attached to the 
PR](https://github.com/apache/spark/files/5025167/AvroBenchmarks.pdf) contains 
the read and write time improvements with the commits. I have also added 
comments to the individual commits.
   
   For read, previous behavior was Decoder => GenericDatumReader => 
AvroDeserializer. Changes to the way SpecificInternalRow is created results in 
improvements to read times, but these can be made in the SpecificInternalRow 
constructor instead (PR here: https://github.com/apache/spark/pull/29366). 
Moving to native reader changes the behavior to Decoder => 
SparkAvroDatumReader. The benefits include the ability to "skip" data not 
needed. This is reflected in column K in the read benchmark cases for single 
column scan as well as in the pruning benchmark.
   
   For write, previous behavior was AvroSerializer => ReflectDatumWriter => 
Encoder. Spark doesn't need ReflectDatumWriter and can use GenericDatumWriter 
instead. This is a one line change 
https://github.com/apache/spark/pull/29354/commits/515b4a99d3edeb902a6680f78a38f0d3f977528f
 and improves the write times significantly (column A, pg 3 of the pdf). Moving 
to native writer changes the behavior to SparkAvroDatumWriter => Encoder and 
improves the write times significantly more (columns D-K, pg 3 of the pdf). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to