bersprockets opened a new pull request #33072:
URL: https://github.com/apache/spark/pull/33072


   ### What changes were proposed in this pull request?
   
   This is a backport of #32969.
   
   When creating a record writer in an AvroDeserializer, or creating a struct 
converter in an AvroSerializer, look up Avro fields using a map rather than 
scanning the entire list of Avro fields.
   
   ### Why are the changes needed?
   
   A query against an Avro table can be quite slow when all are true:
   
   * There are many columns in the Avro file
   * The query contains a wide projection
   * There are many splits in the input
   * Some of the splits are read serially (e.g., less executors than there are 
tasks)
   
   A write to an Avro table can be quite slow when all are true:
   
   * There are many columns in the new rows
   * The operation is creating many files
   
   For example, a single-threaded query against a 6000 column Avro data set 
with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 
minutes with Spark 3.2.0-SNAPSHOT.
   
   This PR restores the faster time.
   
   For the 1000 column read benchmark:
   Before patch: 108447 ms
   After patch: 35925 ms
   percent improvement: 66%
   
   For the 1000 column write benchmark:
   Before patch: 123307
   After patch: 42313
   percent improvement: 65%
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   * Ran existing unit tests
   * Added new unit tests
   * Added new benchmarks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to