bersprockets opened a new pull request #32969:
URL: https://github.com/apache/spark/pull/32969


   ### What changes were proposed in this pull request?
   
   When creating a record writer in an AvroDeserializer, or creating a struct 
converter in an AvroSerializer, look up Avro fields using a map rather than 
scanning the entire list of Avro fields.
   
   
   ### Why are the changes needed?
   
   A query against an Avro table can be quite slow when all are true:
   
   * There are many columns in the Avro file
   * The query contains a wide projection
   * There are many splits in the input
   * Some of the splits are read serially (e.g., less executors than there are 
tasks)
   
   A write to an Avro table can be quite slow when all are true:
   
   * There are many columns in the new rows
   * The operation is creating many files
   
   For example, a single-threaded query against a 6000 column Avro data set 
with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 
minutes with Spark 3.2.0-SNAPSHOT.
   
   This PR restores the faster time.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   * Ran existing unit tests
   * Added new unit tests
   * Added new benchmarks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to