Baohe Zhang created SPARK-34336:
-----------------------------------
Summary: Use GenericData as Avro serialization data model can
improve Avro write/read performance
Key: SPARK-34336
URL: https://issues.apache.org/jira/browse/SPARK-34336
Project: Spark
Issue Type: Improvement
Components: Input/Output, SQL
Affects Versions: 3.1.2
Reporter: Baohe Zhang
We found that using "org.apache.avro.generic.GenericData" as Avro serialization
data model in Avro writer can significantly improve Avro write performance and
slightly improve Avro read performance.
This optimization was originally put up by [~samkhan] in this PR
https://github.com/apache/spark/pull/29354.
We re-evaluated the change "Use GenericData instead of ReflectData when writing
Avro data" in that PR and verified it can provide performance improvement in
Avro write/read benchmarks.
The base branch is today(2/2/21)'s branch-3.1.
Besides current Avro read/write benchmarks, I also ran some extra benchmarks
for nested structs and arrays read/write, these benchmarks were put up in this
PR https://github.com/apache/spark/pull/29352 but haven't been merged.
Benchmark results are added in the comment.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]