This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 842c0c3 [SPARK-37832][SQL] Orc struct converter should use an array
to look up field converters rather than a linked list
842c0c3 is described below
commit 842c0c303c3994a08c92b34e23b468050250da77
Author: Bruce Robbins <[email protected]>
AuthorDate: Thu Jan 6 18:18:17 2022 -0800
[SPARK-37832][SQL] Orc struct converter should use an array to look up
field converters rather than a linked list
### What changes were proposed in this pull request?
Change the Orc struct converter to index an array rather than a linked list
when looking up field converters.
### Why are the changes needed?
Currently, the OrcSerializer's struct converter uses an index to look up
each field converter in a linked list, resulting in a n*(n/2) average
complexity per row (where n is the field count).
Simply converting the linked list to an array brings performance gains,
especially for wide structs.
| field count | row count | master | pr | improvement |
| ----------- | --------- | ------ | ----- | ----------- |
| 10 | 15728640 | 4729 | 4338 | none |
| 100 | 157286 | 5270 | 4064 | 22% |
| 600 | 26214 | 13548 | 4726 | 65% |
The above benchmarks were run on my local machine. Official benchmarks are
forthcoming.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Existing unit tests
- New benchmark (in a separate PR)
Closes #35120 from bersprockets/orc_struct_play2.
Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
.../org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
index edd5052..a928cd9 100644
---
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
+++
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
@@ -154,7 +154,7 @@ class OrcSerializer(dataSchema: StructType) {
case st: StructType => (getter, ordinal) =>
val result = createOrcValue(st).asInstanceOf[OrcStruct]
- val fieldConverters = st.map(_.dataType).map(newConverter(_))
+ val fieldConverters = st.map(_.dataType).map(newConverter(_)).toArray
val numFields = st.length
val struct = getter.getStruct(ordinal, numFields)
var i = 0
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]