This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 842c0c3  [SPARK-37832][SQL] Orc struct converter should use an array 
to look up field converters rather than a linked list
842c0c3 is described below

commit 842c0c303c3994a08c92b34e23b468050250da77
Author: Bruce Robbins <[email protected]>
AuthorDate: Thu Jan 6 18:18:17 2022 -0800

    [SPARK-37832][SQL] Orc struct converter should use an array to look up 
field converters rather than a linked list
    
    ### What changes were proposed in this pull request?
    
    Change the Orc struct converter to index an array rather than a linked list 
when looking up field converters.
    
    ### Why are the changes needed?
    
    Currently, the OrcSerializer's struct converter uses an index to look up 
each field converter in a linked list, resulting in a n*(n/2) average 
complexity per row (where n is the field count).
    
    Simply converting the linked list to an array brings performance gains, 
especially for wide structs.
    
    | field count | row count | master | pr    | improvement |
    | ----------- | --------- | ------ | ----- | ----------- |
    | 10          | 15728640  | 4729   | 4338  | none        |
    | 100         | 157286    | 5270   | 4064  | 22%         |
    | 600         | 26214     | 13548  | 4726  | 65%         |
    
    The above benchmarks were run on my local machine. Official benchmarks are 
forthcoming.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    - Existing unit tests
    - New benchmark (in a separate PR)
    
    Closes #35120 from bersprockets/orc_struct_play2.
    
    Authored-by: Bruce Robbins <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 .../org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
index edd5052..a928cd9 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
@@ -154,7 +154,7 @@ class OrcSerializer(dataSchema: StructType) {
 
     case st: StructType => (getter, ordinal) =>
       val result = createOrcValue(st).asInstanceOf[OrcStruct]
-      val fieldConverters = st.map(_.dataType).map(newConverter(_))
+      val fieldConverters = st.map(_.dataType).map(newConverter(_)).toArray
       val numFields = st.length
       val struct = getter.getStruct(ordinal, numFields)
       var i = 0

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to