(spark) branch master updated: [SPARK-54840][SQL] OrcList Pre-allocation

yao Fri, 26 Dec 2025 03:36:44 -0800

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 355c7cbdc149 [SPARK-54840][SQL] OrcList Pre-allocation
355c7cbdc149 is described below

commit 355c7cbdc1495d9d6bddc2033ab0deb70eed5ddc
Author: Kent Yao <[email protected]>
AuthorDate: Fri Dec 26 19:36:05 2025 +0800

    [SPARK-54840][SQL] OrcList Pre-allocation
    
    ### What changes were proposed in this pull request?
    
    This PR optimizes ORC serialization performance by pre-allocating `OrcList` 
with the exact array size instead of relying on dynamic resizing.
    
    **Key changes:**
    - Pre-allocate `OrcList` with `numElements` from the input array
    - Avoid multiple ArrayList resize operations and element copying during 
serialization
    - Cache `numElements` value to avoid redundant calls in the loop condition
    
    ### Why are the changes needed?
    
    **Problem:**
    When serializing arrays to ORC format, the current implementation creates 
an empty `OrcList` (which extends `ArrayList`) and grows it dynamically. For 
large arrays, this triggers multiple resize operations, each requiring:
    1. Allocating a new larger backing array
    2. Copying all existing elements to the new array
    3. Discarding the old array
    
    **Performance Impact:**
    For an array with 65,536 elements, the default ArrayList growth pattern 
(1.5x capacity increase) causes ~16 resize operations, copying approximately 1 
million elements in total.
    
    **Solution:**
    By pre-allocating the `OrcList` with the known size, we eliminate all 
resize operations and associated element copying, resulting in:
    
    ```
    === Resize Behavior Analysis: 1000000 elements ===
    Target size: 1,000,000 elements
    Total resizes: 29
    Total elements copied: 2,430,972
    Final capacity: 1,215,487
    
    Resize sequence:
      1. Resize:     10 →     15 (copy     10 elements)
      2. Resize:     15 →     22 (copy     15 elements)
      3. Resize:     22 →     33 (copy     22 elements)
      4. Resize:     33 →     49 (copy     33 elements)
      5. Resize:     49 →     73 (copy     49 elements)
      6. Resize:     73 →    109 (copy     73 elements)
      7. Resize:    109 →    163 (copy    109 elements)
      8. Resize:    163 →    244 (copy    163 elements)
      9. Resize:    244 →    366 (copy    244 elements)
      10. Resize:    366 →    549 (copy    366 elements)
      11. Resize:    549 →    823 (copy    549 elements)
      12. Resize:    823 →  1,234 (copy    823 elements)
      13. Resize:  1,234 →  1,851 (copy  1,234 elements)
      14. Resize:  1,851 →  2,776 (copy  1,851 elements)
      15. Resize:  2,776 →  4,164 (copy  2,776 elements)
      16. Resize:  4,164 →  6,246 (copy  4,164 elements)
      17. Resize:  6,246 →  9,369 (copy  6,246 elements)
      18. Resize:  9,369 → 14,053 (copy  9,369 elements)
      19. Resize: 14,053 → 21,079 (copy 14,053 elements)
      20. Resize: 21,079 → 31,618 (copy 21,079 elements)
      21. Resize: 31,618 → 47,427 (copy 31,618 elements)
      22. Resize: 47,427 → 71,140 (copy 47,427 elements)
      23. Resize: 71,140 → 106,710 (copy 71,140 elements)
      24. Resize: 106,710 → 160,065 (copy 106,710 elements)
      25. Resize: 160,065 → 240,097 (copy 160,065 elements)
      26. Resize: 240,097 → 360,145 (copy 240,097 elements)
      27. Resize: 360,145 → 540,217 (copy 360,145 elements)
      28. Resize: 540,217 → 810,325 (copy 540,217 elements)
      29. Resize: 810,325 → 1,215,487 (copy 810,325 elements)
    
    Overhead ratio: 2.43x
    (For every element added, 2.43 elements are copied during resizes)
    
    === Testing Array Size: 1000000 ===
    Array Size: 1,000,000 elements
    Iterations: 50
    Without pre-allocation: 1328.69 ms total (26.574 ms/iter)
    With pre-allocation:    620.36 ms total (12.407 ms/iter)
    Time saved:             708.33 ms (53.3% improvement)
    Expected resize count:  29 times
    Final capacity needed:  1215487
    
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. This is a performance optimization with no functional changes. The 
output remains identical.
    
    ### How was this patch tested?
    
    1. **Existing Tests:** All existing ORC-related tests pass, ensuring 
correctness is maintained
    2. **Performance Testing:** see codeblock above
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    GitHub Copilot w/ Claude Sonnet 4.5
    
    Closes #53605 from yaooqinn/SPARK-54840.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
---
 .../execution/datasources/orc/OrcSerializer.scala  | 25 ++++++++++++----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
index 1399931dd6b6..dc124dc2f7c9 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
@@ -178,19 +178,22 @@ class OrcSerializer(dataSchema: StructType) {
       result
 
     case ArrayType(elementType, _) => (getter, ordinal) =>
-      val result = OrcStruct.createValue(orcType)
-        .asInstanceOf[OrcList[WritableComparable[_]]]
-      // Need to put all converted values to a list, can't reuse object.
-      val elementConverter = newConverter(elementType, 
orcType.getChildren.get(0), reuseObj = false)
       val array = getter.getArray(ordinal)
-      var i = 0
-      while (i < array.numElements()) {
-        if (array.isNullAt(i)) {
-          result.add(null)
-        } else {
-          result.add(elementConverter(array, i))
+      val numElements = array.numElements()
+      val result = new OrcList[WritableComparable[_]](orcType, numElements)
+      if (numElements > 0) {
+        // Need to put all converted values to a list, can't reuse object.
+        val elementConverter =
+          newConverter(elementType, orcType.getChildren.get(0), reuseObj = 
false)
+        var i = 0
+        while (i < numElements) {
+          if (array.isNullAt(i)) {
+            result.add(null)
+          } else {
+            result.add(elementConverter(array, i))
+          }
+          i += 1
         }
-        i += 1
       }
       result
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-54840][SQL] OrcList Pre-allocation

Reply via email to