(spark) branch master updated: [SPARK-53939][PYTHON] Use batch.num_columns instead of len(batch.columns)

gurwls223 Fri, 17 Oct 2025 11:53:21 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new f689ff75a037 [SPARK-53939][PYTHON] Use batch.num_columns instead of 
len(batch.columns)
f689ff75a037 is described below

commit f689ff75a037d90e78312e6d3c826b498300a703
Author: Takuya Ueshin <[email protected]>
AuthorDate: Fri Oct 17 13:12:33 2025 +0900

    [SPARK-53939][PYTHON] Use batch.num_columns instead of len(batch.columns)
    
    ### What changes were proposed in this pull request?
    
    Use `batch.num_columns` instead of `len(batch.columns)` in 
`ArrowStreamUDFSerializer`.
    
    ### Why are the changes needed?
    
    `len(batch.columns)` has a performance overhead compared with 
`batch.num_columns`.
    It should be avoided.
    
    <details>
    <summary>benchmark_column_count_access</summary>
    
    ```python
    import pyarrow as pa
    import time
    import numpy as np
    
    def benchmark_column_count_access():
        # Create test data with varying number of columns
        column_counts = [10, 50, 100, 500, 1000]
        iterations = 1_000_000
    
        print("PyArrow RecordBatch Column Count Access Benchmark")
        print("=" * 70)
        print(f"Iterations: {iterations:,}")
        print()
    
        for num_cols in column_counts:
            # Create a RecordBatch with the specified number of columns
            arrays = [pa.array(np.random.rand(100)) for _ in range(num_cols)]
            names = [f"col_{i}" for i in range(num_cols)]
            batch = pa.record_batch(arrays, names=names)
    
            # Benchmark len(batch.columns)
            start = time.perf_counter()
            for _ in range(iterations):
                _ = len(batch.columns)
            time_len_columns = time.perf_counter() - start
    
            # Benchmark batch.num_columns
            start = time.perf_counter()
            for _ in range(iterations):
                _ = batch.num_columns
            time_num_columns = time.perf_counter() - start
    
            # Calculate speedup
            speedup = time_len_columns / time_num_columns
    
            print(f"Columns: {num_cols:>4}")
            print(f"  len(batch.columns):  {time_len_columns:.4f}s")
            print(f"  batch.num_columns:   {time_num_columns:.4f}s")
            print(f"  Speedup:             {speedup:.2f}x")
            print(f"  Difference:          {(time_len_columns - 
time_num_columns)*1000:.2f}ms")
            print()
    ```
    
    </details>
    
    ```
    PyArrow RecordBatch Column Count Access Benchmark
    ======================================================================
    Iterations: 1,000,000
    
    Columns:   10
      len(batch.columns):  4.0907s
      batch.num_columns:   0.0215s
      Speedup:             190.38x
      Difference:          4069.23ms
    
    Columns:   50
      len(batch.columns):  19.7623s
      batch.num_columns:   0.0212s
      Speedup:             932.47x
      Difference:          19741.09ms
    
    Columns:  100
      len(batch.columns):  39.7946s
      batch.num_columns:   0.0191s
      Speedup:             2088.70x
      Difference:          39775.53ms
    
    Columns:  500
      len(batch.columns):  205.2582s
      batch.num_columns:   0.0210s
      Speedup:             9783.63x
      Difference:          205237.24ms
    
    Columns: 1000
      len(batch.columns):  447.2963s
      batch.num_columns:   0.0234s
      Speedup:             19085.39x
      Difference:          447272.90ms
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The existing tests should pass.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #52639 from ueshin/issues/SPARK-53939/num_columns.
    
    Authored-by: Takuya Ueshin <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/sql/pandas/serializers.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/sql/pandas/serializers.py 
b/python/pyspark/sql/pandas/serializers.py
index 55fbfe4b34bb..59358578d729 100644
--- a/python/pyspark/sql/pandas/serializers.py
+++ b/python/pyspark/sql/pandas/serializers.py
@@ -168,7 +168,7 @@ class ArrowStreamUDFSerializer(ArrowStreamSerializer):
                 assert isinstance(batch, pa.RecordBatch)
 
                 # Wrap the root struct
-                if len(batch.columns) == 0:
+                if batch.num_columns == 0:
                     # When batch has no column, it should still create
                     # an empty batch with the number of rows set.
                     struct = pa.array([{}] * batch.num_rows)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-53939][PYTHON] Use batch.num_columns instead of len(batch.columns)

Reply via email to