[PR] [SPARK-53939][PYTHON] Use batch.num_columns instead of len(batch.columns) [spark]

via GitHub Thu, 16 Oct 2025 14:19:18 -0700


ueshin opened a new pull request, #52639:
URL: https://github.com/apache/spark/pull/52639


   ### What changes were proposed in this pull request?
   
   Use `batch.num_columns` instead of `len(batch.columns)` in 
`ArrowStreamUDFSerializer`.
   
   ### Why are the changes needed?
   
   `len(batch.columns)` has a performance overhead compared with 
`batch.num_columns`.
   It should be avoided.
   
   <details>
   <summary>benchmark_column_count_access</summary>
   
   ```python
   import pyarrow as pa
   import time
   import numpy as np
   
   def benchmark_column_count_access():
       # Create test data with varying number of columns
       column_counts = [10, 50, 100, 500, 1000]
       iterations = 1_000_000
       
       print("PyArrow RecordBatch Column Count Access Benchmark")
       print("=" * 70)
       print(f"Iterations: {iterations:,}")
       print()
       
       for num_cols in column_counts:
           # Create a RecordBatch with the specified number of columns
           arrays = [pa.array(np.random.rand(100)) for _ in range(num_cols)]
           names = [f"col_{i}" for i in range(num_cols)]
           batch = pa.record_batch(arrays, names=names)
           
           # Benchmark len(batch.columns)
           start = time.perf_counter()
           for _ in range(iterations):
               _ = len(batch.columns)
           time_len_columns = time.perf_counter() - start
           
           # Benchmark batch.num_columns
           start = time.perf_counter()
           for _ in range(iterations):
               _ = batch.num_columns
           time_num_columns = time.perf_counter() - start
           
           # Calculate speedup
           speedup = time_len_columns / time_num_columns
           
           print(f"Columns: {num_cols:>4}")
           print(f"  len(batch.columns):  {time_len_columns:.4f}s")
           print(f"  batch.num_columns:   {time_num_columns:.4f}s")
           print(f"  Speedup:             {speedup:.2f}x")
           print(f"  Difference:          {(time_len_columns - 
time_num_columns)*1000:.2f}ms")
           print()
   ```
   
   </details>
   
   ```
   PyArrow RecordBatch Column Count Access Benchmark
   ======================================================================
   Iterations: 1,000,000
   
   Columns:   10
     len(batch.columns):  4.0907s
     batch.num_columns:   0.0215s
     Speedup:             190.38x
     Difference:          4069.23ms
   
   Columns:   50
     len(batch.columns):  19.7623s
     batch.num_columns:   0.0212s
     Speedup:             932.47x
     Difference:          19741.09ms
   
   Columns:  100
     len(batch.columns):  39.7946s
     batch.num_columns:   0.0191s
     Speedup:             2088.70x
     Difference:          39775.53ms
   
   Columns:  500
     len(batch.columns):  205.2582s
     batch.num_columns:   0.0210s
     Speedup:             9783.63x
     Difference:          205237.24ms
   
   Columns: 1000
     len(batch.columns):  447.2963s
     batch.num_columns:   0.0234s
     Speedup:             19085.39x
     Difference:          447272.90ms
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   The existing tests should pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-53939][PYTHON] Use batch.num_columns instead of len(batch.columns) [spark]

Reply via email to