ueshin opened a new pull request, #52639:
URL: https://github.com/apache/spark/pull/52639
### What changes were proposed in this pull request?
Use `batch.num_columns` instead of `len(batch.columns)` in
`ArrowStreamUDFSerializer`.
### Why are the changes needed?
`len(batch.columns)` has a performance overhead compared with
`batch.num_columns`.
It should be avoided.
<details>
<summary>benchmark_column_count_access</summary>
```python
import pyarrow as pa
import time
import numpy as np
def benchmark_column_count_access():
# Create test data with varying number of columns
column_counts = [10, 50, 100, 500, 1000]
iterations = 1_000_000
print("PyArrow RecordBatch Column Count Access Benchmark")
print("=" * 70)
print(f"Iterations: {iterations:,}")
print()
for num_cols in column_counts:
# Create a RecordBatch with the specified number of columns
arrays = [pa.array(np.random.rand(100)) for _ in range(num_cols)]
names = [f"col_{i}" for i in range(num_cols)]
batch = pa.record_batch(arrays, names=names)
# Benchmark len(batch.columns)
start = time.perf_counter()
for _ in range(iterations):
_ = len(batch.columns)
time_len_columns = time.perf_counter() - start
# Benchmark batch.num_columns
start = time.perf_counter()
for _ in range(iterations):
_ = batch.num_columns
time_num_columns = time.perf_counter() - start
# Calculate speedup
speedup = time_len_columns / time_num_columns
print(f"Columns: {num_cols:>4}")
print(f" len(batch.columns): {time_len_columns:.4f}s")
print(f" batch.num_columns: {time_num_columns:.4f}s")
print(f" Speedup: {speedup:.2f}x")
print(f" Difference: {(time_len_columns -
time_num_columns)*1000:.2f}ms")
print()
```
</details>
```
PyArrow RecordBatch Column Count Access Benchmark
======================================================================
Iterations: 1,000,000
Columns: 10
len(batch.columns): 4.0907s
batch.num_columns: 0.0215s
Speedup: 190.38x
Difference: 4069.23ms
Columns: 50
len(batch.columns): 19.7623s
batch.num_columns: 0.0212s
Speedup: 932.47x
Difference: 19741.09ms
Columns: 100
len(batch.columns): 39.7946s
batch.num_columns: 0.0191s
Speedup: 2088.70x
Difference: 39775.53ms
Columns: 500
len(batch.columns): 205.2582s
batch.num_columns: 0.0210s
Speedup: 9783.63x
Difference: 205237.24ms
Columns: 1000
len(batch.columns): 447.2963s
batch.num_columns: 0.0234s
Speedup: 19085.39x
Difference: 447272.90ms
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The existing tests should pass.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]