[PR] [SPARK-56716][PYTHON][TESTS] Add ASV microbenchmark for SQL_ARROW_UDTF [spark]

via GitHub Tue, 05 May 2026 13:53:24 -0700


Yicong-Huang opened a new pull request, #55691:
URL: https://github.com/apache/spark/pull/55691


   ### What changes were proposed in this pull request?
   
   This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python 
UDTFs created via `@arrow_udtf`) to `python/benchmarks/bench_eval_type.py`.
   
   The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and 
`ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, 
type pool) and three handler variants:
   - `identity_udtf` - yields the input batch as a `pa.Table`
   - `filter_udtf` - keeps rows whose first column is non-null (vectorized)
   - `count_udtf` - aggregates each batch into a single-row count table
   
   To support this, two helpers are added to `MockProtocolWriter`:
   - `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs 
`table_arg_offsets` in EvalConf)
   - `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM 
side: argument offsets, partition-child indexes, optional pickled 
`AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF 
name
   
   The wire batch carries one struct column `_0` whose fields are the table's 
schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to 
flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method.
   
   ### Why are the changes needed?
   
   This is part of 
[SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) 
(Micro-benchmark PySpark Eval Types). Establishing a stable baseline for 
`SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we 
can detect any regression objectively.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New ASV microbenchmarks. Two stable runs on the same machine produced 
consistent numbers (one run shown):
   
   ` ` `text
   === bench_eval_type.ArrowUDTFTimeBench.time_worker ===
   scenario            udtf                   value
   sm_batch_few_col    identity_udtf        1.23 ms
   sm_batch_few_col    filter_udtf          1.62 ms
   sm_batch_few_col    count_udtf           0.97 ms
   sm_batch_many_col   identity_udtf        2.23 ms
   sm_batch_many_col   filter_udtf          3.13 ms
   sm_batch_many_col   count_udtf           1.01 ms
   lg_batch_few_col    identity_udtf        1.66 ms
   lg_batch_few_col    filter_udtf          3.57 ms
   lg_batch_few_col    count_udtf           1.14 ms
   lg_batch_many_col   identity_udtf        9.75 ms
   lg_batch_many_col   filter_udtf         15.67 ms
   lg_batch_many_col   count_udtf           3.06 ms
   pure_ints           identity_udtf        5.73 ms
   pure_ints           filter_udtf          7.17 ms
   pure_ints           count_udtf           2.02 ms
   pure_strings        identity_udtf        5.06 ms
   pure_strings        filter_udtf          9.55 ms
   pure_strings        count_udtf           2.19 ms
   
   === bench_eval_type.ArrowUDTFPeakmemBench.peakmem_worker ===
   scenario            udtf                   value
   sm_batch_few_col    identity_udtf       442.2 MB
   sm_batch_few_col    filter_udtf         443.8 MB
   sm_batch_few_col    count_udtf          442.2 MB
   sm_batch_many_col   identity_udtf       443.2 MB
   sm_batch_many_col   filter_udtf         445.6 MB
   sm_batch_many_col   count_udtf          442.6 MB
   lg_batch_few_col    identity_udtf       446.0 MB
   lg_batch_few_col    filter_udtf         447.7 MB
   lg_batch_few_col    count_udtf          443.5 MB
   lg_batch_many_col   identity_udtf       458.7 MB
   lg_batch_many_col   filter_udtf         460.8 MB
   lg_batch_many_col   count_udtf          452.1 MB
   pure_ints           identity_udtf       447.5 MB
   pure_ints           filter_udtf         449.0 MB
   pure_ints           count_udtf          444.8 MB
   pure_strings        identity_udtf       451.2 MB
   pure_strings        filter_udtf         453.6 MB
   pure_strings        count_udtf          445.2 MB
   ` ` `
   
   Run command:
   ` ` `bash
   COLUMNS=120 asv run --bench ArrowUDTF -a repeat=3 --python=same
   ` ` `
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56716][PYTHON][TESTS] Add ASV microbenchmark for SQL_ARROW_UDTF [spark]

Reply via email to