Yicong-Huang opened a new pull request, #55691: URL: https://github.com/apache/spark/pull/55691
### What changes were proposed in this pull request? This PR adds ASV microbenchmarks for `SQL_ARROW_UDTF` (PyArrow-native Python UDTFs created via `@arrow_udtf`) to `python/benchmarks/bench_eval_type.py`. The new `_ArrowUDTFBenchMixin` produces `ArrowUDTFTimeBench` and `ArrowUDTFPeakmemBench`, parametrized by scenario (batch size, column count, type pool) and three handler variants: - `identity_udtf` - yields the input batch as a `pa.Table` - `filter_udtf` - keeps rows whose first column is non-null (vectorized) - `count_udtf` - aggregates each batch into a single-row count table To support this, two helpers are added to `MockProtocolWriter`: - `write_worker_input` gains an optional `eval_conf` parameter (UDTF needs `table_arg_offsets` in EvalConf) - `write_arrow_udtf_payload` mirrors `PythonUDTFRunner.writeUDTF` on the JVM side: argument offsets, partition-child indexes, optional pickled `AnalyzeResult`, the cloudpickled UDTF class, the result schema, and the UDTF name The wire batch carries one struct column `_0` whose fields are the table's schema; `table_arg_offsets=[0]` tells `ArrowStreamArrowUDTFSerializer` to flatten that struct into a `pa.RecordBatch` for the UDTF's `eval(batch)` method. ### Why are the changes needed? This is part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724) (Micro-benchmark PySpark Eval Types). Establishing a stable baseline for `SQL_ARROW_UDTF` is a prerequisite for the upcoming serializer refactor so we can detect any regression objectively. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New ASV microbenchmarks. Two stable runs on the same machine produced consistent numbers (one run shown): ` ` `text === bench_eval_type.ArrowUDTFTimeBench.time_worker === scenario udtf value sm_batch_few_col identity_udtf 1.23 ms sm_batch_few_col filter_udtf 1.62 ms sm_batch_few_col count_udtf 0.97 ms sm_batch_many_col identity_udtf 2.23 ms sm_batch_many_col filter_udtf 3.13 ms sm_batch_many_col count_udtf 1.01 ms lg_batch_few_col identity_udtf 1.66 ms lg_batch_few_col filter_udtf 3.57 ms lg_batch_few_col count_udtf 1.14 ms lg_batch_many_col identity_udtf 9.75 ms lg_batch_many_col filter_udtf 15.67 ms lg_batch_many_col count_udtf 3.06 ms pure_ints identity_udtf 5.73 ms pure_ints filter_udtf 7.17 ms pure_ints count_udtf 2.02 ms pure_strings identity_udtf 5.06 ms pure_strings filter_udtf 9.55 ms pure_strings count_udtf 2.19 ms === bench_eval_type.ArrowUDTFPeakmemBench.peakmem_worker === scenario udtf value sm_batch_few_col identity_udtf 442.2 MB sm_batch_few_col filter_udtf 443.8 MB sm_batch_few_col count_udtf 442.2 MB sm_batch_many_col identity_udtf 443.2 MB sm_batch_many_col filter_udtf 445.6 MB sm_batch_many_col count_udtf 442.6 MB lg_batch_few_col identity_udtf 446.0 MB lg_batch_few_col filter_udtf 447.7 MB lg_batch_few_col count_udtf 443.5 MB lg_batch_many_col identity_udtf 458.7 MB lg_batch_many_col filter_udtf 460.8 MB lg_batch_many_col count_udtf 452.1 MB pure_ints identity_udtf 447.5 MB pure_ints filter_udtf 449.0 MB pure_ints count_udtf 444.8 MB pure_strings identity_udtf 451.2 MB pure_strings filter_udtf 453.6 MB pure_strings count_udtf 445.2 MB ` ` ` Run command: ` ` `bash COLUMNS=120 asv run --bench ArrowUDTF -a repeat=3 --python=same ` ` ` ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
