Yicong-Huang opened a new pull request, #56192: URL: https://github.com/apache/spark/pull/56192
### What changes were proposed in this pull request? Add an ASV micro-benchmark for `SQL_TRANSFORM_WITH_STATE_PANDAS_UDF` (stateful streaming with Pandas) to `bench_eval_type.py`. The new benchmark drives the worker through the `TransformWithStateInPandasSerializer` wire protocol: a single plain Arrow stream pre-sorted by the grouping key column, which the serializer chunks into one `(mode, key, pdfs)` tuple per group, followed by phantom `PROCESS_TIMER` and `COMPLETE` calls with empty pdf iterators. `StatefulProcessorApiClient.__init__` opens a real TCP socket to the JVM state server; a stub `_StubStateServer` listener satisfies that connect. The benchmark UDFs never invoke any state API method, so no protocol exchange beyond connect is needed. Scenarios (`num_groups` x `rows_per_group` x `num_value_cols`): - `few_groups_sm` (50 x 5K x 5) - `few_groups_lg` (50 x 50K x 5) - `many_groups_sm` (2000 x 500 x 5) - `many_groups_lg` (500 x 2K x 5) - `wide_cols` (200 x 5K x 20) UDFs covered: `identity_udf` (full pdf passthrough), `sort_udf` (sort by first column), `count_udf` (collapse each group to a single-row count). Row counts are scaled so `identity_udf` stays under ASV's 60s per-sample timeout. ### Why are the changes needed? Part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724). Establishes a performance baseline before refactoring `SQL_TRANSFORM_WITH_STATE_PANDAS_UDF`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `COLUMNS=120 ./python/asv run --python=same --bench "TransformWithStatePandas" -a "repeat=(3,5,5.0)"` (one of two stable runs): `TransformWithStatePandasUDFTimeBench`: ```text ================ ============== ============ ============ -- udf ---------------- ---------------------------------------- scenario identity_udf sort_udf count_udf ================ ============== ============ ============ few_groups_sm 393+/-1ms 404+/-0.7ms 380+/-2ms few_groups_lg 3.68+/-0.01s 3.80+/-0.01s 3.46+/-0.01s many_groups_sm 3.34+/-0.01s 3.62+/-0.02s 2.86+/-0.01s many_groups_lg 1.90+/-0.01s 1.98+/-0.01s 1.77+/-0s wide_cols 3.71+/-0.01s 3.79+/-0.02s 3.40+/-0.01s ================ ============== ============ ============ ``` `TransformWithStatePandasUDFPeakmemBench`: ```text ================ ============== ========== =========== -- udf ---------------- ------------------------------------- scenario identity_udf sort_udf count_udf ================ ============== ========== =========== few_groups_sm 486M 486M 476M few_groups_lg 569M 579M 541M many_groups_sm 511M 512M 492M many_groups_lg 510M 510M 492M wide_cols 619M 610M 585M ================ ============== ========== =========== ``` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
