[PR] [SPARK-57020][PYTHON][TESTS] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF [spark]

via GitHub Thu, 28 May 2026 13:53:04 -0700


Yicong-Huang opened a new pull request, #56192:
URL: https://github.com/apache/spark/pull/56192


   ### What changes were proposed in this pull request?
   
   Add an ASV micro-benchmark for `SQL_TRANSFORM_WITH_STATE_PANDAS_UDF` 
(stateful streaming with Pandas) to `bench_eval_type.py`.
   
   The new benchmark drives the worker through the 
`TransformWithStateInPandasSerializer` wire protocol: a single plain Arrow 
stream pre-sorted by the grouping key column, which the serializer chunks into 
one `(mode, key, pdfs)` tuple per group, followed by phantom `PROCESS_TIMER` 
and `COMPLETE` calls with empty pdf iterators. 
`StatefulProcessorApiClient.__init__` opens a real TCP socket to the JVM state 
server; a stub `_StubStateServer` listener satisfies that connect. The 
benchmark UDFs never invoke any state API method, so no protocol exchange 
beyond connect is needed.
   
   Scenarios (`num_groups` x `rows_per_group` x `num_value_cols`):
   
   - `few_groups_sm` (50 x 5K x 5)
   - `few_groups_lg` (50 x 50K x 5)
   - `many_groups_sm` (2000 x 500 x 5)
   - `many_groups_lg` (500 x 2K x 5)
   - `wide_cols` (200 x 5K x 20)
   
   UDFs covered: `identity_udf` (full pdf passthrough), `sort_udf` (sort by 
first column), `count_udf` (collapse each group to a single-row count). Row 
counts are scaled so `identity_udf` stays under ASV's 60s per-sample timeout.
   
   ### Why are the changes needed?
   
   Part of [SPARK-55724](https://issues.apache.org/jira/browse/SPARK-55724). 
Establishes a performance baseline before refactoring 
`SQL_TRANSFORM_WITH_STATE_PANDAS_UDF`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   `COLUMNS=120 ./python/asv run --python=same --bench 
"TransformWithStatePandas" -a "repeat=(3,5,5.0)"` (one of two stable runs):
   
   `TransformWithStatePandasUDFTimeBench`:
   
   ```text
   ================ ============== ============ ============
   --                                 udf
   ---------------- ----------------------------------------
       scenario      identity_udf    sort_udf    count_udf
   ================ ============== ============ ============
    few_groups_sm      393+/-1ms      404+/-0.7ms     380+/-2ms
    few_groups_lg     3.68+/-0.01s    3.80+/-0.01s   3.46+/-0.01s
    many_groups_sm    3.34+/-0.01s    3.62+/-0.02s   2.86+/-0.01s
    many_groups_lg    1.90+/-0.01s    1.98+/-0.01s    1.77+/-0s
      wide_cols       3.71+/-0.01s    3.79+/-0.02s   3.40+/-0.01s
   ================ ============== ============ ============
   ```
   
   `TransformWithStatePandasUDFPeakmemBench`:
   
   ```text
   ================ ============== ========== ===========
   --                                udf
   ---------------- -------------------------------------
       scenario      identity_udf   sort_udf   count_udf
   ================ ============== ========== ===========
    few_groups_sm        486M         486M        476M
    few_groups_lg        569M         579M        541M
    many_groups_sm       511M         512M        492M
    many_groups_lg       510M         510M        492M
      wide_cols          619M         610M        585M
   ================ ============== ========== ===========
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57020][PYTHON][TESTS] Add ASV microbenchmark for SQL_TRANSFORM_WITH_STATE_PANDAS_UDF [spark]

Reply via email to