[
https://issues.apache.org/jira/browse/SPARK-55724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yicong Huang updated SPARK-55724:
---------------------------------
Description:
SPARK-55384 and SPARK-55388 is refactoring the serializer and Python UDFs eval
types. To guard against performance regressions, we need microbenchmarks
covering the Python-side serialization round-trip ({{dump_stream}} /
{{load_stream}}) for each eval type — no JVM required.
h3. Plan
New file: {{python/benchmarks/bench_eval_type.py}}
29 eval types, one ASV benchmark class each:
||#||Eval Type||Value||
|1|{{NON_UDF}}|0|
|2|{{SQL_BATCHED_UDF}}|100|
|3|{{SQL_ARROW_BATCHED_UDF}}|101|
|4|{{SQL_SCALAR_PANDAS_UDF}}|200|
|5|{{SQL_GROUPED_MAP_PANDAS_UDF}}|201|
|6|{{SQL_GROUPED_AGG_PANDAS_UDF}}|202|
|7|{{SQL_WINDOW_AGG_PANDAS_UDF}}|203|
|8|{{SQL_SCALAR_PANDAS_ITER_UDF}}|204|
|9|{{SQL_MAP_PANDAS_ITER_UDF}}|205|
|10|{{SQL_COGROUPED_MAP_PANDAS_UDF}}|206|
|11|{{SQL_MAP_ARROW_ITER_UDF}}|207|
|12|{{SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE}}|208|
|13|{{SQL_GROUPED_MAP_ARROW_UDF}}|209|
|14|{{SQL_COGROUPED_MAP_ARROW_UDF}}|210|
|15|{{SQL_TRANSFORM_WITH_STATE_PANDAS_UDF}}|211|
|16|{{SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF}}|212|
|17|{{SQL_TRANSFORM_WITH_STATE_PYTHON_ROW_UDF}}|213|
|18|{{SQL_TRANSFORM_WITH_STATE_PYTHON_ROW_INIT_STATE_UDF}}|214|
|19|{{SQL_GROUPED_MAP_ARROW_ITER_UDF}}|215|
|20|{{SQL_GROUPED_MAP_PANDAS_ITER_UDF}}|216|
|21|{{SQL_GROUPED_AGG_PANDAS_ITER_UDF}}|217|
|22|{{SQL_SCALAR_ARROW_UDF}}|250|
|23|{{SQL_SCALAR_ARROW_ITER_UDF}}|251|
|24|{{SQL_GROUPED_AGG_ARROW_UDF}}|252|
|25|{{SQL_WINDOW_AGG_ARROW_UDF}}|253|
|26|{{SQL_GROUPED_AGG_ARROW_ITER_UDF}}|254|
|27|{{SQL_TABLE_UDF}}|300|
|28|{{SQL_ARROW_TABLE_UDF}}|301|
|29|{{SQL_ARROW_UDTF}}|302|
h3. Stream protocol simulation (pure Python, no JVM)
* *Plain*: raw Arrow IPC
* *Grouped*: {{write_int(1)}} \+ Arrow IPC \+ {{write_int(0)}}
* *Cogroup*: {{write_int(2)}} \+ 2x Arrow IPC \+ {{write_int(0)}}
was:
SPARK-55384 and SPARK-55388 is refactoring the serializer and Python UDFs eval
types. To guard against performance regressions, we need microbenchmarks
covering the Python-side serialization round-trip (dump_stream / load_stream)
for each eval type — no JVM required.
Plan
New file: python/benchmarks/bench_eval_type.py
29 eval types, one ASV benchmark class each:
┌─────┬────────────────────────────────────────────────────┬───────┐
│ # │ Eval Type │ Value │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 1 │ NON_UDF │ 0 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 2 │ SQL_BATCHED_UDF │ 100 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 3 │ SQL_ARROW_BATCHED_UDF │ 101 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 4 │ SQL_SCALAR_PANDAS_UDF │ 200 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 5 │ SQL_GROUPED_MAP_PANDAS_UDF │ 201 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 6 │ SQL_GROUPED_AGG_PANDAS_UDF │ 202 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 7 │ SQL_WINDOW_AGG_PANDAS_UDF │ 203 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 8 │ SQL_SCALAR_PANDAS_ITER_UDF │ 204 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 9 │ SQL_MAP_PANDAS_ITER_UDF │ 205 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 10 │ SQL_COGROUPED_MAP_PANDAS_UDF │ 206 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 11 │ SQL_MAP_ARROW_ITER_UDF │ 207 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 12 │ SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE │ 208 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 13 │ SQL_GROUPED_MAP_ARROW_UDF │ 209 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 14 │ SQL_COGROUPED_MAP_ARROW_UDF │ 210 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 15 │ SQL_TRANSFORM_WITH_STATE_PANDAS_UDF │ 211 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 16 │ SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF │ 212 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 17 │ SQL_TRANSFORM_WITH_STATE_PYTHON_ROW_UDF │ 213 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 18 │ SQL_TRANSFORM_WITH_STATE_PYTHON_ROW_INIT_STATE_UDF │ 214 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 19 │ SQL_GROUPED_MAP_ARROW_ITER_UDF │ 215 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 20 │ SQL_GROUPED_MAP_PANDAS_ITER_UDF │ 216 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 21 │ SQL_GROUPED_AGG_PANDAS_ITER_UDF │ 217 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 22 │ SQL_SCALAR_ARROW_UDF │ 250 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 23 │ SQL_SCALAR_ARROW_ITER_UDF │ 251 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 24 │ SQL_GROUPED_AGG_ARROW_UDF │ 252 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 25 │ SQL_WINDOW_AGG_ARROW_UDF │ 253 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 26 │ SQL_GROUPED_AGG_ARROW_ITER_UDF │ 254 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 27 │ SQL_TABLE_UDF │ 300 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 28 │ SQL_ARROW_TABLE_UDF │ 301 │
├─────┼────────────────────────────────────────────────────┼───────┤
│ 29 │ SQL_ARROW_UDTF │ 302 │
└─────┴────────────────────────────────────────────────────┴───────┘
Stream protocol simulation (pure Python, no JVM):
- Plain: raw Arrow IPC
- Grouped: write_int(1) + Arrow IPC + write_int(0)
- Cogroup: write_int(2) + 2x Arrow IPC + write_int(0)
> Micro-benchmark PySpark Eval Types
> ----------------------------------
>
> Key: SPARK-55724
> URL: https://issues.apache.org/jira/browse/SPARK-55724
> Project: Spark
> Issue Type: Umbrella
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Major
>
> SPARK-55384 and SPARK-55388 is refactoring the serializer and Python UDFs
> eval types. To guard against performance regressions, we need microbenchmarks
> covering the Python-side serialization round-trip ({{dump_stream}} /
> {{load_stream}}) for each eval type — no JVM required.
> h3. Plan
> New file: {{python/benchmarks/bench_eval_type.py}}
> 29 eval types, one ASV benchmark class each:
> ||#||Eval Type||Value||
> |1|{{NON_UDF}}|0|
> |2|{{SQL_BATCHED_UDF}}|100|
> |3|{{SQL_ARROW_BATCHED_UDF}}|101|
> |4|{{SQL_SCALAR_PANDAS_UDF}}|200|
> |5|{{SQL_GROUPED_MAP_PANDAS_UDF}}|201|
> |6|{{SQL_GROUPED_AGG_PANDAS_UDF}}|202|
> |7|{{SQL_WINDOW_AGG_PANDAS_UDF}}|203|
> |8|{{SQL_SCALAR_PANDAS_ITER_UDF}}|204|
> |9|{{SQL_MAP_PANDAS_ITER_UDF}}|205|
> |10|{{SQL_COGROUPED_MAP_PANDAS_UDF}}|206|
> |11|{{SQL_MAP_ARROW_ITER_UDF}}|207|
> |12|{{SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE}}|208|
> |13|{{SQL_GROUPED_MAP_ARROW_UDF}}|209|
> |14|{{SQL_COGROUPED_MAP_ARROW_UDF}}|210|
> |15|{{SQL_TRANSFORM_WITH_STATE_PANDAS_UDF}}|211|
> |16|{{SQL_TRANSFORM_WITH_STATE_PANDAS_INIT_STATE_UDF}}|212|
> |17|{{SQL_TRANSFORM_WITH_STATE_PYTHON_ROW_UDF}}|213|
> |18|{{SQL_TRANSFORM_WITH_STATE_PYTHON_ROW_INIT_STATE_UDF}}|214|
> |19|{{SQL_GROUPED_MAP_ARROW_ITER_UDF}}|215|
> |20|{{SQL_GROUPED_MAP_PANDAS_ITER_UDF}}|216|
> |21|{{SQL_GROUPED_AGG_PANDAS_ITER_UDF}}|217|
> |22|{{SQL_SCALAR_ARROW_UDF}}|250|
> |23|{{SQL_SCALAR_ARROW_ITER_UDF}}|251|
> |24|{{SQL_GROUPED_AGG_ARROW_UDF}}|252|
> |25|{{SQL_WINDOW_AGG_ARROW_UDF}}|253|
> |26|{{SQL_GROUPED_AGG_ARROW_ITER_UDF}}|254|
> |27|{{SQL_TABLE_UDF}}|300|
> |28|{{SQL_ARROW_TABLE_UDF}}|301|
> |29|{{SQL_ARROW_UDTF}}|302|
> h3. Stream protocol simulation (pure Python, no JVM)
> * *Plain*: raw Arrow IPC
> * *Grouped*: {{write_int(1)}} \+ Arrow IPC \+ {{write_int(0)}}
> * *Cogroup*: {{write_int(2)}} \+ 2x Arrow IPC \+ {{write_int(0)}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]