Kimahriman commented on PR #32987:
URL: https://github.com/apache/spark/pull/32987#issuecomment-1496399849
Threw together a quick script to get some rough numbers. Did two types of
queries, one doing a `sqrt` and one doing a `regexp_extract` to test a simple
numeric thing and a more complicated string thing, creating 10 and 5 columns,
respectively that would be eligible for conditional subexpressions.
```python
cols1 = []
for i in range(10):
cols1.append(F.col('id') + i)
cols2 = []
for i in range(5):
cols2.append(F.regexp_extract(F.col('id') + i + 1, r'^1+$', 0))
# Condition will always be false
df1 = spark.range(1_000_000_000).select(*(F.when(c < 0, c) for c in cols1))
df2 = spark.range(100_000_000).select(*(F.when(c == '2', c) for c in cols2))
# Condition will always be true
df3 = spark.range(1_000_000_000).select(*(F.when(c > 0, c) for c in cols1))
df4 = spark.range(100_000_000).select(*(F.when(c != '2', c) for c in cols2))
```
I timed writing this df using `noop`, running three runs of each. Also
confirmed the subexpressions when the config are enabled generated subexpr
functions and weren't inlined.
Result:
```
Evaluated once:
Time for addition without subexpr elimination 0:00:40.509674
Time for addition without subexpr elimination 0:00:38.797353
Time for addition without subexpr elimination 0:00:38.619189
Total time for addition without subexpr elimination 0:00:38.619189
Time for addition with subexpr elimination 0:00:40.144825
Time for addition with subexpr elimination 0:00:40.105391
Time for addition with subexpr elimination 0:00:40.247531
Total time for addition with subexpr elimination 0:00:40.247531
Time for regex without subexpr elimination 0:01:01.763887
Time for regex without subexpr elimination 0:01:03.322138
Time for regex without subexpr elimination 0:01:02.939263
Total time for regex without subexpr elimination 0:01:02.939263
Time for regex with subexpr elimination 0:01:03.709922
Time for regex with subexpr elimination 0:01:02.039912
Time for regex with subexpr elimination 0:01:01.814399
Total time for regex with subexpr elimination 0:01:01.814399
Evaluated twice:
Time for addition without subexpr elimination 0:00:32.890037
Time for addition without subexpr elimination 0:00:35.095279
Time for addition without subexpr elimination 0:00:34.788411
Total time for addition without subexpr elimination 0:00:34.788411
Time for addition with subexpr elimination 0:00:33.468247
Time for addition with subexpr elimination 0:00:35.002380
Time for addition with subexpr elimination 0:00:34.932783
Total time for addition with subexpr elimination 0:00:34.932783
Time for regex without subexpr elimination 0:02:05.794253
Time for regex without subexpr elimination 0:02:03.235071
Time for regex without subexpr elimination 0:02:03.980708
Total time for regex without subexpr elimination 0:02:03.980708
Time for regex with subexpr elimination 0:01:11.863820
Time for regex with subexpr elimination 0:01:12.579065
Time for regex with subexpr elimination 0:01:12.959850
Total time for regex with subexpr elimination 0:01:12.959850
```
You can clearly see the benefit for the more complex expression (String
manipulation). For the numeric expression it's so quick to evaluate there's
less benefit, and maybe slight overhead creating the additional functions. And
in fact the addition was quicker to execute twice inlined than once in a
function, indicating existing subexpression elimination can fail for very fast
expressions. But the variability makes it hard to know exactly, and the
differences seem to be pretty small in the grand scheme of things.
Two additional thoughts on all this:
- It's currently feature flagged and off by default so it won't introduce
any performance regressions without opting into it
- The idea is to reduce the tail execution time, not make the best case
scenario faster, which (in my biased opinion) seems generally more important
for consistency and reliability
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]