Kimahriman commented on PR #32987:
URL: https://github.com/apache/spark/pull/32987#issuecomment-1496399849

   Threw together a quick script to get some rough numbers. Did two types of 
queries, one doing a `sqrt` and one doing a `regexp_extract` to test a simple 
numeric thing and a more complicated string thing, creating 10 and 5 columns, 
respectively that would be eligible for conditional subexpressions. 
   
   ```python
   cols1 = []
   for i in range(10):
       cols1.append(F.col('id') + i)
   cols2 = []
   for i in range(5):
       cols2.append(F.regexp_extract(F.col('id') + i + 1, r'^1+$', 0))
   
   # Condition will always be false
   df1 = spark.range(1_000_000_000).select(*(F.when(c < 0, c) for c in cols1))
   df2 = spark.range(100_000_000).select(*(F.when(c == '2', c) for c in cols2))
   # Condition will always be true
   df3 = spark.range(1_000_000_000).select(*(F.when(c > 0, c) for c in cols1))
   df4 = spark.range(100_000_000).select(*(F.when(c != '2', c) for c in cols2))
   ```
   
   I timed writing this df using `noop`, running three runs of each. Also 
confirmed the subexpressions when the config are enabled generated subexpr 
functions and weren't inlined.
   
   Result:
   ```
   Evaluated once:
   Time for addition without subexpr elimination 0:00:40.509674
   Time for addition without subexpr elimination 0:00:38.797353
   Time for addition without subexpr elimination 0:00:38.619189
   Total time for addition without subexpr elimination 0:00:38.619189
   
   Time for addition with subexpr elimination 0:00:40.144825
   Time for addition with subexpr elimination 0:00:40.105391
   Time for addition with subexpr elimination 0:00:40.247531
   Total time for addition with subexpr elimination 0:00:40.247531
   
   Time for regex without subexpr elimination 0:01:01.763887
   Time for regex without subexpr elimination 0:01:03.322138
   Time for regex without subexpr elimination 0:01:02.939263
   Total time for regex without subexpr elimination 0:01:02.939263
   
   Time for regex with subexpr elimination 0:01:03.709922
   Time for regex with subexpr elimination 0:01:02.039912
   Time for regex with subexpr elimination 0:01:01.814399
   Total time for regex with subexpr elimination 0:01:01.814399
   
   Evaluated twice:
   Time for addition without subexpr elimination 0:00:32.890037
   Time for addition without subexpr elimination 0:00:35.095279
   Time for addition without subexpr elimination 0:00:34.788411
   Total time for addition without subexpr elimination 0:00:34.788411
   
   Time for addition with subexpr elimination 0:00:33.468247
   Time for addition with subexpr elimination 0:00:35.002380
   Time for addition with subexpr elimination 0:00:34.932783
   Total time for addition with subexpr elimination 0:00:34.932783
   
   Time for regex without subexpr elimination 0:02:05.794253
   Time for regex without subexpr elimination 0:02:03.235071
   Time for regex without subexpr elimination 0:02:03.980708
   Total time for regex without subexpr elimination 0:02:03.980708
   
   Time for regex with subexpr elimination 0:01:11.863820
   Time for regex with subexpr elimination 0:01:12.579065
   Time for regex with subexpr elimination 0:01:12.959850
   Total time for regex with subexpr elimination 0:01:12.959850
   ```
   
   You can clearly see the benefit for the more complex expression (String 
manipulation). For the numeric expression it's so quick to evaluate there's 
less benefit, and maybe slight overhead creating the additional functions. And 
in fact the addition was quicker to execute twice inlined than once in a 
function, indicating existing subexpression elimination can fail for very fast 
expressions. But the variability makes it hard to know exactly, and the 
differences seem to be pretty small in the grand scheme of things.
   
   Two additional thoughts on all this:
   - It's currently feature flagged and off by default so it won't introduce 
any performance regressions without opting into it
   - The idea is to reduce the tail execution time, not make the best case 
scenario faster, which (in my biased opinion) seems generally more important 
for consistency and reliability


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to