[ https://issues.apache.org/jira/browse/SPARK-29881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devyn Cairns updated SPARK-29881: --------------------------------- Description: I have an interesting situation where I'm calling functions that are relatively expensive from Spark SQL, and then using the result several times in a loop through {{transform}}. Although the WholeStageCodegen is usually helpful, it always calls expressions as they're used, which means that in the case of, for example: {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}} {{FROM (}} {{ SELECT expensive_operation(foo) AS expensive_result FROM source}} {{)}} the expensive_operation function will almost certainly be called 32 times for each source row, without any explicit way to cache that value intermediately. I've found a workaround for now is to insert something like {{.filter \{ _ => true }}} in the middle, which will create a barrier to whole-stage codegen without much negative impact, aside from preventing other optimizations like PushDown. This does indeed produce the intended result and expensive_operation is only run once. But it would be great to have an API on Dataset like {{.barrier()}} to introduce an explicit barrier to whole-stage codegen without adding any additional behavior or getting in the way of any PushDown optimizations. was: I have an interesting situation where I'm calling functions that are relatively expensive from Spark SQL, and then using the result several times in a loop through {{transform}}. Although the WholeStageCodegen is usually helpful, it always calls expressions as they're used, which means that in the case of, for example: {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}} {{FROM (}} {{ SELECT expensive_operation(foo) AS expensive_result FROM source}} {{)}} the expensive_operation function will almost certainly be called 32 times for each source row, without any explicit way to cache that value intermediately. I've found a workaround for now is to insert something like {{.filter \{ _ => true }}} in the middle, which will create a barrier to whole-stage codegen without much negative impact, aside from preventing other optimizations like PushDown. This does indeed produce the intended result and expensive_operation is only run once. But it would be great to have an API on Dataset like {{.barrier()}} to introduce an explicit barrier to whole-stage codegen without adding any additional behavior or getting in the way of any PushDown optimizations. > Introduce API for manually breaking up dataset plan > --------------------------------------------------- > > Key: SPARK-29881 > URL: https://issues.apache.org/jira/browse/SPARK-29881 > Project: Spark > Issue Type: Wish > Components: SQL > Affects Versions: 2.4.4 > Reporter: Devyn Cairns > Priority: Trivial > > I have an interesting situation where I'm calling functions that are > relatively expensive from Spark SQL, and then using the result several times > in a loop through {{transform}}. > Although the WholeStageCodegen is usually helpful, it always calls > expressions as they're used, which means that in the case of, for example: > {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}} > {{FROM (}} > {{ SELECT expensive_operation(foo) AS expensive_result FROM source}} > {{)}} > the expensive_operation function will almost certainly be called 32 times for > each source row, without any explicit way to cache that value intermediately. > I've found a workaround for now is to insert something like {{.filter \{ _ => > true }}} in the middle, which will create a barrier to whole-stage codegen > without much negative impact, aside from preventing other optimizations like > PushDown. This does indeed produce the intended result and > expensive_operation is only run once. > But it would be great to have an API on Dataset like {{.barrier()}} to > introduce an explicit barrier to whole-stage codegen without adding any > additional behavior or getting in the way of any PushDown optimizations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org