holdenk commented on PR #46143: URL: https://github.com/apache/spark/pull/46143#issuecomment-3712080976
> That's why my initial suggestion was to not do this optimization at all. We just keep the `Filter` above the `Project`. By doing so we avoid the expensive expression duplication caused by filter pushdown, but all expressions in `Project` now need to be evaluated against the full input. I'm not sure how serious this issue is, and I was just trying to help simplify the algorithm given you are doing this optimization. I'm more than happier if you agree to drop this optimization and simplify the code. So just always leave up complex filters and don't don't attempt to split them if needed? I think that's sub-optimal for fairly self evident reasons *but* if you still find the current implementation too complex I could move it into a follow-on PR so there's less to review here and we *just* fix the perf regression introduced in 3.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
