[GitHub] [spark] ChenMichael commented on pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…

GitBox Mon, 22 Nov 2021 11:23:32 -0800


ChenMichael commented on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896



   I'm not sure if this is the best way to solve this bug so I will detail the 
other solution I could come up with and then compare the possible problems with 
them.
   
   1. Materialize cached rdd immediately (.count)
       - `buildBuffers` becomes a blocking call. Moves cost of materialization 
from execution time to planning.
       - Maybe there is some code that assumes the cached rdd is not 
materialized until execution.
       - If someone obtained the cached rdd, but never executes it, then with 
the new changes there is wasted effort materializing the rdd. From the code, it 
seems like obtaining the rdd is always followed up by submitting the job to 
DAGScheduler so I don't know why this would happen.
   
   2. Never use accumulator stats for InMemoryRelation with AQE on
       - These accumulator stats should be more accurate than the estimated 
stats, so there can be missed opportunities for optimization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ChenMichael commented on pull request #34684: [SPARK-37442] - Bug when AQE is enabled where replanning tries to rea…

Reply via email to