ChenMichael commented on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896
I'm not sure if this is the best way to solve this bug so I will detail the
other solution I could come up with and then compare the possible problems with
them.
1. Materialize cached rdd immediately (.count)
- `buildBuffers` becomes a blocking call. Moves cost of materialization
from execution time to planning.
- Maybe there is some code that assumes the cached rdd is not
materialized until execution.
- If someone obtained the cached rdd, but never executes it, then with
the new changes there is wasted effort materializing the rdd. From the code, it
seems like obtaining the rdd is always followed up by submitting the job to
DAGScheduler so I don't know why this would happen.
2. Never use accumulator stats for InMemoryRelation with AQE on
- These accumulator stats should be more accurate than the estimated
stats, so there can be missed opportunities for optimization
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]