[GitHub] [spark] HeartSaVioR opened a new pull request, #37248: [SPARK-39834][SQL][SS] Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

GitBox Thu, 21 Jul 2022 20:12:06 -0700


HeartSaVioR opened a new pull request, #37248:
URL: https://github.com/apache/spark/pull/37248


   ### What changes were proposed in this pull request?
   
   This PR proposes to effectively revert SPARK-39748 but include the origin 
stats and constraints instead in LogicalRDD if it comes from DataFrame, to help 
optimizer figuring out better plan.
   
   ### Why are the changes needed?
   
   We figured out several issues from 
[SPARK-39748](https://issues.apache.org/jira/browse/SPARK-39748):
   
   1. One of major use case for DataFrame.checkpoint is ML, especially 
"iterative algorithm", which purpose is to "prune" the logical plan. That is 
against the purpose of including origin logical plan and we have a risk to have 
nested LogicalRDDs which grows the size of logical plan infinitely.
   
   2. We leverage logical plan to carry over stats, but the correct stats 
information is in optimized plan.
   
   3. (Not an issue but missing spot) constraints is also something we can 
carry over.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing and new UTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR opened a new pull request, #37248: [SPARK-39834][SQL][SS] Include the origin stats and constraints for LogicalRDD if it comes from DataFrame

Reply via email to