[
https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jungtaek Lim resolved SPARK-39834.
----------------------------------
Fix Version/s: 3.4.0
Resolution: Fixed
Issue resolved by pull request 37248
[https://github.com/apache/spark/pull/37248]
> Include the origin stats and constraints for LogicalRDD if it comes from
> DataFrame
> ----------------------------------------------------------------------------------
>
> Key: SPARK-39834
> URL: https://issues.apache.org/jira/browse/SPARK-39834
> Project: Spark
> Issue Type: Improvement
> Components: SQL, Structured Streaming
> Affects Versions: 3.4.0
> Reporter: Jungtaek Lim
> Assignee: Jungtaek Lim
> Priority: Major
> Fix For: 3.4.0
>
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it
> comes from DataFrame, to achieve carrying-over stats as well as providing
> information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially
> "iterative algorithm", which purpose is to "prune" the logical plan. That is
> against the purpose of including origin logical plan and we have a risk to
> have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats
> information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry
> over.
> To address above issues, it would be better if we include stats and
> constraints in LogicalRDD rather than logical plan.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]