GitHub user mgaido91 opened a pull request:
https://github.com/apache/spark/pull/21432
[SPARK-24373][SQL] Add AnalysisBarrier to RelationalGroupedDataset's child
## What changes were proposed in this pull request?
When we create a `RelationalGroupedDataset` we set its child to the
`logicalPlan` of the `DataFrame` we need to aggregate. Since the `logicalPlan`
is already analyzed, we should not analyze it again. But this happens when the
new plan of the aggregate is analyzed.
The current behavior in most of the cases is likely to produce no harm, but
in other cases re-analyzing an analyzed plan can change it, since the analysis
is not idempotent. This can cause issues liek the one described in the JIRA
(missing to find a cached plan).
The PR adds an `AnalysisBarrier` to the `logicalPlan` which is used as
child of `RelationalGroupedDataset`.
## How was this patch tested?
added UT
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mgaido91/spark SPARK-24373
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21432.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21432
----
commit 361fee8082b3401128ea13be82e878a987bc9b61
Author: Marco Gaido <marcogaido91@...>
Date: 2018-05-25T13:59:49Z
[SPARK-24373][SQL] Add AnalysisBarrier to RelationalGroupedDataset's child
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]