GitHub user mengxr opened a pull request:
https://github.com/apache/spark/pull/15567
[SPARK-14393][SQL] values generated by non-deterministic functions
shouldn't change after coalesce or union
## What changes were proposed in this pull request?
When a user appended a column using a "nondeterministic" function to a
DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the
expected semantic is the following:
* The value for each row should remain unchanged, as if we materialize the
values immediately, regardless of later DataFrame operations.
However, since we use `TaskContext.getPartitionId` to get the partition
index from the current thread, the values from nondeterministic columns might
change if we call `union` or `coalesce` after. `TaskContext.getPartitionId`
returns the partition index of the current Spark task, which might not be the
corresponding partition index of the DataFrame where we defined the column.
See the unit tests below or JIRA for examples.
This PR uses the partition index from `RDD.mapPartitionWithIndex` instead
of `TaskContext` and fixes the partition initialization logic in whole-stage
codegen, normal codegen, and codegen fallback.
## How was this patch tested?
Unit tests. (Actually I'm not very confident that this PR fixed all issues
without introducing new ones ...)
cc: @rxin
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mengxr/spark SPARK-14393
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15567.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15567
----
commit f3b9b1017a9b93dbf07ccbfee3f140eab51ae1ea
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T04:30:02Z
add test cases
commit 1a668586e2fd0c53c7c17d388eba256d1601339e
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T04:31:38Z
fix WholeStageCodegen
commit 06a39e11f7a47a3e027e20043ef4ec524e1468a8
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T05:09:15Z
add initializeStateForPartition to Projection
commit 7840c95f556a46cf5934e66c3f8d99b053227577
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T05:15:01Z
add RDD.mapPartitionsWithIndexInternal
commit ccd2fe70a7e0d3f7f9a1bf9ce54b1a00a544d5cc
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T05:39:20Z
fix issue without whole stage codegen
commit 1ca355e456da868eb2411c851b1e5b88779fd854
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T06:26:13Z
Nondeterministic.setInitialValues =>
initializeStatesForPartition(partitionIndex)
commit 9478fd6630c5ab31b9361c387292e739c4e78cbd
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T07:28:31Z
test all code paths
commit bc4ea2c355e324e247028ab26e17a258948bddbd
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T08:56:31Z
fixed codegen fallback case
commit 7ffe0ed88fc6e2bbd68b9fa807dacafbe80ed05b
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T09:13:07Z
also initialize predicate
commit 38dcb7aa81e94b0db81a9cc1be24e8e4524562a1
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T09:21:00Z
fix other nondeterministic functions
commit da9d2619ebf69e8573a6352d39455883f882f86b
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T09:26:16Z
test all nondeterministic functions
commit 2ec32063a2fee26a42c74026eaf287189a4ec0ac
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T09:41:52Z
fix partition initialization in local relation
commit 4dc025593fb06e501b148c2625c0b2e5d72655bc
Author: Xiangrui Meng <[email protected]>
Date: 2016-10-20T09:53:48Z
minor
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]