GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/15567

    [SPARK-14393][SQL] values generated by non-deterministic functions 
shouldn't change after coalesce or union

    ## What changes were proposed in this pull request?
    
    When a user appended a column using a "nondeterministic" function to a 
DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the 
expected semantic is the following:
    
    * The value for each row should remain unchanged, as if we materialize the 
values immediately, regardless of later DataFrame operations.
    
    However, since we use `TaskContext.getPartitionId` to get the partition 
index from the current thread, the values from nondeterministic columns might 
change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` 
returns the partition index of the current Spark task, which might not be the 
corresponding partition index of the DataFrame where we defined the column.
    
    See the unit tests below or JIRA for examples.
    
    This PR uses the partition index from `RDD.mapPartitionWithIndex` instead 
of `TaskContext` and fixes the partition initialization logic in whole-stage 
codegen, normal codegen, and codegen fallback.
    
    ## How was this patch tested?
    
    Unit tests. (Actually I'm not very confident that this PR fixed all issues 
without introducing new ones ...)
    
    cc: @rxin 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-14393

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15567.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15567
    
----
commit f3b9b1017a9b93dbf07ccbfee3f140eab51ae1ea
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T04:30:02Z

    add test cases

commit 1a668586e2fd0c53c7c17d388eba256d1601339e
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T04:31:38Z

    fix WholeStageCodegen

commit 06a39e11f7a47a3e027e20043ef4ec524e1468a8
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T05:09:15Z

    add initializeStateForPartition to Projection

commit 7840c95f556a46cf5934e66c3f8d99b053227577
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T05:15:01Z

    add RDD.mapPartitionsWithIndexInternal

commit ccd2fe70a7e0d3f7f9a1bf9ce54b1a00a544d5cc
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T05:39:20Z

    fix issue without whole stage codegen

commit 1ca355e456da868eb2411c851b1e5b88779fd854
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T06:26:13Z

    Nondeterministic.setInitialValues => 
initializeStatesForPartition(partitionIndex)

commit 9478fd6630c5ab31b9361c387292e739c4e78cbd
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T07:28:31Z

    test all code paths

commit bc4ea2c355e324e247028ab26e17a258948bddbd
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T08:56:31Z

    fixed codegen fallback case

commit 7ffe0ed88fc6e2bbd68b9fa807dacafbe80ed05b
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T09:13:07Z

    also initialize predicate

commit 38dcb7aa81e94b0db81a9cc1be24e8e4524562a1
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T09:21:00Z

    fix other nondeterministic functions

commit da9d2619ebf69e8573a6352d39455883f882f86b
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T09:26:16Z

    test all nondeterministic functions

commit 2ec32063a2fee26a42c74026eaf287189a4ec0ac
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T09:41:52Z

    fix partition initialization in local relation

commit 4dc025593fb06e501b148c2625c0b2e5d72655bc
Author: Xiangrui Meng <[email protected]>
Date:   2016-10-20T09:53:48Z

    minor

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to