[jira] [Commented] (SPARK-25870) RandomSplit with seed gives different results depending on column order

Daniel (JIRA) Mon, 29 Oct 2018 11:08:13 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667548#comment-16667548
 ]


Daniel commented on SPARK-25870:
--------------------------------

Thanks for your message. However, I would think that if you are using the same 
source dataframe and  doing some simple transformations per row, without even 
changing the content, the order shouldn't change. In my particular use case, I 
am running some test on code that my students run, and it shouldn't matter in 
those students code how they order the columns.

> RandomSplit with seed gives different results depending on column order
> -----------------------------------------------------------------------
>
>                 Key: SPARK-25870
>                 URL: https://issues.apache.org/jira/browse/SPARK-25870
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.2
>            Reporter: Daniel
>            Priority: Minor
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Co-discovered by Zhihui Hong ([email protected]):
> {{If you run the following example, the resulting dataframe will have 
> different rows even though the have the same seed:}}
> {{from pyspark.sql import SparkSession, functions as fn}}
> {{spark = SparkSession.builder.getOrCreate()}}{{ }}
> {{df = spark.range(0, 10).withColumn('r', (fn.rand()*10).cast('int'))}}
> {{# sample 1}}
> {{df.randomSplit([0.8, 0.2], seed=0)[0].show(5)}}{{ }}
> {{# sample 2}}
> {{df.select('r', 'id').randomSplit([0.8, 0.2], seed=0)[0].show(5)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-25870) RandomSplit with seed gives different results depending on column order

Reply via email to