[
https://issues.apache.org/jira/browse/SPARK-25870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667548#comment-16667548
]
Daniel commented on SPARK-25870:
--------------------------------
Thanks for your message. However, I would think that if you are using the same
source dataframe and doing some simple transformations per row, without even
changing the content, the order shouldn't change. In my particular use case, I
am running some test on code that my students run, and it shouldn't matter in
those students code how they order the columns.
> RandomSplit with seed gives different results depending on column order
> -----------------------------------------------------------------------
>
> Key: SPARK-25870
> URL: https://issues.apache.org/jira/browse/SPARK-25870
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.2
> Reporter: Daniel
> Priority: Minor
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> Co-discovered by Zhihui Hong ([email protected]):
> {{If you run the following example, the resulting dataframe will have
> different rows even though the have the same seed:}}
> {{from pyspark.sql import SparkSession, functions as fn}}
> {{spark = SparkSession.builder.getOrCreate()}}{{ }}
> {{df = spark.range(0, 10).withColumn('r', (fn.rand()*10).cast('int'))}}
> {{# sample 1}}
> {{df.randomSplit([0.8, 0.2], seed=0)[0].show(5)}}{{ }}
> {{# sample 2}}
> {{df.select('r', 'id').randomSplit([0.8, 0.2], seed=0)[0].show(5)}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]