Joseph K. Bradley created SPARK-16686:
-----------------------------------------
Summary: Dataset.sample with seed: result seems to depend on
downstream usage
Key: SPARK-16686
URL: https://issues.apache.org/jira/browse/SPARK-16686
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Environment: Spark 2.0 - RC4
Standalone
Single-worker cluster
Reporter: Joseph K. Bradley
Summary to reproduce bug:
* Create a DataFrame DF, and sample it with a fixed seed.
* Collect that DataFrame -> result1
* Call a particular UDF on that DataFrame -> result2
You would expect results 1 and 2 to use the same rows from DF, but they appear
not to.
Note: result1 and result2 are both deterministic.
See the attached notebook for details. Cells in the notebook were executed in
order.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]