I ran into an issue where I'm getting unstable results after sampling a dataframe that has had the distinct function called on it. The following code should print different answers each time.
from pyspark.sql import functions as F d = sqlContext.createDataFrame(sc.parallelize([[x] for x in range(100000)]), ['t']) sampled = d.distinct().sample(False, 0.01, 478) print sampled.select(F.min('t').alias('t')).collect() print sampled.select(F.min('t').alias('t')).collect() print sampled.select(F.min('t').alias('t')).collect() Removing distinct and caching after sampling fix the problem (as does using a smaller dataframe). The spark bug reporting docs dissuaded me from creating a JIRA issue without checking with this mailing list that this is reproducible. I'm not familiar enough with the spark code to fix this :\ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-Distinct-Sample-Bug-tp20439.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org