I ran into an issue where I'm getting unstable results after sampling a
dataframe that has had the distinct function called on it. The following
code should print different answers each time.

from pyspark.sql import functions as F
d = sqlContext.createDataFrame(sc.parallelize([[x] for x in range(100000)]),
['t'])
sampled = d.distinct().sample(False, 0.01, 478)
print sampled.select(F.min('t').alias('t')).collect()
print sampled.select(F.min('t').alias('t')).collect()
print sampled.select(F.min('t').alias('t')).collect()

Removing distinct and caching after sampling fix the problem (as does using
a smaller dataframe). The spark bug reporting docs dissuaded me from
creating a JIRA issue without checking with this mailing list that this is
reproducible.

I'm not familiar enough with the spark code to fix this :\



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-Distinct-Sample-Bug-tp20439.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to