I get the same result every time on Spark 2.1:
Using Python version 2.7.12 (default, Jul 2 2016 17:43:17) SparkSession available as 'spark'. >>> from pyspark.sql import functions as F >>> >>> d = sqlContext.createDataFrame(sc.parallelize([[x] for x in range(100000)]), ... ['t']) >>> sampled = d.distinct().sample(False, 0.01, 478) >>> print sampled.select(F.min('t').alias('t')).collect() [Row(t=4)] >>> print sampled.select(F.min('t').alias('t')).collect() [Row(t=4)] >>> print sampled.select(F.min('t').alias('t')).collect() [Row(t=4)] On Wed, Jan 4, 2017 at 8:15 AM, dstuck <david.e.st...@gmail.com> wrote: > I ran into an issue where I'm getting unstable results after sampling a > dataframe that has had the distinct function called on it. The following > code should print different answers each time. > > from pyspark.sql import functions as F > d = sqlContext.createDataFrame(sc.parallelize([[x] for x in > range(100000)]), > ['t']) > sampled = d.distinct().sample(False, 0.01, 478) > print sampled.select(F.min('t').alias('t')).collect() > print sampled.select(F.min('t').alias('t')).collect() > print sampled.select(F.min('t').alias('t')).collect() > > Removing distinct and caching after sampling fix the problem (as does using > a smaller dataframe). The spark bug reporting docs dissuaded me from > creating a JIRA issue without checking with this mailing list that this is > reproducible. > > I'm not familiar enough with the spark code to fix this :\ > > > > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/DataFrame-Distinct- > Sample-Bug-tp20439.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >