Re: DataFrame Distinct Sample Bug?

Reynold Xin Tue, 03 Jan 2017 17:35:35 -0800

I get the same result every time on Spark 2.1:


Using Python version 2.7.12 (default, Jul  2 2016 17:43:17)
SparkSession available as 'spark'.
>>> from pyspark.sql import functions as F
>>>
>>> d = sqlContext.createDataFrame(sc.parallelize([[x] for x in
range(100000)]),
... ['t'])
>>> sampled = d.distinct().sample(False, 0.01, 478)
>>> print sampled.select(F.min('t').alias('t')).collect()
[Row(t=4)]

>>> print sampled.select(F.min('t').alias('t')).collect()
[Row(t=4)]
>>> print sampled.select(F.min('t').alias('t')).collect()
[Row(t=4)]


On Wed, Jan 4, 2017 at 8:15 AM, dstuck <david.e.st...@gmail.com> wrote:

> I ran into an issue where I'm getting unstable results after sampling a
> dataframe that has had the distinct function called on it. The following
> code should print different answers each time.
>
> from pyspark.sql import functions as F
> d = sqlContext.createDataFrame(sc.parallelize([[x] for x in
> range(100000)]),
> ['t'])
> sampled = d.distinct().sample(False, 0.01, 478)
> print sampled.select(F.min('t').alias('t')).collect()
> print sampled.select(F.min('t').alias('t')).collect()
> print sampled.select(F.min('t').alias('t')).collect()
>
> Removing distinct and caching after sampling fix the problem (as does using
> a smaller dataframe). The spark bug reporting docs dissuaded me from
> creating a JIRA issue without checking with this mailing list that this is
> reproducible.
>
> I'm not familiar enough with the spark code to fix this :\
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/DataFrame-Distinct-
> Sample-Bug-tp20439.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: DataFrame Distinct Sample Bug?

Reply via email to