[
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991854#comment-14991854
]
Nic Eggert commented on SPARK-11008:
------------------------------------
I was having a similar problem in 1.5.1 that seems to be fixed in 1.5.2-rc1,
which includes the Oct 13 patch. Seems fixed to me.
> Spark window function returns inconsistent/wrong results
> --------------------------------------------------------
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
> Reporter: Prasad Chalasani
> Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count()
> gives widely varying results in repeated runs: none exceed the correct value,
> but of course all but one are wrong. On large data-sets I sometimes get as
> small as HALF of the correct value.
> A minimal reproducible example is here:
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id",
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100.
> If I re-run the code in step 5 in the same shell, then the result gets
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In
> my real application the size of the data is much larger and the window
> function is not trivial as above (i.e. there are multiple "time" values per
> "id", etc), and I see results sometimes as small as HALF of the correct value
> (e.g. 120,000 while the correct value is 250,000). So this is a serious
> problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]