[
https://issues.apache.org/jira/browse/SPARK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269468#comment-15269468
]
Eric Wasserman commented on SPARK-15070:
----------------------------------------
I tried the same example (actually slightly modified to accommodate the 2.0 API
change groupBy -> groupByKey) against the current 2.0 branch and the bug does
not reproduce. The fix for SPARK-12555 was put into the 2.0.0 branch. So it
seems this bug might have the same origins as 12555.
> Data corruption when using Dataset.groupBy[K : Encoder](func: T => K) when
> data loaded from JSON file.
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-15070
> URL: https://issues.apache.org/jira/browse/SPARK-15070
> Project: Spark
> Issue Type: Bug
> Components: Input/Output, SQL
> Affects Versions: 1.6.1
> Environment: produced on Mac OS X 10.11.4 in local mode
> Reporter: Eric Wasserman
>
> full running case at: https://github.com/ewasserman/spark-bug.git
> Bug.scala
> ==========
> package bug
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> case class BugRecord(m: String, elapsed_time: java.lang.Double)
> object Bug {
> def main(args: Array[String]): Unit = {
> val c = new SparkConf().setMaster("local[2]").setAppName("BugTest")
> val sc = new SparkContext(c)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val logs = sqlc.read.json("bug-data.json").as[BugRecord]
> logs.groupBy(r => "FOO").agg(avg($"elapsed_time").as[Double]).show(20,
> truncate = false)
>
> sc.stop()
> }
> }
> bug-data.json
> ==========
> {"m":"POST","elapsed_time":0.123456789012345678,"source_time":"abcdefghijk"}
> -----------------
> Expected Output:
> +-----------+-------------------+
> |_1 |_2 |
> +-----------+-------------------+
> |FOO |0.12345678901234568|
> +-----------+-------------------+
> Observed Output:
> +-----------+-------------------+
> |_1 |_2 |
> +-----------+-------------------+
> |POSTabc|0.12345726584950388|
> +-----------+-------------------+
> The grouping key has been corrupted (it is *not* the product of the groupBy
> function) and is a combination of bytes from the actual key column and an
> extra attribute in the JSON not present in the case class. The aggregated
> value is also corrupted.
> NOTE:
> The problem does not manifest when using an alternate form of groupBy:
> logs.groupBy($"m").agg(avg($"elapsed_time").as[Double])
> The corrupted key problem does not manifest when there is not an additional
> field in the JSON. Ie. if the data file is this:
> {"m":"POST","elapsed_time":0.123456789012345678}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]