[ https://issues.apache.org/jira/browse/SPARK-47061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013641#comment-18013641 ]
Gaël Jourdan commented on SPARK-47061: -------------------------------------- Hello folks, anyone has feedback on this? Is there a related Scala bug maybe? Or is it entirely in Spark scope? > Wrong result from flatMapGroups using Scala 2.13.x > -------------------------------------------------- > > Key: SPARK-47061 > URL: https://issues.apache.org/jira/browse/SPARK-47061 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.0 > Environment: Tested with Windows using OpenJDK 17, as well as Ubuntu > using OpenJDK 19 > Reporter: Magnus Kühn > Priority: Major > > Using Scala 2.13 and `KeyValueGroupedDataset::flatMapGroups` produces wrong > results. All rows produced by flatMapGroups have the values from the last > entry in the returned iterator. > * Downgrading to Scala 2.12 fixes the issue. > * Using `mapGroups` followed by `flatMap` also fixes the issue. > > Test-Setup: > {code:scala} > import org.apache.spark.sql.SparkSession > object Main { > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder().master("local[*]").getOrCreate() > import spark.implicits._ > // using flatMapGroups > spark.createDataset(Seq(1, 2)) > .groupByKey(x => x) > .flatMapGroups((x, _) => Seq(10 + x, 20 + x, 30 + x)).show() > // second code using map, then flatMap ~> should yield the same result > spark.createDataset(Seq(1, 2)) > .groupByKey(x => x) > .mapGroups((x, _) => Seq(10 + x, 20 + x, 30 + x)) > .flatMap(x => x).show() > } > } {code} > We map the key 1 to the Sequence (11, 21, 31). Analogously the key 2 is > mapped to (12, 22, 32). Both computations should produce the following > (identical) result: > {code:java} > +-----+ > |value| > +-----+ > | 11| > | 21| > | 31| > | 12| > | 22| > | 32| > +-----+ {code} > This was run using Scala 2.12 with Spark 3.5 - using the following > `build.sbt`: > {code:java} > ThisBuild / scalaVersion := "2.12.18" > libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.5.0" > {code} > > Problem: By upgrading to Scala 2.13 we instead get the following result: > {code:java} > +-----+ > |value| > +-----+ > | 31| > | 31| > | 31| > | 32| > | 32| > | 32| > +-----+ {code} > Using this new `build.sbt`. The Code was not modified. > {code:java} > ThisBuild / scalaVersion := "2.13.10" > libraryDependencies += "org.apache.spark" % "spark-sql_2.13" % "3.5.0" {code} > [The test-case is inspired by this StackOverflow > question.|https://stackoverflow.com/questions/74633091/why-does-keyvaluegroupeddatasets-flatmapgroups-give-incorrect-result-when-runni] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org