[ 
https://issues.apache.org/jira/browse/SPARK-47061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013641#comment-18013641
 ] 

Gaël Jourdan commented on SPARK-47061:
--------------------------------------

Hello folks, anyone has feedback on this? Is there a related Scala bug maybe? 
Or is it entirely in Spark scope?

> Wrong result from flatMapGroups using Scala 2.13.x
> --------------------------------------------------
>
>                 Key: SPARK-47061
>                 URL: https://issues.apache.org/jira/browse/SPARK-47061
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.0
>         Environment: Tested with Windows using OpenJDK 17, as well as Ubuntu 
> using OpenJDK 19
>            Reporter: Magnus Kühn
>            Priority: Major
>
> Using Scala 2.13 and `KeyValueGroupedDataset::flatMapGroups` produces wrong 
> results. All rows produced by flatMapGroups have the values from the last 
> entry in the returned iterator.
>  * Downgrading to Scala 2.12 fixes the issue.
>  * Using `mapGroups` followed by `flatMap` also fixes the issue.
>  
> Test-Setup:
> {code:scala}
> import org.apache.spark.sql.SparkSession
> object Main {
>   def main(args: Array[String]): Unit = {
>     val spark = SparkSession.builder().master("local[*]").getOrCreate()
>     import spark.implicits._
>     // using flatMapGroups
>     spark.createDataset(Seq(1, 2))
>       .groupByKey(x => x)
>       .flatMapGroups((x, _) => Seq(10 + x, 20 + x, 30 + x)).show()
>     // second code using map, then flatMap ~> should yield the same result    
>     spark.createDataset(Seq(1, 2))
>       .groupByKey(x => x)
>       .mapGroups((x, _) => Seq(10 + x, 20 + x, 30 + x))
>       .flatMap(x => x).show()
>    }
> } {code}
> We map the key 1 to the Sequence (11, 21, 31). Analogously the key 2 is 
> mapped to (12, 22, 32). Both computations should produce the following 
> (identical) result:
> {code:java}
> +-----+
> |value|
> +-----+
> |   11|
> |   21|
> |   31|
> |   12|
> |   22|
> |   32|
> +-----+ {code}
> This was run using Scala 2.12 with Spark 3.5 - using the following 
> `build.sbt`:
> {code:java}
> ThisBuild / scalaVersion := "2.12.18"
> libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.5.0"
> {code}
>  
> Problem: By upgrading to Scala 2.13 we instead get the following result:
> {code:java}
> +-----+
> |value|
> +-----+
> |   31|
> |   31|
> |   31|
> |   32|
> |   32|
> |   32|
> +-----+ {code}
> Using this new `build.sbt`. The Code was not modified.
> {code:java}
> ThisBuild / scalaVersion := "2.13.10"
> libraryDependencies += "org.apache.spark" % "spark-sql_2.13" % "3.5.0" {code}
> [The test-case is inspired by this StackOverflow 
> question.|https://stackoverflow.com/questions/74633091/why-does-keyvaluegroupeddatasets-flatmapgroups-give-incorrect-result-when-runni]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to