[ https://issues.apache.org/jira/browse/SPARK-32136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-32136: ---------------------------------- Priority: Blocker (was: Major) > Spark producing incorrect groupBy results when key is a struct with nullable > properties > --------------------------------------------------------------------------------------- > > Key: SPARK-32136 > URL: https://issues.apache.org/jira/browse/SPARK-32136 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Jason Moore > Priority: Blocker > > I'm in the process of migrating from Spark 2.4.x to Spark 3.0.0 and I'm > noticing a behaviour change in a particular aggregation we're doing, and I > think I've tracked it down to how Spark is now treating nullable properties > within the column being grouped by. > > Here's a simple test I've been able to set up to repro it: > > {code:scala} > case class B(c: Option[Double]) > case class A(b: Option[B]) > val df = Seq( > A(None), > A(Some(B(None))), > A(Some(B(Some(1.0)))) > ).toDF > val res = df.groupBy("b").agg(count("*")) > {code} > Spark 2.4.6 has the expected result: > {noformat} > > res.show > +-----+--------+ > | b|count(1)| > +-----+--------+ > | []| 1| > | null| 1| > |[1.0]| 1| > +-----+--------+ > > res.collect.foreach(println) > [[null],1] > [null,1] > [[1.0],1] > {noformat} > But Spark 3.0.0 has an unexpected result: > {noformat} > > res.show > +-----+--------+ > | b|count(1)| > +-----+--------+ > | []| 2| > |[1.0]| 1| > +-----+--------+ > > res.collect.foreach(println) > [[null],2] > [[1.0],1] > {noformat} > Notice how it has keyed one of the values in be as `[null]`; that is, an > instance of B with a null value for the `c` property instead of a null for > the overall value itself. > Is this an intended change? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org