[
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447847#comment-16447847
]
Emlyn Corrin commented on SPARK-24051:
--------------------------------------
I tried again to reproduce it in the scala API, keeping as close as possible to
the Java calls, but do not get the same error. It looks like maybe the error is
specific to the Java API.
For reference here is the scala code I came up with:
{code:scala}
import org.apache.spark.sql.{Column, RowFactory}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import scala.collection.JavaConverters._
//val ds1 = Seq((1,42),(2,99)).toDF("a", "b")
val arr1 = List(RowFactory.create(Int.box(1), Int.box(42)),
RowFactory.create(Int.box(2), Int.box(99)))
val sch1 = new StructType(Array(new StructField("a", DataTypes.IntegerType),
new StructField("b", DataTypes.IntegerType)))
val ds1 = spark.createDataFrame(arr1.asJava, sch1)
ds1.show
//val ds2 = Seq((3)).toDF("a").withColumn("b",lit(0))
val arr2 = List(RowFactory.create(Int.box(3)))
val sch2 = new StructType(Array(new StructField("a", DataTypes.IntegerType)))
val ds2 = spark.createDataFrame(arr2.asJava, sch2).withColumn("b",lit(0))
ds2.show
//val cols = Array(new Column("a"),new
Column("b").as("b"),count(lit(1)).over(Window.partitionBy()).as("n"))
val ds = ds1.select(new Column("a"),new
Column("b").as("b"),count(lit(1)).over(Window.partitionBy()).as("n")).union(ds2.select(new
Column("a"),new
Column("b").as("b"),count(lit(1)).over(Window.partitionBy()).as("n"))).where(new
Column("n").geq(1)).drop("n")
ds.show
{code}
> Incorrect results
> -----------------
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Emlyn Corrin
> Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific)
> query, demonstrated by the Java program below. It was simplified from a much
> more complex query, but I'm having trouble simplifying it further without
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session =
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true,
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true,
> Metadata.empty())});
> Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true,
> Metadata.empty())});
> Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset<Row> ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code}
> +---+---+
> | a| b|
> +---+---+
> | 1| 42|
> | 2| 99|
> +---+---+
> {code}
> with
> {code}
> +---+---+
> | a| b|
> +---+---+
> | 3| 0|
> +---+---+
> {code}
> The expected result is:
> {code}
> +---+---+
> | a| b|
> +---+---+
> | 1| 42|
> | 2| 99|
> | 3| 0|
> +---+---+
> {code}
> but instead it prints:
> {code}
> +---+---+
> | a| b|
> +---+---+
> | 1| 0|
> | 2| 0|
> | 3| 0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original
> values in rows 1 and 2.
> Making seemingly trivial changes, such as replacing {{new Column("b").as("b",
> Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}}
> clause after the union, make it behave correctly again.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]