[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

Sunitha Kambhampati (Jira) Thu, 06 Feb 2020 13:19:18 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031938#comment-17031938
 ]


Sunitha Kambhampati commented on SPARK-27298:
---------------------------------------------

Thanks for trying it out in your env. That is good to know, that you are 
getting the right result on spark-2.4.4 and not on Spark-2.3.0.

Based on that, I ran this test on spark 2.3.0 in my linux environment and I can 
see the wrong count. I generated the explain to debug this and the plan is 
optimized to 

Spark 2.3.0

 
{code:java}
== Optimized Logical Plan ==
Aggregate [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, 
Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101], [CustID#92, 
DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, 
Surname#99, Telephone#100L, Title#101]
+- Filter ((isnotnull(Gender#94) && (Gender#94 = M)) && NOT Income#96 IN 
(503.65,495.54,486.82,481.28,479.79))
   +- 
Relation[CustID#92,DOB#93,Gender#94,HouseholdID#95,Income#96,Initials#97,Occupation#98,Surname#99,Telephone#100L,Title#101]
 parquet
{code}
 

 

With Spark 2.3.3, where it generates the correct count, I see the optimized 
plan.

 
{code:java}
== Optimized Logical Plan ==
Aggregate [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, 
Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101], [CustID#92, 
DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, 
Surname#99, Telephone#100L, Title#101]
+- Filter ((isnotnull(Gender#94) && (Gender#94 = M)) && NOT coalesce(Income#96 
IN (503.65,495.54,486.82,481.28,479.79), false))
   +- 
Relation[CustID#92,DOB#93,Gender#94,HouseholdID#95,Income#96,Initials#97,Occupation#98,Surname#99,Telephone#100L,Title#101]
 parquet
{code}
 

 

Observations:

This issue is caused by the nulls in Income rows that were being filtered out 
incorrectly.  This is coming from the optimizer rule 'ReplaceExceptWithFilter'. 
 

This bug was fixed in SPARK-26366 and back ported and fixed in spark 2.3.3. 

---------

There doesn't seem to be anything specific to OS in this fix, so I am not sure 
why you were seeing the correct count on windows with  Spark 2.3.0(that has the 
bug).  For this, will need to get hold of the explain and also the count on how 
many rows were null for income column for gender=M on windows (as mentioned in 
my earlier comments).

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27298
>                 URL: https://issues.apache.org/jira/browse/SPARK-27298
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.4.2
>            Reporter: Mahima Khatri
>            Priority: Blocker
>              Labels: data-loss
>         Attachments: Console-Result-Windows.txt, 
> Linux-spark-2.3.0_result.txt, Linux-spark-2.4.4_result.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset<Row> dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset<Row> dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset<Row> onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset<Row> onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*************************************");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset<Row> incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*************************************");
> Dataset<Row> maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*************************************");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

Reply via email to