[ 
https://issues.apache.org/jira/browse/SPARK-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-21754:
---------------------------
    Description: 
No Exception/Warn When Join Columns are Differing Types, which can lead to 
problematic join results to the unsuspecting.

from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import pandas as pd
spark = SparkSession.builder.master("local").appName("JoinTest").getOrCreate()


* Spark infers LongType Schema for keycol:*

left_df = pd.DataFrame({"keycol": [1], "col1": ["hello"]})
left_sdf = spark.createDataFrame(left_df)
left_sdf.schema
left_sdf.show()

right_df = pd.DataFrame({"keycol": ["1", "1", "01", "01"],
                         "r_col2": ["alpha", "beta", "gamma", "theta"]
                         })
right_sdf = spark.createDataFrame(right_df)
right_sdf.schema


* But when joining no warning of mismatched types '01' get converted to 1*

left_sdf.join(right_sdf, on="keycol", how="left").show()

* Get:*
 +------+-----+------+
 |keycol| col1|r_col2|
 +------+-----+------+
 |     1|hello| alpha|
 |     1|hello|  beta|
 |     1|hello| gamma|
 |     1|hello| theta|
 +------+-----+------+

Think it'd be safer if it fails?

  was:
No Exception/Warn When Join Columns are Differing Types, which can lead to 
problematic join results to the unsuspecting.

from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import pandas as pd
spark = SparkSession.builder.master("local").appName("JoinTest").getOrCreate()


# Spark infers LongType Schema for keycol.

left_df = pd.DataFrame({"keycol": [1], "col1": ["hello"]})
left_sdf = spark.createDataFrame(left_df)
left_sdf.schema
left_sdf.show()

right_df = pd.DataFrame({"keycol": ["1", "1", "01", "01"],
                         "r_col2": ["alpha", "beta", "gamma", "theta"]
                         })
right_sdf = spark.createDataFrame(right_df)
right_sdf.schema


# When joining no warning of mismatched types '01' get converted to 1

left_sdf.join(right_sdf, on="keycol", how="left").show()

Think it'd be safer if it fails?


> No Exception/Warn When Join Columns are Differing Types
> -------------------------------------------------------
>
>                 Key: SPARK-21754
>                 URL: https://issues.apache.org/jira/browse/SPARK-21754
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>         Environment: Ubuntu Xenial 16.04
>            Reporter: Ed Lee
>
> No Exception/Warn When Join Columns are Differing Types, which can lead to 
> problematic join results to the unsuspecting.
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("JoinTest").getOrCreate()
> * Spark infers LongType Schema for keycol:*
> left_df = pd.DataFrame({"keycol": [1], "col1": ["hello"]})
> left_sdf = spark.createDataFrame(left_df)
> left_sdf.schema
> left_sdf.show()
> right_df = pd.DataFrame({"keycol": ["1", "1", "01", "01"],
>                          "r_col2": ["alpha", "beta", "gamma", "theta"]
>                          })
> right_sdf = spark.createDataFrame(right_df)
> right_sdf.schema
> * But when joining no warning of mismatched types '01' get converted to 1*
> left_sdf.join(right_sdf, on="keycol", how="left").show()
> * Get:*
>  +------+-----+------+
>  |keycol| col1|r_col2|
>  +------+-----+------+
>  |     1|hello| alpha|
>  |     1|hello|  beta|
>  |     1|hello| gamma|
>  |     1|hello| theta|
>  +------+-----+------+
> Think it'd be safer if it fails?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to