[
https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-23786:
------------------------------------
Assignee: (was: Apache Spark)
> CSV schema validation - column names are not checked
> ----------------------------------------------------
>
> Key: SPARK-23786
> URL: https://issues.apache.org/jira/browse/SPARK-23786
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Maxim Gekk
> Priority: Major
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Here is a csv file contains two columns of the same type:
> {code}
> $cat marina.csv
> depth, temperature
> 10.2, 9.0
> 5.5, 12.3
> {code}
> If we define the schema with correct types but wrong column names (reversed
> order):
> {code:scala}
> val schema = new StructType().add("temperature", DoubleType).add("depth",
> DoubleType)
> {code}
> Spark reads the csv file without any errors:
> {code:scala}
> val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
> ds.show
> {code}
> and outputs wrong result:
> {code}
> +-----------+-----+
> |temperature|depth|
> +-----------+-----+
> | 10.2| 9.0|
> | 5.5| 12.3|
> +-----------+-----+
> {code}
> The correct behavior would be either output error or read columns according
> its names in the schema.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]