[
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014098#comment-15014098
]
Harry Brundage commented on SPARK-11319:
----------------------------------------
Forgive my frankness but that is ridiculous. This means anyone reading a JSON
or CSV file needs to do their own validation pass over the data before passing
it to SparkSQL. For anyone working with any kind of datasource they don't trust
completely, this renders the entire loader layer of SparkSQL useless, and
forces each user to implement their own. You have an opportunity to solve your
users problems of notifying them when their schema expectations are violated
and instead you silently allow and happily encourage data quality issues. Every
database worth its salt validates data on input!
> PySpark silently Accepts null values in non-nullable DataFrame fields.
> ----------------------------------------------------------------------
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a",
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]