[
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014405#comment-15014405
]
Harry Brundage commented on SPARK-11319:
----------------------------------------
Why force all users everywhere to do a pass over the data first to establish if
there isn't a null or not? You are forcing a bunch of people to go through the
struggle of figuring out that this is even a problem, then write their own
implementation of data validators, then debug that code when it breaks.
Instead, Spark should do it properly for everyone and act as a useful library.
Also, the schema inference inside PySpark (and maybe Scala Spark as well) only
looks at the first 100 rows to establish schema, which I presume is why it only
ever outputs nullable columns since it hasn't done a full pass over the data.
This means the only way to get a non-nullable field is by passing your own
schema as Kevin did above, which means this case will happen to more than just
poor Kevin up there. The fact that the collect even worked boggles my mind:
imagine the havoc nulls in the place of values assumed to be not nullable by
Catalyst and Tungsten will start causing when things get complicated! The user
experience of trying to debug an NPE from inside Spark because their input data
violated either Spark's or their own expectations is absolute garbage. Again,
every database I have ever encountered validates schema on the INSERT or COPY
statement.
Also, assuming we eventually have non-nullable UDFs, the behaviour of nulls
coming out in the column upon UDF exceptions will also violate these false
assumptions Spark is making. This needs to be fixed! Am I taking crazy pills?!
> PySpark silently Accepts null values in non-nullable DataFrame fields.
> ----------------------------------------------------------------------
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a",
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]