[ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014405#comment-15014405
 ] 

Harry Brundage commented on SPARK-11319:
----------------------------------------

Why force all users everywhere to do a pass over the data first to establish if 
there isn't a null or not? You are forcing a bunch of people to go through the 
struggle of figuring out that this is even a problem, then write their own 
implementation of data validators, then debug that code when it breaks. 
Instead, Spark should do it properly for everyone and act as a useful library. 

Also, the schema inference inside PySpark (and maybe Scala Spark as well) only 
looks at the first 100 rows to establish schema, which I presume is why it only 
ever outputs nullable columns since it hasn't done a full pass over the data. 
This means the only way to get a non-nullable field is by passing your own 
schema as Kevin did above, which means this case will happen to more than just 
poor Kevin up there. The fact that the collect even worked boggles my mind: 
imagine the havoc nulls in the place of values assumed to be not nullable by 
Catalyst and Tungsten will start causing when things get complicated! The user 
experience of trying to debug an NPE from inside Spark because their input data 
violated either Spark's or their own expectations is absolute garbage. Again, 
every database I have ever encountered validates schema on the INSERT or COPY 
statement.

Also, assuming we eventually have non-nullable UDFs, the behaviour of nulls 
coming out in the column upon UDF exceptions will also violate these false 
assumptions Spark is making. This needs to be fixed! Am I taking crazy pills?!

> PySpark silently Accepts null values in non-nullable DataFrame fields.
> ----------------------------------------------------------------------
>
>                 Key: SPARK-11319
>                 URL: https://issues.apache.org/jira/browse/SPARK-11319
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>            Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to