[ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014535#comment-15014535
 ] 

Michael Armbrust commented on SPARK-11319:
------------------------------------------

Okay, the tone here is quickly veering away from productive.  There are clearly 
some tradeoffs at play here or there would not be a debate.

First of all, lets talk about nullability in the context of Spark SQL.  It is a 
*hint* to the optimizer that we *can ignore null checks*.  When in doubt, you 
should always set it to true since then you are asking Spark to perform null 
checks.  All interfaces should default to setting this true when it is 
unspecified.  If there are places where this is not the case or where this is 
not clearly documented we should fix this.

There is a tension between performance and checking everything so that we can 
provide better errors.  In many cases, we've had similarly effusive requests 
from users to change behavior where we were being overly cautious and it was 
hurting performance. (i.e. "Just Trust Us!")

bq. This means anyone reading a JSON or CSV file needs to do their own 
validation pass over the data before passing it to SparkSQL.

I do not believe that this is true.  Both datasources know that its possible to 
have corrupt lines and thus the schema information they produce says that all 
columns are nullable.

If you are doing the parsing or reading yourself, don't tell Spark SQL to skip 
the null checks by saying the column is not nullable.

bq. Again, every database I have ever encountered validates schema on the 
INSERT or COPY statement.

Every database you have ever used was based on the assumption that you are 
going to do an expensive ETL and then query that data many times.  Spark SQL is 
trying to optimize for the case where people are querying data in-situ and so I 
don't think a direct comparison here is really fair.  A traditional RDBMS has 
integrity constraints because they control the data that they are querying.  We 
can't do this because someone can always just drop another file that violates 
these into HDFS or S3, etc.  Thus in all cases where we know we don't have 
control we set nullable = true internally.  In the advanced interfaces where we 
allow users to specific schema manually we expect them to do the same.

bq. Also, assuming we eventually have non-nullable UDFs, the behaviour of nulls 
coming out in the column upon UDF exceptions will also violate these false 
assumptions Spark is making.

Whenever we encounter an expression that could return null, we change the 
nullability of the result, even if the input was non-nullable.

Given the above, I'm thinking the best option is to add better documentation 
about the semantics here, but I'm open to other concrete suggestions.

> PySpark silently Accepts null values in non-nullable DataFrame fields.
> ----------------------------------------------------------------------
>
>                 Key: SPARK-11319
>                 URL: https://issues.apache.org/jira/browse/SPARK-11319
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>            Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to