[jira] [Commented] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.

Kevin Cox (JIRA) Thu, 19 Nov 2015 14:54:42 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014657#comment-15014657
 ]


Kevin Cox commented on SPARK-11319:
-----------------------------------

{quote}
First of all, lets talk about nullability in the context of Spark SQL. It is a 
hint to the optimizer that we can ignore null checks.
{quote}

It doesn't sound much like a _hint_ to me. In my mind _hints_ don't allow an 
optimizer to produce incorrect behaviour. This is more of a contract that the 
optimizer could rely on.

{quote}
I do not believe that this is true. Both datasources know that its possible to 
have corrupt lines and thus the schema information they produce says that all 
columns are nullable.
{quote}

I see two problems with this statement. Firstly it sounds like if a datasource 
has corrupt lines someone should know about it. That could be thousands of 
dollars down the drain if you lose the wrong lines. I find the current 
behaviour of turning corrupt lines into all nulls completely unacceptable.

Secondly I'm not talking about infering. It makes sense to me that inferred 
types should be nullable. I'm talking about when I give you the schema I expect 
and I would hope to see an error if you couldn't sanely map that data into my 
schema.

{quote}
Given the above, I'm thinking the best option is to add better documentation 
about the semantics here, but I'm open to other concrete suggestions.
{quote}

This definitely has to be done, one way or the other.

I'm not sure I buy the performance argument for two reasons. The first reason 
is that bad data is usually far more costly then some CPU time. The other 
reason is that when you are parsing JSON or reading input you are doing type 
checks and conversions anyways so it seems like a null check would be 
essentially free. (The JVM is incredibly good at optimizing out null checks)

> PySpark silently Accepts null values in non-nullable DataFrame fields.
> ----------------------------------------------------------------------
>
>                 Key: SPARK-11319
>                 URL: https://issues.apache.org/jira/browse/SPARK-11319
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>            Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.

Reply via email to