[
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014535#comment-15014535
]
Michael Armbrust commented on SPARK-11319:
------------------------------------------
Okay, the tone here is quickly veering away from productive. There are clearly
some tradeoffs at play here or there would not be a debate.
First of all, lets talk about nullability in the context of Spark SQL. It is a
*hint* to the optimizer that we *can ignore null checks*. When in doubt, you
should always set it to true since then you are asking Spark to perform null
checks. All interfaces should default to setting this true when it is
unspecified. If there are places where this is not the case or where this is
not clearly documented we should fix this.
There is a tension between performance and checking everything so that we can
provide better errors. In many cases, we've had similarly effusive requests
from users to change behavior where we were being overly cautious and it was
hurting performance. (i.e. "Just Trust Us!")
bq. This means anyone reading a JSON or CSV file needs to do their own
validation pass over the data before passing it to SparkSQL.
I do not believe that this is true. Both datasources know that its possible to
have corrupt lines and thus the schema information they produce says that all
columns are nullable.
If you are doing the parsing or reading yourself, don't tell Spark SQL to skip
the null checks by saying the column is not nullable.
bq. Again, every database I have ever encountered validates schema on the
INSERT or COPY statement.
Every database you have ever used was based on the assumption that you are
going to do an expensive ETL and then query that data many times. Spark SQL is
trying to optimize for the case where people are querying data in-situ and so I
don't think a direct comparison here is really fair. A traditional RDBMS has
integrity constraints because they control the data that they are querying. We
can't do this because someone can always just drop another file that violates
these into HDFS or S3, etc. Thus in all cases where we know we don't have
control we set nullable = true internally. In the advanced interfaces where we
allow users to specific schema manually we expect them to do the same.
bq. Also, assuming we eventually have non-nullable UDFs, the behaviour of nulls
coming out in the column upon UDF exceptions will also violate these false
assumptions Spark is making.
Whenever we encounter an expression that could return null, we change the
nullability of the result, even if the input was non-nullable.
Given the above, I'm thinking the best option is to add better documentation
about the semantics here, but I'm open to other concrete suggestions.
> PySpark silently Accepts null values in non-nullable DataFrame fields.
> ----------------------------------------------------------------------
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a",
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]