[
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014678#comment-15014678
]
Michael Armbrust commented on SPARK-11319:
------------------------------------------
bq. It doesn't sound much like a hint to me. In my mind hints don't allow an
optimizer to produce incorrect behaviour. This is more of a contract that the
optimizer could rely on.
Fair, this is *a contract with the optimizer that you will not produce null
data*. If you are not certain you will not produce null values then set this
to true.
bq. Firstly it sounds like if a datasource has corrupt lines someone should
know about it. That could be thousands of dollars down the drain if you lose
the wrong lines. I find the current behaviour of turning corrupt lines into all
nulls completely unacceptable.
*Many* users asked for this feature. The inability to read dirty data is
unacceptable. If you want to be more strict than the average user then check
the value of {{__corrupt_record}}.
bq. This definitely has to be done, one way or the other.
Pull requests welcome.
bq. The JVM is incredibly good at optimizing out null checks.
These generally are not JVM null checks. They are look ups into a bitset once
we are in the execution engine. If I could go back in time I would probably
not expose this to python users at all. Probably the best solution is to
ignore you and set nullable to true no matter what you say.
> PySpark silently Accepts null values in non-nullable DataFrame fields.
> ----------------------------------------------------------------------
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a",
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]