Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

Aleksander Eskilson Fri, 14 Oct 2016 08:11:29 -0700

I've opened a Jira to the issue you requested earlier:
https://issues.apache.org/jira/browse/SPARK-17939



On Fri, Oct 14, 2016 at 9:24 AM Aleksander Eskilson <aleksander...@gmail.com>
wrote:

> Interesting. I'm quite glad to read your explanation, it makes some of our
> work quite a bit more clear. I'll open a ticket in a similar vein to this
> discussion: https://github.com/apache/spark/pull/11785, contrasting
> nullability implementation as optimization and and enforcement.
>
> Additionally, shall I go ahead and open a ticket pointing out the missing
> call to .asNullable in the streaming reader?
>
> Thanks,
> Alek
>
> On Thu, Oct 13, 2016 at 4:05 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
> There is a <https://issues.apache.org/jira/browse/SPARK-15192> lot
> <https://github.com/apache/spark/pull/15329> of
> <https://github.com/apache/spark/pull/14124> confusion
> <https://github.com/apache/spark/pull/13873> around
> <https://github.com/apache/spark/pull/11785> nullable
> <https://issues.apache.org/jira/browse/SPARK-11319> in
> <https://issues.apache.org/jira/browse/SPARK-11868> StructType
> <https://issues.apache.org/jira/browse/SPARK-13740> and we should
> definitly come up with a consistent story and make sure we have better
> documentation.  This might even mean deprecating this field.  At a high
> level, I think the key problem is that internally, nullable is used as an
> optimization, not an enforcement mechanism.  This is a lot different than
> NOT NULL in a traditional database. Specifically, we take "nulllable =
> false" as a promise that that column will never be null and use that fact
> to do optimizations like skipping null checks.  This means that if you lie
> to us, you actually can get wrong answers (i.e. 0 instead of null).  This
> is clearly confusing and not ideal.
>
> A little bit of explanation for some of the specific cases you brought up:
> the reason that we call asNullable on file sources is that even if that
> column is never null in the data, there are cases that can still produce a
> null result.  For example, when parsing JSON, we null out all columns other
> than _corrupt_record when we fail to parse a line.  The fact that its
> different in streaming is a bug.
>
> Would you mind opening up a JIRA ticket, and we discuss the right path
> forward there?
>
> On Thu, Oct 13, 2016 at 12:35 PM, Aleksander Eskilson <
> aleksander...@gmail.com> wrote:
>
> Hi there,
>
> Working in the space of custom Encoders/ExpressionEncoders, I've noticed
> that the StructType schema as set when creating an object of the
> ExpressionEncoder[T] class [1] is not the schema actually used to set types
> for the columns of a Dataset, as created by using the .as(encoder) method
> [2] on read data. Instead, what occurs is that the schema is either
> inferred through analysis of the data, or a schema can be provided using
> the .schema(structType) method [3] of the DataFrameReader. However, when
> using the .schema(..) method of DataFrameReader, potentially undesirable
> behaviour occurs: while the DataSource is being resolved, all FieldTypes of
> the a StructType schema have their nullability set to *true* (using the
> asNullable function of StructTypes) [4] when the data is read from a local
> file, as opposed to a non-streaming source.
>
> Of course, allowing null-values where they shouldn't exist can weaken the
> type-guarantees for DataSets over certain types of encoded data.
>
> Thinking on how this might be resolved, first, if it's a legitimate bug,
> I'm not sure why "non-streaming file based" datasources need to have their
> StructFields all rendered nullable. Simply removing the call to asNullable
> would fix the issue. Second, if it's actually necessary for most
> filesystem-read data-sources to have their StructFields potentially
> nullable in this manner, we could instead let the StructType schema
> provided to the Encoder have the final say in the DataSet's schema.
>
> This latter option seems sensible to me: if a client is willing to provide
> a custom Encoder via the .as(..) method on the reader, presumably in
> setting the schema field of the encoder they have some legitimate notion of
> how their object's types should be mapped to DataSet column types. Any
> failure when resolving their data to a DataSet by means of their Encoder
> can then be traced to their Encoder for their own debugging.
>
> Thoughts? Thanks,
> Alek Eskilson
>
> [1] -
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L213
> [2] -
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L374
> [3] -
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62
> [4] -
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L426
>
>
>

Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

Reply via email to