I've opened a Jira to the issue you requested earlier: https://issues.apache.org/jira/browse/SPARK-17939
On Fri, Oct 14, 2016 at 9:24 AM Aleksander Eskilson <aleksander...@gmail.com> wrote: > Interesting. I'm quite glad to read your explanation, it makes some of our > work quite a bit more clear. I'll open a ticket in a similar vein to this > discussion: https://github.com/apache/spark/pull/11785, contrasting > nullability implementation as optimization and and enforcement. > > Additionally, shall I go ahead and open a ticket pointing out the missing > call to .asNullable in the streaming reader? > > Thanks, > Alek > > On Thu, Oct 13, 2016 at 4:05 PM Michael Armbrust <mich...@databricks.com> > wrote: > > There is a <https://issues.apache.org/jira/browse/SPARK-15192> lot > <https://github.com/apache/spark/pull/15329> of > <https://github.com/apache/spark/pull/14124> confusion > <https://github.com/apache/spark/pull/13873> around > <https://github.com/apache/spark/pull/11785> nullable > <https://issues.apache.org/jira/browse/SPARK-11319> in > <https://issues.apache.org/jira/browse/SPARK-11868> StructType > <https://issues.apache.org/jira/browse/SPARK-13740> and we should > definitly come up with a consistent story and make sure we have better > documentation. This might even mean deprecating this field. At a high > level, I think the key problem is that internally, nullable is used as an > optimization, not an enforcement mechanism. This is a lot different than > NOT NULL in a traditional database. Specifically, we take "nulllable = > false" as a promise that that column will never be null and use that fact > to do optimizations like skipping null checks. This means that if you lie > to us, you actually can get wrong answers (i.e. 0 instead of null). This > is clearly confusing and not ideal. > > A little bit of explanation for some of the specific cases you brought up: > the reason that we call asNullable on file sources is that even if that > column is never null in the data, there are cases that can still produce a > null result. For example, when parsing JSON, we null out all columns other > than _corrupt_record when we fail to parse a line. The fact that its > different in streaming is a bug. > > Would you mind opening up a JIRA ticket, and we discuss the right path > forward there? > > On Thu, Oct 13, 2016 at 12:35 PM, Aleksander Eskilson < > aleksander...@gmail.com> wrote: > > Hi there, > > Working in the space of custom Encoders/ExpressionEncoders, I've noticed > that the StructType schema as set when creating an object of the > ExpressionEncoder[T] class [1] is not the schema actually used to set types > for the columns of a Dataset, as created by using the .as(encoder) method > [2] on read data. Instead, what occurs is that the schema is either > inferred through analysis of the data, or a schema can be provided using > the .schema(structType) method [3] of the DataFrameReader. However, when > using the .schema(..) method of DataFrameReader, potentially undesirable > behaviour occurs: while the DataSource is being resolved, all FieldTypes of > the a StructType schema have their nullability set to *true* (using the > asNullable function of StructTypes) [4] when the data is read from a local > file, as opposed to a non-streaming source. > > Of course, allowing null-values where they shouldn't exist can weaken the > type-guarantees for DataSets over certain types of encoded data. > > Thinking on how this might be resolved, first, if it's a legitimate bug, > I'm not sure why "non-streaming file based" datasources need to have their > StructFields all rendered nullable. Simply removing the call to asNullable > would fix the issue. Second, if it's actually necessary for most > filesystem-read data-sources to have their StructFields potentially > nullable in this manner, we could instead let the StructType schema > provided to the Encoder have the final say in the DataSet's schema. > > This latter option seems sensible to me: if a client is willing to provide > a custom Encoder via the .as(..) method on the reader, presumably in > setting the schema field of the encoder they have some legitimate notion of > how their object's types should be mapped to DataSet column types. Any > failure when resolving their data to a DataSet by means of their Encoder > can then be traced to their Encoder for their own debugging. > > Thoughts? Thanks, > Alek Eskilson > > [1] - > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L213 > [2] - > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L374 > [3] - > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62 > [4] - > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L426 > > >