Hi,
Regarding reading part for nullable, it seems to be considered to add a 
data cleaning step as Xiao said at 
https://www.mail-archive.com/user@spark.apache.org/msg39233.html.

Here is a PR https://github.com/apache/spark/pull/17293 to add the data 
cleaning step that throws an exception if null exists in non-null column.
Any comments are appreciated.

Kazuaki Ishizaki



From:   Jason White <jason.wh...@shopify.com>
To:     dev@spark.apache.org
Date:   2017/03/21 06:31
Subject:        Why are DataFrames always read with nullable=True?



If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378


If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file 
or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is 
already
supported now.

Is this functionality still desirable? Is it potentially not applicable 
for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable 
to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html

Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Reply via email to