Hi, Regarding reading part for nullable, it seems to be considered to add a data cleaning step as Xiao said at https://www.mail-archive.com/user@spark.apache.org/msg39233.html.
Here is a PR https://github.com/apache/spark/pull/17293 to add the data cleaning step that throws an exception if null exists in non-null column. Any comments are appreciated. Kazuaki Ishizaki From: Jason White <jason.wh...@shopify.com> To: dev@spark.apache.org Date: 2017/03/21 06:31 Subject: Why are DataFrames always read with nullable=True? If I create a dataframe in Spark with non-nullable columns, and then save that to disk as a Parquet file, the columns are properly marked as non-nullable. I confirmed this using parquet-tools. Then, when loading it back, Spark forces the nullable back to True. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378 If I remove the `.asNullable` part, Spark performs exactly as I'd like by default, picking up the data using the schema either in the Parquet file or provided by me. This particular LoC goes back a year now, and I've seen a variety of discussions about this issue. In particular with Michael here: https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those seemed to be discussing writing, not reading, though, and writing is already supported now. Is this functionality still desirable? Is it potentially not applicable for all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to pass an option to the DataFrameReader to disable this functionality? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org