dongjoon-hyun commented on a change in pull request #25454: [MINOR][DOCS] Remove Note on Parquet Nullability Behaviour URL: https://github.com/apache/spark/pull/25454#discussion_r313957648
########## File path: docs/sql-data-sources-parquet.md ########## @@ -24,8 +24,7 @@ license: | [Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema -of the original data. When reading Parquet files, all columns are automatically converted to be nullable for -compatibility reasons. Review comment: Hi, @sujithjay . Thank you for making a PR, but the old statement is true. You are confused between reading and writing. When writing, the schema is preserved. When reading, it becomes nullable. ``` $ parquet-tools meta /tmp/nullability.parquet/part-00000-f8024888-e746-41ac-ac9c-dc1cab5dfb80-c000.snappy.parquet file: file:/tmp/nullability.parquet/part-00000-f8024888-e746-41ac-ac9c-dc1cab5dfb80-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- id: REQUIRED INT64 R:0 D:0 scala> spark.read.parquet("/tmp/nullability.parquet").printSchema root |-- id: long (nullable = true) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
