[GitHub] [spark] dongjoon-hyun commented on a change in pull request #25454: [MINOR][DOCS] Remove Note on Parquet Nullability Behaviour

GitBox Wed, 14 Aug 2019 09:18:34 -0700

dongjoon-hyun commented on a change in pull request #25454: [MINOR][DOCS] 
Remove Note on Parquet Nullability Behaviour
URL: https://github.com/apache/spark/pull/25454#discussion_r313957648


 ##########
 File path: docs/sql-data-sources-parquet.md
 ##########
 @@ -24,8 +24,7 @@ license: |
 
 [Parquet](http://parquet.io) is a columnar format that is supported by many 
other data processing systems.
 Spark SQL provides support for both reading and writing Parquet files that 
automatically preserves the schema
-of the original data. When reading Parquet files, all columns are 
automatically converted to be nullable for
-compatibility reasons.
 
 Review comment:
   Hi, @sujithjay .
   Thank you for making a PR, but the old statement is true. You are confused 
between reading and writing.
   When writing, the schema is preserved. When reading, it becomes nullable.
   ```
   $ parquet-tools meta 
/tmp/nullability.parquet/part-00000-f8024888-e746-41ac-ac9c-dc1cab5dfb80-c000.snappy.parquet
   file:        
file:/tmp/nullability.parquet/part-00000-f8024888-e746-41ac-ac9c-dc1cab5dfb80-c000.snappy.parquet
   creator:     parquet-mr version 1.10.1 (build 
a89df8f9932b6ef6633d06069e50c9b7970bebd1)
   extra:       org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
   
   file schema: spark_schema
   
--------------------------------------------------------------------------------
   id:          REQUIRED INT64 R:0 D:0
   
   scala> spark.read.parquet("/tmp/nullability.parquet").printSchema
   root
    |-- id: long (nullable = true)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #25454: [MINOR][DOCS] Remove Note on Parquet Nullability Behaviour

Reply via email to