[jira] [Created] (SPARK-24264) [Structured Streaming] Remove 'mergeSchema' option from Parquet source configuration

Gerard Maas (JIRA) Sun, 13 May 2018 14:01:00 -0700

Gerard Maas created SPARK-24264:
-----------------------------------

             Summary: [Structured Streaming] Remove 'mergeSchema' option from 
Parquet source configuration
                 Key: SPARK-24264
                 URL: https://issues.apache.org/jira/browse/SPARK-24264
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: Gerard Maas



Looking into the Parquet format support for the File source in Structured 
Streaming, the docs mention the use of the option 'mergeSchema' to merge the 
schemas of the part files found.[1]
 
There seems to be no practical use of that configuration in a streaming context.
 
In its batch counterpart, `mergeSchemas` would infer the schema superset of the 
part-files found. 
 
 When using the File source + parquet format in streaming mode, we must provide 
a schema to the readStream.schema(...) builder and that schema is fixed for the 
duration of the stream.
 
My current understanding is that:
 
- Files containing a subset of the fields declared in the schema will render 
null values for the non-existing fields.
- For files containing a superset of the fields, the additional data fields 
will be lost. 
- Files not matching the schema set on the streaming source will render all 
fields null for each record in the file.
 
It looks like 'mergeSchema' has no practical effect, although enabling it might 
lead to additional processing to actually merge the Parquet schema of the input 
files. 
 
I inquired on the dev+user mailing lists about any other behavior but I got no 
responses.
 
>From the user perspective, they may think that this option would help their 
>job cope with schema evolution at runtime, but that is also not the case. 
 
Looks like removing this option and leaving the value always set to false is 
the reasonable thing to do.  
 
[1] 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-24264) [Structured Streaming] Remove 'mergeSchema' option from Parquet source configuration

Reply via email to