[
https://issues.apache.org/jira/browse/SPARK-24264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gerard Maas updated SPARK-24264:
--------------------------------
Issue Type: Bug (was: Improvement)
> [Structured Streaming] Remove 'mergeSchema' option from Parquet source
> configuration
> ------------------------------------------------------------------------------------
>
> Key: SPARK-24264
> URL: https://issues.apache.org/jira/browse/SPARK-24264
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 2.3.0
> Reporter: Gerard Maas
> Priority: Major
> Labels: features, usability
>
> Looking into the Parquet format support for the File source in Structured
> Streaming, the docs mention the use of the option 'mergeSchema' to merge the
> schemas of the part files found.[1]
>
> There seems to be no practical use of that configuration in a streaming
> context.
>
> In its batch counterpart, `mergeSchemas` would infer the schema superset of
> the part-files found.
>
> When using the File source + parquet format in streaming mode, we must
> provide a schema to the readStream.schema(...) builder and that schema is
> fixed for the duration of the stream.
>
> My current understanding is that:
>
> - Files containing a subset of the fields declared in the schema will render
> null values for the non-existing fields.
> - For files containing a superset of the fields, the additional data fields
> will be lost.
> - Files not matching the schema set on the streaming source will render all
> fields null for each record in the file.
>
> It looks like 'mergeSchema' has no practical effect, although enabling it
> might lead to additional processing to actually merge the Parquet schema of
> the input files.
>
> I inquired on the dev+user mailing lists about any other behavior but I got
> no responses.
>
> From the user perspective, they may think that this option would help their
> job cope with schema evolution at runtime, but that is also not the case.
>
> Looks like removing this option and leaving the value always set to false is
> the reasonable thing to do.
>
> [1]
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]