[
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shixiong Zhu updated SPARK-28651:
---------------------------------
Description:
Right now, batch DataFrame always changes the schema to nullable automatically
(See this line:
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
However, streaming DataFrame's schema is read in this line
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
which doesn't change the schema to nullable automatically.
We should make streaming DataFrame consistent with batch.
This issue was reported by Tomasz Magdanski. He found it caused corrupted
parquet files due to the schema mismatch.
was:
Right now, batch DataFrame always changes the schema to nullable automatically
(See this line:
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
However, streaming DataFrame's schema is read in this line
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
which doesn't change the schema to nullable automatically.
We should make streaming DataFrame consistent with batch.
> Streaming file source doesn't change the schema to nullable automatically
> -------------------------------------------------------------------------
>
> Key: SPARK-28651
> URL: https://issues.apache.org/jira/browse/SPARK-28651
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 2.4.3
> Reporter: Shixiong Zhu
> Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable
> automatically (See this line:
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
> which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was reported by Tomasz Magdanski. He found it caused corrupted
> parquet files due to the schema mismatch.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]