[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

Shixiong Zhu (JIRA) Wed, 07 Aug 2019 14:37:23 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-28651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shixiong Zhu updated SPARK-28651:
---------------------------------
    Description: 
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.

This issue was reported by Tomasz Magdanski. He found it caused corrupted 
parquet files due to the schema mismatch.

  was:
Right now, batch DataFrame always changes the schema to nullable automatically 
(See this line: 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

However, streaming DataFrame's schema is read in this line 
https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
 which doesn't change the schema to nullable automatically.

We should make streaming DataFrame consistent with batch.


> Streaming file source doesn't change the schema to nullable automatically
> -------------------------------------------------------------------------
>
>                 Key: SPARK-28651
>                 URL: https://issues.apache.org/jira/browse/SPARK-28651
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.4.3
>            Reporter: Shixiong Zhu
>            Priority: Major
>
> Right now, batch DataFrame always changes the schema to nullable 
> automatically (See this line: 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
> However, streaming DataFrame's schema is read in this line 
> https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259
>  which doesn't change the schema to nullable automatically.
> We should make streaming DataFrame consistent with batch.
> This issue was reported by Tomasz Magdanski. He found it caused corrupted 
> parquet files due to the schema mismatch.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-28651) Streaming file source doesn't change the schema to nullable automatically

Reply via email to