[
https://issues.apache.org/jira/browse/SPARK-16518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375374#comment-15375374
]
Xiao Li edited comment on SPARK-16518 at 7/13/16 5:16 PM:
----------------------------------------------------------
[~hyukjin.kwon] You can open a child JIRA under this and submit a PR to resolve
the issue. I am trying to summarize schema compatibility/merging/evolution
issues for all the natively supported file formats.
BTW, thank you for your investigation!
was (Author: smilegator):
[~hyukjin.kwon] You can open a child JIRA under this and submit a PR to resolve
the issue. I am trying to summarize schema compatibility/merging/evolution
issues for all the natively supported file formats.
> Schema Compatibility of Parquet Data Source
> -------------------------------------------
>
> Key: SPARK-16518
> URL: https://issues.apache.org/jira/browse/SPARK-16518
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Xiao Li
>
> Currently, we are not checking the schema compatibility. Different file
> formats behave differently. This JIRA just summarizes what I observed for
> parquet data source tables.
> *Scenario 1 Data type mismatch*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int, col2 int)}}
> *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the error we
> got:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most
> recent failure:
> Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException
> at
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62)
> {noformat}
> *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the error we
> got:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most
> recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost):
> org.apache.spark.SparkException:
> Failed merging schema of file
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-00000-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet:
> root
> |-- a: integer (nullable = false)
> |-- b: string (nullable = true)
> {noformat}
> *Scenario 2 More columns in append dataset*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int, col2 string, col3 int)}}
> *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the schema
> of the resultset is {{(col1 int, col2 string)}}.
> *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema of
> the resultset is {{(col1 int, col2 string, col3 int)}}.
> *Scenario 3 Less columns in append dataset*:
> The existing schema is {{(col1 int, col2 string)}}
> The schema of appending dataset is {{(col1 int)}}
> *Case 1*: _when {{spark.sql.parquet.mergeSchema}} is {{false}}_, the
> schema of the resultset is {{(col1 int, col2 string)}}.
> *Case 2*: _when {{spark.sql.parquet.mergeSchema}} is {{true}}_, the schema
> of the resultset is {{(col1 int)}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]