[
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174660#comment-16174660
]
Serge Smertin edited comment on SPARK-18727 at 9/21/17 12:31 PM:
-----------------------------------------------------------------
i have some similar use-cases that were mentioned in [#comment-15987668] by
[~simeons] - adding fields to nested _struct_ fields. application is built the
way that parquet files are created/partitioned outside of Spark and only new
columns might be added. Again, mostly within couple of nested structs.
I don't know all potential implications of the idea, but can we just use the
last element of selected files instead of the first one, as long as the
FileStatus [list is already sorted by path
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
It easier to guarantee that only new columns would be added over the time. And
the following code change doesn't seem to be huge deviation from current
behavior, thus tremendously saving time compared to
{{spark.sql.parquet.mergeSchema=true}}:
{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}
/cc [[email protected]] [~xwu0226]
was (Author: nfx):
in one of the use-cases for project in [#comment-15987668] by [~simeons] -
adding fields to nested _struct_ fields. application is built the way that
parquet files are created/partitioned outside of Spark and only new columns
might be added. Again, mostly within couple of nested structs.
I don't know all potential implications of the idea, but can we just use the
last element of selected files instead of the first one, as long as the
FileStatus [list is already sorted by path
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
It easier to guarantee that only new columns would be added over the time. And
the following code change doesn't seem to be huge deviation from current
behavior, thus tremendously saving time compared to
{{spark.sql.parquet.mergeSchema=true}}:
{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
.orElse(filesByType.metadata.lastOption)
.orElse(filesByType.data.lastOption)
{code}
/cc [[email protected]] [~xwu0226]
> Support schema evolution as new files are inserted into table
> -------------------------------------------------------------
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Eric Liang
> Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog,
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table,
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE,
> or automatically as new files with compatible schemas are appended into the
> table.
> cc [~rxin]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]