[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174660#comment-16174660
 ] 

Serge Smertin edited comment on SPARK-18727 at 9/21/17 12:31 PM:
-----------------------------------------------------------------

i have some similar use-cases that were mentioned in [#comment-15987668] by 
[~simeons] - adding fields to nested _struct_ fields. application is built the 
way that parquet files are created/partitioned outside of Spark and only new 
columns might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
            .orElse(filesByType.metadata.lastOption)
            .orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 


was (Author: nfx):
in one of the use-cases for project in [#comment-15987668] by [~simeons] - 
adding fields to nested _struct_ fields. application is built the way that 
parquet files are created/partitioned outside of Spark and only new columns 
might be added. Again, mostly within couple of nested structs. 

I don't know all potential implications of the idea, but can we just use the 
last element of selected files instead of the first one, as long as the 
FileStatus [list is already sorted by path 
lexicographically|https://github.com/apache/spark/blob/32fa0b81411f781173e185f4b19b9fd6d118f9fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L251]?
 It easier to guarantee that only new columns would be added over the time. And 
the following code change doesn't seem to be huge deviation from current 
behavior, thus tremendously saving time compared to 
{{spark.sql.parquet.mergeSchema=true}}:

{code:java} // ParquetFileFormat.scala (lines 232..240)
filesByType.commonMetadata.lastOption
            .orElse(filesByType.metadata.lastOption)
            .orElse(filesByType.data.lastOption)
{code}

/cc [~r...@databricks.com] [~xwu0226] 

> Support schema evolution as new files are inserted into table
> -------------------------------------------------------------
>
>                 Key: SPARK-18727
>                 URL: https://issues.apache.org/jira/browse/SPARK-18727
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Eric Liang
>            Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to