[
https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15770282#comment-15770282
]
Soubhik Chakraborty commented on SPARK-18199:
---------------------------------------------
Can't we use PARQUET-382 feature that got added in 1.9.0 ?
SPARK-13127 already is in progress and assuming it won't hit any major hurdle,
we should be able to use it right away. Only downside it doesn't allow append
of two different schema which is understandable.
> Support appending to Parquet files
> ----------------------------------
>
> Key: SPARK-18199
> URL: https://issues.apache.org/jira/browse/SPARK-18199
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Jeremy Smith
>
> Currently, appending to a Parquet directory involves simply creating new
> parquet files in the directory. With many small appends (for example, in a
> streaming job with a short batch duration) this leads to an unbounded number
> of small Parquet files accumulating. These must be cleaned up with some
> frequency by removing them all and rewriting a new file containing all the
> rows.
> It would be far better if Spark supported appending to the Parquet files
> themselves. HDFS supports this, as does Parquet:
> * The Parquet footer can be read in order to obtain necessary metadata.
> * The new rows can then be appended to the Parquet file as a row group.
> * A new footer can then be appended containing the metadata and referencing
> the new row groups as well as the previously existing row groups.
> This would result in a small amount of bloat in the file as new row groups
> are added (since duplicate metadata would accumulate) but it's hugely
> preferable to accumulating small files, which is bad for HDFS health and also
> eventually leads to Spark being unable to read the Parquet directory at all.
> Periodic rewriting of the file could still be performed in order to remove
> the duplicate metadata.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]