[
https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin closed SPARK-18199.
-------------------------------
Resolution: Invalid
I'm closing this as invalid. It is not a good idea to append to an existing
file in distributed systems, especially given we might have two writers at the
same time.
> Support appending to Parquet files
> ----------------------------------
>
> Key: SPARK-18199
> URL: https://issues.apache.org/jira/browse/SPARK-18199
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Jeremy Smith
>
> Currently, appending to a Parquet directory involves simply creating new
> parquet files in the directory. With many small appends (for example, in a
> streaming job with a short batch duration) this leads to an unbounded number
> of small Parquet files accumulating. These must be cleaned up with some
> frequency by removing them all and rewriting a new file containing all the
> rows.
> It would be far better if Spark supported appending to the Parquet files
> themselves. HDFS supports this, as does Parquet:
> * The Parquet footer can be read in order to obtain necessary metadata.
> * The new rows can then be appended to the Parquet file as a row group.
> * A new footer can then be appended containing the metadata and referencing
> the new row groups as well as the previously existing row groups.
> This would result in a small amount of bloat in the file as new row groups
> are added (since duplicate metadata would accumulate) but it's hugely
> preferable to accumulating small files, which is bad for HDFS health and also
> eventually leads to Spark being unable to read the Parquet directory at all.
> Periodic rewriting of the file could still be performed in order to remove
> the duplicate metadata.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]