Jeremy Smith created SPARK-18199:
------------------------------------
Summary: Support appending to Parquet files
Key: SPARK-18199
URL: https://issues.apache.org/jira/browse/SPARK-18199
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Jeremy Smith
Currently, appending to a Parquet directory involves simply creating new
parquet files in the directory. With many small appends (for example, in a
streaming job with a short batch duration) this leads to an unbounded number of
small Parquet files accumulating. These must be cleaned up with some frequency
by removing them all and rewriting a new file containing all the rows.
It would be far better if Spark supported appending to the Parquet files
themselves. HDFS supports this, as does Parquet:
* The Parquet footer can be read in order to obtain necessary metadata.
* The new rows can then be appended to the Parquet file as a row group.
* A new footer can then be appended containing the metadata and referencing the
new row groups as well as the previously existing row groups.
This would result in a small amount of bloat in the file as new row groups are
added (since duplicate metadata would accumulate) but it's hugely preferable to
accumulating small files, which is bad for HDFS health and also eventually
leads to Spark being unable to read the Parquet directory at all. Periodic
rewriting of the file could still be performed in order to remove the duplicate
metadata.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]