Jeremy Smith created SPARK-18199:
------------------------------------

             Summary: Support appending to Parquet files
                 Key: SPARK-18199
                 URL: https://issues.apache.org/jira/browse/SPARK-18199
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Jeremy Smith


Currently, appending to a Parquet directory involves simply creating new 
parquet files in the directory. With many small appends (for example, in a 
streaming job with a short batch duration) this leads to an unbounded number of 
small Parquet files accumulating. These must be cleaned up with some frequency 
by removing them all and rewriting a new file containing all the rows.

It would be far better if Spark supported appending to the Parquet files 
themselves. HDFS supports this, as does Parquet:

* The Parquet footer can be read in order to obtain necessary metadata.
* The new rows can then be appended to the Parquet file as a row group.
* A new footer can then be appended containing the metadata and referencing the 
new row groups as well as the previously existing row groups.

This would result in a small amount of bloat in the file as new row groups are 
added (since duplicate metadata would accumulate) but it's hugely preferable to 
accumulating small files, which is bad for HDFS health and also eventually 
leads to Spark being unable to read the Parquet directory at all.  Periodic 
rewriting of the file could still be performed in order to remove the duplicate 
metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to