flykobe cheng created PARQUET-460:
-------------------------------------
Summary: Parquet files concat tool
Key: PARQUET-460
URL: https://issues.apache.org/jira/browse/PARQUET-460
Project: Parquet
Issue Type: Improvement
Components: parquet-mr
Affects Versions: 1.8.0, 1.7.0
Reporter: flykobe cheng
Currently the parquet file generation is time consuming, most of time used for
serialize and compress. It cost about 10mins to generate a 100MB~ parquet file
in our scenario. We want to improve write performance without generate too many
small files, which will impact read performance.
We propose to:
1. generate several small parquet files concurrently
2. merge small files to one file: concat the parquet blocks in binary (without
SerDe), merge footers and modify the path and offset metadata.
We create ParquetFilesConcat class to finish step 2. It can be invoked by
parquet.tools.command.ConcatCommand. If this function approved by parquet
community, we will integrate it in spark.
It will impact compression and introduced more dictionary pages, but it can be
improved by adjusting the concurrency of step 1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)