lei yu created PARQUET-2207:
-------------------------------
Summary: support saving meta and data seperately
Key: PARQUET-2207
URL: https://issues.apache.org/jira/browse/PARQUET-2207
Project: Parquet
Issue Type: New Feature
Components: parquet-format
Reporter: lei yu
I often needs to create tens of milliions of small dataframes and save them
into parquet files. all these dataframes have the same column and index
information. and normally they have the same number of rows(around 300).
as the data is quite small, the parquet meta information is relatively large
and it's quite a big waste of disk space, as the same meta information is
repeated tens of millions of times.
concating them into one big parquet file can save disk space, but it's not
friendly for parallel processing of each small dataframe.
if I can save one copy of the meta information into one file, and the rest
parquet files contains only the data. then the disk space can be saved, and
still good for parallel processing.
seems to me this is possible by design, but I couldn't find any API supporting
this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)