[
https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nandor Kollar updated PARQUET-1115:
-----------------------------------
Summary: Warn users when misusing parquet-tools merge (was: Prevent users
from misusing parquet-tools merge)
> Warn users when misusing parquet-tools merge
> --------------------------------------------
>
> Key: PARQUET-1115
> URL: https://issues.apache.org/jira/browse/PARQUET-1115
> Project: Parquet
> Issue Type: Improvement
> Reporter: Zoltan Ivanfi
> Assignee: Nandor Kollar
> Priority: Major
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its
> use is not practical, we should describe its limitations in the help text of
> this command. Additionally, we should add a warning to the output of the
> merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality,
> because they want to achieve good performance and historically that has been
> associated with large Parquet files. However, in practice Hive performance
> won't change significantly after using {{parquet-tools merge}}, but Impala
> performance will be much worse. The reason for that is that good performance
> is not a result of large files but large rowgroups instead (up to the HDFS
> block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places
> them one after the other. It was intended to be used for Parquet files that
> are already arranged in row groups of the desired size. When used to merge
> many small files, the resulting file will still contain small row groups and
> one loses most of the advantages of larger files (the only one that remains
> is that it takes a single HDFS operation to read them).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)