[
https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ryan Blue updated PARQUET-1115:
-------------------------------
Fix Version/s: (was: 1.9.1)
1.10.0
> Warn users when misusing parquet-tools merge
> --------------------------------------------
>
> Key: PARQUET-1115
> URL: https://issues.apache.org/jira/browse/PARQUET-1115
> Project: Parquet
> Issue Type: Improvement
> Reporter: Zoltan Ivanfi
> Assignee: Nandor Kollar
> Priority: Major
> Fix For: 1.10.0
>
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its
> use is not practical, we should describe its limitations in the help text of
> this command. Additionally, we should add a warning to the output of the
> merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality,
> because they want to achieve good performance and historically that has been
> associated with large Parquet files. However, in practice Hive performance
> won't change significantly after using {{parquet-tools merge}}, but Impala
> performance will be much worse. The reason for that is that good performance
> is not a result of large files but large rowgroups instead (up to the HDFS
> block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places
> them one after the other. It was intended to be used for Parquet files that
> are already arranged in row groups of the desired size. When used to merge
> many small files, the resulting file will still contain small row groups and
> one loses most of the advantages of larger files (the only one that remains
> is that it takes a single HDFS operation to read them).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)