[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge

Nandor Kollar (JIRA) Thu, 02 Nov 2017 09:38:22 -0700

     [ 
https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nandor Kollar updated PARQUET-1115:
-----------------------------------
    Summary: Warn users when misusing parquet-tools merge  (was: Prevent users 
from misusing parquet-tools merge)

> Warn users when misusing parquet-tools merge
> --------------------------------------------
>
>                 Key: PARQUET-1115
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1115
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Zoltan Ivanfi
>            Assignee: Nandor Kollar
>            Priority: Major
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its 
> use is not practical, we should describe its limitations in the help text of 
> this command. Additionally, we should add a warning to the output of the 
> merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality, 
> because they want to achieve good performance and historically that has been 
> associated with large Parquet files. However, in practice Hive performance 
> won't change significantly after using {{parquet-tools merge}}, but Impala 
> performance will be much worse. The reason for that is that good performance 
> is not a result of large files but large rowgroups instead (up to the HDFS 
> block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places 
> them one after the other. It was intended to be used for Parquet files that 
> are already arranged in row groups of the desired size. When used to merge 
> many small files, the resulting file will still contain small row groups and 
> one loses most of the advantages of larger files (the only one that remains 
> is that it takes a single HDFS operation to read them).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge

Reply via email to