[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge

Ryan Blue (JIRA) Fri, 30 Mar 2018 14:40:24 -0700

     [ 
https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ryan Blue updated PARQUET-1115:
-------------------------------
    Fix Version/s:     (was: 1.9.1)
                   1.10.0

> Warn users when misusing parquet-tools merge
> --------------------------------------------
>
>                 Key: PARQUET-1115
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1115
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Zoltan Ivanfi
>            Assignee: Nandor Kollar
>            Priority: Major
>             Fix For: 1.10.0
>
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its 
> use is not practical, we should describe its limitations in the help text of 
> this command. Additionally, we should add a warning to the output of the 
> merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality, 
> because they want to achieve good performance and historically that has been 
> associated with large Parquet files. However, in practice Hive performance 
> won't change significantly after using {{parquet-tools merge}}, but Impala 
> performance will be much worse. The reason for that is that good performance 
> is not a result of large files but large rowgroups instead (up to the HDFS 
> block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places 
> them one after the other. It was intended to be used for Parquet files that 
> are already arranged in row groups of the desired size. When used to merge 
> many small files, the resulting file will still contain small row groups and 
> one loses most of the advantages of larger files (the only one that remains 
> is that it takes a single HDFS operation to read them).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1115) Warn users when misusing parquet-tools merge

Reply via email to