Zoltan Ivanfi created PARQUET-1115:
--------------------------------------

             Summary: Prevent users from misusing parquet-tools merge
                 Key: PARQUET-1115
                 URL: https://issues.apache.org/jira/browse/PARQUET-1115
             Project: Parquet
          Issue Type: Improvement
            Reporter: Zoltan Ivanfi
            Assignee: Zoltan Ivanfi


To prevent users from using {{parquet-tools merge}} in scenarios where its use 
is not practical, we should describe its limitations in the help text of this 
command. Additionally, we should add a warning to the output of the merge 
command if the size of the original row groups are below a threshold.

Reasoning:

Many users are tempted to use the new {{parquet-tools merge}} functionality, 
because they want to achieve good performance and historically that has been 
associated with large Parquet files. However, in practice Hive performance 
won't change significantly after using {{parquet-tools merge}}, but Impala 
performance will be much worse. The reason for that is that good performance is 
not a result of large files but large rowgroups instead (up to the HDFS block 
size).

However, {{parquet-tools merge}} does not merge rowgroups, it just places them 
one after the other. It was intended to be used for Parquet files that are 
already arranged in row groups of the desired size. When used to merge many 
small files, the resulting file will still contain small row groups and one 
loses most of the advantages of larger files (the only one that remains is that 
it takes a single HDFS operation to read them).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to