[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Xuefu Zhang (JIRA) Mon, 06 Jan 2014 11:27:27 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863294#comment-13863294
 ]


Xuefu Zhang commented on HIVE-6134:
-----------------------------------

[~ericchu30] Merging or concatenating files for a table/partition makes more 
sense in that the table/partition will likely be used over and over again. On 
the other hand, merging small files resulted from a query that are not 
permanently stored, while helping your case,  brings extra cost on the query 
execution, which is probably not a good idea for every query. If we choose to 
selectively, then the challenge is to know when to merge the result.

If the user has a better idea, then we can extend the Hive syntax to provide a 
construct such as "SELECTM col1, col2 FROM table1",  but from your description, 
the users may not have that sense. They will not know until the query fails. 
"Select and merge" approach is close to your workaround of a temp table, right?

Having too many partitions pose many challenges including the problem you're 
facing. I'd suggest you revisit your partition strategy and try to reduce the 
number of partitions that a query would involve.

> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>
>                 Key: HIVE-6134
>                 URL: https://issues.apache.org/jira/browse/HIVE-6134
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
>            Reporter: Eric Chu
>
> According to the documentation, if we set hive.merge.mapfiles to true, Hive 
> will launch an additional MR job to merge the small output files at the end 
> of a map-only job when the average output file size is smaller than 
> hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles 
> to true, Hive will merge the output files of a map-reduce job. 
> My expectation is that this is true for all MR queries. However, my 
> observation is that this is only true for CTAS queries. In 
> GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used 
> if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a 
> regular SELECT query that doesn't have move tasks, these properties are not 
> used.
> Is my understanding correct and if so, what's the reasoning behind the logic 
> of not supporting this for regular SELECT queries? It seems to me that this 
> should be supported for regular SELECT queries as well. One scenario where 
> this hits us hard is when users try to download the result in HUE, and HUE 
> times out b/c there are thousands of output files. The workaround is to 
> re-run the query as CTAS, but it's a significant time sink.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Reply via email to