[ https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863599#comment-13863599 ]
Xuefu Zhang commented on HIVE-6134: ----------------------------------- [~ericchu30] I got what you're saying. However, I'm not sure if these properties are enough for queries that generates small files. To say one thing, a query generating two small files is different from one that generates thousands, but we may not want to spawn a job to merge two small files, while it (merging two files) may be acceptable for files for a table. Thus, I'm not totally convinced that these properties are equally applicable to the scenarios you described. I agree with Ashutosh's comment in general. However, if the same problem also occurs to other clients such as JDBC, or compacting for their lives much easier, then it might make sense to fix it in Hive. Still, I am not convinced that we can reuse the above mentioned properties for such a purpose. > Merging small files based on file size only works for CTAS queries > ------------------------------------------------------------------ > > Key: HIVE-6134 > URL: https://issues.apache.org/jira/browse/HIVE-6134 > Project: Hive > Issue Type: Bug > Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0 > Reporter: Eric Chu > > According to the documentation, if we set hive.merge.mapfiles to true, Hive > will launch an additional MR job to merge the small output files at the end > of a map-only job when the average output file size is smaller than > hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles > to true, Hive will merge the output files of a map-reduce job. > My expectation is that this is true for all MR queries. However, my > observation is that this is only true for CTAS queries. In > GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used > if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a > regular SELECT query that doesn't have move tasks, these properties are not > used. > Is my understanding correct and if so, what's the reasoning behind the logic > of not supporting this for regular SELECT queries? It seems to me that this > should be supported for regular SELECT queries as well. One scenario where > this hits us hard is when users try to download the result in HUE, and HUE > times out b/c there are thousands of output files. The workaround is to > re-run the query as CTAS, but it's a significant time sink. -- This message was sent by Atlassian JIRA (v6.1.5#6160)