[ https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861870#comment-13861870 ]
Eric Chu commented on HIVE-6134: -------------------------------- Thanks Xuefu for the quick response! A few questions/comments: 1. Could you elaborate on why you think it makes sense to only merge small files for queries resulting in a new table? Alternatively, what are the issues for supporting these properties for regular queries? I'd love to have this support for regular queries, unless there's a strong reason against it. 2. If indeed these properties are designed only for queries resulting in a new table, then we should mention that in the documentation. Currently it's misleading - it sounds like they'd work for regular queries as well. 3. The main pain point here is that users won't know that there are many output files until AFTER the query is run. Imagine analysts who don't know these details and HUE is the only query interface for them. It's frustrating and time consuming to run a long-running query in Hue, only to find out they can't get the results b/c HUE times out trying to read these many small files, and so they'll have to run the query again as CTAS. Having a table just so they could download the result seems to be an overkill. 4. Do you have a suggestion for the aforementioned HUE issue? Hue starts timing out when the query results in thousands of small output files. This is a major pain point for our analysts today. > Merging small files based on file size only works for CTAS queries > ------------------------------------------------------------------ > > Key: HIVE-6134 > URL: https://issues.apache.org/jira/browse/HIVE-6134 > Project: Hive > Issue Type: Bug > Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0 > Reporter: Eric Chu > > According to the documentation, if we set hive.merge.mapfiles to true, Hive > will launch an additional MR job to merge the small output files at the end > of a map-only job when the average output file size is smaller than > hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles > to true, Hive will merge the output files of a map-reduce job. > My expectation is that this is true for all MR queries. However, my > observation is that this is only true for CTAS queries. In > GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used > if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a > regular SELECT query that doesn't have move tasks, these properties are not > used. > Is my understanding correct and if so, what's the reasoning behind the logic > of not supporting this for regular SELECT queries? It seems to me that this > should be supported for regular SELECT queries as well. One scenario where > this hits us hard is when users try to download the result in HUE, and HUE > times out b/c there are thousands of output files. The workaround is to > re-run the query as CTAS, but it's a significant time sink. -- This message was sent by Atlassian JIRA (v6.1.5#6160)