[ https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962335#comment-13962335 ]
Eric Chu commented on HIVE-6134: -------------------------------- Hi [~xuefuz] and [~ashutoshc], it turns out this issues not only affects Hue but also HIVE CLI - in that results won't show up in CLI until more than a minute has passed with timeout error for connection to nodes. I'm trying to make the change myself in GenMRFileSink1.java to support a new property that when it's turned on, Hive will merge files for a regular (i.e., without mvTask), map-only job that uses more than X mappers (another property). I'm wondering if and how we could find out the number of mappers that will be used for that job when we are at that stage of the optimization. I want to set chDir to true when this number is greater than some threshold set via a new property. I notice that currWork.getMapWork().getNumMapTasks() actually returns null. Can you give me some pointers? > Merging small files based on file size only works for CTAS queries > ------------------------------------------------------------------ > > Key: HIVE-6134 > URL: https://issues.apache.org/jira/browse/HIVE-6134 > Project: Hive > Issue Type: Bug > Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0 > Reporter: Eric Chu > > According to the documentation, if we set hive.merge.mapfiles to true, Hive > will launch an additional MR job to merge the small output files at the end > of a map-only job when the average output file size is smaller than > hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles > to true, Hive will merge the output files of a map-reduce job. > My expectation is that this is true for all MR queries. However, my > observation is that this is only true for CTAS queries. In > GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used > if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a > regular SELECT query that doesn't have move tasks, these properties are not > used. > Is my understanding correct and if so, what's the reasoning behind the logic > of not supporting this for regular SELECT queries? It seems to me that this > should be supported for regular SELECT queries as well. One scenario where > this hits us hard is when users try to download the result in HUE, and HUE > times out b/c there are thousands of output files. The workaround is to > re-run the query as CTAS, but it's a significant time sink. -- This message was sent by Atlassian JIRA (v6.2#6252)