[ 
https://issues.apache.org/jira/browse/HIVE-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722680#action_12722680
 ] 

Namit Jain commented on HIVE-439:
---------------------------------

Had a discussion with Dhruba regarding the default file size also  - 

1. In case of identity select, we do not have a map-reduce job, and therefore 
no merging is required.
2. There is no harm in having bigger files as far as name node is concerned. 
The only problem is that it will result in lesser number of reducers, thereby 
increasing the time for merging. However, a 10GB file should result
    in ~20minutes on most installations, so should be OK.
3. Even if the filter selects most of the rows: select * from T where ..,. If T 
is a big table, we dont want to create small versions of T for the selected 
table.
    So, a big size is better.

I am generating the patch after resolving conflicts - will upload again

> merge small files after a map-only job
> --------------------------------------
>
>                 Key: HIVE-439
>                 URL: https://issues.apache.org/jira/browse/HIVE-439
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.4.0
>
>         Attachments: hive.439.1.patch, hive.439.2.patch, hive.439.3.patch, 
> hive.439.4.patch
>
>
> There are cases when the input to a Hive job are thousands of small files. In 
> this case, there is a mapper for each file. Most of the overhead for spawning 
> all these mappers can be avoided if these small files are combined into fewer 
> larger files.
> The problem can also be addressed by having a mapper span multiple blocks as 
> in:
> https://issues.apache.org/jira/browse/HIVE-74
> Bit, it also makes sense in HIVE to merge files whenever possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to