[jira] Updated: (HIVE-439) merge small files after a map-only job

Zheng Shao (JIRA) Fri, 31 Jul 2009 00:51:08 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zheng Shao updated HIVE-439:
----------------------------

    Description: 
There are cases when the input to a Hive job are thousands of small files. In 
this case, there is a mapper for each file. Most of the overhead for spawning 
all these mappers can be avoided if these small files are combined into fewer 
larger files.

The problem can also be addressed by having a mapper span multiple blocks as in:

https://issues.apache.org/jira/browse/HIVE-74


Bit, it also makes sense in HIVE to merge files whenever possible.

{code}
<property>
  <name>hive.merge.mapfiles</name>
  <value>true</value>
  <description>Merge small files at the end of the job</description>
</property>

<property>
  <name>hive.merge.size.per.task</name>
  <value>256000000</value>
  <description>Size of merged files at the end of the job</description>
</property>
{code}


  was:
There are cases when the input to a Hive job are thousands of small files. In 
this case, there is a mapper for each file. Most of the overhead for spawning 
all these mappers can be avoided if these small files are combined into fewer 
larger files.

The problem can also be addressed by having a mapper span multiple blocks as in:

https://issues.apache.org/jira/browse/HIVE-74


Bit, it also makes sense in HIVE to merge files whenever possible.

{code}
<property>
  <name>hive.merge.mapfiles</name>
  <value>true</value>
  <description>Merge small files at the end of the job</description>
</property>

<property>
  <name>hive.merge.size.per.mapper</name>
  <value>1000000000</value>
  <description>Size of merged files at the end of the job</description>
</property>
{code}



> merge small files after a map-only job
> --------------------------------------
>
>                 Key: HIVE-439
>                 URL: https://issues.apache.org/jira/browse/HIVE-439
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.4.0
>
>         Attachments: hive.439.1.patch, hive.439.2.patch, hive.439.3.patch, 
> hive.439.4.patch, hive.439.5.patch
>
>
> There are cases when the input to a Hive job are thousands of small files. In 
> this case, there is a mapper for each file. Most of the overhead for spawning 
> all these mappers can be avoided if these small files are combined into fewer 
> larger files.
> The problem can also be addressed by having a mapper span multiple blocks as 
> in:
> https://issues.apache.org/jira/browse/HIVE-74
> Bit, it also makes sense in HIVE to merge files whenever possible.
> {code}
> <property>
>   <name>hive.merge.mapfiles</name>
>   <value>true</value>
>   <description>Merge small files at the end of the job</description>
> </property>
> <property>
>   <name>hive.merge.size.per.task</name>
>   <value>256000000</value>
>   <description>Size of merged files at the end of the job</description>
> </property>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-439) merge small files after a map-only job

Reply via email to