[jira] [Commented] (HIVE-8043) Support merging small files [Spark Branch]

Rui Li (JIRA) Wed, 17 Sep 2014 08:03:53 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137365#comment-14137365
 ]


Rui Li commented on HIVE-8043:
------------------------------

Hi [~xuefuz],

I looked into the patch in HIVE-7704. My understanding is that the newly added 
operator, mapper etc. is just for (fast) merging RC and Orc files. Other file 
formats will still be merged by the {{TS -> FS}} work. For RC and Orc files, 
this work is a {{MergeFileWork}}, for others, this work is a {{MapWork}}. And 
according to the execution engine, this work will be wrapped in a MapredWork, 
TezWork or SparkWork.

For RC and Orc files, {{MergeFileMapper}} is used instead of {{ExecMapper}}. 
The main difference between the two mappers is that {{MergeFileMapper}} wraps 
and uses {{AbstractFileMergeOperator}} (two implementations for RC and Orc file 
respectively) as the top operator, while {{ExecMapper}} uses {{MapOperator}}.

I think the following needs to be considered on spark side:
* For non-RC files, I think it should work out of the box, at least for simple 
cases. We may need to take extra care of dynamically partitioned tables, 
multi-insert and union queries etc. I tested some simple insert queries where I 
increased {{mapreduce.job.reduces}} to generate many small files. With 
{{hive.merge.sparkfiles=false}}, the destination table consists of all these 
small files, and when turned on, all the small files get merged. I noticed the 
merging feature caused some issue in HIVE-7810. I'll verify if it's still a 
problem now that we have union-remove disabled for spark.
* For RC and Orc files, we need to be aware of the {{MergeFileWork}}. And since 
{{SparkMapRecordHandler}} is our counterpart for {{ExecMapper}}, we'll need 
another record handler as counterpart for {{MergeFileMapper}}, maybe another 
hive function as well. I'm working to implement this to do some tests.
* MR distinguishes map-only and map-reduce jobs for merging. Not sure if we 
shall do similar thing for spark
* Besides, it seems there're two scenarios where merging is needed: at the end 
of a job (map-only or map-reduce), and in DDL task. I'll investigate more into 
this.

Any idea or suggestion is appreciated. Thanks.

> Support merging small files [Spark Branch]
> ------------------------------------------
>
>                 Key: HIVE-8043
>                 URL: https://issues.apache.org/jira/browse/HIVE-8043
>             Project: Hive
>          Issue Type: Task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Rui Li
>              Labels: Spark-M1
>
> Hive currently supports merging small files with MR as the execution engine. 
> There are options available for this, such as 
> {code}
> hive.merge.mapfiles
> hive.merge.mapredfiles
> {code}
> Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we 
> might need a little more research and design on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8043) Support merging small files [Spark Branch]

Reply via email to