[ https://issues.apache.org/jira/browse/HIVE-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137365#comment-14137365 ]
Rui Li commented on HIVE-8043: ------------------------------ Hi [~xuefuz], I looked into the patch in HIVE-7704. My understanding is that the newly added operator, mapper etc. is just for (fast) merging RC and Orc files. Other file formats will still be merged by the {{TS -> FS}} work. For RC and Orc files, this work is a {{MergeFileWork}}, for others, this work is a {{MapWork}}. And according to the execution engine, this work will be wrapped in a MapredWork, TezWork or SparkWork. For RC and Orc files, {{MergeFileMapper}} is used instead of {{ExecMapper}}. The main difference between the two mappers is that {{MergeFileMapper}} wraps and uses {{AbstractFileMergeOperator}} (two implementations for RC and Orc file respectively) as the top operator, while {{ExecMapper}} uses {{MapOperator}}. I think the following needs to be considered on spark side: * For non-RC files, I think it should work out of the box, at least for simple cases. We may need to take extra care of dynamically partitioned tables, multi-insert and union queries etc. I tested some simple insert queries where I increased {{mapreduce.job.reduces}} to generate many small files. With {{hive.merge.sparkfiles=false}}, the destination table consists of all these small files, and when turned on, all the small files get merged. I noticed the merging feature caused some issue in HIVE-7810. I'll verify if it's still a problem now that we have union-remove disabled for spark. * For RC and Orc files, we need to be aware of the {{MergeFileWork}}. And since {{SparkMapRecordHandler}} is our counterpart for {{ExecMapper}}, we'll need another record handler as counterpart for {{MergeFileMapper}}, maybe another hive function as well. I'm working to implement this to do some tests. * MR distinguishes map-only and map-reduce jobs for merging. Not sure if we shall do similar thing for spark * Besides, it seems there're two scenarios where merging is needed: at the end of a job (map-only or map-reduce), and in DDL task. I'll investigate more into this. Any idea or suggestion is appreciated. Thanks. > Support merging small files [Spark Branch] > ------------------------------------------ > > Key: HIVE-8043 > URL: https://issues.apache.org/jira/browse/HIVE-8043 > Project: Hive > Issue Type: Task > Components: Spark > Reporter: Xuefu Zhang > Assignee: Rui Li > Labels: Spark-M1 > > Hive currently supports merging small files with MR as the execution engine. > There are options available for this, such as > {code} > hive.merge.mapfiles > hive.merge.mapredfiles > {code} > Hive.merge.sparkfiles is already introduced in HIVE-7810. To make it work, we > might need a little more research and design on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)