On Tue, Jun 22, 2010 at 8:57 PM, Zhou Shuaifeng
<[email protected]> wrote:
>
> Hi Edward,
> You said that there may be a "chain" of many map/reduce jobs. Is this 
> realized by the class "Chain"? (org.apache.hadoop.mapred.lib)
>
> And I think it may save jobs if in the chain, output of one map/reduce job 
> can be the input of many other jobs, it will be more effective.
> This means the "chain" have many branches, many jobs share the same input. 
> The struct of this chain likes a tree.
>
> If not, it's a simple chain, it will be no effective, I think.
>
> So, what's your opinion?
>
> Regards,
> Zhou
>
>
> ________________________________
> From: Edward Capriolo [mailto:[email protected]]
> Sent: Tuesday, June 22, 2010 11:32 PM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: hive Multi Table/File Inserts questions
>
>
> On Tue, Jun 22, 2010 at 2:55 AM, Zhou Shuaifeng <[email protected]> 
> wrote:
>>
>> Hi, when I use Multi Table/File Inserts commands, some may be not more 
>> effective than run table insert commands separately.
>> Â
>> For example,
>> Â
>>     from pokes
>>     insert overwrite table pokes_count
>>     select bar,count(foo) group by bar
>>     insert overwrite table pokes_sum
>>     select bar,sum(foo) group by bar;
>> Â
>> To execute this, 2 map/reduce jobs is needed, which is not less than run the 
>> two command separately:
>> Â
>>     insert overwrite table pokes_count select bar,count(foo) from 
>> pokes group by bar;
>>     insert overwrite table pokes_sum select bar,sum(foo) from 
>> pokes group by bar; Â
>> Â
>> And the time taken is the same.
>> But the first one seems only scan the table 'pokes' once, why still need 2 
>> map/reduce jobs? And why the time taken couldnot be less?
>> Is there any way to make it more effective?
>> Â
>> Thanks a lot,
>> Zhou
>> This e-mail and its attachments contain confidential information from 
>> HUAWEI, which
>> is intended only for the person or entity whose address is listed above. Any 
>> use of the
>> information contained herein in any way (including, but not limited to, 
>> total or partial
>> disclosure, reproduction, or dissemination) by persons other than the 
>> intended
>> recipient(s) is prohibited. If you receive this e-mail in error, please 
>> notify the sender by
>> phone or email immediately and delete it!
>> Â
>
> Zhou,
>
> In the case of simple selects and a few tables you are not going to see the 
> full benefit.
>
> Imagine some complex query was like this:
>
> from (
>   from (
>     select (table1 join table2 where x=6) t1
>   ) x
>   join table3 on x.col1 = t3.col1
> ) y
>
> This could theoretically be a chain of thousands of map reduce jobs. Then you 
> would save jobs and time by only evaluating once.
>
> Also you are only testing with 2 output tables. What happens with 10 or 20? 
> Just curious.
>
> Regards,
> Edward

By "chain" I meant that a complex query could be multiple map reduce
jobs (stages). Hive does not use ChainMapper or ChainReducer (that I
know of). Depending on the query the output of the one stage could be
input to the next.

Let me give an example.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Joins

 SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c
ON (c.key = b.key2)
"there are two map/reduce jobs involved in computing the join."

I hope my syntax is ok but this gets the point across.

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2)
insert overwrite table a_count
select a.val,count(b.val) group by a.val
insert overwrite table b_count
select a.val,count(c.val) group by a.val

= 4 jobs

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2)
insert overwrite table a_count
select a.val,count(b.val) group by a.val;

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2)
insert overwrite table a_count
select a.val,count(b.val) group by a.val;

= 6 map/reduce jobs

With more outputs you would get more saving.

Reply via email to