On Tue, Jun 22, 2010 at 2:55 AM, Zhou Shuaifeng <[email protected]>wrote:
> Hi, when I use Multi *Table*/File Inserts commands, some may be not more
> effective than run table insert commands separately.
>
> For example,
>
> from pokes
> insert overwrite table pokes_count
> select bar,count(foo) group by bar
> insert overwrite table pokes_sum
> select bar,sum(foo) group by bar;
>
> To execute this, 2 map/reduce jobs is needed, which is not less than run
> the two command separately:
>
> insert overwrite table pokes_count select bar,count(foo) from
> pokes group by bar;
> insert overwrite table pokes_sum select bar,sum(foo) from pokes group
> by bar;
>
> And the time taken is the same.
> But the first one seems only scan the table 'pokes' once, why still need 2
> map/reduce jobs? And why the time taken couldnot be less?
> Is there any way to make it more effective?
>
> Thanks a lot,
> Zhou
>
> This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by
> phone or email immediately and delete it!
>
>
Zhou,
In the case of simple selects and a few tables you are not going to see the
full benefit.
Imagine some complex query was like this:
from (
from (
select (table1 join table2 where x=6) t1
) x
join table3 on x.col1 = t3.col1
) y
This could theoretically be a chain of thousands of map reduce jobs. Then
you would save jobs and time by only evaluating once.
Also you are only testing with 2 output tables. What happens with 10 or 20?
Just curious.
Regards,
Edward