Add syntax to force a new mapreduce job / transform subquery in mapper
----------------------------------------------------------------------

                 Key: HIVE-836
                 URL: https://issues.apache.org/jira/browse/HIVE-836
             Project: Hadoop Hive
          Issue Type: Wish
            Reporter: Adam Kramer


Hive currently does a lot of awesome work to figure out when my transformers 
should be used in the mapper and when they should be used in the reducer. 
However, sometimes I have a different plan.

For example, consider this:

SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
FROM (
    SELECT b.val AS val1, c.val AS val2
    FROM tblb b JOIN tblc c on (b.key=c.key)
) a

...in this syntax b and c will be joined (in the reducer, of course), and then 
the rows that pass the join clause will be passed to niftyscript _in the 
reducer._ However, when niftyscript is high-computation and there is a lot of 
data coming out of the join but very few reducers, there's a huge hold-up. It 
would be awesome if I could somehow force a new mapreduce step after the 
subquery, so that ./niftyscript is run in the mappers rather than the prior 
step's reducers.

Current workaround is to dump everything to a temporary table and then start 
over, but that is not an easy to scale--the subquery structure effectively (and 
easily) "locks" the mid-points so no other job can touch the table.

SUGGESTED FIX: Either cause MAP and REDUCE to force map/reduce steps (c.f. 
https://issues.apache.org/jira/browse/HIVE-835 ), or add a query element to 
specify that "the job ends here." For example, in the above query, FROM a 
SELF-CONTAINED or PRECOMPUTE a or START JOB AFTER a or something like that.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to