Add syntax to force a new mapreduce job / transform subquery in mapper
----------------------------------------------------------------------
Key: HIVE-836
URL: https://issues.apache.org/jira/browse/HIVE-836
Project: Hadoop Hive
Issue Type: Wish
Reporter: Adam Kramer
Hive currently does a lot of awesome work to figure out when my transformers
should be used in the mapper and when they should be used in the reducer.
However, sometimes I have a different plan.
For example, consider this:
SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
FROM (
SELECT b.val AS val1, c.val AS val2
FROM tblb b JOIN tblc c on (b.key=c.key)
) a
...in this syntax b and c will be joined (in the reducer, of course), and then
the rows that pass the join clause will be passed to niftyscript _in the
reducer._ However, when niftyscript is high-computation and there is a lot of
data coming out of the join but very few reducers, there's a huge hold-up. It
would be awesome if I could somehow force a new mapreduce step after the
subquery, so that ./niftyscript is run in the mappers rather than the prior
step's reducers.
Current workaround is to dump everything to a temporary table and then start
over, but that is not an easy to scale--the subquery structure effectively (and
easily) "locks" the mid-points so no other job can touch the table.
SUGGESTED FIX: Either cause MAP and REDUCE to force map/reduce steps (c.f.
https://issues.apache.org/jira/browse/HIVE-835 ), or add a query element to
specify that "the job ends here." For example, in the above query, FROM a
SELF-CONTAINED or PRECOMPUTE a or START JOB AFTER a or something like that.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.