[ https://issues.apache.org/jira/browse/HIVE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Kramer updated HIVE-836: ----------------------------- Description: Hive currently does a lot of awesome work to figure out when my transformers should be used in the mapper and when they should be used in the reducer. However, sometimes I have a different plan. For example, consider this: {code:title=foo.sql} SELECT TRANSFORM(a.val1, a.val2) USING './niftyscript' AS part1, part2, part3 FROM ( SELECT b.val AS val1, c.val AS val2 FROM tblb b JOIN tblc c on (b.key=c.key) ) a {code} ...now, assume that the join step is very easy and 'niftyscript' is really processor intensive. The ideal format for this is a MR task with few mappers and few reducers, and then a second MR task with lots of mappers. Currently, there is no way to even require the outer TRANSFORM statement occur in a separate map phase. Implementing a "hint" such as /* +MAP */, akin to /* +MAPJOIN(x) */, would be awesome. Current workaround is to dump everything to a temporary table and then start over, but that is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points so no other job can touch the table. was: Hive currently does a lot of awesome work to figure out when my transformers should be used in the mapper and when they should be used in the reducer. However, sometimes I have a different plan. For example, consider this: SELECT TRANSFORM(a.val1, a.val2) USING './niftyscript' AS part1, part2, part3 FROM ( SELECT b.val AS val1, c.val AS val2 FROM tblb b JOIN tblc c on (b.key=c.key) ) a ...in this syntax b and c will be joined (in the reducer, of course), and then the rows that pass the join clause will be passed to niftyscript _in the reducer._ However, when niftyscript is high-computation and there is a lot of data coming out of the join but very few reducers, there's a huge hold-up. It would be awesome if I could somehow force a new mapreduce step after the subquery, so that ./niftyscript is run in the mappers rather than the prior step's reducers. Current workaround is to dump everything to a temporary table and then start over, but that is not an easy to scale--the subquery structure effectively (and easily) "locks" the mid-points so no other job can touch the table. SUGGESTED FIX: Either cause MAP and REDUCE to force map/reduce steps (c.f. https://issues.apache.org/jira/browse/HIVE-835 ), or add a query element to specify that "the job ends here." For example, in the above query, FROM a SELF-CONTAINED or PRECOMPUTE a or START JOB AFTER a or something like that. > Add syntax to force a new mapreduce job / transform subquery in mapper > ---------------------------------------------------------------------- > > Key: HIVE-836 > URL: https://issues.apache.org/jira/browse/HIVE-836 > Project: Hive > Issue Type: Wish > Reporter: Adam Kramer > > Hive currently does a lot of awesome work to figure out when my transformers > should be used in the mapper and when they should be used in the reducer. > However, sometimes I have a different plan. > For example, consider this: > {code:title=foo.sql} > SELECT TRANSFORM(a.val1, a.val2) > USING './niftyscript' > AS part1, part2, part3 > FROM ( > SELECT b.val AS val1, c.val AS val2 > FROM tblb b JOIN tblc c on (b.key=c.key) > ) a > {code} > ...now, assume that the join step is very easy and 'niftyscript' is really > processor intensive. The ideal format for this is a MR task with few mappers > and few reducers, and then a second MR task with lots of mappers. > Currently, there is no way to even require the outer TRANSFORM statement > occur in a separate map phase. Implementing a "hint" such as /* +MAP */, akin > to /* +MAPJOIN(x) */, would be awesome. > Current workaround is to dump everything to a temporary table and then start > over, but that is not an easy to scale--the subquery structure effectively > (and easily) "locks" the mid-points so no other job can touch the table. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira