okumin opened a new pull request, #4432: URL: https://github.com/apache/hive/pull/4432
…of Tez ### What changes were proposed in this pull request? This PR adds a new threshold to enable/disable auto reducer parallelism based on the potential reduction. https://issues.apache.org/jira/browse/HIVE-23831 ### Why are the changes needed? We introduced a heuristics to disable auto reducer parallelism in [HIVE-14200](https://issues.apache.org/jira/browse/HIVE-14200) when `{# of tasks} * {hive.tez.min.partition.factor}` is smaller than 1.0. It takes effect when the estimated number is smaller than 4 by default because the default `hive.tez.min.partition.factor` is 0.25. For example, if the estimated number is equal to 4, Tez tunes the final number between 1 and 8 because the minimum number can be 4 * 0.25 = 1.0, and the maximum number can be 4 * 2.0 = 8 as the default `hive.tez.max.partition.factor` is 2.0. ``` $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=150 ... INFO : Map 1: 0(+1)/1 Reducer 2: 0/8 INFO : Map 1: 1/1 Reducer 2: 2/2 ``` If the estimated number is 3, Tez doesn't apply auto-reducer parallelism because `{# of tasks} * {hive.tez.min.partition.factor}` = `3 * 0.25` is smaller than 1.0. ``` $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=200 ... INFO : Map 1: 1/1 Reducer 2: 0(+1)/6 INFO : Map 1: 1/1 Reducer 2: 6/6 ``` I understand we introduced [HIVE-14200](https://issues.apache.org/jira/browse/HIVE-14200) so that Hive on Tez can avoid overhead when the possible reduction is small. However, I know `{# of tasks} * {hive.tez.min.partition.factor} < 1.0` is a little overfitting the default values of `hive.tez.min.partition.factor` and `hive.tez.max.partition.factor`. For example, when we configure `hive.tez.min.partition.factor=0.05`, auto reducer parallelism is disabled when # of tasks is 19 or less. It means Hive on Tez misses a chance to tune parallelism between 1 and 38 in the worst case. I expect the case is worth trying auto reducer parallelism. ``` $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=30 --hiveconf hive.tez.min.partition.factor=0.05 ... INFO : Map 1: 0(+1)/1 Reducer 2: 0/34 INFO : Map 1: 1/1 Reducer 2: 1(+1)/34 INFO : Map 1: 1/1 Reducer 2: 13(+0)/34 ``` I guess a similar issue can happen when we give big `hive.tez.max.partition.factor`. ### Does this PR introduce _any_ user-facing change? Yes if a user already configures `hive.tez.min.partition.factor` or `hive.tez.max.partition.factor`. If not, the behavior doesn't change. ### Is the change a dependency upgrade? No. ### How was this patch tested? I tested it on my local machine. No behaviors change with default parameters. ``` $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=150 ... INFO : Map 1: 0(+1)/1 Reducer 2: 0/8 INFO : Map 1: 1/1 Reducer 2: 0(+1)/2 ... $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=200 ... INFO : Map 1: 0(+1)/1 Reducer 2: 0/6 INFO : Map 1: 1/1 Reducer 2: 0(+1)/6 ``` Auto reducer parallelism is enabled even when `hive.tez.min.partition.factor` is small. ``` $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=30 --hiveconf hive.tez.min.partition.factor=0.05 ... INFO : Map 1: 0(+1)/1 Reducer 2: 0/34 INFO : Map 1: 1/1 Reducer 2: 0(+1)/12 ``` The threshold is tunable. ``` $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=200 --hiveconf hive.tez.auto.reducer.parallelism.threshold=6 ... INFO : Map 1: 0(+1)/1 Reducer 2: 0/6 INFO : Map 1: 1/1 Reducer 2: 2/2 ... $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' --hiveconf hive.server2.in.place.progress=false --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf hive.exec.reducers.bytes.per.reducer=300 --hiveconf hive.tez.auto.reducer.parallelism.threshold=6 ... INFO : Map 1: 0/1 Reducer 2: 0/4 INFO : Map 1: 1/1 Reducer 2: 4/4 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
