[ 
https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665513#action_12665513
 ] 

Zheng Shao commented on HIVE-105:
---------------------------------

Some detailed plan:

Part A: Automatically detect the number of reducers:
1. If "mapred.reduce.tasks" is set and not less than 0, use that as the number 
of reducers, and skip all following steps;
2. Take a look at the total size of the input files, divide that by 
"hive.exec.bytes.per.reducer", to get a number R;
3. If "hive.exec.max.reducers" is set and not less than 0, Use min(R, 
"hive.exec.max.reducers") as the number of reducers.

NOTE: The user needs to set "mapred.reduce.tasks" to a negative number to take 
advantage of this new feature. This is to provide backward compatibility. 
hadoop-default.xml sets "mapred.reduce.tasks" so there is no way for the user 
to "unset" that variable.


Part B: Allow users to override per-stage reducer numbers:
Add step 0 in Part A:
0. If "hive.exec.reducers.stage.N", where "N" is the stage number, is set, then 
use that as the number of reducers
Also Part B will add the functionality to remove all parameters like 
"hive.exec.reducers.stage.N" after the query is done.


> estimate number of required reducers and other map-reduce parameters 
> automatically
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-105
>                 URL: https://issues.apache.org/jira/browse/HIVE-105
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>
> currently users have to specify number of reducers. In a multi-user 
> environment - we generally ask users to be prudent in selecting number of 
> reducers (since they are long running and block other users). Also - large 
> number of reducers produce large number of output files - which puts pressure 
> on namenode resources.
> there are other map-reduce parameters - for example the min split size and 
> the proposed use of combinefileinputformat that are also fairly tricky for 
> the user to determine (since they depend on map side selectivity and cluster 
> size). This will become totally critical when there is integration with BI 
> tools since there will be no opportunity to optimize job settings and there 
> will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by 
> a best effort at estimating map side selectivity/output size using sampling 
> and determining such parameters from there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to