[ 
https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654029#action_12654029
 ] 

Joydeep Sen Sarma commented on HIVE-105:
----------------------------------------

currently mapred.reduce.tasks controls the 'default' number of reducers. it is 
in fact expected that users _would_ override it (since the default - for 
example - is set to 1 in standard hadoop configs - which is useless).

i am just afraid of overloading the semantics of well understood hadoop 
variables. for example - a n00b hive user (but reasonably experienced with 
hadoop) might (without reading documentation) try to increase this parameter 
(mapred.reduce.tasks) and expect something interesting to happen - whereas 
nothing will (since we would still default to 1G/reducer). 

so i would argue for a differently named variable (say: hive.exec.maxreducers) 
at the minimum. (I wish hadoop had something equivalent - but since hadoop 
doesn't determine reducer count automatically - it makes little sense). if we 
go this route - i would actually say that we should forbid the setting of 
mapred.reduce.tasks (perhaps have a list of hadoop options in HiveConf that 
cannot be set by user since they are ignored by hive)

another quick thought - we should try to find a close-by prime number (or 
alternately a multiple of large primes perhaps) for the inferred reducers 
(based on previously observed problems with skews).

> estimate number of required reducers and other map-reduce parameters 
> automatically
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-105
>                 URL: https://issues.apache.org/jira/browse/HIVE-105
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>
> currently users have to specify number of reducers. In a multi-user 
> environment - we generally ask users to be prudent in selecting number of 
> reducers (since they are long running and block other users). Also - large 
> number of reducers produce large number of output files - which puts pressure 
> on namenode resources.
> there are other map-reduce parameters - for example the min split size and 
> the proposed use of combinefileinputformat that are also fairly tricky for 
> the user to determine (since they depend on map side selectivity and cluster 
> size). This will become totally critical when there is integration with BI 
> tools since there will be no opportunity to optimize job settings and there 
> will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by 
> a best effort at estimating map side selectivity/output size using sampling 
> and determining such parameters from there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to