[
https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665651#action_12665651
]
Joydeep Sen Sarma commented on HIVE-105:
----------------------------------------
> 1. If "mapred.reduce.tasks" is set and not less than 0, use that as the
> number of reducers, and skip all following steps;
most hadoop installations have a non-zero value specified for this.
hadoop-default.xml comes with a value of 1. Did u mean if it is > 1?
> 2. Take a look at the total size of the input files, divide that by
> "hive.exec.bytes.per.reducer", to get a number R;
s/hive.exec.bytes.per.reducer/hive.exec.bytesperreducer
for part-B - i think Prasad filed a jira for variables that are query specific
(versus session specific). that is easy to get done. I am a little
uncomfortable with the query hinting thing since the user doesn't even know
when reduces will be used or not (depends on how we implement group-bys and
joins). It seems query hinting will have to first force a particular type of
plan (use group-by algo1) and then force reducer count for that particular algo.
> estimate number of required reducers and other map-reduce parameters
> automatically
> ----------------------------------------------------------------------------------
>
> Key: HIVE-105
> URL: https://issues.apache.org/jira/browse/HIVE-105
> Project: Hadoop Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Joydeep Sen Sarma
> Assignee: Zheng Shao
>
> currently users have to specify number of reducers. In a multi-user
> environment - we generally ask users to be prudent in selecting number of
> reducers (since they are long running and block other users). Also - large
> number of reducers produce large number of output files - which puts pressure
> on namenode resources.
> there are other map-reduce parameters - for example the min split size and
> the proposed use of combinefileinputformat that are also fairly tricky for
> the user to determine (since they depend on map side selectivity and cluster
> size). This will become totally critical when there is integration with BI
> tools since there will be no opportunity to optimize job settings and there
> will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by
> a best effort at estimating map side selectivity/output size using sampling
> and determining such parameters from there.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.