[ 
https://issues.apache.org/jira/browse/HIVE-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-105:
----------------------------

    Description: 
currently users have to specify number of reducers. In a multi-user environment 
- we generally ask users to be prudent in selecting number of reducers (since 
they are long running and block other users). Also - large number of reducers 
produce large number of output files - which puts pressure on namenode 
resources.

there are other map-reduce parameters - for example the min split size and the 
proposed use of combinefileinputformat that are also fairly tricky for the user 
to determine (since they depend on map side selectivity and cluster size). This 
will become totally critical when there is integration with BI tools since 
there will be no opportunity to optimize job settings and there will be a wide 
variety of jobs.

This jira calls for automating the selection of such parameters - possibly by a 
best effort at estimating map side selectivity/output size using sampling and 
determining such parameters from there.

Configs:
hive.exec.reducers.bytes.per.reducer
hive.exec.reducers.max
mapred.reduce.tasks



  was:
currently users have to specify number of reducers. In a multi-user environment 
- we generally ask users to be prudent in selecting number of reducers (since 
they are long running and block other users). Also - large number of reducers 
produce large number of output files - which puts pressure on namenode 
resources.

there are other map-reduce parameters - for example the min split size and the 
proposed use of combinefileinputformat that are also fairly tricky for the user 
to determine (since they depend on map side selectivity and cluster size). This 
will become totally critical when there is integration with BI tools since 
there will be no opportunity to optimize job settings and there will be a wide 
variety of jobs.

This jira calls for automating the selection of such parameters - possibly by a 
best effort at estimating map side selectivity/output size using sampling and 
determining such parameters from there.


> estimate number of required reducers and other map-reduce parameters 
> automatically
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-105
>                 URL: https://issues.apache.org/jira/browse/HIVE-105
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Zheng Shao
>             Fix For: 0.2.0
>
>         Attachments: HIVE-105.1.patch, HIVE-105.2.patch, HIVE-105.3.patch, 
> HIVE-105.4.patch
>
>
> currently users have to specify number of reducers. In a multi-user 
> environment - we generally ask users to be prudent in selecting number of 
> reducers (since they are long running and block other users). Also - large 
> number of reducers produce large number of output files - which puts pressure 
> on namenode resources.
> there are other map-reduce parameters - for example the min split size and 
> the proposed use of combinefileinputformat that are also fairly tricky for 
> the user to determine (since they depend on map side selectivity and cluster 
> size). This will become totally critical when there is integration with BI 
> tools since there will be no opportunity to optimize job settings and there 
> will be a wide variety of jobs.
> This jira calls for automating the selection of such parameters - possibly by 
> a best effort at estimating map side selectivity/output size using sampling 
> and determining such parameters from there.
> Configs:
> hive.exec.reducers.bytes.per.reducer
> hive.exec.reducers.max
> mapred.reduce.tasks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to