BELUGA BEHR created HIVE-17004: ---------------------------------- Summary: Calculating Number Of Reducers Looks At All Files Key: HIVE-17004 URL: https://issues.apache.org/jira/browse/HIVE-17004 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 2.1.1 Reporter: BELUGA BEHR
When calculating the number of Mappers and Reducers, the two algorithms are looking at different data sets. The number of Mappers are calculated based on the number of splits and the number of Reducers are based on the number of files within the HDFS directory. What you see is that if I add files to a sub-directory of the HDFS directory, the number of splits remains the same since I did not tell Hive to search recursively, and the number of Reducers increases. Please improve this so that Reducers are looking at the same files that are considered for splits and not at files within sub-directories (unless configured to do so). {code} CREATE EXTERNAL TABLE Complaints ( a string, b string, c string, d string, e string, f string, g string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/admin/complaints'; {code} {code} [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.1.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.2.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.3.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.4.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.5.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.csv {code} {code} INFO : Compiling command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae): select a, count(1) from complaints group by a limit 10 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, type:string, comment:null), FieldSchema(name:_c1, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); Time taken: 0.077 seconds INFO : Executing command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae): select a, count(1) from complaints group by a limit 10 INFO : Query ID = hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks not specified. Estimated from input data size: 11 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=<number> INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=<number> INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces=<number> INFO : number of splits:2 INFO : Submitting tokens for job: job_1493729203063_0003 INFO : The url to track the job: http://host:8088/proxy/application_1493729203063_0003/ INFO : Starting Job = job_1493729203063_0003, Tracking URL = http://host:8088/proxy/application_1493729203063_0003/ INFO : Kill Command = /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job -kill job_1493729203063_0003 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 11 INFO : 2017-05-02 14:20:14,206 Stage-1 map = 0%, reduce = 0% INFO : 2017-05-02 14:20:22,520 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.48 sec INFO : 2017-05-02 14:20:34,029 Stage-1 map = 100%, reduce = 27%, Cumulative CPU 15.72 sec INFO : 2017-05-02 14:20:35,069 Stage-1 map = 100%, reduce = 55%, Cumulative CPU 21.94 sec INFO : 2017-05-02 14:20:36,110 Stage-1 map = 100%, reduce = 64%, Cumulative CPU 23.97 sec INFO : 2017-05-02 14:20:39,233 Stage-1 map = 100%, reduce = 73%, Cumulative CPU 25.26 sec INFO : 2017-05-02 14:20:43,392 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 30.9 sec INFO : MapReduce Total cumulative CPU time: 30 seconds 900 msec INFO : Ended Job = job_1493729203063_0003 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 2 Reduce: 11 Cumulative CPU: 30.9 sec HDFS Read: 735691149 HDFS Write: 153 SUCCESS INFO : Total MapReduce CPU Time Spent: 30 seconds 900 msec INFO : Completed executing command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae); Time taken: 36.035 seconds INFO : OK {code} {code} [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.1.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.2.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.3.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.4.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.5.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.csv drwxr-xr-x - admin admin 0 2017-05-02 14:16 /user/admin/complaints/t -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.1.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.2.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.3.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.4.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.5.csv -rwxr-xr-x 2 admin admin 122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.csv {code} {code} INFO : Compiling command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e): select a, count(1) from complaints group by a limit 10 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, type:string, comment:null), FieldSchema(name:_c1, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); Time taken: 0.073 seconds INFO : Executing command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e): select a, count(1) from complaints group by a limit 10 INFO : Query ID = hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks not specified. Estimated from input data size: 22 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=<number> INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=<number> INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces=<number> INFO : number of splits:2 INFO : Submitting tokens for job: job_1493729203063_0004 INFO : The url to track the job: http://host:8088/proxy/application_1493729203063_0004/ INFO : Starting Job = job_1493729203063_0004, Tracking URL = http://host:8088/proxy/application_1493729203063_0004/ INFO : Kill Command = /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop job -kill job_1493729203063_0004 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 22 INFO : 2017-05-02 14:29:27,464 Stage-1 map = 0%, reduce = 0% INFO : 2017-05-02 14:29:36,829 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.2 sec INFO : 2017-05-02 14:29:47,287 Stage-1 map = 100%, reduce = 14%, Cumulative CPU 15.36 sec INFO : 2017-05-02 14:29:49,381 Stage-1 map = 100%, reduce = 27%, Cumulative CPU 20.76 sec INFO : 2017-05-02 14:29:50,433 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 22.69 sec INFO : 2017-05-02 14:29:56,743 Stage-1 map = 100%, reduce = 45%, Cumulative CPU 27.73 sec INFO : 2017-05-02 14:30:00,916 Stage-1 map = 100%, reduce = 64%, Cumulative CPU 34.95 sec INFO : 2017-05-02 14:30:06,142 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 41.49 sec INFO : 2017-05-02 14:30:10,297 Stage-1 map = 100%, reduce = 82%, Cumulative CPU 42.92 sec INFO : 2017-05-02 14:30:11,334 Stage-1 map = 100%, reduce = 86%, Cumulative CPU 45.24 sec INFO : 2017-05-02 14:30:12,365 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 50.33 sec INFO : MapReduce Total cumulative CPU time: 50 seconds 330 msec INFO : Ended Job = job_1493729203063_0004 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 2 Reduce: 22 Cumulative CPU: 50.33 sec HDFS Read: 735731640 HDFS Write: 153 SUCCESS INFO : Total MapReduce CPU Time Spent: 50 seconds 330 msec INFO : Completed executing command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e); Time taken: 51.841 seconds INFO : OK {code} https://github.com/apache/hive/blob/bc510f63de9d6baab3a5ad8a4bf4eed9c6fde8b1/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2959 Number of splits (Mappers) stay the same between the two runs, number of Reducers increases. *INFO : number of splits:2* # Number of reduce tasks not specified. Estimated from input data size: 11 # Number of reduce tasks not specified. Estimated from input data size: 22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)