[
https://issues.apache.org/jira/browse/MAPREDUCE-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Noguchi updated MAPREDUCE-1238:
------------------------------------
Attachment: MAPREDUCE-1238-v0.20-1.patch
This is not my patch but was pointed out internally by a dev but nobody
followed up. Uploading here to see if this makes sense.
Copy&Pasting his comment.
{noformat}
I tried to reproduce this on a small cluster (60 nodes) with hadoop 0.20.202.
Steps to reproduce the issue:
===============================
1. Setup file sink for jobtracker, so that we can get waiting_maps counter in a
separate file
2. Have queue configs similar to that of production (i took nitroblue config)
3. Submit a simple sort job with -Dmapred.job.queue.name="search_general". This
queue should not be present in the cluster. Now, the waiting_maps would get
into -ve value. Example is given below.
1298346748441 mapred.jobtracker: context=mapred, sessionId=,
hostName=gsta90014.tan.ygrid.yahoo.com, waiting_maps=-120, waiting_reduces=-16,
jobs_failed=1, jobs_preparing=0
Problem:
========
1. WaitingMaps are incremented in JobInProgress.initTasks(). If a user gets an
exception even before tasks are initialized, JobInProgress decrements the
waiting_maps wrongly in garbageCollect(). This causes -ve values in
waiting_maps and waiting_reduces.
Tried the following code change in JobInProgress for fixing:
============================================================
//check if tasks are initialized, and decrement waiting_maps accordingly.
if (tasksInited) {
// Let the JobTracker know that a job is complete
jobtracker.getInstrumentation().decWaitingMaps(getJobID(),
pendingMaps());
jobtracker.getInstrumentation().decWaitingReduces(getJobID(),
pendingReduces());
}
Need to check with dev for reviewing the above logic.
JobTracker logs when the problem was observed:
==============================================
11/02/22 03:52:22 INFO ipc.Server: SASL server context established. Negotiated
QoP is auth
11/02/22 03:52:22 INFO ipc.Server: SASL server successfully authenticated
client: [email protected]
11/02/22 03:52:22 INFO ipc.Server: Auth successfull for
[email protected]
11/02/22 03:52:22 INFO authorize.ServiceAuthorizationManager: Authorization
successfull for [email protected] for protocol=interface
org.apache.hadoop.mapred.JobSubmissionProtocol
11/02/22 03:52:23 INFO token.DelegationTokenRenewal: registering token for
renewal for service =98.138.162.177:8020 and jobID = job_201102220351_0001
11/02/22 03:52:23 INFO mapred.JobInProgress: job_201102220351_0001: nMaps=120
nReduces=16 max=200000
11/02/22 03:52:23 INFO hdfs.DFSClient: Renewing HDFS_DELEGATION_TOKEN token
1940 for gridperf on 98.138.162.177:8020
11/02/22 03:52:23 INFO mapred.JobInProgress$JobSummary:
jobId=job_201102220351_0001,submitTime=1298346743640,launchTime=0,,finishTime=1298346743724,numMaps=0,numSlotsPerMap=1,numReduces=0,numSlotsPerReduce=1,user=gridperf,queue=search_general,status=FAILED,mapSlotSeconds=0,reduceSlotsSeconds=0,clusterMapCapacity=0,clusterReduceCapacity=0
11/02/22 03:52:23 INFO mapred.JobHistory: No file for job-history with
job_201102220351_0001 found in cache!
11/02/22 03:52:23 INFO mapred.JobHistory: No file for jobconf with
job_201102220351_0001 found in cache!
11/02/22 03:52:23 INFO hdfs.DFSClient: Cancelling HDFS_DELEGATION_TOKEN token
1940 for gridperf on 98.138.162.177:8020
11/02/22 03:52:23 INFO ipc.Server: IPC Server handler 1 on 8021, call
submitJob(job_201102220351_0001,
hdfs://gsta90013.tan.ygrid.yahoo.com/grid/0/daytona/hadoop/tmp/mapred/staging/gridperf/.staging/job_201102220351_0001,
org.apache.hadoop.security.Credentials@7051630a) from 98.138.162.177:45951:
error: java.io.IOException: Queue "search_general" does not exist
java.io.IOException: Queue "search_general" does not exist
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3930)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1380)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1378)
{noformat}
> mapred metrics shows negative count of waiting maps and reduces
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-1238
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1238
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Reporter: Ramya Sunil
> Attachments: MAPREDUCE-1238-v0.20-1.patch
>
>
> Negative waiting_maps and waiting_reduces count is observed in the mapred
> metrics
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira