> On June 3, 2013, 5:36 p.m., Ben Mahler wrote:
> > What differences were you seeing between job.pendingMaps() and this new
> > technique?
> >
> > Looking at JobInProgress.pendingMaps():
> >
> > public synchronized int pendingMaps() {
> > return numMapTasks - runningMapTasks - failedMapTIPs -
> > finishedMapTasks + speculativeMapTasks;
> > }
> >
> > vs. jobdetails_jsp.printTaskSummary() {
> > int totalTasks = tasks.length;
> > int runningTasks = 0;
> > int finishedTasks = 0;
> > int killedTasks = 0;
> > int failedTaskAttempts = 0;
> > int killedTaskAttempts = 0;
> > for(int i=0; i < totalTasks; ++i) {
> > TaskInProgress task = tasks[i];
> > if (task.isComplete()) {
> > finishedTasks += 1;
> > } else if (task.isRunning()) {
> > runningTasks += 1;
> > } else if (task.wasKilled()) {
> > killedTasks += 1;
> > }
> > failedTaskAttempts += task.numTaskFailures();
> > killedTaskAttempts += task.numKilledTasks();
> > }
> > int pendingTasks = totalTasks - runningTasks - killedTasks -
> > finishedTasks;
> > ...
> > }
> >
> > It seems like the difference here might be between failed_ vs. killed_
> > and/or the fact that the latter case uses speculativeMapTasks?
>
> Brenden Matthews wrote:
> Yes, possibly. To be honest I don't remember, it's been months since I
> fixed this. All I remember specifically is that it was wrong.
>
> Ben Mahler wrote:
> Ah, yes that's what I'm getting at, wrong in what way?
>
> Ben Mahler wrote:
> Hey Brenden, I would really like to get this change in as I trust that
> you've found an issue with the code. But for posterity, it would be nice to
> have a clearer explanation of what was wrong with the old code (so if we run
> into bugs again we can understand why we choose to do this instead of just
> call pendingMaps/Reduces). Perhaps we need a different term for "pending",
> i.e., we're looking for tasks that don't have a slot open to run on.
>
> Do you remember how you noticed there was a problem? Did you notice in
> production? Was it because pending was 0, despite there being map tasks that
> needed to be run? Was the symptom that there were tasks that were not running
> because pending was 0?
Hi Ben,
As I recall, I was finding that the values Mesos was reporting in the log did
not match the Hadoop JobTracker web UI. In fact, on many occasions the values
for the pending tasks were negative. Since it's not possible to have a
negative count, it was obviously broken.
I thought perhaps it was related to this issue:
https://mail-archives.apache.org/mod_mbox/hadoop-common-commits/201204.mbox/%[email protected]%3E
But the build of Hadoop I was running had this patch. I decided that rather
than patching Hadoop, I should just use the same calculation that the web UI
was using.
- Brenden
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11116/#review21326
-----------------------------------------------------------
On June 6, 2013, 2:08 a.m., Brenden Matthews wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11116/
> -----------------------------------------------------------
>
> (Updated June 6, 2013, 2:08 a.m.)
>
>
> Review request for mesos.
>
>
> Description
> -------
>
> Fix TaskTracker pending tasks calculation.
>
> Review: https://reviews.apache.org/r/11116
>
>
> Diffs
> -----
>
> hadoop/mesos/src/java/org/apache/hadoop/mapred/MesosScheduler.java
> afe401f5265e3d9494af7eace42eec45943184a3
>
> Diff: https://reviews.apache.org/r/11116/diff/
>
>
> Testing
> -------
>
> Used in production at airbnb.
>
>
> Thanks,
>
> Brenden Matthews
>
>