> On June 3, 2013, 5:36 p.m., Ben Mahler wrote:
> > What differences were you seeing between job.pendingMaps() and this new
> > technique?
> >
> > Looking at JobInProgress.pendingMaps():
> >
> > public synchronized int pendingMaps() {
> > return numMapTasks - runningMapTasks - failedMapTIPs -
> > finishedMapTasks + speculativeMapTasks;
> > }
> >
> > vs. jobdetails_jsp.printTaskSummary() {
> > int totalTasks = tasks.length;
> > int runningTasks = 0;
> > int finishedTasks = 0;
> > int killedTasks = 0;
> > int failedTaskAttempts = 0;
> > int killedTaskAttempts = 0;
> > for(int i=0; i < totalTasks; ++i) {
> > TaskInProgress task = tasks[i];
> > if (task.isComplete()) {
> > finishedTasks += 1;
> > } else if (task.isRunning()) {
> > runningTasks += 1;
> > } else if (task.wasKilled()) {
> > killedTasks += 1;
> > }
> > failedTaskAttempts += task.numTaskFailures();
> > killedTaskAttempts += task.numKilledTasks();
> > }
> > int pendingTasks = totalTasks - runningTasks - killedTasks -
> > finishedTasks;
> > ...
> > }
> >
> > It seems like the difference here might be between failed_ vs. killed_
> > and/or the fact that the latter case uses speculativeMapTasks?
>
> Brenden Matthews wrote:
> Yes, possibly. To be honest I don't remember, it's been months since I
> fixed this. All I remember specifically is that it was wrong.
>
> Ben Mahler wrote:
> Ah, yes that's what I'm getting at, wrong in what way?
>
> Ben Mahler wrote:
> Hey Brenden, I would really like to get this change in as I trust that
> you've found an issue with the code. But for posterity, it would be nice to
> have a clearer explanation of what was wrong with the old code (so if we run
> into bugs again we can understand why we choose to do this instead of just
> call pendingMaps/Reduces). Perhaps we need a different term for "pending",
> i.e., we're looking for tasks that don't have a slot open to run on.
>
> Do you remember how you noticed there was a problem? Did you notice in
> production? Was it because pending was 0, despite there being map tasks that
> needed to be run? Was the symptom that there were tasks that were not running
> because pending was 0?
>
> Brenden Matthews wrote:
> Hi Ben,
>
> As I recall, I was finding that the values Mesos was reporting in the log
> did not match the Hadoop JobTracker web UI. In fact, on many occasions the
> values for the pending tasks were negative. Since it's not possible to have
> a negative count, it was obviously broken.
>
> I thought perhaps it was related to this issue:
>
>
> https://mail-archives.apache.org/mod_mbox/hadoop-common-commits/201204.mbox/%[email protected]%3E
>
> But the build of Hadoop I was running had this patch. I decided that
> rather than patching Hadoop, I should just use the same calculation that the
> web UI was using.
Perfect! Can you mention that negative values were observed, and reference
https://issues.apache.org/jira/browse/MAPREDUCE-1238? It would be great to
mention that you've seen discrepancies with getPendingMaps/Reduces and what the
webUI reports as pending.
- Ben
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11116/#review21326
-----------------------------------------------------------
On June 6, 2013, 2:08 a.m., Brenden Matthews wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11116/
> -----------------------------------------------------------
>
> (Updated June 6, 2013, 2:08 a.m.)
>
>
> Review request for mesos.
>
>
> Description
> -------
>
> Fix TaskTracker pending tasks calculation.
>
> Review: https://reviews.apache.org/r/11116
>
>
> Diffs
> -----
>
> hadoop/mesos/src/java/org/apache/hadoop/mapred/MesosScheduler.java
> afe401f5265e3d9494af7eace42eec45943184a3
>
> Diff: https://reviews.apache.org/r/11116/diff/
>
>
> Testing
> -------
>
> Used in production at airbnb.
>
>
> Thanks,
>
> Brenden Matthews
>
>