On Thursday 24 July 2008 21:40:22 Devaraj Das wrote: > On 7/25/08 12:09 AM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: > > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: > >> Could you try to kill the tasktracker hosting the task the next time > >> when it happens? I just want to isolate the problem - whether it is a > >> problem in the TT-JT communication or in the Task-TT communication. From > >> your description it looks like the problem is between the JT-TT > >> communication. But pls run the experiment when it happens again and let > >> us know what happens. > > > > Well, I did restart the tasktracker where the reduce job was running, but > > that lead only to a situation where the jobtracker did not restart the > > job, showed it as still running, and was not able to kill the reduce task > > via hadoop job -kill-task nor -fail-task. > > The reduce task would eventually be reexecuted (after some timeout, > defaulting to 10 minutes, the tasktracker would be assumed as lost and all > reducers that were running on that node would be reexecuted). > > > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A > > peer at another startup confirmed the whole batch of problems I've been > > experiencing, and for him 0.15 works for production. > > > > <rant-mode> > > No question, 0.17 is way better than 0.16, on the other hand I wonder how > > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've > > introduced reducing to our workloads, and before 0.16 failed >80% of the > > jobs with reducers not being able to get their output. 0.17.0 improved > > that to a point where one can, with some pain, e.g. restarting the > > cluster daily, not storing anything important on HDFS, only temporary > > data, ..., use it somehow for production, at least for small jobs.) So > > one wonders how 0.16 got released? Or was it meant only as developer-only > > bug fixing series? > > </rant-mode> > > Pls raise jiras for the specific problems.
I know, that's why I bracketed it as rantmode. OTOH, many of these issues had either this creepy feeling where you wondered if you did something wrong or were issues where I had to react relatively quickly, which usually destroys the faulty state. (I know, as a developer having reproduced a bug is golden. As an admin asked about processing lag, it's rather to opposite) Plus fixing the issue in the next release or even via a patch means that I have a non-working cluster till then. Now I that means I would need to start debugging the cluster utility software instead of our apps. ;( Andreas
signature.asc
Description: This is a digitally signed message part.
