[EMAIL PROTECTED] wrote on Fri, 08 Sep 2006 10:34 -0500:
> I've pulled the latest code from the cvs head, and rebuilt it,
> unfortunately the MD server still crashes hard, and now it seems that
> the servers are eating 100% cpu (though I didnt check this prior to
> updating, I suppose that could have been useful)
No clue why the other servers are at 100%, but let's make that a
secondary issue for now (see below).
> Here's a log of the IB version.. I'm almost positive this is an IB
> specific problem at this point, as nobody else is having these problems
> that I know of.
>
> ---- client ----
> p5l6:~# pvfs2-cp -t /tmp/junkfile /pvfs2/6node/
> Wrote 2147483648 bytes in 2.695799 seconds. 759.700592 MB/seconds
> p5l6:~# pvfs2-cp -t /pvfs2/6node/junkfile /dev/null
[..]
> [E 10:23:20.739431] Job time out: cancelling flow operation, job_id: 4370.
Is the MD server also an IO server?
If the job times out, recovery may happen or it may be buggy. But
we should be focusing on why the timeout happens. That's the first
problem. If we can solve that, we won't have to look at the cancel
operations.
> and with my current level (lack) of debugging, none of the data servers
> show anything, but are running away at 100% cpu, their logs show nothing
> other than the startup line.
Can you cause things to break with "network" debugging enabled?
That would be the greatest help as we suspect there are IB issues.
If you can gather logs from the single client and the server that
times out a job, we can piece together again what everybody was
doing. If you were able to synchronize the clocks on the machines
that would eliminate us having to worry about that.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers