On Tue, Jan 08, 2008 at 10:48:48PM -0500, David Dillow wrote: > > The aborts are caused by a command timing out in the SCSI mid-layer and > its error handling taking over -- more details about the escalation are > in Documentation/scsi_eh.txt and assorted files.
Thanks for decoding those errors/warnings. > You can turn up the SCSI logging facilities to track down the command > that is dying, but expect that to be _very_ noisy on a busy system. >From coincident errors on the DDN, it looks like these are SCSI Write commands (2A) that are failing. > I've often seen this during the initial bus scan when adding a target to > SRP, and I've seen it happen under heavy load once -- maybe more, but I > saw it today for sure. In our case, I'm pretty sure it is heavy load. Well, I didn't see what was going on at the time this started, but the targets (LUNs) were already mounted, and we've been seeing heavy load on the DDN recently. > I am curious, though, what command could be getting stuck > for long enough for the mid-layer to time it out -- I think the default > timeout for the sd driver is 60 seconds, and the INQUIRY timeout is 5 > seconds. I just cannot account for what could be taking that long. I'm curious too as to why WRITEs are taking so long. :) I think we're overloading the DDN, but it could be something else going on. This is a freshly installed configuration (only about a week old), with 6 GFS file servers reading and writing to ~6 shared LUNs on the DDN over IB/SRP (which in turn are shared off the servers via NFS to a ~350 node HPC cluster). We've been running an identical setup for a few years with another DDN, but over FC. I think we have still have a few things to tune/optimize for IB. That said, after talking with DDN support, it's looking like something got wedged on the DDN which was causing the timeouts. > Do your targets come back after this? During the scans, mine do, but > today's under load effectively left the target dead. Rebooting the > server brought it back. Yes, after unwedging the DDN, the targets were fully accessible on the server again. Thanks for the reply. John _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
