Hi All,

I’ve got some very weird problems going on here (and I do have a PMR open with 
IBM).  On Monday I attempted to unlink a fileset, something that I’ve done many 
times with no issues.  This time, however, it hung up the filesystem.  I was 
able to clear things up by shutting down GPFS on the filesystem manager for 
that filesystem and restarting it.

The very next morning we awoke to problems with GPFS.  I noticed in my messages 
file on all my NSD servers I had messages like:

Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Write Protect is off
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Asking for cache data failed
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Assuming drive cache: write 
through
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: rejecting I/O to offline device
Jun 12 22:03:32 nsd32 kernel: sd 0:0:4:2: [sdab] Attached SCSI disk
Jun 12 22:03:32 nsd32 multipathd: sdab: add path (uevent)
Jun 12 22:03:32 nsd32 multipathd: sdab: failed to get path uid
Jun 12 22:03:32 nsd32 multipathd: uevent trigger error
Jun 12 22:03:42 nsd32 kernel: rport-0:0-4: blocked FC remote port time out: 
removing target and saving binding

Since we use an FC SAN and Linux multi-pathing I was expecting some sort of 
problem with the switches.  Now on the switches I see messages like:

 [114][Thu Jun 15 19:02:05.411 UTC 2017][I][8600.0020][Port][Port: 9][SYNC_LOSS]
  [115][Thu Jun 15 19:03:49.988 UTC 2017][I][8600.001F][Port][Port: 9][SYNC_ACQ]

Which (while not in this example) do correlate time-wise with the multi path 
messages on the servers.  So it’s not a GPFS problem and I shouldn’t be bugging 
this list about this EXCEPT…

These issues only started on Monday after I ran the mmunlinkfileset command.  
That’s right … NO such errors prior to then.  And literally NOTHING changed on 
Monday with my SAN environment (nothing had changed there for months actually). 
 Nothing added to nor removed from the SAN.  No changes until today when, in an 
attempt to solve this issue, I updated the switch firmware on all switches one 
at a time.  I also yum updated to the latest RHEL 7 version of the multipathd 
packages.

I’ve been Googling and haven’t found anything useful on those SYNC_LOSS 
messages on the QLogic SANbox 5800 switches.  Anybody out there happen to have 
any knowledge of them and what could be causing them?  Oh, I’m investigating 
this now … but it’s not all ports that are throwing the errors.  And the ports 
that are seem to be random and don’t have one specific type of hardware plugged 
in … i.e. some ports have NSD servers plugged in, others have storage arrays.

I understand that it makes no sense that mmunlinkfileset hanging would cause 
problems with my SAN … but I also don’t believe in coincidences!

I’m running GPFS 4.2.2.3.  Any help / suggestions apprecaiated!

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to