We are currently in the middle of upgrading to lustre 1.8.3 on the server, but 
are still running lustre 1.6.6 in production on the servers. We are on-track 
for an upgrade in the next couple of months, but it needs to happen during our 
next center-wide outage, so I need to fix an issue with this older version of 
Lustre.

We have 6 lustre clients that recently ceased to be able to communicate with an 
OSS via the o2ib lustre interface. I am able to tcp ping and ibping in both 
directions, but receive an Input/Output error with "lctl ping nid". The OSTs 
that are on this OSS show as inactive.

The clients can talk to the 'inactive' OSTs via the tcp interface if the o2ib 
lustre interface is disabled on the clients.

scw-045:~ # lfs df -h
UUID                     bytes      Used Available  Use% Mounted on
ls09-MDT0000_UUID        61.7G      2.5G     48.9G    4% 
/mnt/lustre_scratch_2009[MDT:0]
ls09-OST0000_UUID         1.8T      1.1T    622.7G   60% 
/mnt/lustre_scratch_2009[OST:0]
ls09-OST0001_UUID         1.8T      1.1T    608.9G   61% 
/mnt/lustre_scratch_2009[OST:1]
ls09-OST0002_UUID         1.8T      1.1T    649.5G   59% 
/mnt/lustre_scratch_2009[OST:2]
ls09-OST0003_UUID         1.8T   1000.8G    739.9G   54% 
/mnt/lustre_scratch_2009[OST:3]
ls09-OST0004_UUID         1.8T      1.1T    602.0G   62% 
/mnt/lustre_scratch_2009[OST:4]
ls09-OST0005_UUID         1.8T    960.2G    780.5G   52% 
/mnt/lustre_scratch_2009[OST:5]
ls09-OST0006_UUID         1.8T      1.1T    570.0G   63% 
/mnt/lustre_scratch_2009[OST:6]
ls09-OST0007_UUID         1.8T      1.2T    519.4G   66% 
/mnt/lustre_scratch_2009[OST:7]
ls09-OST0008_UUID         1.8T    888.3G    852.4G   48% 
/mnt/lustre_scratch_2009[OST:8]
ls09-OST0009_UUID         1.8T    951.3G    789.4G   51% 
/mnt/lustre_scratch_2009[OST:9]
ls09-OST000a_UUID         1.8T      1.0T    688.8G   57% 
/mnt/lustre_scratch_2009[OST:10]
ls09-OST000b_UUID         1.8T    969.9G    770.8G   52% 
/mnt/lustre_scratch_2009[OST:11]
ls09-OST000c_UUID         1.8T      1.0T    695.4G   56% 
/mnt/lustre_scratch_2009[OST:12]
ls09-OST000d_UUID         1.8T      1.0T    680.1G   57% 
/mnt/lustre_scratch_2009[OST:13]
ls09-OST000e_UUID         1.8T    901.8G    838.8G   49% 
/mnt/lustre_scratch_2009[OST:14]
ls09-OST000f_UUID         1.8T      1.0T    695.4G   56% 
/mnt/lustre_scratch_2009[OST:15]
ls09-OST0010_UUID         1.8T    995.2G    745.4G   54% 
/mnt/lustre_scratch_2009[OST:16]
ls09-OST0011_UUID         1.8T    919.9G    820.7G   50% 
/mnt/lustre_scratch_2009[OST:17]
ls09-OST0012_UUID   : inactive device
ls09-OST0013_UUID   : inactive device
ls09-OST0014_UUID   : inactive device
ls09-OST0015_UUID   : inactive device
ls09-OST0016_UUID   : inactive device
ls09-OST0017_UUID   : inactive device
ls09-OST0018_UUID   : inactive device
ls09-OST0019_UUID   : inactive device
ls09-OST001a_UUID   : inactive device

In the client, I'm seeing these log messages when attempting to mount the 
filesystem:
Jul 27 11:17:38 scw-045 kernel: LustreError: 
2947:0:(o2iblnd_cb.c:2468:kiblnd_rejected()) 10.0.0...@o2ib rejected: consumer 
defined fatal error
Jul 27 11:17:38 scw-045 kernel: LustreError: 
2947:0:(o2iblnd_cb.c:2468:kiblnd_rejected()) Skipped 48 previous similar 
messages
Jul 27 11:17:38 scw-045 kernel: LustreError: 
2947:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  
r...@ffff81021d0aa800 x2157/t0 o8->ls09-OST
[email protected]@o2ib:6/4 lens 240/400 e 0 to 1 dl 1280243863 ref 2 fl 
Rpc:N/0/0 rc 0/0
Jul 27 11:17:38 scw-045 kernel: LustreError: 
2947:0:(events.c:66:request_out_callback()) Skipped 223 previous similar 
messages
Jul 27 11:17:38 scw-045 kernel: Lustre: Request x2157 sent from 
ls09-OST001a-osc-ffff810219166800 to NID 10.0.0...@o2ib 0s ago has timed out 
(limit 5s).
Jul 27 11:17:38 scw-045 kernel: Lustre: Skipped 223 previous similar messages
Jul 27 11:18:53 scw-045 kernel: Lustre: 
4746:0:(import.c:507:import_select_connection()) 
ls09-OST0012-osc-ffff810219166800: tried all connections, increasin
g latency to 50s
Jul 27 11:18:53 scw-045 kernel: Lustre: 
4746:0:(import.c:507:import_select_connection()) Skipped 224 previous similar 
messages

Rebooting an affected client has not resolved the issue. All other clients and 
servers are fully functional, and I was not able to find any notable errors in 
our IB network.

I'm trying to avoid it, but my next step is most likely rebooting the OSS. I'd 
like to avoid that if I can, since we have some rather unique applications our 
users have that would be quite sensitive to a 15-minute IO pause.

Thanks,


-Greg


--
Greg Mason
HPC Administrator
Michigan State University
High Performance Computing Center

web: www.hpcc.msu.edu
email: [email protected]




_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to