Hello,
We are having an interesting issue with OST connectivity that we just cant 
figure out on our own. For background, in this instance we have one VM running 
as an MDS, and two physical servers with direct connected storage systems to 
serve as OSSs, There are two OSTs per OSS, making four total, and 
pacemaker/corosync set up to run HA for the OSTs. When I set up this instance 
in February, I tested thoroughly, and made sure that I could run the filesystem 
with any combination of OSTs running on either of the OSSs, however lately, the 
OSTs are having connectivity issues if they run on a certain OSS. For instance, 
if OST0 and OST1 are running on OSS1, and OST2 and OST3 are running on OSS2, 
the filesystem will work fine with no issues, but if I migrate any OST to the 
other OSS, that OST will mount up and appear to be working fine from a 'lctl 
dl' ran from the MDS, but all files located on the affected OST will be 
unavailable from any clients, and a 'lfs check servers' ran from a client will 
hang for a while, then show "resource temporarily unavailable (11)" on that 
OST. Any attempt to access or even check metadata of a file [ls, df, du, ect] 
will freeze up the session.
I kicked off a 'lfsck_start -o -t layout -A' from the MDT and it completed 
without finding anything to repair.
Id appreciate if anyone could point me in a direction to look for answers to 
this issue.

root@[MDS] ~ $ lctl dl
  0 UP osd-ldiskfs lustrest-MDT0000-osd lustrest-MDT0000-osd_UUID 11
  1 UP mgs MGS MGS 56
  2 UP mgc MGC[MDS STORAGE NETWORK  IP]@tcp 
bc90ff88-6a97-fd41-f1af-97bf148bf883 4
  3 UP mds MDS MDS_uuid 2
  4 UP lod lustrest-MDT0000-mdtlov lustrest-MDT0000-mdtlov_UUID 3
  5 UP mdt lustrest-MDT0000 lustrest-MDT0000_UUID 60
  6 UP mdd lustrest-MDD0000 lustrest-MDD0000_UUID 3
  7 UP qmt lustrest-QMT0000 lustrest-QMT0000_UUID 3
  8 UP osp lustrest-OST0000-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
  9 UP osp lustrest-OST0001-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
10 UP osp lustrest-OST0002-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
11 UP osp lustrest-OST0003-osc-MDT0000 lustrest-MDT0000-mdtlov_UUID 4
12 UP lwp lustrest-MDT0000-lwp-MDT0000 lustrest-MDT0000-lwp-MDT0000_UUID 4
Attached is a screencap of the failing 'lfs check servers' during the failure.
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to