Hi;

We are running Lustre 1.6.3 using o2ib (OFED 1.2) and tcp networks.
Clients are CentOS 4.5 patchless clients, and the single server
(MGS/MDS/OSS combo) is a CentOS 5.0 with patched kernel (includes
proposed fix for bug 13438).  All nodes are x86_64.

We have run into a problem on the clients when one of our users
tries to install an rpm package into an rpm database that lives
on the Lustre filesystem.

The rpm install command hangs in I/O wait state (client is using
o2ib).  Attempts to access the rpm database directory from other processes
like ls also hang in D state.  ldlm_poold and pdflush are stuck:

[EMAIL PROTECTED] ~]# ps auxww | grep ' D ' | grep -v grep
root        79  0.0  0.0     0    0 ?        D    Oct26   0:01 [pdflush]
uscms01   7155  0.0  0.0 16912 1684 ?        D    Oct26   0:00 ls -al 
/scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm
uscms01  11712  0.0  0.0 16912 1684 ?        D    Oct27   0:00 ls -al 
/scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm
uscms01  15363  0.0  0.0  2820  820 ?        D    Oct27   0:00 /usr/sbin/lsof 
/scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib
uscms01  16589  0.0  0.1 12700 5564 ?        D    Oct26   0:02 rpm -Uvh 
--define _rpmlock_path 
/scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm/__db.0 -r 
/scratch/mri/osg/app/cmssoft/cms --dbpath 
/scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm --rcfile 
/scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/build/eulisse/rpm/PKGTOOLS/slc3_ia32_gcc323/external/rpm/4.4.2.1-CMS3/lib/rpm/rpmrc
 --nodeps --prefix 
/scratch/mri/osg/app/cmssoft/cms --ignoreos --ignorearch 
/scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/external+elfutils+0.128-CMS3-1-1007.slc3_ia32_gcc323.rpm
uscms01  21531  0.0  0.0 16912 1684 ?        D    Oct27   0:00 ls -al 
/scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm
root     23827  0.0  0.0     0    0 ?        D    Oct26   0:02 [ldlm_poold]

Rebooting the client and remounting restores access to the rpm
database directory (ls works), but if the user starts their commands
again, the problem repeats.

We tried adding the 'flock' mount option, and the user was able to
do his software installation once, but the problem returned.

In the system logs on the client, I see some LustreErrors, but am
unsure if they correspond to the users activity (see appended).
They mention bug 11742, which does not appear to have a solution.

Has anyone ever seen this before or know of a fix?  Any help
would be appreciated.

Thanks,
Craig

 From the client:

Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed 
in transit AND doesn't match the original - likely false positive due to mmap 
IO (bug 11742): from 
[EMAIL PROTECTED] inum 18200061/3052028908 object 3065594/0 extent 
[446464-450559]
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 
947c387a, server csum bdfdb0f6, client csum now 279c1071
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error 
 [EMAIL PROTECTED] x390550/t31236588 
o4->[EMAIL PROTECTED]@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed 
in transit AND doesn't match the original - likely false positive due to mmap 
IO (bug 11742): from 
[EMAIL PROTECTED] inum 18200059/1461467594 object 3307767/0 extent [20480-24575]
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 
38323908, server csum 8ecf8abb, client csum now 8e60616c
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error 
 [EMAIL PROTECTED] x390536/t34503458 
o4->[EMAIL PROTECTED]@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed 
in transit AND doesn't match the original - likely false positive due to mmap 
IO (bug 11742): from 
[EMAIL PROTECTED] inum 18200060/1314702654 object 3279245/0 extent 
[1204224-1318911]
Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 3 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 
18528fa8, server csum 4cda3c4e, client csum now 7b2e3f37
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 3 previous similar 
messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error 
 [EMAIL PROTECTED] x390556/t29531809 
o4->[EMAIL PROTECTED]@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar 
messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed 
in transit AND doesn't match the original - likely false positive due to mmap 
IO (bug 11742): from 
[EMAIL PROTECTED] inum 18200060/1314702654 object 3279245/0 extent 
[1204224-1318911]
Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 4 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 
6fc1808f, server csum f9c02df2, client csum now 271aff8c
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 4 previous similar 
messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error 
 [EMAIL PROTECTED] x390559/t29531810 
o4->[EMAIL PROTECTED]@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 4 previous similar 
messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed 
in transit AND doesn't match the original - likely false positive due to mmap 
IO (bug 11742): from 
[EMAIL PROTECTED] inum 18200061/3052028908 object 3065594/0 extent 
[446464-450559]
Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 2 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 
be7ec5be, server csum 952e539d, client csum now 7c80e126
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 2 previous similar 
messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, 
returning error
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error 
 [EMAIL PROTECTED] x390567/t29531812 
o4->[EMAIL PROTECTED]@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar 
messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, 
returning error
Oct 26 12:56:33 iogw1 kernel: LustreError: 
23766:0:(osc_request.c:1277:osc_brw_redo_request()) Skipped 1 previous similar 
message



_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to