Hi; We are running Lustre 1.6.3 using o2ib (OFED 1.2) and tcp networks. Clients are CentOS 4.5 patchless clients, and the single server (MGS/MDS/OSS combo) is a CentOS 5.0 with patched kernel (includes proposed fix for bug 13438). All nodes are x86_64.
We have run into a problem on the clients when one of our users tries to install an rpm package into an rpm database that lives on the Lustre filesystem. The rpm install command hangs in I/O wait state (client is using o2ib). Attempts to access the rpm database directory from other processes like ls also hang in D state. ldlm_poold and pdflush are stuck: [EMAIL PROTECTED] ~]# ps auxww | grep ' D ' | grep -v grep root 79 0.0 0.0 0 0 ? D Oct26 0:01 [pdflush] uscms01 7155 0.0 0.0 16912 1684 ? D Oct26 0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm uscms01 11712 0.0 0.0 16912 1684 ? D Oct27 0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm uscms01 15363 0.0 0.0 2820 820 ? D Oct27 0:00 /usr/sbin/lsof /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib uscms01 16589 0.0 0.1 12700 5564 ? D Oct26 0:02 rpm -Uvh --define _rpmlock_path /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm/__db.0 -r /scratch/mri/osg/app/cmssoft/cms --dbpath /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm --rcfile /scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/build/eulisse/rpm/PKGTOOLS/slc3_ia32_gcc323/external/rpm/4.4.2.1-CMS3/lib/rpm/rpmrc --nodeps --prefix /scratch/mri/osg/app/cmssoft/cms --ignoreos --ignorearch /scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/external+elfutils+0.128-CMS3-1-1007.slc3_ia32_gcc323.rpm uscms01 21531 0.0 0.0 16912 1684 ? D Oct27 0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm root 23827 0.0 0.0 0 0 ? D Oct26 0:02 [ldlm_poold] Rebooting the client and remounting restores access to the rpm database directory (ls works), but if the user starts their commands again, the problem repeats. We tried adding the 'flock' mount option, and the user was able to do his software installation once, but the problem returned. In the system logs on the client, I see some LustreErrors, but am unsure if they correspond to the users activity (see appended). They mention bug 11742, which does not appear to have a solution. Has anyone ever seen this before or know of a fix? Any help would be appreciated. Thanks, Craig From the client: Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from [EMAIL PROTECTED] inum 18200061/3052028908 object 3065594/0 extent [446464-450559] Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 947c387a, server csum bdfdb0f6, client csum now 279c1071 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error [EMAIL PROTECTED] x390550/t31236588 o4->[EMAIL PROTECTED]@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from [EMAIL PROTECTED] inum 18200059/1461467594 object 3307767/0 extent [20480-24575] Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 38323908, server csum 8ecf8abb, client csum now 8e60616c Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error [EMAIL PROTECTED] x390536/t34503458 o4->[EMAIL PROTECTED]@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from [EMAIL PROTECTED] inum 18200060/1314702654 object 3279245/0 extent [1204224-1318911] Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 18528fa8, server csum 4cda3c4e, client csum now 7b2e3f37 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error [EMAIL PROTECTED] x390556/t29531809 o4->[EMAIL PROTECTED]@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from [EMAIL PROTECTED] inum 18200060/1314702654 object 3279245/0 extent [1204224-1318911] Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 4 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 6fc1808f, server csum f9c02df2, client csum now 271aff8c Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 4 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error [EMAIL PROTECTED] x390559/t29531810 o4->[EMAIL PROTECTED]@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 4 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from [EMAIL PROTECTED] inum 18200061/3052028908 object 3065594/0 extent [446464-450559] Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 2 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum be7ec5be, server csum 952e539d, client csum now 7c80e126 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 2 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, returning error Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error [EMAIL PROTECTED] x390567/t29531812 o4->[EMAIL PROTECTED]@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, returning error Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) Skipped 1 previous similar message _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
