Peter Bojanic wrote:
Nathan,
The following bugs were reported by Cray when they attempted to
integrate Lustre 1.6.0 beta6 with their environment. Could these
problems have been found by integrating with a stock SLES 9
distribution, first? Is anyone running Lustre 1.6.0 with SLES 9 yet?
There were 2 Suse-specific issues and at least 1 Catamount specific
issue. 1 issue was well-known ahead of time. 2 are intended behavior.
I think 4 issues (11138, 11120, 11114, 11091) would have been caught by
Suse Liblustre testing. 2 more issues (11093, 11147 maybe) by doing our
own Catamount testing.
11153 unexpected errors seen during evict by nid
I don't think this is an error yet - still looking. We don't seem to
actually have a test for evict by nid though; a regression test for this
should be added to our test suite.
11147 more sanity test failures in 1.6 beta
There are a few issues in this bug. 10809 is dealt with below; the LNET
issue I don't see on my x86, and am not sure if SLES9 testing would or
would not show it.
11143 sanity test fails in fcntl test
Previously known 10842, still open. I have been very vocal about this
bug for awhile now (before Cray testing.)
11138 sanity fails early in 1.6 beta
Previously known and fixed 10999 didn't make it into the beta.
11134 liblustre clients won't connect to servers
Cross-version issue that I never thought to check. We need to add a
"cross-version liblustre check" to our major release process.
11133 old mount syntax does not work
Expected behavior - clarified documentation
11120 bad ldiskfs build suse in 1.6 beta
This would have been found with SUSE build - malformed patch from an
update from b1_4
11114 bad patch in 1.6 beta
Would have been found with SUSE testing
11102 no zero-copy TCP in new 1.6 beta
Intentional
11093 build failure in 1.6 beta
This would only have been found somewhere where HAVE_LIBPTHREAD isn't
defined, which afaik is only Catamount. Might have been found in a very
careful code review.
11091 build question about latest 1.6 beta
This broke many builds and should have been caught by our current
testing. (Unreviewed change to build that should never have been signed
in.) I think this was just unlucky timing for Cray.
10809 liblustre sanity test fails
I dropped the ball on this one - test 55 failure was masked by earlier
test 21 failure (10842), and I never tried to run the remaining tests.
I have now started skipping test21 to get the remaining coverage. I
should have done this as soon as it became apparent that 10842 would
take awhile to fix.
Please give a run-down of how these problems could have been prevented
in the first place. Also, how could the liblustre issues have been
identified prior to running on Catamount?
Thanks,
Bojanic
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss