Thank you Wojciech, Alexey, Parinay and John, it looks like the controller (areca 1280) is the problem. In the logs I found an according error message (arcmsr6:...). Googling this shows that a lot of users having problems with areca controllers under heavy load. The OSS crash can be reproduced with tiobench running 32 threads each on two clients (!). So the problem might not be directly related to obdfilter-survey but just triggers a totally different problem.
Wojciech, I read about this. The odd thing is that the speed per process is constant even with more processes on the same or another client. The system is obviously capable of much more but shows a limit per reader process. Alexey, I was not able to capture the sysrq+t into a file in my test installation and after discovering the arcmsr message went that way first. Prinay, I tried the behaviour with strace, but neither do I get any output apart from the "attached" message nor does obdfilter-survey continue afterwards. John, panic_on_lbug is not set and today i saw that the system freezes after 1-2h even without interrupting obdfilter-survey. I will do a test with a different controller in the next days and will post log info if the problem persists. Thanks again! Robert Am 05.01.2011 18:25, schrieb John Hammond: > On 01/04/2011 02:14 PM, robert wrote: >> Hi Everyone! >> >> I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are >> three identical oss with four osts each. >> >> After having fantastic write rates but low read rates, I ran the >> obdfilter-survey script to get a hint of what may cause this. >> >> Unfortnately obdfilter-survey in case=disk mode freezes on two of my >> three oss at the write task of the 4 objs, 16 threads line and leaves >> the system in an unstable state requiring a reboot. The other oss runs >> through the script without problems. To exclude a problem in the >> system´s setup, I booted one of the bad oss with the working oss´ disk - >> with the same faulty result. Creating a new filesystem on all osts of >> one of the problem oss neither did the trick. >> >> Any ideas what may cause this behavior? Thanks! > Do you have panic_on_lbug set? > > It's easy to LBUG Lustre by interrupting (Ctrl-C/SIGINT/Arrivederci Roma) a > running obdfilter-survey. Using 1.8.4 on RHEL 5.5: > > [r...@oss21 obdfilter-survey]# nobjhi=2 thrhi=2 size=1024 case=disk sh > obdfilter-survey > Wed Jan 5 10:51:05 CST 2011 Obdfilter-survey for case=disk from > oss21.ranger.tacc.utexas.edu > ost 6 sz 6291456K rsz 1024K obj 6 thr 6 write > ^C > > [r...@oss21 ~]# dmesg > [87251.960393] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup()) > ASSERTION(eco->eco_refcount == 0) failed > [87251.960451] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup()) LBUG() > [87251.960482] Pid: 11759, comm: lctl > ... > > See https://bugzilla.lustre.org/show_bug.cgi?id=21745 > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
