Hi Vlad - Randy had done some work to work around this, I guess I was confused about what he had actually done because I thought it addressed something else! at any rate, can you try out the stable branch and see if the changes help out?
If they dont work we'll start working on it from there. Cheers, Kyle Schochenmaier On Tue, Jun 12, 2012 at 12:19 PM, vlad <[email protected]> wrote: > Hi Kyle! > > >> Hi vlad, this is a new one for me, and issues similar rarely occur under >> relatively low loads like 1GB/s in my experience, are you able to > reproduce >> by using pvfs2-cp /input/file /dev/null and specifying a -b to set block >> sizes? >> If this is what I think it is you shouldn't have any associated > timeouts >> on server side, can you verify? > > Yeah, that is and was true... > > this is the new faulty output : > > ".. > [root@doppler18 ~]# time /share/apps/orangefs/bin/pvfs2-cp -b 8192k > /scratchfs/testfile-100GB.dump /dev/null > [E 17:24:41.214472] Error: encourage_recv_incoming: mop_id 7fdd60000950 in > RTS_DONE message not found. > [E 17:24:41.223019] [bt] /share/apps/orangefs/bin/pvfs2-cp(error+0xca) > [0x4689ba] > [E 17:24:41.223036] [bt] /share/apps/orangefs/bin/pvfs2-cp() [0x465d64] > [E 17:24:41.223044] [bt] /share/apps/orangefs/bin/pvfs2-cp() [0x467b05] > [E 17:24:41.223052] [bt] > /share/apps/orangefs/bin/pvfs2-cp(BMI_testcontext+0xf3) [0x4549c3] > [E 17:24:41.223060] [bt] > /share/apps/orangefs/bin/pvfs2-cp(PINT_thread_mgr_bmi_push+0x159) > [0x4599c9] > [E 17:24:41.223068] [bt] /share/apps/orangefs/bin/pvfs2-cp() [0x455aca] > [E 17:24:41.223075] [bt] > /share/apps/orangefs/bin/pvfs2-cp(job_testcontext+0x12a) [0x4562ba] > [E 17:24:41.223082] [bt] > /share/apps/orangefs/bin/pvfs2-cp(PINT_client_state_machine_test+0xd2) > [0x411632] > [E 17:24:41.223090] [bt] > /share/apps/orangefs/bin/pvfs2-cp(PINT_client_wait_internal+0x78) > [0x4118b8] > [E 17:24:41.223098] [bt] > /share/apps/orangefs/bin/pvfs2-cp(PVFS_sys_io+0xae) [0x420ffe] > [E 17:24:41.223105] [bt] /share/apps/orangefs/bin/pvfs2-cp(main+0x3b2) > [0x40e492] > > real 0m13.389s > user 0m13.001s > sys 0m0.024s > > ... > " > That should have been about 100GB ..But > > Tomorrow I will run a test by reducing the block size to 4MB , copying > that 100GB file .. and post the results again. > > Mind you, a copy of 10GB (blocksize 8MB) went through right now without > errors though.. > > > I've forgotten to tell you, that our nodes habe 2xOpterons 6200 with > 64Gb of memory RAM installed each, so there is some caching involved. Also, > there are active jobs running at present on my nodes. > > > >> More info to come once I get into the office. > > Great! Thanks ! > > Greetings > Vlad > > >> On Jun 12, 2012 7:52 AM, "vlad" <[email protected]> wrote: >> >>> Hi! >>> >>> We are evaluating orangefs 2.8.6 with QDR-Infiniband on rocks cluster >>> suite 6.0 (based on CentOS 6.x) and I have set up 8 Nodes (doppler14-20 >>> and >>> doppler22). Each node is metaserver, storageserver and client. >>> >>> Connection is made via ib://doppler18:3335/pvfs2-fs. The file system is >>> mounted to /scratchfs via the kernel-inteface (pvfs2.ko). Our kernel >>> version is "2.6.32-220.13.1.el6.x86_64" >>> >>> We have very impressive transfer rates (with 800-600MB/s) when we dump >>> very big files (1TB) on the filesystem (dd if=/dev/zero >>> of=/scratchfs/testfile.dump bs=8192K) , but when reading the dump to >>> /dev/zero >>> the client-core collapses and our /scratchfs gets inaccessible. >>> >>> The use of pvfs2fuse does not improve the situation, since we get a >>> socket error (usually after dumping of 1GB of data, sometimes earlier, >>> sometimes later ..). The pvfs2fuse-mountpoint gets also inaccessible . >>> >>> >>> I've found this in one of our client log files: >>> ".. >>> [E 14:22:23.279365] Error: encourage_recv_incoming: mop_id 7f6ce4000950 >>> in >>> RTS_DONE message not found. >>> [E 14:22:23.292947] [bt] pvfs2-client-core(error+0xca) [0x46f91a] >>> [E 14:22:23.292978] [bt] pvfs2-client-core() [0x46ccc4] >>> [E 14:22:23.292999] [bt] pvfs2-client-core() [0x46ea65] >>> [E 14:22:23.293018] [bt] pvfs2-client-core(BMI_testcontext+0xf3) >>> [0x45aa83] >>> [E 14:22:23.293037] [bt] >>> pvfs2-client-core(PINT_thread_mgr_bmi_push+0x159) [0x4608a9] >>> [E 14:22:23.293056] [bt] pvfs2-client-core() [0x45c9aa] >>> [E 14:22:23.293074] [bt] pvfs2-client-core(job_testcontext+0x12a) >>> [0x45d19a] >>> [E 14:22:23.293092] [bt] >>> pvfs2-client-core(PINT_client_state_machine_testsome+0xee) [0x41757e] >>> [E 14:22:23.293111] [bt] pvfs2-client-core() [0x412ecd] >>> [E 14:22:23.293130] [bt] pvfs2-client-core(main+0x703) [0x413fb3] >>> [E 14:22:23.293165] [bt] /lib64/libc.so.6(__libc_start_main+0xfd) >>> [0x392b41ecdd] >>> [E 14:22:23.303725] pvfs2-client-core with pid 29108 exited with value > 1 >>> .." >>> >>> I have ot found any evidence for this error in the server log files >>> though .. >>> >>> This is the output of our /etc/pvfs2tab: >>> >>> "ib://doppler18:3335/pvfs2-fs /scratchfs pvfs2 defaults,noauto 0 0" >>> >>> Please can you help me to stabilize the read access to our files ? >>> >>> >>> Greetings from Salzburg/Austria/Europe >>> >>> >>> Vlad Popa >>> >>> University of Salzburg >>> Dept Of Computer Science-HPC Computing >>> Jakob-Harringer-Str2 >>> 5020 Salzburg >>> Tel 0043-662-80446313 >>> mal:[email protected] >>> _______________________________________________ >>> Pvfs2-users mailing list >>> [email protected] >>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users >>> _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
