Re: [Pvfs2-users] Timeouts while reading from our pvfs2-system client collapses

Kyle Schochenmaier Tue, 12 Jun 2012 11:55:43 -0700

Hi Vlad -

Randy had done some work to work around this, I guess I was confused
about what he had actually done because I thought it addressed
something else!    at any rate, can you try out the stable branch and
see if the changes help out?


If they dont work we'll start working on it from there.

Cheers,
Kyle Schochenmaier


On Tue, Jun 12, 2012 at 12:19 PM, vlad <[email protected]> wrote:
> Hi  Kyle!
>
>
>> Hi vlad, this is a new one for me, and issues similar rarely occur under
>> relatively low loads like 1GB/s in my experience, are you able to
> reproduce
>> by using pvfs2-cp /input/file /dev/null and specifying a -b to set block
>> sizes?
>> If this is what I think it is you shouldn't have any associated
> timeouts
>> on server side, can you verify?
>
> Yeah, that is and was true...
>
> this is the new faulty output :
>
> "..
> [root@doppler18 ~]# time  /share/apps/orangefs/bin/pvfs2-cp -b  8192k
> /scratchfs/testfile-100GB.dump  /dev/null
> [E 17:24:41.214472] Error: encourage_recv_incoming: mop_id 7fdd60000950 in
> RTS_DONE message not found.
> [E 17:24:41.223019]     [bt] /share/apps/orangefs/bin/pvfs2-cp(error+0xca)
> [0x4689ba]
> [E 17:24:41.223036]     [bt] /share/apps/orangefs/bin/pvfs2-cp() [0x465d64]
> [E 17:24:41.223044]     [bt] /share/apps/orangefs/bin/pvfs2-cp() [0x467b05]
> [E 17:24:41.223052]     [bt]
> /share/apps/orangefs/bin/pvfs2-cp(BMI_testcontext+0xf3) [0x4549c3]
> [E 17:24:41.223060]     [bt]
> /share/apps/orangefs/bin/pvfs2-cp(PINT_thread_mgr_bmi_push+0x159)
> [0x4599c9]
> [E 17:24:41.223068]     [bt] /share/apps/orangefs/bin/pvfs2-cp() [0x455aca]
> [E 17:24:41.223075]     [bt]
> /share/apps/orangefs/bin/pvfs2-cp(job_testcontext+0x12a) [0x4562ba]
> [E 17:24:41.223082]     [bt]
> /share/apps/orangefs/bin/pvfs2-cp(PINT_client_state_machine_test+0xd2)
> [0x411632]
> [E 17:24:41.223090]     [bt]
> /share/apps/orangefs/bin/pvfs2-cp(PINT_client_wait_internal+0x78)
> [0x4118b8]
> [E 17:24:41.223098]     [bt]
> /share/apps/orangefs/bin/pvfs2-cp(PVFS_sys_io+0xae) [0x420ffe]
> [E 17:24:41.223105]     [bt] /share/apps/orangefs/bin/pvfs2-cp(main+0x3b2)
> [0x40e492]
>
> real    0m13.389s
> user    0m13.001s
> sys     0m0.024s
>
> ...
> "
> That should have been  about 100GB ..But
>
> Tomorrow I will  run a test by reducing the  block size to 4MB , copying
> that 100GB file .. and post the results again.
>
> Mind you, a copy of 10GB  (blocksize 8MB) went through right now without
> errors though..
>
>
>  I've forgotten to tell you, that our nodes habe 2xOpterons 6200  with
> 64Gb of memory RAM installed each, so there is some caching involved. Also,
> there are active jobs running at present on my nodes.
>
>
>
>> More info to come once I get into the office.
>
> Great! Thanks !
>
> Greetings
> Vlad
>
>
>> On Jun 12, 2012 7:52 AM, "vlad" <[email protected]> wrote:
>>
>>> Hi!
>>>
>>> We are  evaluating  orangefs 2.8.6 with QDR-Infiniband on rocks cluster
>>> suite 6.0 (based on CentOS 6.x) and I have set up 8 Nodes (doppler14-20
>>> and
>>> doppler22). Each node is metaserver, storageserver and client.
>>>
>>> Connection is made via ib://doppler18:3335/pvfs2-fs. The file system is
>>> mounted to /scratchfs via the kernel-inteface (pvfs2.ko). Our kernel
>>> version is "2.6.32-220.13.1.el6.x86_64"
>>>
>>> We have very impressive  transfer rates (with 800-600MB/s) when we dump
>>> very big files (1TB) on the  filesystem (dd if=/dev/zero
>>> of=/scratchfs/testfile.dump bs=8192K) , but when reading the dump to
>>> /dev/zero
>>> the client-core collapses and our /scratchfs gets inaccessible.
>>>
>>> The use of pvfs2fuse  does not improve the situation, since we get a
>>> socket error (usually after dumping of 1GB of data, sometimes earlier,
>>> sometimes later ..). The pvfs2fuse-mountpoint gets also inaccessible .
>>>
>>>
>>> I've found this in one of our client log files:
>>> "..
>>> [E 14:22:23.279365] Error: encourage_recv_incoming: mop_id 7f6ce4000950
>>> in
>>> RTS_DONE message not found.
>>> [E 14:22:23.292947]     [bt] pvfs2-client-core(error+0xca) [0x46f91a]
>>> [E 14:22:23.292978]     [bt] pvfs2-client-core() [0x46ccc4]
>>> [E 14:22:23.292999]     [bt] pvfs2-client-core() [0x46ea65]
>>> [E 14:22:23.293018]     [bt] pvfs2-client-core(BMI_testcontext+0xf3)
>>> [0x45aa83]
>>> [E 14:22:23.293037]     [bt]
>>> pvfs2-client-core(PINT_thread_mgr_bmi_push+0x159) [0x4608a9]
>>> [E 14:22:23.293056]     [bt] pvfs2-client-core() [0x45c9aa]
>>> [E 14:22:23.293074]     [bt] pvfs2-client-core(job_testcontext+0x12a)
>>> [0x45d19a]
>>> [E 14:22:23.293092]     [bt]
>>> pvfs2-client-core(PINT_client_state_machine_testsome+0xee) [0x41757e]
>>> [E 14:22:23.293111]     [bt] pvfs2-client-core() [0x412ecd]
>>> [E 14:22:23.293130]     [bt] pvfs2-client-core(main+0x703) [0x413fb3]
>>> [E 14:22:23.293165]     [bt] /lib64/libc.so.6(__libc_start_main+0xfd)
>>> [0x392b41ecdd]
>>> [E 14:22:23.303725] pvfs2-client-core with pid 29108 exited with value
> 1
>>> .."
>>>
>>> I have ot found  any evidence  for this error in the server log files
>>> though ..
>>>
>>> This is the output of our  /etc/pvfs2tab:
>>>
>>> "ib://doppler18:3335/pvfs2-fs /scratchfs pvfs2 defaults,noauto 0 0"
>>>
>>> Please can you help me to stabilize the read access to our files ?
>>>
>>>
>>> Greetings from Salzburg/Austria/Europe
>>>
>>>
>>> Vlad Popa
>>>
>>> University of Salzburg
>>> Dept Of Computer Science-HPC Computing
>>> Jakob-Harringer-Str2
>>> 5020 Salzburg
>>> Tel 0043-662-80446313
>>> mal:[email protected]
>>> _______________________________________________
>>> Pvfs2-users mailing list
>>> [email protected]
>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>>>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Timeouts while reading from our pvfs2-system client collapses

Reply via email to