Is it possible that a response just isn't making it back to the clients for some reason?

If the client library can't find anything else to do, it will be normal for it to spend the majority of its time sleeping in either poll() or epoll() until some messages show up that it needs. It should give up eventually, but the job timeouts may be set rather high by default. It looks like the defaults are 300 second timeouts with 5 retries.

You might find some more information by setting the PVFS2_DEBUGMASK environment variable to "network" before running one of the pvfs2-* utilities that hangs. If that doesn't indicate anything useful you could try setting it to "verbose" to get even more debugging output. In conjunction with this you might want to set ClientJobBMITimeoutSecs and ClientJobFlowTimeoutSecs to something lower (like 30 seconds) so you can see if the client times out and retries while you watch.

-Phil

Murali Vilayannur wrote:
Hi Florin,
Thanks for getting back on that!
This is quite weird. it probably points to some platform-specific library issue.
Since we do use threads, perhaps it is time to retry running configure
by disabling usage of threads and see if that helps?

./configure --disable-thread-safety is something you can try
perhaps ./configure --enable-nptl-workaround is also something you can
try (not together with the previous one though) to workaround glibc
oddities.
Sam, RobL, Pete any ideas? I am lost..:(
Final alternative is to perhaps do a live debug on your machine if possible..
thanks,
Murali

On 7/2/07, Florin Isaila <[EMAIL PROTECTED]> wrote:
Hi,

many thanks Murali. I have just tried that, but it keeps getting stuck
with an even stranger stack trace:

(gdb) bt
#0  0x0ff4b2d0 in poll () from /lib/tls/libc.so.6
#1  0x0ffc871c in ?? () from /lib/tls/libc.so.6
#2  0x0ffc871c in ?? () from /lib/tls/libc.so.6
Previous frame identical to this frame (corrupt stack?)

Any other suggestions?

Best regards
Florin

On 7/2/07, Murali Vilayannur <[EMAIL PROTECTED]> wrote:
> Hi Florin,
> Given that both your backtraces point to epoll(), can you run make
> clean followed by configure with --disable-epoll, rebuild everything
> and see if that works?
> If it does work, it probably points to some epoll specific bug on ppc
> either in pvfs2 or the libepoll code..
> thanks,
> Murali
>
> On 7/2/07, Florin Isaila <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > We have installed PVFS2 2.6.3 over Ethernet on a SUSE distribution,
> > locally on a biprocessor (PowerPC 970FX) machine.
> >
> > Some commands like pvfs2-ping, pvfs2-mkdir, pvfs2-ls (w/o parameters)
> > work fine.
> >
> > But we can not get it run for some pvfs2-* commands. For instance
> > pvfs2-cp gets stuck. Here the trace of gdb:
> >
> > (gdb) bt
> > #0  0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> > #1  0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> > incount=128, outcount=0xffff97b0, maps=0xffff93b0, status=0xffff95b0,
> >     poll_timeout=10, external_mutex=0x100d2ce0)
> >     at socket-collection-epoll.c:281
> > #2  0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> > #3 0x10098d10 in BMI_tcp_testcontext (incount=5, out_id_array=0x100d2b58,
> >     outcount=0xffff9864, error_code_array=0x100d2b80,
> > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0, max_idle_time=10,
> >     context_id=0) at bmi-tcp.c:1303
> > #4 0x1005aa18 in BMI_testcontext (incount=5, out_id_array=0x100d2b58,
> >     outcount=0x100d14cc, error_code_array=0x100d2b80,
> >     actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >     max_idle_time_ms=10, context_id=0) at bmi.c:944
> > #5  0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> > #6  0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> >     at thread-mgr.c:815
> > #7 0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at job.c:4661
> > #8  0x1007025c in job_testcontext (out_id_array_p=0xffff99d0,
> >     inout_count_p=0xffff99b8, returned_user_ptr_array=0xffffd1d0,
> > out_status_array_p=0xffffa1d0, timeout_ms=10, context_id=1) at job.c:4068
> > #9  0x1000fdb0 in PINT_client_state_machine_test (op_id=3,
> >     error_code=0xffffd670) at client-state-machine.c:536
> > ---Type <return> to continue, or q <return> to quit---
> > #10 0x1001041c in PINT_client_wait_internal (op_id=3,
> >     in_op_str=0x100b209c "fs_add", out_error=0xffffd670,
> >     in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> > #11 0x10010734 in PVFS_sys_wait (op_id=3, in_op_str=0x100b209c "fs_add",
> >     out_error=0xffffd670) at client-state-machine.c:861
> > #12 0x10035c4c in PVFS_sys_fs_add (mntent=0x100d3030) at fs-add.sm:205
> > #13 0x1004c220 in PVFS_util_init_defaults () at pvfs2-util.c:1040
> > #14 0x1000a5c8 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:135
> >
> > Some other times (but rarely) is getting stuck at a different place:
> >
> > (gdb) bt
> > #0  0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> > #1  0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> > incount=128, outcount=0xffff9b30, maps=0xffff9730, status=0xffff9930,
> >     poll_timeout=10, external_mutex=0x100d2ce0)
> >     at socket-collection-epoll.c:281
> > #2  0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> > #3 0x10098d10 in BMI_tcp_testcontext (incount=5, out_id_array=0x100d2b58,
> >     outcount=0xffff9be4, error_code_array=0x100d2b80,
> > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0, max_idle_time=10,
> >     context_id=0) at bmi-tcp.c:1303
> > #4 0x1005aa18 in BMI_testcontext (incount=5, out_id_array=0x100d2b58,
> >     outcount=0x100d14cc, error_code_array=0x100d2b80,
> >     actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >     max_idle_time_ms=10, context_id=0) at bmi.c:944
> > #5  0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> > #6  0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> >     at thread-mgr.c:815
> > #7 0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at job.c:4661
> > #8  0x1007025c in job_testcontext (out_id_array_p=0xffff9d50,
> >     inout_count_p=0xffff9d38, returned_user_ptr_array=0xffffd550,
> > out_status_array_p=0xffffa550, timeout_ms=10, context_id=1) at job.c:4068
> > #9  0x1000fdb0 in PINT_client_state_machine_test (op_id=28,
> >     error_code=0xffffda1c) at client-state-machine.c:536
> > ---Type <return> to continue, or q <return> to quit---
> > #10 0x1001041c in PINT_client_wait_internal (op_id=28,
> >     in_op_str=0x100ac1b8 "io", out_error=0xffffda1c,
> >     in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> > #11 0x10010734 in PVFS_sys_wait (op_id=28, in_op_str=0x100ac1b8 "io",
> >     out_error=0xffffda1c) at client-state-machine.c:861
> > #12 0x1001b78c in PVFS_sys_io (ref=
> >       {handle = 1048570, fs_id = 1957135728, __pad1 = -26176},
> >     file_req=0x100d07d8, file_req_offset=0, buffer=0x40068008,
> >     mem_req=0x100efbd0, credentials=0xffffe060, resp_p=0xffffda90,
> >     io_type=PVFS_IO_WRITE) at sys-io.sm:363
> > #13 0x1000b078 in generic_write (dest=0xffffddb0,
> >     buffer=0x40068008 "\177ELF\001\002\001", offset=0, count=2469777,
> >     credentials=0xffffe060) at pvfs2-cp.c:365
> > #14 0x1000a824 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:180
> >
> >
> > After breaking the program with Ctrl-C, the files appear created. Any
> > clue where this can come from? It appears like the metadata
> > communication works but the data not.
> >
> > Bellow the result of the ping command.
> >
> > Many thanks
> > Florin
> >
> > pvfs2-ping -m ~/florin/mnt/pvfs2/
> >
> > (1) Parsing tab file...
> >
> > (2) Initializing system interface...
> >
> > (3) Initializing each file system found in tab file:
> > /home/A40001/u72877927/florin/app
> >                                s/etc/pvfs2tab...
> >
> >    PVFS2 servers: tcp://localhost:55555
> >    Storage name: pvfs2-fs
> >    Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> >    /home/A40001/u72877927/florin/mnt/pvfs2: Ok
> >
> > (4) Searching for /home/A40001/u72877927/florin/mnt/pvfs2/ in pvfstab...
> >
> >    PVFS2 servers: tcp://localhost:55555
> >    Storage name: pvfs2-fs
> >    Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> >
> >    meta servers:
> >    tcp://localhost:55555
> >
> >    data servers:
> >    tcp://localhost:55555
> >
> > (5) Verifying that all servers are responding...
> >
> >    meta servers:
> >    tcp://localhost:55555 Ok
> >
> >    data servers:
> >    tcp://localhost:55555 Ok
> >
> > (6) Verifying that fsid 1957135728 is acceptable to all servers...
> >
> >    Ok; all servers understand fs_id 1957135728
> >
> > (7) Verifying that root handle is owned by one server...
> >
> >    Root handle: 1048576
> >      Ok; root handle is owned by exactly one server.
> >
> > =============================================================
> >
> > The PVFS2 filesystem at /home/A40001/u72877927/florin/mnt/pvfs2/
> > appears to be correctly configured.
> > _______________________________________________
> > Pvfs2-users mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to