Hi guys, many thanks for your replies. Unfortunatly, in spite of all
tries, it keeps having problems:
1) -disable-thread-safety does not compile
2) --enable-nptl-workaround does not create the file system correctly
3) Configured without any extra-parameters and with debugging is
getting stuck for 5 minutes and then tries again:
[D 15:38:53.732290] Posted PVFS_SYS_IO (waiting for test)
[E 15:43:53.314350] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 29.
Murali, you are saying something about live debugging, would it be
possible for you to do that? If yes, just send me the public rsa/dsa
keys and then I will give you the machine details.
Many thanks
Florin
On 7/3/07, Phil Carns <[EMAIL PROTECTED]> wrote:
> Is it possible that a response just isn't making it back to the clients
> for some reason?
>
> If the client library can't find anything else to do, it will be normal
> for it to spend the majority of its time sleeping in either poll() or
> epoll() until some messages show up that it needs. It should give up
> eventually, but the job timeouts may be set rather high by default. It
> looks like the defaults are 300 second timeouts with 5 retries.
>
> You might find some more information by setting the PVFS2_DEBUGMASK
> environment variable to "network" before running one of the pvfs2-*
> utilities that hangs. If that doesn't indicate anything useful you
> could try setting it to "verbose" to get even more debugging output. In
> conjunction with this you might want to set ClientJobBMITimeoutSecs and
> ClientJobFlowTimeoutSecs to something lower (like 30 seconds) so you can
> see if the client times out and retries while you watch.
>
> -Phil
>
> Murali Vilayannur wrote:
> > Hi Florin,
> > Thanks for getting back on that!
> > This is quite weird. it probably points to some platform-specific
> > library issue.
> > Since we do use threads, perhaps it is time to retry running configure
> > by disabling usage of threads and see if that helps?
> >
> > ./configure --disable-thread-safety is something you can try
> > perhaps ./configure --enable-nptl-workaround is also something you can
> > try (not together with the previous one though) to workaround glibc
> > oddities.
> > Sam, RobL, Pete any ideas? I am lost..:(
> > Final alternative is to perhaps do a live debug on your machine if
> > possible..
> > thanks,
> > Murali
> >
> > On 7/2/07, Florin Isaila <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >>
> >> many thanks Murali. I have just tried that, but it keeps getting stuck
> >> with an even stranger stack trace:
> >>
> >> (gdb) bt
> >> #0 0x0ff4b2d0 in poll () from /lib/tls/libc.so.6
> >> #1 0x0ffc871c in ?? () from /lib/tls/libc.so.6
> >> #2 0x0ffc871c in ?? () from /lib/tls/libc.so.6
> >> Previous frame identical to this frame (corrupt stack?)
> >>
> >> Any other suggestions?
> >>
> >> Best regards
> >> Florin
> >>
> >> On 7/2/07, Murali Vilayannur <[EMAIL PROTECTED]> wrote:
> >> > Hi Florin,
> >> > Given that both your backtraces point to epoll(), can you run make
> >> > clean followed by configure with --disable-epoll, rebuild everything
> >> > and see if that works?
> >> > If it does work, it probably points to some epoll specific bug on ppc
> >> > either in pvfs2 or the libepoll code..
> >> > thanks,
> >> > Murali
> >> >
> >> > On 7/2/07, Florin Isaila <[EMAIL PROTECTED]> wrote:
> >> > > Hi,
> >> > >
> >> > > We have installed PVFS2 2.6.3 over Ethernet on a SUSE distribution,
> >> > > locally on a biprocessor (PowerPC 970FX) machine.
> >> > >
> >> > > Some commands like pvfs2-ping, pvfs2-mkdir, pvfs2-ls (w/o parameters)
> >> > > work fine.
> >> > >
> >> > > But we can not get it run for some pvfs2-* commands. For instance
> >> > > pvfs2-cp gets stuck. Here the trace of gdb:
> >> > >
> >> > > (gdb) bt
> >> > > #0 0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> >> > > #1 0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> >> > > incount=128, outcount=0xffff97b0, maps=0xffff93b0,
> >> status=0xffff95b0,
> >> > > poll_timeout=10, external_mutex=0x100d2ce0)
> >> > > at socket-collection-epoll.c:281
> >> > > #2 0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> >> > > #3 0x10098d10 in BMI_tcp_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > > outcount=0xffff9864, error_code_array=0x100d2b80,
> >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> max_idle_time=10,
> >> > > context_id=0) at bmi-tcp.c:1303
> >> > > #4 0x1005aa18 in BMI_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > > outcount=0x100d14cc, error_code_array=0x100d2b80,
> >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> > > max_idle_time_ms=10, context_id=0) at bmi.c:944
> >> > > #5 0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> >> > > #6 0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> >> > > at thread-mgr.c:815
> >> > > #7 0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at
> >> job.c:4661
> >> > > #8 0x1007025c in job_testcontext (out_id_array_p=0xffff99d0,
> >> > > inout_count_p=0xffff99b8, returned_user_ptr_array=0xffffd1d0,
> >> > > out_status_array_p=0xffffa1d0, timeout_ms=10, context_id=1) at
> >> job.c:4068
> >> > > #9 0x1000fdb0 in PINT_client_state_machine_test (op_id=3,
> >> > > error_code=0xffffd670) at client-state-machine.c:536
> >> > > ---Type <return> to continue, or q <return> to quit---
> >> > > #10 0x1001041c in PINT_client_wait_internal (op_id=3,
> >> > > in_op_str=0x100b209c "fs_add", out_error=0xffffd670,
> >> > > in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> >> > > #11 0x10010734 in PVFS_sys_wait (op_id=3, in_op_str=0x100b209c
> >> "fs_add",
> >> > > out_error=0xffffd670) at client-state-machine.c:861
> >> > > #12 0x10035c4c in PVFS_sys_fs_add (mntent=0x100d3030) at
> >> fs-add.sm:205
> >> > > #13 0x1004c220 in PVFS_util_init_defaults () at pvfs2-util.c:1040
> >> > > #14 0x1000a5c8 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:135
> >> > >
> >> > > Some other times (but rarely) is getting stuck at a different place:
> >> > >
> >> > > (gdb) bt
> >> > > #0 0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> >> > > #1 0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> >> > > incount=128, outcount=0xffff9b30, maps=0xffff9730,
> >> status=0xffff9930,
> >> > > poll_timeout=10, external_mutex=0x100d2ce0)
> >> > > at socket-collection-epoll.c:281
> >> > > #2 0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> >> > > #3 0x10098d10 in BMI_tcp_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > > outcount=0xffff9be4, error_code_array=0x100d2b80,
> >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> max_idle_time=10,
> >> > > context_id=0) at bmi-tcp.c:1303
> >> > > #4 0x1005aa18 in BMI_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > > outcount=0x100d14cc, error_code_array=0x100d2b80,
> >> > > actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> > > max_idle_time_ms=10, context_id=0) at bmi.c:944
> >> > > #5 0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> >> > > #6 0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> >> > > at thread-mgr.c:815
> >> > > #7 0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at
> >> job.c:4661
> >> > > #8 0x1007025c in job_testcontext (out_id_array_p=0xffff9d50,
> >> > > inout_count_p=0xffff9d38, returned_user_ptr_array=0xffffd550,
> >> > > out_status_array_p=0xffffa550, timeout_ms=10, context_id=1) at
> >> job.c:4068
> >> > > #9 0x1000fdb0 in PINT_client_state_machine_test (op_id=28,
> >> > > error_code=0xffffda1c) at client-state-machine.c:536
> >> > > ---Type <return> to continue, or q <return> to quit---
> >> > > #10 0x1001041c in PINT_client_wait_internal (op_id=28,
> >> > > in_op_str=0x100ac1b8 "io", out_error=0xffffda1c,
> >> > > in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> >> > > #11 0x10010734 in PVFS_sys_wait (op_id=28, in_op_str=0x100ac1b8 "io",
> >> > > out_error=0xffffda1c) at client-state-machine.c:861
> >> > > #12 0x1001b78c in PVFS_sys_io (ref=
> >> > > {handle = 1048570, fs_id = 1957135728, __pad1 = -26176},
> >> > > file_req=0x100d07d8, file_req_offset=0, buffer=0x40068008,
> >> > > mem_req=0x100efbd0, credentials=0xffffe060, resp_p=0xffffda90,
> >> > > io_type=PVFS_IO_WRITE) at sys-io.sm:363
> >> > > #13 0x1000b078 in generic_write (dest=0xffffddb0,
> >> > > buffer=0x40068008 "\177ELF\001\002\001", offset=0, count=2469777,
> >> > > credentials=0xffffe060) at pvfs2-cp.c:365
> >> > > #14 0x1000a824 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:180
> >> > >
> >> > >
> >> > > After breaking the program with Ctrl-C, the files appear created. Any
> >> > > clue where this can come from? It appears like the metadata
> >> > > communication works but the data not.
> >> > >
> >> > > Bellow the result of the ping command.
> >> > >
> >> > > Many thanks
> >> > > Florin
> >> > >
> >> > > pvfs2-ping -m ~/florin/mnt/pvfs2/
> >> > >
> >> > > (1) Parsing tab file...
> >> > >
> >> > > (2) Initializing system interface...
> >> > >
> >> > > (3) Initializing each file system found in tab file:
> >> > > /home/A40001/u72877927/florin/app
> >> > > s/etc/pvfs2tab...
> >> > >
> >> > > PVFS2 servers: tcp://localhost:55555
> >> > > Storage name: pvfs2-fs
> >> > > Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> >> > > /home/A40001/u72877927/florin/mnt/pvfs2: Ok
> >> > >
> >> > > (4) Searching for /home/A40001/u72877927/florin/mnt/pvfs2/ in
> >> pvfstab...
> >> > >
> >> > > PVFS2 servers: tcp://localhost:55555
> >> > > Storage name: pvfs2-fs
> >> > > Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> >> > >
> >> > > meta servers:
> >> > > tcp://localhost:55555
> >> > >
> >> > > data servers:
> >> > > tcp://localhost:55555
> >> > >
> >> > > (5) Verifying that all servers are responding...
> >> > >
> >> > > meta servers:
> >> > > tcp://localhost:55555 Ok
> >> > >
> >> > > data servers:
> >> > > tcp://localhost:55555 Ok
> >> > >
> >> > > (6) Verifying that fsid 1957135728 is acceptable to all servers...
> >> > >
> >> > > Ok; all servers understand fs_id 1957135728
> >> > >
> >> > > (7) Verifying that root handle is owned by one server...
> >> > >
> >> > > Root handle: 1048576
> >> > > Ok; root handle is owned by exactly one server.
> >> > >
> >> > > =============================================================
> >> > >
> >> > > The PVFS2 filesystem at /home/A40001/u72877927/florin/mnt/pvfs2/
> >> > > appears to be correctly configured.
> >> > > _______________________________________________
> >> > > Pvfs2-users mailing list
> >> > > [email protected]
> >> > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >> > >
> >> >
> >>
> > _______________________________________________
> > Pvfs2-users mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>