Re: [Pvfs2-users] PVFS2 installation problem

Murali Vilayannur Wed, 04 Jul 2007 09:38:32 -0700

Florin,
Many thanks for setting up an account on your system!
Your system's aio callback/glibc libraries are broken.
We have a configure check to workaround this.
--disable-aio-threaded-callbacks
Like you mentioned though, the builds were broken.
Attached diffs will fix that.
Subsequently, I am able to do pvfs2-cp etc without any problems.
I have already modified the pvfs-2.6.3 sources under
/home/A40001/u72877927/florin/apps/pvfs-2.6.3 to build properly with
the new option mentioned above.
Let us know if you hit any problems!
If you happen to upgrade your glibc, then do try pvfs2 with aio
threaded callbacks. it might work until which time you have to employ
this workaround. There shouldn't be any noticeable performance drop I
think although others can correct me if I am wrong on that.
thanks,
murali


--- /tmp/gen-locks.h    2007-07-04 18:10:49.223697505 +0200
+++ ./src/common/gen-locks/gen-locks.h  2007-07-04 08:59:27.603350642 +0200
@@ -62,7 +62,7 @@
#endif /* __GEN_POSIX_LOCKING__ */


-#ifdef __GEN_NULL_LOCKING__
+#if defined(__GEN_NULL_LOCKING__) && !defined(__GEN_POSIX_LOCKING__)
       /* this stuff messes around just enough to prevent warnings */
typedef int gen_mutex_t;
typedef unsigned long gen_thread_t;


On 7/3/07, Florin Isaila <[EMAIL PROTECTED]> wrote:

Hi guys, many thanks for your replies. Unfortunatly, in spite of all
tries, it keeps having problems:

1) -disable-thread-safety does not compile
2) --enable-nptl-workaround does not create the file system correctly
3) Configured without any extra-parameters  and with debugging is
getting stuck for 5 minutes and then tries again:

[D 15:38:53.732290] Posted PVFS_SYS_IO (waiting for test)
[E 15:43:53.314350] job_time_mgr_expire: job time out: cancelling bmi
operation, job_id: 29.

Murali, you are saying something about live debugging, would it be
possible for you to do that? If yes, just send me the public rsa/dsa
keys and then I will give you the machine details.

Many thanks
Florin


On 7/3/07, Phil Carns <[EMAIL PROTECTED]> wrote:
> Is it possible that a response just isn't making it back to the clients
> for some reason?
>
> If the client library can't find anything else to do, it will be normal
> for it to spend the majority of its time sleeping in either poll() or
> epoll() until some messages show up that it needs.  It should give up
> eventually, but the job timeouts may be set rather high by default.  It
> looks like the defaults are 300 second timeouts with 5 retries.
>
> You might find some more information by setting the PVFS2_DEBUGMASK
> environment variable to "network" before running one of the pvfs2-*
> utilities that hangs.  If that doesn't indicate anything useful you
> could try setting it to "verbose" to get even more debugging output.  In
> conjunction with this you might want to set ClientJobBMITimeoutSecs and
> ClientJobFlowTimeoutSecs to something lower (like 30 seconds) so you can
> see if the client times out and retries while you watch.
>
> -Phil
>
> Murali Vilayannur wrote:
> > Hi Florin,
> > Thanks for getting back on that!
> > This is quite weird. it probably points to some platform-specific
> > library issue.
> > Since we do use threads, perhaps it is time to retry running configure
> > by disabling usage of threads and see if that helps?
> >
> > ./configure --disable-thread-safety is something you can try
> > perhaps ./configure --enable-nptl-workaround is also something you can
> > try (not together with the previous one though) to workaround glibc
> > oddities.
> > Sam, RobL, Pete any ideas? I am lost..:(
> > Final alternative is to perhaps do a live debug on your machine if
> > possible..
> > thanks,
> > Murali
> >
> > On 7/2/07, Florin Isaila <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >>
> >> many thanks Murali. I have just tried that, but it keeps getting stuck
> >> with an even stranger stack trace:
> >>
> >> (gdb) bt
> >> #0  0x0ff4b2d0 in poll () from /lib/tls/libc.so.6
> >> #1  0x0ffc871c in ?? () from /lib/tls/libc.so.6
> >> #2  0x0ffc871c in ?? () from /lib/tls/libc.so.6
> >> Previous frame identical to this frame (corrupt stack?)
> >>
> >> Any other suggestions?
> >>
> >> Best regards
> >> Florin
> >>
> >> On 7/2/07, Murali Vilayannur <[EMAIL PROTECTED]> wrote:
> >> > Hi Florin,
> >> > Given that both your backtraces point to epoll(), can you run make
> >> > clean followed by configure with --disable-epoll, rebuild everything
> >> > and see if that works?
> >> > If it does work, it probably points to some epoll specific bug on ppc
> >> > either in pvfs2 or the libepoll code..
> >> > thanks,
> >> > Murali
> >> >
> >> > On 7/2/07, Florin Isaila <[EMAIL PROTECTED]> wrote:
> >> > > Hi,
> >> > >
> >> > > We have installed PVFS2 2.6.3 over Ethernet on a SUSE distribution,
> >> > > locally on a biprocessor (PowerPC 970FX) machine.
> >> > >
> >> > > Some commands like pvfs2-ping, pvfs2-mkdir, pvfs2-ls (w/o parameters)
> >> > > work fine.
> >> > >
> >> > > But we can not get it run for some pvfs2-* commands. For instance
> >> > > pvfs2-cp gets stuck. Here the trace of gdb:
> >> > >
> >> > > (gdb) bt
> >> > > #0  0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> >> > > #1  0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> >> > >     incount=128, outcount=0xffff97b0, maps=0xffff93b0,
> >> status=0xffff95b0,
> >> > >     poll_timeout=10, external_mutex=0x100d2ce0)
> >> > >     at socket-collection-epoll.c:281
> >> > > #2  0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> >> > > #3  0x10098d10 in BMI_tcp_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > >     outcount=0xffff9864, error_code_array=0x100d2b80,
> >> > >     actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> max_idle_time=10,
> >> > >     context_id=0) at bmi-tcp.c:1303
> >> > > #4  0x1005aa18 in BMI_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > >     outcount=0x100d14cc, error_code_array=0x100d2b80,
> >> > >     actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> > >     max_idle_time_ms=10, context_id=0) at bmi.c:944
> >> > > #5  0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> >> > > #6  0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> >> > >     at thread-mgr.c:815
> >> > > #7  0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at
> >> job.c:4661
> >> > > #8  0x1007025c in job_testcontext (out_id_array_p=0xffff99d0,
> >> > >     inout_count_p=0xffff99b8, returned_user_ptr_array=0xffffd1d0,
> >> > >     out_status_array_p=0xffffa1d0, timeout_ms=10, context_id=1) at
> >> job.c:4068
> >> > > #9  0x1000fdb0 in PINT_client_state_machine_test (op_id=3,
> >> > >     error_code=0xffffd670) at client-state-machine.c:536
> >> > > ---Type <return> to continue, or q <return> to quit---
> >> > > #10 0x1001041c in PINT_client_wait_internal (op_id=3,
> >> > >     in_op_str=0x100b209c "fs_add", out_error=0xffffd670,
> >> > >     in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> >> > > #11 0x10010734 in PVFS_sys_wait (op_id=3, in_op_str=0x100b209c
> >> "fs_add",
> >> > >     out_error=0xffffd670) at client-state-machine.c:861
> >> > > #12 0x10035c4c in PVFS_sys_fs_add (mntent=0x100d3030) at
> >> fs-add.sm:205
> >> > > #13 0x1004c220 in PVFS_util_init_defaults () at pvfs2-util.c:1040
> >> > > #14 0x1000a5c8 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:135
> >> > >
> >> > > Some other times (but rarely) is getting stuck at a different place:
> >> > >
> >> > > (gdb) bt
> >> > > #0  0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
> >> > > #1  0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
> >> > >     incount=128, outcount=0xffff9b30, maps=0xffff9730,
> >> status=0xffff9930,
> >> > >     poll_timeout=10, external_mutex=0x100d2ce0)
> >> > >     at socket-collection-epoll.c:281
> >> > > #2  0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
> >> > > #3  0x10098d10 in BMI_tcp_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > >     outcount=0xffff9be4, error_code_array=0x100d2b80,
> >> > >     actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> max_idle_time=10,
> >> > >     context_id=0) at bmi-tcp.c:1303
> >> > > #4  0x1005aa18 in BMI_testcontext (incount=5,
> >> out_id_array=0x100d2b58,
> >> > >     outcount=0x100d14cc, error_code_array=0x100d2b80,
> >> > >     actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
> >> > >     max_idle_time_ms=10, context_id=0) at bmi.c:944
> >> > > #5  0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
> >> > > #6  0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
> >> > >     at thread-mgr.c:815
> >> > > #7  0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at
> >> job.c:4661
> >> > > #8  0x1007025c in job_testcontext (out_id_array_p=0xffff9d50,
> >> > >     inout_count_p=0xffff9d38, returned_user_ptr_array=0xffffd550,
> >> > >     out_status_array_p=0xffffa550, timeout_ms=10, context_id=1) at
> >> job.c:4068
> >> > > #9  0x1000fdb0 in PINT_client_state_machine_test (op_id=28,
> >> > >     error_code=0xffffda1c) at client-state-machine.c:536
> >> > > ---Type <return> to continue, or q <return> to quit---
> >> > > #10 0x1001041c in PINT_client_wait_internal (op_id=28,
> >> > >     in_op_str=0x100ac1b8 "io", out_error=0xffffda1c,
> >> > >     in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
> >> > > #11 0x10010734 in PVFS_sys_wait (op_id=28, in_op_str=0x100ac1b8 "io",
> >> > >     out_error=0xffffda1c) at client-state-machine.c:861
> >> > > #12 0x1001b78c in PVFS_sys_io (ref=
> >> > >       {handle = 1048570, fs_id = 1957135728, __pad1 = -26176},
> >> > >     file_req=0x100d07d8, file_req_offset=0, buffer=0x40068008,
> >> > >     mem_req=0x100efbd0, credentials=0xffffe060, resp_p=0xffffda90,
> >> > >     io_type=PVFS_IO_WRITE) at sys-io.sm:363
> >> > > #13 0x1000b078 in generic_write (dest=0xffffddb0,
> >> > >     buffer=0x40068008 "\177ELF\001\002\001", offset=0, count=2469777,
> >> > >     credentials=0xffffe060) at pvfs2-cp.c:365
> >> > > #14 0x1000a824 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:180
> >> > >
> >> > >
> >> > > After breaking the program with Ctrl-C, the files appear created. Any
> >> > > clue where this can come from? It appears like the metadata
> >> > > communication works but the data not.
> >> > >
> >> > > Bellow the result of the ping command.
> >> > >
> >> > > Many thanks
> >> > > Florin
> >> > >
> >> > > pvfs2-ping -m ~/florin/mnt/pvfs2/
> >> > >
> >> > > (1) Parsing tab file...
> >> > >
> >> > > (2) Initializing system interface...
> >> > >
> >> > > (3) Initializing each file system found in tab file:
> >> > > /home/A40001/u72877927/florin/app
> >> > >                                s/etc/pvfs2tab...
> >> > >
> >> > >    PVFS2 servers: tcp://localhost:55555
> >> > >    Storage name: pvfs2-fs
> >> > >    Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> >> > >    /home/A40001/u72877927/florin/mnt/pvfs2: Ok
> >> > >
> >> > > (4) Searching for /home/A40001/u72877927/florin/mnt/pvfs2/ in
> >> pvfstab...
> >> > >
> >> > >    PVFS2 servers: tcp://localhost:55555
> >> > >    Storage name: pvfs2-fs
> >> > >    Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
> >> > >
> >> > >    meta servers:
> >> > >    tcp://localhost:55555
> >> > >
> >> > >    data servers:
> >> > >    tcp://localhost:55555
> >> > >
> >> > > (5) Verifying that all servers are responding...
> >> > >
> >> > >    meta servers:
> >> > >    tcp://localhost:55555 Ok
> >> > >
> >> > >    data servers:
> >> > >    tcp://localhost:55555 Ok
> >> > >
> >> > > (6) Verifying that fsid 1957135728 is acceptable to all servers...
> >> > >
> >> > >    Ok; all servers understand fs_id 1957135728
> >> > >
> >> > > (7) Verifying that root handle is owned by one server...
> >> > >
> >> > >    Root handle: 1048576
> >> > >      Ok; root handle is owned by exactly one server.
> >> > >
> >> > > =============================================================
> >> > >
> >> > > The PVFS2 filesystem at /home/A40001/u72877927/florin/mnt/pvfs2/
> >> > > appears to be correctly configured.
> >> > > _______________________________________________
> >> > > Pvfs2-users mailing list
> >> > > [email protected]
> >> > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> >> > >
> >> >
> >>
> > _______________________________________________
> > Pvfs2-users mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>
>

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] PVFS2 installation problem

Reply via email to