Hi Florin,
Given that both your backtraces point to epoll(), can you run make
clean followed by configure with --disable-epoll, rebuild everything
and see if that works?
If it does work, it probably points to some epoll specific bug on ppc
either in pvfs2 or the libepoll code..
thanks,
Murali

On 7/2/07, Florin Isaila <[EMAIL PROTECTED]> wrote:
Hi,

We have installed PVFS2 2.6.3 over Ethernet on a SUSE distribution,
locally on a biprocessor (PowerPC 970FX) machine.

Some commands like pvfs2-ping, pvfs2-mkdir, pvfs2-ls (w/o parameters)
work fine.

But we can not get it run for some pvfs2-* commands. For instance
pvfs2-cp gets stuck. Here the trace of gdb:

(gdb) bt
#0  0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
#1  0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
    incount=128, outcount=0xffff97b0, maps=0xffff93b0, status=0xffff95b0,
    poll_timeout=10, external_mutex=0x100d2ce0)
    at socket-collection-epoll.c:281
#2  0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
#3  0x10098d10 in BMI_tcp_testcontext (incount=5, out_id_array=0x100d2b58,
    outcount=0xffff9864, error_code_array=0x100d2b80,
    actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0, max_idle_time=10,
    context_id=0) at bmi-tcp.c:1303
#4  0x1005aa18 in BMI_testcontext (incount=5, out_id_array=0x100d2b58,
    outcount=0x100d14cc, error_code_array=0x100d2b80,
    actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
    max_idle_time_ms=10, context_id=0) at bmi.c:944
#5  0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
#6  0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
    at thread-mgr.c:815
#7  0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at job.c:4661
#8  0x1007025c in job_testcontext (out_id_array_p=0xffff99d0,
    inout_count_p=0xffff99b8, returned_user_ptr_array=0xffffd1d0,
    out_status_array_p=0xffffa1d0, timeout_ms=10, context_id=1) at job.c:4068
#9  0x1000fdb0 in PINT_client_state_machine_test (op_id=3,
    error_code=0xffffd670) at client-state-machine.c:536
---Type <return> to continue, or q <return> to quit---
#10 0x1001041c in PINT_client_wait_internal (op_id=3,
    in_op_str=0x100b209c "fs_add", out_error=0xffffd670,
    in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
#11 0x10010734 in PVFS_sys_wait (op_id=3, in_op_str=0x100b209c "fs_add",
    out_error=0xffffd670) at client-state-machine.c:861
#12 0x10035c4c in PVFS_sys_fs_add (mntent=0x100d3030) at fs-add.sm:205
#13 0x1004c220 in PVFS_util_init_defaults () at pvfs2-util.c:1040
#14 0x1000a5c8 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:135

Some other times (but rarely) is getting stuck at a different place:

(gdb) bt
#0  0x0ff5596c in epoll_wait () from /lib/tls/libc.so.6
#1  0x100a062c in BMI_socket_collection_testglobal (scp=0x100e48b0,
    incount=128, outcount=0xffff9b30, maps=0xffff9730, status=0xffff9930,
    poll_timeout=10, external_mutex=0x100d2ce0)
    at socket-collection-epoll.c:281
#2  0x1009bf24 in tcp_do_work (max_idle_time=10) at bmi-tcp.c:2681
#3  0x10098d10 in BMI_tcp_testcontext (incount=5, out_id_array=0x100d2b58,
    outcount=0xffff9be4, error_code_array=0x100d2b80,
    actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0, max_idle_time=10,
    context_id=0) at bmi-tcp.c:1303
#4  0x1005aa18 in BMI_testcontext (incount=5, out_id_array=0x100d2b58,
    outcount=0x100d14cc, error_code_array=0x100d2b80,
    actual_size_array=0x100d2b98, user_ptr_array=0x100d2bc0,
    max_idle_time_ms=10, context_id=0) at bmi.c:944
#5  0x10071fc8 in bmi_thread_function (ptr=0x0) at thread-mgr.c:239
#6  0x10072e24 in PINT_thread_mgr_bmi_push (max_idle_time=10)
    at thread-mgr.c:815
#7  0x10071460 in do_one_work_cycle_all (idle_time_ms=10) at job.c:4661
#8  0x1007025c in job_testcontext (out_id_array_p=0xffff9d50,
    inout_count_p=0xffff9d38, returned_user_ptr_array=0xffffd550,
    out_status_array_p=0xffffa550, timeout_ms=10, context_id=1) at job.c:4068
#9  0x1000fdb0 in PINT_client_state_machine_test (op_id=28,
    error_code=0xffffda1c) at client-state-machine.c:536
---Type <return> to continue, or q <return> to quit---
#10 0x1001041c in PINT_client_wait_internal (op_id=28,
    in_op_str=0x100ac1b8 "io", out_error=0xffffda1c,
    in_class_str=0x100a97d4 "sys") at client-state-machine.c:733
#11 0x10010734 in PVFS_sys_wait (op_id=28, in_op_str=0x100ac1b8 "io",
    out_error=0xffffda1c) at client-state-machine.c:861
#12 0x1001b78c in PVFS_sys_io (ref=
      {handle = 1048570, fs_id = 1957135728, __pad1 = -26176},
    file_req=0x100d07d8, file_req_offset=0, buffer=0x40068008,
    mem_req=0x100efbd0, credentials=0xffffe060, resp_p=0xffffda90,
    io_type=PVFS_IO_WRITE) at sys-io.sm:363
#13 0x1000b078 in generic_write (dest=0xffffddb0,
    buffer=0x40068008 "\177ELF\001\002\001", offset=0, count=2469777,
    credentials=0xffffe060) at pvfs2-cp.c:365
#14 0x1000a824 in main (argc=3, argv=0xffffe3b4) at pvfs2-cp.c:180


After breaking the program with Ctrl-C, the files appear created. Any
clue where this can come from? It appears like the metadata
communication works but the data not.

Bellow the result of the ping command.

Many thanks
Florin

pvfs2-ping -m ~/florin/mnt/pvfs2/

(1) Parsing tab file...

(2) Initializing system interface...

(3) Initializing each file system found in tab file:
/home/A40001/u72877927/florin/app
                               s/etc/pvfs2tab...

   PVFS2 servers: tcp://localhost:55555
   Storage name: pvfs2-fs
   Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2
   /home/A40001/u72877927/florin/mnt/pvfs2: Ok

(4) Searching for /home/A40001/u72877927/florin/mnt/pvfs2/ in pvfstab...

   PVFS2 servers: tcp://localhost:55555
   Storage name: pvfs2-fs
   Local mount point: /home/A40001/u72877927/florin/mnt/pvfs2

   meta servers:
   tcp://localhost:55555

   data servers:
   tcp://localhost:55555

(5) Verifying that all servers are responding...

   meta servers:
   tcp://localhost:55555 Ok

   data servers:
   tcp://localhost:55555 Ok

(6) Verifying that fsid 1957135728 is acceptable to all servers...

   Ok; all servers understand fs_id 1957135728

(7) Verifying that root handle is owned by one server...

   Root handle: 1048576
     Ok; root handle is owned by exactly one server.

=============================================================

The PVFS2 filesystem at /home/A40001/u72877927/florin/mnt/pvfs2/
appears to be correctly configured.
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to