Hey guys,

We have a script that is giving PVFS some trouble. It has several processes
that divide up a file and read in different chunks. It needs at least two
clients accessing the file, and it seems to happen more often with more
processes participating but not necessarily more nodes. As a reference
point, when I used 8 clients with 2 processes each, I only saw 2 errors in 3
or 4 days. When I switched that to 2 clients with 8 processes each, I see 3
or 4 per hour. Each run lasts about 5 minutes.

I get what appear to be two separate errors, and both are client side. The
servers never log anything during these tests. The errors that actually
cause a failure leave one of these messages from each process:

Jul 13 09:00:31 client3 PVFS2: [E] fp_multiqueue_cancel: flow proto cancel
called on 0x9a0b420
Jul 13 09:00:31 client3 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x9a0b420, error_code: -1610613121
Jul 13 09:00:31 client3 PVFS2: [E] handle_io_error: flow proto 0x9a0b420
canceled 1 operations, will clean up.

The script is searching for a newline in the file at a specific location,
and appears to be reading bad data because it doesn't find the newline where
it should.

The other error that occurs less frequently leaves a core file from
pvfs2-client-core and puts out these error messages:

Jul 13 09:19:49 client3 PVFS2: [E] Error: payload_progress: Invalid argument
Jul 13 09:19:49 client3 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x9a0b420, error_code: -1073741967
Jul 13 09:19:49 client3 PVFS2: [E] handle_io_error: flow proto 0x9a0b420
canceled 0 operations, will clean up.
Jul 13 09:19:49 client3 PVFS2: [E] handle_io_error: flow proto 0x9a0b420
error cleanup finished, error_code: -1073741967
Jul 13 09:19:49 client3 PVFS2: [E] handle_io_error: flow proto error cleanup
started on 0x9a0bab0, error_code: -1073741967
Jul 13 09:19:49 client3 PVFS2: [E] handle_io_error: flow proto 0x9a0bab0
canceled 0 operations, will clean up.
Jul 13 09:19:49 client3 PVFS2: [E] handle_io_error: flow proto 0x9a0bab0
error cleanup finished, error_code: -1073741967

The backtrace from the core file is as follows:

(gdb) bt
#0  0x0021eeff in raise () from /lib/tls/libc.so.6
#1  0x00220705 in abort () from /lib/tls/libc.so.6
#2  0x00218619 in __assert_fail () from /lib/tls/libc.so.6
#3  0x00536edd in bmi_to_mem_callback_wrapper (user_ptr=0x99fab88,
actual_size=65536, error_code=0)
    at
../pvfs2_src/src/io/flow/flowproto-bmi-trove/flowproto-multiqueue.c:260
#4  0x00540300 in bmi_thread_function (ptr=0x0) at
../pvfs2_src/src/io/job/thread-mgr.c:276
#5  0x00540cec in PINT_thread_mgr_bmi_push (max_idle_time=10) at
../pvfs2_src/src/io/job/thread-mgr.c:814
#6  0x0053f944 in do_one_work_cycle_all (idle_time_ms=10) at
../pvfs2_src/src/io/job/job.c:4730
#7  0x0053eae5 in job_testcontext (out_id_array_p=0xbfffb500,
inout_count_p=0xbfffbd10, returned_user_ptr_array=0xbfff8100,
    out_status_array_p=0xbfff8500, timeout_ms=10, context_id=1) at
../pvfs2_src/src/io/job/job.c:4137
#8  0x0056246f in PINT_client_state_machine_testsome
(op_id_array=0xbfffc0a0, op_count=0xbfffc3a8, user_ptr_array=0xbfffc2a0,
    error_code_array=0xbfffbfa0, timeout_ms=10) at
../pvfs2_src/src/client/sysint/client-state-machine.c:654
#9  0x00562887 in PVFS_sys_testsome (op_id_array=0xbfffc0a0,
op_count=0xbfffc3a8, user_ptr_array=0xbfffc2a0, error_code_array=0xbfffbfa0,
    timeout_ms=10) at
../pvfs2_src/src/client/sysint/client-state-machine.c:862
#10 0x0804e56e in process_vfs_requests () at
../pvfs2_src/src/apps/kernel/linux/pvfs2-client-core.c:2932
#11 0x0804f2b0 in main (argc=13, argv=0xbfffc494) at
../pvfs2_src/src/apps/kernel/linux/pvfs2-client-core.c:3313

I have reproduced this on both 2.4 and 2.6 kernels against a pvfs 2.6 file
system, but I haven't been able to reproduce it on a 2.8 file system. Do the
log messages and backtrace point to anything that may have changed since 2.6
that we could create a patch for? I can provide the core files if needed.

Thanks,
Bart.
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to