Hi Mikhail,
The default flow protocol used to check BMI_CHECK_MAXSIZE so that it
wouldn't try to send messages too large for the BMI method. I'm not
sure why it doesn't right now; that check may have been removed by
accident. I can see one of the other protocols (flowproto-dump-offsets,
which simply prints debugging information rather than moving data) still
does, which is ironic since it is the only flow protocol that doesn't
actually use BMI :)
At any rate, you have two options.
For a short term fix, you can modify your server configuration file to
add a "FlowBufferSizeBytes" parameter to the <FileSystem> section, and
make sure that its value is set to something within the maximum size
that your BMI protocol supports. That should be good enough to get you
back on track testing your code.
For a longer term fix, you would want to modify flowproto-multiqueue.c
so that it performs the proper check and then either reduces the flow
buffer size accordingly or at least prints out a meaningful error
message. That would be a great patch to post back to the list in case
anyone else hits that problem in the future.
-Phil
On 11/23/2011 07:07 AM, ?????? ???????????? wrote:
Hi.
I hava a problem:
[mix@smart bin]$ trun ./pvfs2-cp ./pvfs2-cp /home/mix/pvfs2fs -n 4
[E 17:54:31.582929] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.583273] handle_io_error: flow proto error cleanup started
on 0x9d71604: Message too long
[E 17:54:31.583374] handle_io_error: flow proto 0x9d71604 canceled 0
operations, will clean up.
[E 17:54:31.583469] handle_io_error: flow proto 0x9d71604 error
cleanup finished: Message too long
[E 17:54:31.584654] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.584767] handle_io_error: flow proto error cleanup started
on 0x9d71cc0: Message too long
[E 17:54:31.584860] handle_io_error: flow proto 0x9d71cc0 canceled 0
operations, will clean up.
[E 17:54:31.584961] handle_io_error: flow proto 0x9d71cc0 error
cleanup finished: Message too long
^Ctask2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 255.
servers reaction:
[mix@smart sbin]$ trun ./pvfs2-server ./fs.conf -d -n 0
task1: pvfs2-server started on nodes 0
[S 11/22/2011 20:49:32] PVFS2 Server on node torus0 version
2.8.4-orangefs starting...
[E 11/22/2011 20:49:32] BMI_initialize: j=0, ladr = m2://0, proto=m2:
bmi_m2
[S 11/22/2011 20:49:34] PVFS2 Server ready.
[E 11/22/2011 20:55:37] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 1000.
[E 11/22/2011 20:55:37] fp_multiqueue_cancel: flow proto cancel called
on 0x83c2a38
[E 11/22/2011 20:55:37] fp_multiqueue_cancel: I/O error occurred
[E 11/22/2011 20:55:37] handle_io_error: flow proto error cleanup
started on 0x83c2a38: Operation cancelled (possibly due to timeout)
[E 11/22/2011 20:55:37] handle_io_error: flow proto 0x83c2a38 canceled
1 operations, will clean up.
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0
file exists but it empty
when i run pvfs2-validate one server crushes and other servers doesn't
respond to other requests from pvfs2 utilities:
[mix@smart bin]$ trun ./pvfs2-validate -d /home/mix/pvfs2fs -n 4
task2: pvfs2-validate started on nodes 4
^Ctask2: Program /home/mix/orfs/bin/pvfs2-validate exited with
exitcode 255. (pvfs2-validate also hangs)
server that crushes:
[E 11/22/2011 20:57:55] Error: poorly formatted protocol message received.
[E 11/22/2011 20:57:55] Too small: message only 0 bytes.
[E 11/22/2011 20:57:55] msgpairarray decode error: Protocol error
[E 11/22/2011 20:57:55] PVFS2 server: signal 11, faulty address is
(nil), from 0x80c30ac
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80c30ac]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80e1493]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_invoke+0x12f)
[0x80de9b1]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_next+0x23c)
[0x80ded83]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_continue+0x18)
[0x80dedb7]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(main+0x665) [0x80586c9]
[E 11/22/2011 20:57:55] [bt] /lib/libc.so.6(__libc_start_main+0xe0)
[0xbb5390]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x8057f01]
task0: Program /home/mix/orfs/sbin/pvfs2-server exited with exitcode 11.
then I run servers again and pvfs2-validate doesn't claim about errors and
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.
null-sized file exists and now it is ok for pvfs2-validate.
This problem does not occur when I trying to copy small file:
mix@smart bin]$ trun ./pvfs2-cp ./pvfs2tab /home/mix/pvfs2fs -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.
and back
[mix@smart bin]$ trun ./pvfs2-cp /home/mix/pvfs2fs/pvfs2tab
/home/mix/p2tab -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:04 pvfs2tab
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.
[mix@smart bin]$ diff ./pvfs2tab /home/mix/p2tab
[mix@smart bin]$ ls -l /home/mix/p2tab
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:44 /home/mix/p2tab
no differencies between files
It is seems like pvfs2-cp trying to send file with one message but
maximum message size is 8kb (in my bmi_m2 method) and in log I found that
BMI_post_send_list tryes to send one buffer of size 49216 bytes. early
it calls bmi_get_info with option 10 (BMI_GET_UNEXP_SIZE) and send
unexpected message to server and receives message from the server.
But it never calls bmi_get_info with option 3 (BMI_CHECK_MAXSIZE) and
bmi_post_send_list returns BMI_EMSGSIZE.
Is it a problem in pvfs2-cp? or bmi method must support sending of big
expected messages (10 mb for instance)?
Thanks,
Mikhail Gilmendinov
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers