Oh, and as a side note, will your protocol eventually support sizes larger than 8kb? You probably know this already, but 8kb won't be a very efficient access size for most hard drives, regardless of your network characteristics.

-Phil

On 11/23/2011 03:25 PM, Phil Carns wrote:
Hi Mikhail,

The default flow protocol used to check BMI_CHECK_MAXSIZE so that it wouldn't try to send messages too large for the BMI method. I'm not sure why it doesn't right now; that check may have been removed by accident. I can see one of the other protocols (flowproto-dump-offsets, which simply prints debugging information rather than moving data) still does, which is ironic since it is the only flow protocol that doesn't actually use BMI :)

At any rate, you have two options.

For a short term fix, you can modify your server configuration file to add a "FlowBufferSizeBytes" parameter to the <FileSystem> section, and make sure that its value is set to something within the maximum size that your BMI protocol supports. That should be good enough to get you back on track testing your code.

For a longer term fix, you would want to modify flowproto-multiqueue.c so that it performs the proper check and then either reduces the flow buffer size accordingly or at least prints out a meaningful error message. That would be a great patch to post back to the list in case anyone else hits that problem in the future.

-Phil

On 11/23/2011 07:07 AM, ?????? ???????????? wrote:

Hi.

I hava a problem:

[mix@smart bin]$ trun ./pvfs2-cp ./pvfs2-cp /home/mix/pvfs2fs -n 4
[E 17:54:31.582929] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.583273] handle_io_error: flow proto error cleanup started on 0x9d71604: Message too long [E 17:54:31.583374] handle_io_error: flow proto 0x9d71604 canceled 0 operations, will clean up. [E 17:54:31.583469] handle_io_error: flow proto 0x9d71604 error cleanup finished: Message too long
[E 17:54:31.584654] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.584767] handle_io_error: flow proto error cleanup started on 0x9d71cc0: Message too long [E 17:54:31.584860] handle_io_error: flow proto 0x9d71cc0 canceled 0 operations, will clean up. [E 17:54:31.584961] handle_io_error: flow proto 0x9d71cc0 error cleanup finished: Message too long
^Ctask2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 255.

servers reaction:

[mix@smart sbin]$ trun ./pvfs2-server ./fs.conf -d -n 0
task1: pvfs2-server started on nodes 0
[S 11/22/2011 20:49:32] PVFS2 Server on node torus0 version 2.8.4-orangefs starting... [E 11/22/2011 20:49:32] BMI_initialize: j=0, ladr = m2://0, proto=m2: bmi_m2
[S 11/22/2011 20:49:34] PVFS2 Server ready.
[E 11/22/2011 20:55:37] job_time_mgr_expire: job time out: cancelling flow operation, job_id: 1000. [E 11/22/2011 20:55:37] fp_multiqueue_cancel: flow proto cancel called on 0x83c2a38
[E 11/22/2011 20:55:37] fp_multiqueue_cancel: I/O error occurred
[E 11/22/2011 20:55:37] handle_io_error: flow proto error cleanup started on 0x83c2a38: Operation cancelled (possibly due to timeout) [E 11/22/2011 20:55:37] handle_io_error: flow proto 0x83c2a38 canceled 1 operations, will clean up.

[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0

file exists but it empty

when i run pvfs2-validate one server crushes and other servers doesn't respond to other requests from pvfs2 utilities:
[mix@smart bin]$ trun ./pvfs2-validate -d /home/mix/pvfs2fs -n 4
task2: pvfs2-validate started on nodes 4
^Ctask2: Program /home/mix/orfs/bin/pvfs2-validate exited with exitcode 255. (pvfs2-validate also hangs)

server that crushes:

[E 11/22/2011 20:57:55] Error: poorly formatted protocol message received.
[E 11/22/2011 20:57:55] Too small: message only 0 bytes.
[E 11/22/2011 20:57:55] msgpairarray decode error: Protocol error
[E 11/22/2011 20:57:55] PVFS2 server: signal 11, faulty address is (nil), from 0x80c30ac
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80c30ac]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80e1493]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_invoke+0x12f) [0x80de9b1] [E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_next+0x23c) [0x80ded83] [E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_continue+0x18) [0x80dedb7] [E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server(main+0x665) [0x80586c9] [E 11/22/2011 20:57:55] [bt] /lib/libc.so.6(__libc_start_main+0xe0) [0xbb5390]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x8057f01]
task0: Program /home/mix/orfs/sbin/pvfs2-server exited with exitcode 11.

then I run servers again and pvfs2-validate doesn't claim about errors and
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.
null-sized file exists and now it is ok for pvfs2-validate.

This problem does not occur when I trying to copy small file:

mix@smart bin]$ trun ./pvfs2-cp ./pvfs2tab /home/mix/pvfs2fs -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.

and back

[mix@smart bin]$ trun ./pvfs2-cp /home/mix/pvfs2fs/pvfs2tab /home/mix/p2tab -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.

[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:04 pvfs2tab
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.

[mix@smart bin]$ diff ./pvfs2tab /home/mix/p2tab
[mix@smart bin]$ ls -l /home/mix/p2tab
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:44 /home/mix/p2tab
no differencies between files

It is seems like pvfs2-cp trying to send file with one message but maximum message size is 8kb (in my bmi_m2 method) and in log I found that BMI_post_send_list tryes to send one buffer of size 49216 bytes. early it calls bmi_get_info with option 10 (BMI_GET_UNEXP_SIZE) and send unexpected message to server and receives message from the server. But it never calls bmi_get_info with option 3 (BMI_CHECK_MAXSIZE) and bmi_post_send_list returns BMI_EMSGSIZE.

Is it a problem in pvfs2-cp? or bmi method must support sending of big expected messages (10 mb for instance)?

Thanks,
Mikhail Gilmendinov



_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to