Hi.

I hava a problem:

[mix@smart bin]$ trun ./pvfs2-cp ./pvfs2-cp /home/mix/pvfs2fs -n 4
[E 17:54:31.582929] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.583273] handle_io_error: flow proto error cleanup started on 
0x9d71604: Message too long
[E 17:54:31.583374] handle_io_error: flow proto 0x9d71604 canceled 0 
operations, will clean up.
[E 17:54:31.583469] handle_io_error: flow proto 0x9d71604 error cleanup 
finished: Message too long
[E 17:54:31.584654] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.584767] handle_io_error: flow proto error cleanup started on 
0x9d71cc0: Message too long
[E 17:54:31.584860] handle_io_error: flow proto 0x9d71cc0 canceled 0 
operations, will clean up.
[E 17:54:31.584961] handle_io_error: flow proto 0x9d71cc0 error cleanup 
finished: Message too long
^Ctask2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 255.

servers reaction:

[mix@smart sbin]$ trun ./pvfs2-server ./fs.conf -d -n 0
task1: pvfs2-server started on nodes 0
[S 11/22/2011 20:49:32] PVFS2 Server on node torus0 version 2.8.4-orangefs 
starting...
[E 11/22/2011 20:49:32] BMI_initialize: j=0, ladr = m2://0, proto=m2: bmi_m2
[S 11/22/2011 20:49:34] PVFS2 Server ready.
[E 11/22/2011 20:55:37] job_time_mgr_expire: job time out: cancelling flow 
operation, job_id: 1000.
[E 11/22/2011 20:55:37] fp_multiqueue_cancel: flow proto cancel called on 
0x83c2a38
[E 11/22/2011 20:55:37] fp_multiqueue_cancel: I/O error occurred
[E 11/22/2011 20:55:37] handle_io_error: flow proto error cleanup started on 
0x83c2a38: Operation cancelled (possibly due to timeout)
[E 11/22/2011 20:55:37] handle_io_error: flow proto 0x83c2a38 canceled 1 
operations, will clean up.

[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0

file exists but it empty

when i run pvfs2-validate one server crushes and other servers doesn't respond 
to other requests from pvfs2 utilities:
[mix@smart bin]$ trun ./pvfs2-validate -d /home/mix/pvfs2fs -n 4
task2: pvfs2-validate started on nodes 4
^Ctask2: Program /home/mix/orfs/bin/pvfs2-validate exited with exitcode 255. 
(pvfs2-validate also hangs)

server that crushes:

[E 11/22/2011 20:57:55] Error: poorly formatted protocol message received.
[E 11/22/2011 20:57:55] Too small: message only 0 bytes.
[E 11/22/2011 20:57:55] msgpairarray decode error: Protocol error
[E 11/22/2011 20:57:55] PVFS2 server: signal 11, faulty address is (nil), from 
0x80c30ac
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80c30ac]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80e1493]
[E 11/22/2011 20:57:55] [bt] 
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_invoke+0x12f) [0x80de9b1]
[E 11/22/2011 20:57:55] [bt] 
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_next+0x23c) [0x80ded83]
[E 11/22/2011 20:57:55] [bt] 
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_continue+0x18) [0x80dedb7]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server(main+0x665) 
[0x80586c9]
[E 11/22/2011 20:57:55] [bt] /lib/libc.so.6(__libc_start_main+0xe0) [0xbb5390]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x8057f01]
task0: Program /home/mix/orfs/sbin/pvfs2-server exited with exitcode 11.

then I run servers again and pvfs2-validate doesn't claim about errors and
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.
null-sized file exists and now it is ok for pvfs2-validate.

This problem does not occur when I trying to copy small file:

mix@smart bin]$ trun ./pvfs2-cp ./pvfs2tab /home/mix/pvfs2fs -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.

and back

[mix@smart bin]$ trun ./pvfs2-cp /home/mix/pvfs2fs/pvfs2tab /home/mix/p2tab -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.

[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:04 pvfs2tab
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.

[mix@smart bin]$ diff ./pvfs2tab /home/mix/p2tab
[mix@smart bin]$ ls -l /home/mix/p2tab 
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:44 /home/mix/p2tab
no differencies between files

It is seems like pvfs2-cp trying to send file with one message but maximum 
message size is 8kb (in my bmi_m2 method) and in log I found that
BMI_post_send_list tryes to send one buffer of size 49216 bytes. early it calls 
bmi_get_info with option 10 (BMI_GET_UNEXP_SIZE) and send unexpected message to 
server and receives message from the server.
But it never calls bmi_get_info with option 3 (BMI_CHECK_MAXSIZE) and 
bmi_post_send_list returns BMI_EMSGSIZE.

Is it a problem in pvfs2-cp? or bmi method must support sending of big expected 
messages (10 mb for instance)?

Thanks,
Mikhail Gilmendinov
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to