Hi.
I hava a problem:
[mix@smart bin]$ trun ./pvfs2-cp ./pvfs2-cp /home/mix/pvfs2fs -n 4
[E 17:54:31.582929] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.583273] handle_io_error: flow proto error cleanup started
on 0x9d71604: Message too long
[E 17:54:31.583374] handle_io_error: flow proto 0x9d71604 canceled 0
operations, will clean up.
[E 17:54:31.583469] handle_io_error: flow proto 0x9d71604 error
cleanup finished: Message too long
[E 17:54:31.584654] mem_to_bmi_callback_fn: I/O error occurred
[E 17:54:31.584767] handle_io_error: flow proto error cleanup started
on 0x9d71cc0: Message too long
[E 17:54:31.584860] handle_io_error: flow proto 0x9d71cc0 canceled 0
operations, will clean up.
[E 17:54:31.584961] handle_io_error: flow proto 0x9d71cc0 error
cleanup finished: Message too long
^Ctask2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 255.
servers reaction:
[mix@smart sbin]$ trun ./pvfs2-server ./fs.conf -d -n 0
task1: pvfs2-server started on nodes 0
[S 11/22/2011 20:49:32] PVFS2 Server on node torus0 version
2.8.4-orangefs starting...
[E 11/22/2011 20:49:32] BMI_initialize: j=0, ladr = m2://0, proto=m2:
bmi_m2
[S 11/22/2011 20:49:34] PVFS2 Server ready.
[E 11/22/2011 20:55:37] job_time_mgr_expire: job time out: cancelling
flow operation, job_id: 1000.
[E 11/22/2011 20:55:37] fp_multiqueue_cancel: flow proto cancel
called on 0x83c2a38
[E 11/22/2011 20:55:37] fp_multiqueue_cancel: I/O error occurred
[E 11/22/2011 20:55:37] handle_io_error: flow proto error cleanup
started on 0x83c2a38: Operation cancelled (possibly due to timeout)
[E 11/22/2011 20:55:37] handle_io_error: flow proto 0x83c2a38
canceled 1 operations, will clean up.
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0
file exists but it empty
when i run pvfs2-validate one server crushes and other servers
doesn't respond to other requests from pvfs2 utilities:
[mix@smart bin]$ trun ./pvfs2-validate -d /home/mix/pvfs2fs -n 4
task2: pvfs2-validate started on nodes 4
^Ctask2: Program /home/mix/orfs/bin/pvfs2-validate exited with
exitcode 255. (pvfs2-validate also hangs)
server that crushes:
[E 11/22/2011 20:57:55] Error: poorly formatted protocol message
received.
[E 11/22/2011 20:57:55] Too small: message only 0 bytes.
[E 11/22/2011 20:57:55] msgpairarray decode error: Protocol error
[E 11/22/2011 20:57:55] PVFS2 server: signal 11, faulty address is
(nil), from 0x80c30ac
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80c30ac]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x80e1493]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_invoke+0x12f)
[0x80de9b1]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_next+0x23c)
[0x80ded83]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(PINT_state_machine_continue+0x18)
[0x80dedb7]
[E 11/22/2011 20:57:55] [bt]
/home/mix/orfs/sbin/pvfs2-server(main+0x665) [0x80586c9]
[E 11/22/2011 20:57:55] [bt] /lib/libc.so.6(__libc_start_main+0xe0)
[0xbb5390]
[E 11/22/2011 20:57:55] [bt] /home/mix/orfs/sbin/pvfs2-server [0x8057f01]
task0: Program /home/mix/orfs/sbin/pvfs2-server exited with exitcode 11.
then I run servers again and pvfs2-validate doesn't claim about
errors and
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.
null-sized file exists and now it is ok for pvfs2-validate.
This problem does not occur when I trying to copy small file:
mix@smart bin]$ trun ./pvfs2-cp ./pvfs2tab /home/mix/pvfs2fs -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.
and back
[mix@smart bin]$ trun ./pvfs2-cp /home/mix/pvfs2fs/pvfs2tab
/home/mix/p2tab -n 4
task2: pvfs2-cp started on nodes 4
task2: Program /home/mix/orfs/bin/pvfs2-cp exited with exitcode 0.
[mix@smart bin]$ trun ./pvfs2-ls -l /home/mix/pvfs2fs -n 4
task2: pvfs2-ls started on nodes 4
-rwxr-xr-x 1 mix mix 0 2011-11-22 20:53 pvfs2-cp
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:04 pvfs2tab
drwxrwxrwx 1 mix mix 4096 2011-11-21 17:24 lost+found
task2: Program /home/mix/orfs/bin/pvfs2-ls exited with exitcode 0.
[mix@smart bin]$ diff ./pvfs2tab /home/mix/p2tab
[mix@smart bin]$ ls -l /home/mix/p2tab
-rw-rw-r-- 1 mix mix 60 2011-11-22 21:44 /home/mix/p2tab
no differencies between files
It is seems like pvfs2-cp trying to send file with one message but
maximum message size is 8kb (in my bmi_m2 method) and in log I found that
BMI_post_send_list tryes to send one buffer of size 49216 bytes.
early it calls bmi_get_info with option 10 (BMI_GET_UNEXP_SIZE) and
send unexpected message to server and receives message from the server.
But it never calls bmi_get_info with option 3 (BMI_CHECK_MAXSIZE) and
bmi_post_send_list returns BMI_EMSGSIZE.
Is it a problem in pvfs2-cp? or bmi method must support sending of
big expected messages (10 mb for instance)?
Thanks,
Mikhail Gilmendinov
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers