I'm seeing an issue when removing large files from a PVFS2 file system. My
example setup is a 12 node PVFS2 file system with 2.2TB EXT3 SAN mounts to
each pvfs2 server. The server is configured for 30 second timeouts and 5
retries. We really don't want to change the timeout values and retries if
possible.

 

There is a 2TB file that exists. When the client tries to 'rm' the 2TB file,
the client basically goes through the 30 second timeout and exhausts the
retries and then reports back to the command line "Invalid Argument". From
everything I can tell, the file *really* gets deleted and doesn't show up in
a directory listing. 

 

I've included the client command line results and the log messages from the
delete below

 

bash-2.05b$ rm cmsdb_silo_mstr_20080606a

rm: cannot remove `cmsdb_silo_mstr_20080606a': Invalid argument

 

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955100.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955103.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955106.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955109.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955112.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955115.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955118.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955121.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955124.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955127.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955130.

Oct  2 10:29:35 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955133.

Oct  2 10:29:35 clientNode1 PVFS2: [E] msgpair failed, will retry: Operation
cancelled (possibly due to timeout)

Oct  2 10:29:36 clientNode1 last message repeated 11 times.

 

<SKIPPING REPEAT OF THE ABOVE 5 MORE TIMES>

 

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server1HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server2HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server3HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server4HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server5HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server6HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server7HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server8HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server9HA:3334 failed: Operation cancelled (possibly
due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server10HA:3334 failed: Operation cancelled
(possibly due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server11HA:3334 failed: Operation cancelled
(possibly due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** msgpairarray_completion_fn:
msgpair to server tcp://server12HA:3334 failed: Operation cancelled
(possibly due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] *** Out of retries.

Oct  2 10:32:10 clientNode1 PVFS2: [E] Error: failed removing one or more
datafiles associated with the meta handle 1610612708

Oct  2 10:32:10 clientNode1 PVFS2: [E] WARNING: PVFS_sys_remove()
encountered an error which may lead to inconsistent state: Operation
cancelled (possibly due to timeout)

Oct  2 10:32:10 clientNode1 PVFS2: [E] WARNING: PVFS2 fsck (if available)
may be needed.

Oct  2 10:32:10 clientNode1 kernel: pvfs2: warning: got error code without
errno equivalent: -1610612865.

Oct  2 10:32:59 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955696.

Oct  2 10:32:59 clientNode1 PVFS2: [E] msgpair failed, will retry: Operation
cancelled (possibly due to timeout)

Oct  2 10:33:29 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955732.

Oct  2 10:33:29 clientNode1 PVFS2: [E] msgpair failed, will retry: Operation
cancelled (possibly due to timeout)

Oct  2 10:33:59 clientNode1 PVFS2: [E] job_time_mgr_expire: job time out:
cancelling bmi operation, job_id: 192955766.

Oct  2 10:33:59 clientNode1 PVFS2: [E] msgpair failed, will retry: Operation
cancelled (possibly due to timeout)

Oct  2 10:34:20 clientNode1 PVFS2: [E] Error: failed removing one or more
datafiles associated with the meta handle 1252698765

Oct  2 10:34:20 clientNode1 PVFS2: [E] WARNING: PVFS_sys_remove()
encountered an error which may lead to inconsistent state: No such file or
directory

Oct  2 10:34:20 clientNode1 PVFS2: [E] WARNING: PVFS2 fsck (if available)
may be needed.

 

 

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to