Yes. I reran the test again so that I could grab the actual messages, and
this time it was a bit more aggressive. Instead of leaving a file in a bad
state, it left my whole directory structure under the file system root in
that state. I attached a chunk of log messages from the client and from the
server that was timing out. The other servers did not log anything.
I am currently getting this back from pvfs2-fsck:
server 1, exceeding number of handles it declared (42923), currently (43000)
pvfs28-fsck: ../pvfs2_src/src/apps/admin/pvfs2-fsck.c:1325:
handlelist_add_handles: Assertion `0' failed.
Aborted
Bart.
On Fri, Jun 18, 2010 at 8:15 AM, Sam Lang <[email protected]> wrote:
>
> Hi Bart,
>
> When you run the script, do you see any timeout error messages in the
> client log?
>
> -sam
>
> On Jun 18, 2010, at 9:03 AM, Bart Taylor wrote:
>
> > Hey Phil,
> >
> > Yes, it is running 2.8.2. My setup was using 3 servers with
> 2.6.18-194.el5 kernels and High Availability. I have not had a chance yet to
> try it on another file system, so I do not know if it is specific to that
> setup. It has been triggered from more than one client, but the only know I
> know for certain was running a 2.6.9-89.ELsmp kernel.
> >
> > Bart.
> >
> >
> > On Fri, Jun 18, 2010 at 7:39 AM, Phil Carns <[email protected]> wrote:
> > Hi Bart,
> >
> > Is this on 2.8.2? Do you happen to know how many servers are needed to
> trigger the problem?
> >
> > thanks,
> > -Phil
> >
> >
> > On 06/17/2010 04:08 PM, Bart Taylor wrote:
> >>
> >> Hey guys,
> >>
> >> We have had some problems in the past on 2.6 with file creations leaving
> bad
> >> files that we cannot delete. Most utilities like ls and rm return "No
> such file
> >> or directory", and pvfs utilities like viewdist, pvfs2-ls, and pvfs2-rm
> return
> >> various errors. We have resorted to looking up the parent handle, the
> fsid, and
> >> filename and using pvfs2-remove-object to delete the entry. But we
> weren't ever
> >> able to intentionally recreate the problem.
> >>
> >> Recently while testing 2.8, I have been able to reliably trigger a
> similar
> >> scenario where a file creation fails and leaves a garbage entry that
> cannot be
> >> deleted in any of the normal ways requiring the pvfs2-remove-object
> approach to
> >> clean up. The file and various outputs for this case:
> >>
> >> [r...@client dir]# ls -l 2010.06.10.28050
> >> total 0
> >> ?--------- ? ? ? ? ? File17027
> >>
> >> [r...@client dir]# rm 2010.06.10.28050/File17027
> >> rm: cannot lstat `2010.06.10.28050/File17027': No such file or directory
> >>
> >> [r...@client dir]# rm -rf 2010.06.10.28050
> >> rm: cannot remove directory `2010.06.10.28050': Directory not empty
> >>
> >> [r...@client dir]# pvfs2-rm 2010.06.10.28050/File17027
> >> Error: An error occurred while removing 2010.06.10.28050/File17027
> >> PVFS_sys_remove: No such file or directory (error class: 0)
> >>
> >> [r...@client dir]# pvfs2-stat 2010.06.10.28050/File17027
> >> PVFS_sys_lookup: No such file or directory (error class: 0)
> >> Error stating [2010.06.10.28050/File17027]
> >>
> >> [r...@client dir]# pvfs2-viewdist -f 2010.06.10.28050/File17027
> >> PVFS_sys_lookup: No such file or directory (error class: 0)
> >> Could not open 2010.06.10.28050/File17027
> >>
> >> [r...@client dir]# ls -l 2010.06.10.28050
> >> total 0
> >> ?--------- ? ? ? ? ? File17027
> >>
> >>
> >> I have included a test script that will spawn off a number of processes,
> open a
> >> bunch of files, write to each of them, then close them. You can tweak
> the
> >> options as you want but using 5 processes and 50,000 files will usually
> create
> >> at least one of these files. Here is an example command:
> >>
> >> $> ulimit -n 1000000 && ./open-file-limit --num-files=50000
> --sleep-time=1 --num-processes=5 --directory=/mnt/pvfs2/ --file-size=1
> >>
> >> You may have to do a long listing on any left-over directories to find
> the file(s).
> >>
> >> I will give any help I can to help recreate the bad file or find the
> cause.
> >> Until then, is there a better (simpler) way to remove these entries,
> maybe
> >> some sort of utility that doesn't require doing manual handle lookups
> before
> >> getting the file removed? It would ease some support pain if it were
> simpler to
> >> fix.
> >>
> >> Thanks for your help,
> >> Bart.
> >>
> >> _______________________________________________
> >> Pvfs2-developers mailing list
> >>
> >> [email protected]
> >> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >>
> >>
> >>
> >
> >
> > _______________________________________________
> > Pvfs2-developers mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >
> >
> > _______________________________________________
> > Pvfs2-developers mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
>
Jun 18 14:24:39 server2 PVFS2: [E] dbpf_bstream_direct_write_op_svc: failed to
get size from dspace attr: (error=-1073742082)
Jun 18 14:26:09 server2 PVFS2: [E] Error: failed to retrieve peer name for
client.
Jun 18 14:26:10 server2 last message repeated 13 times
Jun 18 14:26:39 server2 PVFS2: [E] Error: failed to retrieve peer name for
client.
Jun 18 14:26:39 server2 last message repeated 27 times
Jun 18 14:27:50 server2 PVFS2: [E] dbpf_bstream_direct_write_op_svc: failed to
get dspace attr for bstream: (error=-1073742082)
Jun 18 14:27:55 server2 PVFS2: [E] Error: got unknown type when verifying
attributes for handle 1431655749.
Jun 18 14:28:09 server2 last message repeated 15599 times
Jun 18 14:28:09 server2 PVFS2: [E] Error: got unknown type when verifying
attributes for handle 1431655749.
Jun 18 14:28:20 server2 last message repeated 11749 times
Jun 18 14:28:09 server2 PVFS2: [E] Error: got unknown type when verifying
attributes for handle 1431655749.
Jun 18 14:28:20 server2 PVFS2: [E] Error: failed to retrieve peer name for
client.
Jun 18 14:28:20 server2 last message repeated 13 times
Jun 18 14:28:20 server2 PVFS2: [E] Error: got unknown type when verifying
attributes for handle 1431655749.
Jun 18 14:28:51 server2 last message repeated 33796 times
Jun 18 14:28:58 server2 last message repeated 7394 times
Jun 18 14:30:50 server2 PVFS2: [E] Error: got unknown type when verifying
attributes for handle 1431655749.
Jun 18 14:30:57 server2 last message repeated 4928 times
Jun 18 14:31:18 server2 PVFS2: [E] Error: got unknown type when verifying
attributes for handle 1431655749.
Jun 18 14:31:36 server2 last message repeated 19 times
Jun 18 14:25:09 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 50950661.
Jun 18 14:25:09 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:25:09 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:25:39 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 51222121.
Jun 18 14:25:39 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:26:09 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 51471914.
Jun 18 14:26:09 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:26:09 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:26:39 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 51787079.
Jun 18 14:26:39 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:26:39 client last message repeated 4 times
Jun 18 14:27:09 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 52047716.
Jun 18 14:27:09 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:27:39 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 52323603.
Jun 18 14:27:39 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:27:39 client PVFS2: [E] *** msgpairarray_completion_fn: msgpair to
server tcp://server2:3334 failed: Operation cancelled (possibly due to timeout)
Jun 18 14:27:39 client PVFS2: [E] *** Out of retries.
Jun 18 14:27:39 client kernel: pvfs2_file_write: error in vectored write to
handle 1430807757, FILE: File4293
Jun 18 14:27:39 client kernel: -- returning -110
Jun 18 14:27:55 client PVFS2: [E] getattr_object_getattr_failure : No such
device or address
Jun 18 14:28:20 client last message repeated 27349 times
Jun 18 14:28:20 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 52722668.
Jun 18 14:28:20 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:28:20 client last message repeated 2 times
Jun 18 14:28:20 client PVFS2: [E] getattr_object_getattr_failure : No such
device or address
Jun 18 14:28:50 client last message repeated 33012 times
Jun 18 14:28:50 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 53040224.
Jun 18 14:28:50 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:28:50 client PVFS2: [E] getattr_object_getattr_failure : No such
device or address
Jun 18 14:28:58 client last message repeated 8177 times
Jun 18 14:29:20 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 53370380.
Jun 18 14:29:20 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:29:50 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 53452139.
Jun 18 14:29:50 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:30:20 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 53452171.
Jun 18 14:30:20 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:30:50 client PVFS2: [E] job_time_mgr_expire: job time out: cancelling
bmi operation, job_id: 53452202.
Jun 18 14:30:50 client PVFS2: [E] Warning: msgpair failed to
tcp://server2:3334, will retry: Operation cancelled (possibly due to timeout)
Jun 18 14:30:50 client PVFS2: [E] *** msgpairarray_completion_fn: msgpair to
server tcp://server2:3334 failed: Operation cancelled (possibly due to timeout)
Jun 18 14:30:50 client PVFS2: [E] *** Out of retries.
Jun 18 14:30:50 client kernel: pvfs2_file_write: error in vectored write to
handle 1430790881, FILE: File4925
Jun 18 14:30:50 client kernel: -- returning -110
Jun 18 14:30:50 client PVFS2: [E] getattr_object_getattr_failure : No such
device or address
Jun 18 14:31:23 client last message repeated 4937 times
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers