[EMAIL PROTECTED] wrote on Wed, 19 Dec 2007 14:12 -0700:
> I don't have a solution for you, but I included some comments that should
> clear up a couple of things.
Thanks for taking a look.
> On Dec 17, 2007 11:41 AM, Pete Wyckoff <[EMAIL PROTECTED]> wrote:
>
> > We have an abusive file/dir create/remove script that led me to find
> > this repeatable problem, with CVS head.
> >
> > Start 1 server, both meta + data. This is stock pvfs, no OSD code
> > here. 40k is a 40*1024 byte file of zeroes on NFS.
> >
> > ib30$ pvfs2-mkdir -p /pvfs/1/2/3
> > ib30$ pvfs2-cp 40k /pvfs/1/2/3/a
> > ib30$ pvfs2-cp 40k /pvfs/1/2/3/b
> > ib30$ pvfs2-rm /pvfs/1/2/3/a /pvfs/1/2/3/b
> > PVFS_sys_lookup: No such file or directory (error class: 0)
> >
> > Here's a verbose dump of that last command. The rm code takes two
> > trips through the loop, one for each file to remove. There are
> > differences in the two lookups. The second time through hits on the
> > directory /1/2/3, which makes sense since the first time it looked
> > it up successfully.
>
> I think there may be an error in the debug message that tells what was
> successfully found in the cache. The lookup-ncache.sm strips off the first
> segment to lookup, but that "*** ncache hit" message prints out the entire
> path as being found. I think this message should just print out the segment
> that was looked up.
That observation helps. I'll update the debug message to make this
clear.
> > But after finding this in the cache, it goes to the completion
> > function and claims it only resolved 1 segment. (Grep for the
> > second occurrence of lookup_segment_lookup_comp_fn below.)
> > Shouldn't this be 3, since it found all three dirs at once in the
> > cache? Then it goes to get the attrs for /1, which weren't present,
>
> This sounds correct in light of the incorrect debug message. It really did
> only find 1 segment since that is all it attempted to find, and will
> immediately try to lookup the attributes for that segment. Ncache will
> always return a NULL value for the attributes array and a count of 0.
Okay, understood.
Coming out of lookup_context_check_completion (the second trip, all
this, where we have ncache problems), cur_ctx->current_segment goes
up by 1, good, and cur_seg->name is "2" with remaining "3". And
seg_starting_refn is the handle of "1", so all seems good.
Then we loop back to lookup_segment_lookup_setup_msgpair, to look up
the next segment, and this code happens:
seg_to_lookup = (cur_seg->seg_remaining ? cur_seg->seg_remaining :
cur_seg->seg_name);
But seg_remaining is just "3", not "2/3". So we go looking for "3"
as a subdir of "1", which isn't going to work.
Maybe whatever built the cur_seg in the first place was broken? The
function that initializes it is initialize_context, and it does, in
both the first and second trips through remove:
[D 13:30:21.539414] initialize_context called
[D 13:30:21.539420] original pathname is: 1/2/3
[D 13:30:21.539426] cur_seg_name[0]: 1
[D 13:30:21.539431] pathname is: 1
[D 13:30:21.539437] *seg_remaining is: 1/2/3
[D 13:30:21.539442] cur_seg_name[1]: 2
[D 13:30:21.539448] pathname is: 1/2
[D 13:30:21.539453] *seg_remaining is: 3
[D 13:30:21.539459] cur_seg_name[2]: 3
[D 13:30:21.539464] pathname is: 1/2/3
(and *seg_remaining is NULL for this last one)
Aha. There is some suspicious code, added by Murali back in Jun
2007 to fix a different lookup bug.
The thread starts here:
http://www.beowulf-underground.org/pipermail/pvfs2-users/2007-June/001968.html
Here is the diff (excerpt) that added the code from 1.68 to 1.69:
Index: src/client/sysint/sys-lookup.sm
--- src/client/sysint/sys-lookup.sm 13 Apr 2007 05:14:16 -0000 1.68
+++ src/client/sysint/sys-lookup.sm 20 Jun 2007 06:08:51 -0000
@@ -378,7 +378,16 @@
cur_seg->seg_name = strdup(cur_seg_name);
assert(cur_seg->seg_name);
- seg_remaining = strstr(orig_pathname, cur_seg_name);
+ slash_str = orig_pathname;
+ for (i = 0; i < cur_seg_index; i++) {
+ slash_str = strrchr(slash_str, '/');
+ if (slash_str == NULL) {
+ break;
+ }
+ slash_str++;
+ }
+ //seg_remaining = strstr(orig_pathname, cur_seg_name);
+ seg_remaining = slash_str;
if (seg_remaining)
{
gossip_debug(GOSSIP_LOOKUP_DEBUG,
If I change the strrchr to strchr, my case works, but it breaks the
bugfix reported previously (but without segv). If I revert the
change, going back to the original strstr, my case also still works,
and the previous bug also reoccurs (with segv). I'm not
understanding this fully, but we seem to be getting closer. Or
maybe the problem is in the interpretation of seg_remaining in other
parts of the lookup code.
-- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers