Hello old friends!  I return with a gift, like an almost-forgotten
uncle visiting from a faraway land.

I have an interesting issue, on 1.6.6:

# cat /proc/fs/lustre/version
lustre: 1.6.6
kernel: patchless
build:  
1.6.6-19700101080000-PRISTINE-.usr.src.kernels.2.6.18-128.1.16.el5-x86_64.-2.6.18-128.1.16.el5


Consider this setup:

- one task creates a "foo.working" directory, does all its work
inside, then renames it to "foo.done"

- another task polls, waiting for "foo.working" to disappear.  all of
this occurs on one node.

- the problem: the rename occurred, but "foo.working" remains as a valid dentry!


Witness:

node1$ ls
tape18            tapeLabel_19.txt  tid64.done     tid67.working  tid70.working
tape19.working    tid62.done        tid65.done     tid68.working
tapeLabel_18.txt  tid63.done        tid66.working  tid69

-- Note the absence of a "tid65.working" directory

node1$ stat tid65.working
  File: `tid65.working'
  Size: 12288           Blocks: 24         IO Block: 4096   directory
Device: 6d48dd40h/1833491776d   Inode: 76317251    Links: 3
Access: (2775/drwxrwsr-x)  Uid: ( 3005/ stuartm)   Gid: ( 2000/    prod)
Access: 2009-11-18 17:14:26.000000000 +0800
Modify: 2009-11-18 16:08:01.000000000 +0800
Change: 2009-11-18 16:08:01.000000000 +0800

This is unique to node1.  On node2:

node2$ stat tid65.working
stat: cannot stat `tid65.working': No such file or directory


Attached is a lnet.debug=-1 log of the stat on node1, in which we can
see it revalidating the dentry for a directory that no longer exists.

A snapshot of that lock in the DLM cache reveals no obvious abnormal pathology:

00010000:00010000:7:
1258537200.873945:0:30794:0:(ldlm_resource.c:1116:ldlm_resource_dump())
--- Resource: ffff810133749500 (76317251/3438387721/0/0) (rc: 3)
00010000:00010000:7:1258537200.873947:0:30794:0:(ldlm_resource.c:1120:ldlm_resource_dump())
Granted locks:
00010000:00010000:7:1258537200.873948:0:30794:0:(ldlm_lock.c:1729:ldlm_lock_dump())
 -- Lock dump: ffff8101ac7d2c00/0x3b4ffa12a31eb406 (rc: 1) (pos: 1)
(pid: 28837)
00010000:00010000:7:1258537200.873950:0:30794:0:(ldlm_lock.c:1742:ldlm_lock_dump())
  Node: NID 172.16.0....@tcp (rhandle: 0x3ad2b5ae2b9e570a)
00010000:00010000:7:1258537200.873951:0:30794:0:(ldlm_lock.c:1746:ldlm_lock_dump())
  Resource: ffff810133749500 (76317251/3438387721)
00010000:00010000:7:1258537200.873953:0:30794:0:(ldlm_lock.c:1751:ldlm_lock_dump())
  Req mode: CR, grant mode: CR, rc: 1, read: 0, write: 0 flags: 0x0
00010000:00010000:7:1258537200.873954:0:30794:0:(ldlm_lock.c:1765:ldlm_lock_dump())
  Bits: 0x3


We stopped the job when it became clear that it would never finish.
Eventually that lock did disappear -- likely just due to normal DLM
turnover -- and the problem resolved itself.  If the task had been
allowed to continue, however, constantly stat()ing that dead
directory, the lock would have remained at the bottom of the LRU --
and thus it would be an effectively infinite loop!


In the interest of full disclosure, there were some of these in the
dmesg.  But they were from several days prior to the creation even of
the parent directory, so I think it's very unlikely that they are
related:

BUG: warning at fs/inotify.c:202/set_dentry_child_flags() (Tainted: G     )

Call Trace:
 [<ffffffff800f2777>] set_dentry_child_flags+0xef/0x14d
 [<ffffffff800f280d>] remove_watch_no_event+0x38/0x47
 [<ffffffff800f2834>] inotify_remove_watch_locked+0x18/0x3b
 [<ffffffff800f296f>] inotify_rm_wd+0x8d/0xb6
 [<ffffffff800f2ee5>] sys_inotify_rm_watch+0x46/0x63
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0


Has this cheerful missive induced an "A ha!" moment in anyone, that
would explain this?  Have I overlooked something important?

Much like those given by your own almost-forgotten uncles, this gift
was essentially a pair of itchy wool socks.  I hope you will forgive
me.

Cheers,

-p

(nobody handles the moderator queue any more, eh?)
00000080:00200000:2:1258536010.927949:0:31018:0:(file.c:3101:ll_inode_permission()) VFS Op:inode=76317214/3438387692(ffff81012b2142f8), mask 1
00000080:00000001:2:1258536010.927953:0:31018:0:(dcache.c:693:ll_revalidate_nd()) Process entered
00000080:00000010:2:1258536010.927955:0:31018:0:(namei.c:663:ll_convert_intent()) kmalloced 'it': 48 at ffff8100cd63fbc0.
00000080:00000001:2:1258536010.927957:0:31018:0:(dcache.c:374:ll_revalidate_it()) Process entered
00000080:00200000:2:1258536010.927958:0:31018:0:(dcache.c:376:ll_revalidate_it()) VFS Op:name=tid65.working,intent=getattr
00000002:00000001:2:1258536010.927962:0:31018:0:(mdc_locks.c:773:mdc_intent_lock()) Process entered
00000002:00010000:2:1258536010.927962:0:31018:0:(mdc_locks.c:779:mdc_intent_lock()) name: tid65.working in inode 76317214, intent: getattr flags 00
00010000:00000001:2:1258536010.927964:0:31018:0:(ldlm_lock.c:1019:ldlm_lock_match()) Process entered
00010000:00000001:2:1258536010.927965:0:31018:0:(ldlm_resource.c:868:ldlm_resource_get()) Process entered
00010000:00000040:2:1258536010.927967:0:31018:0:(ldlm_resource.c:901:ldlm_resource_getref()) getref res: ffff810133749500 count: 3
00010000:00000001:2:1258536010.927968:0:31018:0:(ldlm_resource.c:884:ldlm_resource_get()) Process leaving (rc=18446604440891069696 : -139632818481920 : ffff810133749500)
00010000:00000001:2:1258536010.927970:0:31018:0:(ldlm_lock.c:198:ldlm_lock_remove_from_lru()) Process entered
00010000:00000001:2:1258536010.927971:0:31018:0:(ldlm_lock.c:202:ldlm_lock_remove_from_lru()) Process leaving
00010000:00010000:2:1258536010.927972:0:31018:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 3 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837
00010000:00000001:2:1258536010.927976:0:31018:0:(ldlm_lock.c:1041:ldlm_lock_match()) Process leaving via out (rc=1 : 1 : 1)
00010000:00000001:2:1258536010.927977:0:31018:0:(ldlm_resource.c:947:ldlm_resource_putref()) Process entered
00010000:00000040:2:1258536010.927978:0:31018:0:(ldlm_resource.c:950:ldlm_resource_putref()) putref res: ffff810133749500 count: 2
00010000:00000001:2:1258536010.927979:0:31018:0:(ldlm_resource.c:965:ldlm_resource_putref()) Process leaving (rc=0 : 0 : 0)
00010000:00010000:2:1258536010.927980:0:31018:0:(ldlm_lock.c:1088:ldlm_lock_match()) ### matched (0 0) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837
00000002:00000001:2:1258536010.927985:0:31018:0:(mdc_locks.c:788:mdc_intent_lock()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:2:1258536010.927987:0:31018:0:(dcache.c:475:ll_revalidate_it()) Process leaving via out (rc=1 : 1 : 1)
00000080:00002000:2:1258536010.927988:0:31018:0:(dcache.c:530:ll_revalidate_it()) revalidated dentry tid65.working (ffff81011e8c24b0) parent ffff8102043bb738 inode ffff8101e41f38f8 refc 10
00000080:00010000:2:1258536010.927990:0:31018:0:(dcache.c:330:ll_lookup_finish_locks()) setting l_data to inode ffff8101e41f38f8 (76317251/3438387721)
00000002:00000001:2:1258536010.927992:0:31018:0:(mdc_locks.c:115:mdc_set_lock_data()) Process entered
00010000:00000001:2:1258536010.927993:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered
00000020:00000001:2:1258536010.927994:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered
00000020:00000001:2:1258536010.927995:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00)
00010000:00000001:2:1258536010.927998:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving
00010000:00000001:2:1258536010.927999:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered
00010000:00000001:2:1258536010.927999:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving
00000002:00000001:2:1258536010.928000:0:31018:0:(mdc_locks.c:142:mdc_set_lock_data()) Process leaving
00000080:00010000:2:1258536010.928001:0:31018:0:(dcache.c:164:ll_intent_drop_lock()) releasing lock with cookie 0x3b4ffa12a31eb406 from it ffff8100cd63fbc0
00010000:00000001:2:1258536010.928002:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered
00000020:00000001:2:1258536010.928003:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered
00000020:00000001:2:1258536010.928004:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00)
00010000:00000001:2:1258536010.928006:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving
00010000:00000001:2:1258536010.928006:0:31018:0:(ldlm_lock.c:613:ldlm_lock_decref_internal()) Process entered
00010000:00010000:2:1258536010.928008:0:31018:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 3/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837
00010000:00000001:2:1258536010.928011:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered
00010000:00000001:2:1258536010.928012:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving
00010000:00000001:2:1258536010.928013:0:31018:0:(ldlm_lock.c:220:ldlm_lock_add_to_lru()) Process entered
00010000:00000001:2:1258536010.928014:0:31018:0:(ldlm_lock.c:224:ldlm_lock_add_to_lru()) Process leaving
00010000:00000001:2:1258536010.928015:0:31018:0:(ldlm_lock.c:663:ldlm_lock_decref_internal()) Process leaving
00010000:00000001:2:1258536010.928015:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered
00010000:00000001:2:1258536010.928016:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving
00000080:00000001:2:1258536010.928017:0:31018:0:(dcache.c:538:ll_revalidate_it()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:2:1258536010.928018:0:31018:0:(dcache.c:175:ll_intent_release()) Process entered
00000080:00000001:2:1258536010.928019:0:31018:0:(dcache.c:193:ll_intent_release()) Process leaving
00000080:00000010:2:1258536010.928020:0:31018:0:(dcache.c:750:ll_revalidate_nd()) kfreed 'it': 48 at ffff8100cd63fbc0.
00000080:00000001:2:1258536010.928021:0:31018:0:(dcache.c:756:ll_revalidate_nd()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:2:1258536010.928023:0:31018:0:(file.c:2939:ll_inode_revalidate_it()) Process entered
00000080:00200000:2:1258536010.928024:0:31018:0:(file.c:2946:ll_inode_revalidate_it()) VFS Op:inode=76317251/3438387721(ffff8101e41f38f8),name=tid65.working
00000002:00000001:2:1258536010.928025:0:31018:0:(mdc_locks.c:773:mdc_intent_lock()) Process entered
00000002:00010000:2:1258536010.928026:0:31018:0:(mdc_locks.c:779:mdc_intent_lock()) name:  in inode 76317214, intent: getattr flags 00
00010000:00000001:2:1258536010.928030:0:31018:0:(ldlm_lock.c:1019:ldlm_lock_match()) Process entered
00010000:00000001:2:1258536010.928030:0:31018:0:(ldlm_resource.c:868:ldlm_resource_get()) Process entered
00010000:00000040:2:1258536010.928031:0:31018:0:(ldlm_resource.c:901:ldlm_resource_getref()) getref res: ffff810133749500 count: 3
00010000:00000001:2:1258536010.928032:0:31018:0:(ldlm_resource.c:884:ldlm_resource_get()) Process leaving (rc=18446604440891069696 : -139632818481920 : ffff810133749500)
00010000:00000001:2:1258536010.928034:0:31018:0:(ldlm_lock.c:198:ldlm_lock_remove_from_lru()) Process entered
00010000:00000001:2:1258536010.928035:0:31018:0:(ldlm_lock.c:202:ldlm_lock_remove_from_lru()) Process leaving
00010000:00010000:2:1258536010.928036:0:31018:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 3 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837
00010000:00000001:2:1258536010.928039:0:31018:0:(ldlm_lock.c:1041:ldlm_lock_match()) Process leaving via out (rc=1 : 1 : 1)
00010000:00000001:2:1258536010.928040:0:31018:0:(ldlm_resource.c:947:ldlm_resource_putref()) Process entered
00010000:00000040:2:1258536010.928041:0:31018:0:(ldlm_resource.c:950:ldlm_resource_putref()) putref res: ffff810133749500 count: 2
00010000:00000001:2:1258536010.928042:0:31018:0:(ldlm_resource.c:965:ldlm_resource_putref()) Process leaving (rc=0 : 0 : 0)
00010000:00010000:2:1258536010.928043:0:31018:0:(ldlm_lock.c:1088:ldlm_lock_match()) ### matched (0 0) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837
00000002:00000001:2:1258536010.928051:0:31018:0:(mdc_locks.c:788:mdc_intent_lock()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:2:1258536010.928052:0:31018:0:(dcache.c:308:revalidate_it_finish()) Process entered
00000080:00000001:2:1258536010.928053:0:31018:0:(dcache.c:311:revalidate_it_finish()) Process leaving (rc=0 : 0 : 0)
00000080:00010000:2:1258536010.928054:0:31018:0:(dcache.c:330:ll_lookup_finish_locks()) setting l_data to inode ffff8101e41f38f8 (76317251/3438387721)
00000002:00000001:2:1258536010.928055:0:31018:0:(mdc_locks.c:115:mdc_set_lock_data()) Process entered
00010000:00000001:2:1258536010.928056:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered
00000020:00000001:2:1258536010.928057:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered
00000020:00000001:2:1258536010.928058:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00)
00010000:00000001:2:1258536010.928059:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving
00010000:00000001:2:1258536010.928060:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered
00010000:00000001:2:1258536010.928061:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving
00000002:00000001:2:1258536010.928061:0:31018:0:(mdc_locks.c:142:mdc_set_lock_data()) Process leaving
00000080:00010000:2:1258536010.928062:0:31018:0:(dcache.c:164:ll_intent_drop_lock()) releasing lock with cookie 0x3b4ffa12a31eb406 from it ffff81014dca3d68
00010000:00000001:2:1258536010.928063:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered
00000020:00000001:2:1258536010.928064:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered
00000020:00000001:2:1258536010.928065:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00)
00010000:00000001:2:1258536010.928066:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving
00010000:00000001:2:1258536010.928067:0:31018:0:(ldlm_lock.c:613:ldlm_lock_decref_internal()) Process entered
00010000:00010000:2:1258536010.928068:0:31018:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 3/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837
00010000:00000001:2:1258536010.928071:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered
00010000:00000001:2:1258536010.928072:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving
00010000:00000001:2:1258536010.928073:0:31018:0:(ldlm_lock.c:220:ldlm_lock_add_to_lru()) Process entered
00010000:00000001:2:1258536010.928074:0:31018:0:(ldlm_lock.c:224:ldlm_lock_add_to_lru()) Process leaving
00010000:00000001:2:1258536010.928074:0:31018:0:(ldlm_lock.c:663:ldlm_lock_decref_internal()) Process leaving
00010000:00000001:2:1258536010.928075:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered
00010000:00000001:2:1258536010.928076:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving
00000080:00000001:2:1258536010.928076:0:31018:0:(file.c:3017:ll_inode_revalidate_it()) Process leaving via out (rc=0 : 0 : 0)
00000100:00000001:2:1258536010.928077:0:31018:0:(client.c:1515:__ptlrpc_req_finished()) Process entered
00000100:00000001:2:1258536010.928078:0:31018:0:(client.c:1517:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:2:1258536010.928079:0:31018:0:(file.c:3025:ll_inode_revalidate_it()) Process leaving (rc=0 : 0 : 0)
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to