Hello old friends! I return with a gift, like an almost-forgotten uncle visiting from a faraway land.
I have an interesting issue, on 1.6.6: # cat /proc/fs/lustre/version lustre: 1.6.6 kernel: patchless build: 1.6.6-19700101080000-PRISTINE-.usr.src.kernels.2.6.18-128.1.16.el5-x86_64.-2.6.18-128.1.16.el5 Consider this setup: - one task creates a "foo.working" directory, does all its work inside, then renames it to "foo.done" - another task polls, waiting for "foo.working" to disappear. all of this occurs on one node. - the problem: the rename occurred, but "foo.working" remains as a valid dentry! Witness: node1$ ls tape18 tapeLabel_19.txt tid64.done tid67.working tid70.working tape19.working tid62.done tid65.done tid68.working tapeLabel_18.txt tid63.done tid66.working tid69 -- Note the absence of a "tid65.working" directory node1$ stat tid65.working File: `tid65.working' Size: 12288 Blocks: 24 IO Block: 4096 directory Device: 6d48dd40h/1833491776d Inode: 76317251 Links: 3 Access: (2775/drwxrwsr-x) Uid: ( 3005/ stuartm) Gid: ( 2000/ prod) Access: 2009-11-18 17:14:26.000000000 +0800 Modify: 2009-11-18 16:08:01.000000000 +0800 Change: 2009-11-18 16:08:01.000000000 +0800 This is unique to node1. On node2: node2$ stat tid65.working stat: cannot stat `tid65.working': No such file or directory Attached is a lnet.debug=-1 log of the stat on node1, in which we can see it revalidating the dentry for a directory that no longer exists. A snapshot of that lock in the DLM cache reveals no obvious abnormal pathology: 00010000:00010000:7: 1258537200.873945:0:30794:0:(ldlm_resource.c:1116:ldlm_resource_dump()) --- Resource: ffff810133749500 (76317251/3438387721/0/0) (rc: 3) 00010000:00010000:7:1258537200.873947:0:30794:0:(ldlm_resource.c:1120:ldlm_resource_dump()) Granted locks: 00010000:00010000:7:1258537200.873948:0:30794:0:(ldlm_lock.c:1729:ldlm_lock_dump()) -- Lock dump: ffff8101ac7d2c00/0x3b4ffa12a31eb406 (rc: 1) (pos: 1) (pid: 28837) 00010000:00010000:7:1258537200.873950:0:30794:0:(ldlm_lock.c:1742:ldlm_lock_dump()) Node: NID 172.16.0....@tcp (rhandle: 0x3ad2b5ae2b9e570a) 00010000:00010000:7:1258537200.873951:0:30794:0:(ldlm_lock.c:1746:ldlm_lock_dump()) Resource: ffff810133749500 (76317251/3438387721) 00010000:00010000:7:1258537200.873953:0:30794:0:(ldlm_lock.c:1751:ldlm_lock_dump()) Req mode: CR, grant mode: CR, rc: 1, read: 0, write: 0 flags: 0x0 00010000:00010000:7:1258537200.873954:0:30794:0:(ldlm_lock.c:1765:ldlm_lock_dump()) Bits: 0x3 We stopped the job when it became clear that it would never finish. Eventually that lock did disappear -- likely just due to normal DLM turnover -- and the problem resolved itself. If the task had been allowed to continue, however, constantly stat()ing that dead directory, the lock would have remained at the bottom of the LRU -- and thus it would be an effectively infinite loop! In the interest of full disclosure, there were some of these in the dmesg. But they were from several days prior to the creation even of the parent directory, so I think it's very unlikely that they are related: BUG: warning at fs/inotify.c:202/set_dentry_child_flags() (Tainted: G ) Call Trace: [<ffffffff800f2777>] set_dentry_child_flags+0xef/0x14d [<ffffffff800f280d>] remove_watch_no_event+0x38/0x47 [<ffffffff800f2834>] inotify_remove_watch_locked+0x18/0x3b [<ffffffff800f296f>] inotify_rm_wd+0x8d/0xb6 [<ffffffff800f2ee5>] sys_inotify_rm_watch+0x46/0x63 [<ffffffff8005e28d>] tracesys+0xd5/0xe0 Has this cheerful missive induced an "A ha!" moment in anyone, that would explain this? Have I overlooked something important? Much like those given by your own almost-forgotten uncles, this gift was essentially a pair of itchy wool socks. I hope you will forgive me. Cheers, -p (nobody handles the moderator queue any more, eh?)
00000080:00200000:2:1258536010.927949:0:31018:0:(file.c:3101:ll_inode_permission()) VFS Op:inode=76317214/3438387692(ffff81012b2142f8), mask 1 00000080:00000001:2:1258536010.927953:0:31018:0:(dcache.c:693:ll_revalidate_nd()) Process entered 00000080:00000010:2:1258536010.927955:0:31018:0:(namei.c:663:ll_convert_intent()) kmalloced 'it': 48 at ffff8100cd63fbc0. 00000080:00000001:2:1258536010.927957:0:31018:0:(dcache.c:374:ll_revalidate_it()) Process entered 00000080:00200000:2:1258536010.927958:0:31018:0:(dcache.c:376:ll_revalidate_it()) VFS Op:name=tid65.working,intent=getattr 00000002:00000001:2:1258536010.927962:0:31018:0:(mdc_locks.c:773:mdc_intent_lock()) Process entered 00000002:00010000:2:1258536010.927962:0:31018:0:(mdc_locks.c:779:mdc_intent_lock()) name: tid65.working in inode 76317214, intent: getattr flags 00 00010000:00000001:2:1258536010.927964:0:31018:0:(ldlm_lock.c:1019:ldlm_lock_match()) Process entered 00010000:00000001:2:1258536010.927965:0:31018:0:(ldlm_resource.c:868:ldlm_resource_get()) Process entered 00010000:00000040:2:1258536010.927967:0:31018:0:(ldlm_resource.c:901:ldlm_resource_getref()) getref res: ffff810133749500 count: 3 00010000:00000001:2:1258536010.927968:0:31018:0:(ldlm_resource.c:884:ldlm_resource_get()) Process leaving (rc=18446604440891069696 : -139632818481920 : ffff810133749500) 00010000:00000001:2:1258536010.927970:0:31018:0:(ldlm_lock.c:198:ldlm_lock_remove_from_lru()) Process entered 00010000:00000001:2:1258536010.927971:0:31018:0:(ldlm_lock.c:202:ldlm_lock_remove_from_lru()) Process leaving 00010000:00010000:2:1258536010.927972:0:31018:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 3 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837 00010000:00000001:2:1258536010.927976:0:31018:0:(ldlm_lock.c:1041:ldlm_lock_match()) Process leaving via out (rc=1 : 1 : 1) 00010000:00000001:2:1258536010.927977:0:31018:0:(ldlm_resource.c:947:ldlm_resource_putref()) Process entered 00010000:00000040:2:1258536010.927978:0:31018:0:(ldlm_resource.c:950:ldlm_resource_putref()) putref res: ffff810133749500 count: 2 00010000:00000001:2:1258536010.927979:0:31018:0:(ldlm_resource.c:965:ldlm_resource_putref()) Process leaving (rc=0 : 0 : 0) 00010000:00010000:2:1258536010.927980:0:31018:0:(ldlm_lock.c:1088:ldlm_lock_match()) ### matched (0 0) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837 00000002:00000001:2:1258536010.927985:0:31018:0:(mdc_locks.c:788:mdc_intent_lock()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:2:1258536010.927987:0:31018:0:(dcache.c:475:ll_revalidate_it()) Process leaving via out (rc=1 : 1 : 1) 00000080:00002000:2:1258536010.927988:0:31018:0:(dcache.c:530:ll_revalidate_it()) revalidated dentry tid65.working (ffff81011e8c24b0) parent ffff8102043bb738 inode ffff8101e41f38f8 refc 10 00000080:00010000:2:1258536010.927990:0:31018:0:(dcache.c:330:ll_lookup_finish_locks()) setting l_data to inode ffff8101e41f38f8 (76317251/3438387721) 00000002:00000001:2:1258536010.927992:0:31018:0:(mdc_locks.c:115:mdc_set_lock_data()) Process entered 00010000:00000001:2:1258536010.927993:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered 00000020:00000001:2:1258536010.927994:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered 00000020:00000001:2:1258536010.927995:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00) 00010000:00000001:2:1258536010.927998:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving 00010000:00000001:2:1258536010.927999:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered 00010000:00000001:2:1258536010.927999:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving 00000002:00000001:2:1258536010.928000:0:31018:0:(mdc_locks.c:142:mdc_set_lock_data()) Process leaving 00000080:00010000:2:1258536010.928001:0:31018:0:(dcache.c:164:ll_intent_drop_lock()) releasing lock with cookie 0x3b4ffa12a31eb406 from it ffff8100cd63fbc0 00010000:00000001:2:1258536010.928002:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered 00000020:00000001:2:1258536010.928003:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered 00000020:00000001:2:1258536010.928004:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00) 00010000:00000001:2:1258536010.928006:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving 00010000:00000001:2:1258536010.928006:0:31018:0:(ldlm_lock.c:613:ldlm_lock_decref_internal()) Process entered 00010000:00010000:2:1258536010.928008:0:31018:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 3/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837 00010000:00000001:2:1258536010.928011:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered 00010000:00000001:2:1258536010.928012:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving 00010000:00000001:2:1258536010.928013:0:31018:0:(ldlm_lock.c:220:ldlm_lock_add_to_lru()) Process entered 00010000:00000001:2:1258536010.928014:0:31018:0:(ldlm_lock.c:224:ldlm_lock_add_to_lru()) Process leaving 00010000:00000001:2:1258536010.928015:0:31018:0:(ldlm_lock.c:663:ldlm_lock_decref_internal()) Process leaving 00010000:00000001:2:1258536010.928015:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered 00010000:00000001:2:1258536010.928016:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving 00000080:00000001:2:1258536010.928017:0:31018:0:(dcache.c:538:ll_revalidate_it()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:2:1258536010.928018:0:31018:0:(dcache.c:175:ll_intent_release()) Process entered 00000080:00000001:2:1258536010.928019:0:31018:0:(dcache.c:193:ll_intent_release()) Process leaving 00000080:00000010:2:1258536010.928020:0:31018:0:(dcache.c:750:ll_revalidate_nd()) kfreed 'it': 48 at ffff8100cd63fbc0. 00000080:00000001:2:1258536010.928021:0:31018:0:(dcache.c:756:ll_revalidate_nd()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:2:1258536010.928023:0:31018:0:(file.c:2939:ll_inode_revalidate_it()) Process entered 00000080:00200000:2:1258536010.928024:0:31018:0:(file.c:2946:ll_inode_revalidate_it()) VFS Op:inode=76317251/3438387721(ffff8101e41f38f8),name=tid65.working 00000002:00000001:2:1258536010.928025:0:31018:0:(mdc_locks.c:773:mdc_intent_lock()) Process entered 00000002:00010000:2:1258536010.928026:0:31018:0:(mdc_locks.c:779:mdc_intent_lock()) name: in inode 76317214, intent: getattr flags 00 00010000:00000001:2:1258536010.928030:0:31018:0:(ldlm_lock.c:1019:ldlm_lock_match()) Process entered 00010000:00000001:2:1258536010.928030:0:31018:0:(ldlm_resource.c:868:ldlm_resource_get()) Process entered 00010000:00000040:2:1258536010.928031:0:31018:0:(ldlm_resource.c:901:ldlm_resource_getref()) getref res: ffff810133749500 count: 3 00010000:00000001:2:1258536010.928032:0:31018:0:(ldlm_resource.c:884:ldlm_resource_get()) Process leaving (rc=18446604440891069696 : -139632818481920 : ffff810133749500) 00010000:00000001:2:1258536010.928034:0:31018:0:(ldlm_lock.c:198:ldlm_lock_remove_from_lru()) Process entered 00010000:00000001:2:1258536010.928035:0:31018:0:(ldlm_lock.c:202:ldlm_lock_remove_from_lru()) Process leaving 00010000:00010000:2:1258536010.928036:0:31018:0:(ldlm_lock.c:580:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 3 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837 00010000:00000001:2:1258536010.928039:0:31018:0:(ldlm_lock.c:1041:ldlm_lock_match()) Process leaving via out (rc=1 : 1 : 1) 00010000:00000001:2:1258536010.928040:0:31018:0:(ldlm_resource.c:947:ldlm_resource_putref()) Process entered 00010000:00000040:2:1258536010.928041:0:31018:0:(ldlm_resource.c:950:ldlm_resource_putref()) putref res: ffff810133749500 count: 2 00010000:00000001:2:1258536010.928042:0:31018:0:(ldlm_resource.c:965:ldlm_resource_putref()) Process leaving (rc=0 : 0 : 0) 00010000:00010000:2:1258536010.928043:0:31018:0:(ldlm_lock.c:1088:ldlm_lock_match()) ### matched (0 0) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 2/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837 00000002:00000001:2:1258536010.928051:0:31018:0:(mdc_locks.c:788:mdc_intent_lock()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:2:1258536010.928052:0:31018:0:(dcache.c:308:revalidate_it_finish()) Process entered 00000080:00000001:2:1258536010.928053:0:31018:0:(dcache.c:311:revalidate_it_finish()) Process leaving (rc=0 : 0 : 0) 00000080:00010000:2:1258536010.928054:0:31018:0:(dcache.c:330:ll_lookup_finish_locks()) setting l_data to inode ffff8101e41f38f8 (76317251/3438387721) 00000002:00000001:2:1258536010.928055:0:31018:0:(mdc_locks.c:115:mdc_set_lock_data()) Process entered 00010000:00000001:2:1258536010.928056:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered 00000020:00000001:2:1258536010.928057:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered 00000020:00000001:2:1258536010.928058:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00) 00010000:00000001:2:1258536010.928059:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving 00010000:00000001:2:1258536010.928060:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered 00010000:00000001:2:1258536010.928061:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving 00000002:00000001:2:1258536010.928061:0:31018:0:(mdc_locks.c:142:mdc_set_lock_data()) Process leaving 00000080:00010000:2:1258536010.928062:0:31018:0:(dcache.c:164:ll_intent_drop_lock()) releasing lock with cookie 0x3b4ffa12a31eb406 from it ffff81014dca3d68 00010000:00000001:2:1258536010.928063:0:31018:0:(ldlm_lock.c:427:__ldlm_handle2lock()) Process entered 00000020:00000001:2:1258536010.928064:0:31018:0:(lustre_handles.c:166:class_handle2object()) Process entered 00000020:00000001:2:1258536010.928065:0:31018:0:(lustre_handles.c:191:class_handle2object()) Process leaving (rc=18446604442921675776 : -139630787875840 : ffff8101ac7d2c00) 00010000:00000001:2:1258536010.928066:0:31018:0:(ldlm_lock.c:461:__ldlm_handle2lock()) Process leaving 00010000:00000001:2:1258536010.928067:0:31018:0:(ldlm_lock.c:613:ldlm_lock_decref_internal()) Process entered 00010000:00010000:2:1258536010.928068:0:31018:0:(ldlm_lock.c:597:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(CR) ns: p1-MDT0000-mdc-ffff81021dae2000 lock: ffff8101ac7d2c00/0x3b4ffa12a31eb406 lrc: 3/1,0 mode: CR/CR res: 76317251/3438387721 bits 0x3 rrc: 2 type: IBT flags: 0 remote: 0x3ad2b5ae2b9e570a expref: -99 pid 28837 00010000:00000001:2:1258536010.928071:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered 00010000:00000001:2:1258536010.928072:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving 00010000:00000001:2:1258536010.928073:0:31018:0:(ldlm_lock.c:220:ldlm_lock_add_to_lru()) Process entered 00010000:00000001:2:1258536010.928074:0:31018:0:(ldlm_lock.c:224:ldlm_lock_add_to_lru()) Process leaving 00010000:00000001:2:1258536010.928074:0:31018:0:(ldlm_lock.c:663:ldlm_lock_decref_internal()) Process leaving 00010000:00000001:2:1258536010.928075:0:31018:0:(ldlm_lock.c:147:ldlm_lock_put()) Process entered 00010000:00000001:2:1258536010.928076:0:31018:0:(ldlm_lock.c:177:ldlm_lock_put()) Process leaving 00000080:00000001:2:1258536010.928076:0:31018:0:(file.c:3017:ll_inode_revalidate_it()) Process leaving via out (rc=0 : 0 : 0) 00000100:00000001:2:1258536010.928077:0:31018:0:(client.c:1515:__ptlrpc_req_finished()) Process entered 00000100:00000001:2:1258536010.928078:0:31018:0:(client.c:1517:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:2:1258536010.928079:0:31018:0:(file.c:3025:ll_inode_revalidate_it()) Process leaving (rc=0 : 0 : 0)
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
