Hello community, I'm still facing this issue using GlusterFS-3.8.5-1 and i would like to know if it could be related to a potential bug in the disperse mode or if it's related to the way of Spark writing results of jobs in gluster mountpoint.
Found the following locks using statedump: [xlator.features.locks.exp-locks.inode] path=/tests/output/_temporary/0/_temporary mandatory=0 inodelk-count=3 lock-dump.domain.domain=dht.layout.heal lock-dump.domain.domain=exp-disperse-0 inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 10978, owner=60af7c2a0c7f0000, client=0x7f4fd4009900, connection-id=server1-5422-2016/10/26-15:03:03:550732-exp-client-0-0-0, blocked at 2016-10-26 15:53:13, granted at 2016-10-26 15:53:14 inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 19137, owner=18590414617f0000, client=0x7f4fdc0e1950, connection-id=server2-1720-2016/10/26-15:05:59:328008-exp-client-0-0-0, blocked at 2016-10-26 15:53:14 inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 9798, owner=303b34a4e8720000, client=0x7f4fdc0deac0, connection-id=server1-3955-2016/10/26-15:04:00:874305-exp-client-0-0-0, blocked at 2016-10-26 15:53:18 Many thanks for your advices and help on this topic. 2016-06-20 10:43 GMT+02:00 Florian Philippon <[email protected]>: > Hello guys, > > I would like to get some advices on a some problems we have on our 3 hosts > gluster setup. > > Here the setup used: > > 1. GlusterFS 3.8.0-1 (we did an upgrade from 3.7.11 last week) > 2. Type: Disperse > 3. Number of Bricks: 1 x (2 + 1) = 3 > 4. Transport-type: tcp > 5. Options Reconfigured: transport.address-family: inet > > Please note that we also have the ACL option enabled on the volume mount. > > Use case: > > An user submit jobs/tasks to a Spark cluster which have the glusterfs > volume mounted on each host. > 13 tasks were successfully completed in ~30 min for each (convert some > logs to a json format and write the ouput to the gluster fs) but one was > blocked for more than 12 hours when we checked > was going wrong. > > We found some log entries related to an inode locking in the brick log one > one host: > > [2016-06-19 03:15:08.563397] E [inodelk.c:304:__inode_unlock_lock] > 0-exp-locks: Matching lock not found for unlock 0-9223372036854775807, by > 10613ebc6c6a0000 on 0x6cee5c0f4730 > [2016-06-19 03:15:08.563684] E [MSGID: 115053] > [server-rpc-fops.c:273:server_inodelk_cbk] 0-exp-server: 5375861: INODELK > /spark/user/20160328/_temporary/0/_temporary (015bde3a-09d > 6-41a2-8e9f-7e7c5295d596) ==> (Invalid argument) [Invalid argument] > > Errors in the data log: > > [2016-06-19 03:13:29.198676] I [MSGID: 109036] > [dht-common.c:8824:dht_log_new_layout_for_dir_selfheal] > 0-exp-dht: Setting layout of /spark/user/20160328/_ > temporary/0/_temporary/at > tempt_201606190511_0004_m_000004_26 with [Subvol_name: exp-disperse-0, > Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ], > [2016-06-19 03:14:59.349357] I [MSGID: 109066] > [dht-rename.c:1562:dht_rename] 0-exp-dht: renaming /spark/user/20160328/_ > temporary/0/_temporary/attempt_201606190511_0004_m_000001_2 > 3 (hash=exp-disperse-0/cache=exp-disperse-0) => /spark/user/20160328/_ > temporary/0/task_201606190511_0004_m_000001 (hash=exp-disperse-0/cache=< > nul>) > > And these entries are also spamming the data log when an action is done > the fs: > > [2016-06-19 13:58:22.817308] I [dict.c:462:dict_get] > (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x13628) > [0x6f0655cd1628] -->/usr/lib64/glusterfs/3.8.0/xlator/system/posix-acl.s > o(+0x9ccb) [0x6f0655ab5ccb] -->/lib64/libglusterfs.so.0(dict_get+0xec) > [0x6f066528df7c] ) 0-dict: !this || key=system.posix_acl_access [Invalid > argument] > [2016-06-19 13:58:22.817364] I [dict.c:462:dict_get] > (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x13628) > [0x6f0655cd1628] -->/usr/lib64/glusterfs/3.8.0/xlator/system/posix-acl.s > o(+0x9d21) [0x6f0655ab5d21] -->/lib64/libglusterfs.so.0(dict_get+0xec) > [0x6f066528df7c] ) 0-dict: !this || key=system.posix_acl_default [Invalid > argument] > > We did a stadump and we got confirmation that some processes were in a > blocking state. > > We did a clear lock on the blocked inode and the spark job has finally > finished (with errors). > > What could be the root cause of these lockings? > > Thanks for your help! > > Florian > > > -- Florian Philippon
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
