Re: [Gluster-devel] Possible race condition bug with tiered volume

Dustin Black Wed, 19 Oct 2016 12:24:00 -0700

# gluster --version
glusterfs 3.7.9 built on Jun 10 2016 06:32:42


Try not to make fun of my python, but I was able to make a small
modification to the to the sync_files.py script from smallfile and at least
enable my team to move on with testing. It's terribly hacky and ugly, but
works around the problem, which I am pretty convinced is a Gluster bug at
this point.


# diff bin/sync_files.py.orig bin/sync_files.py
6a7,8
> import errno
> import binascii
27c29,40
<         shutil.rmtree(master_invoke.network_dir)
---
>         try:
>             shutil.rmtree(master_invoke.network_dir)
>         except OSError as e:
>             err = e.errno
>             if err != errno.EEXIST:
>                 # workaround for possible bug in Gluster
>                 if err != errno.ENOTEMPTY:
>                     raise e
>                 else:
>                     print('saw ENOTEMPTY on stonewall, moving shared
directory')
>                     ext = str(binascii.b2a_hex(os.urandom(15)))
>                     shutil.move(master_invoke.network_dir,
master_invoke.network_dir + ext)


Dustin Black, RHCA
Senior Architect, Software-Defined Storage
Red Hat, Inc.
(o) +1.212.510.4138  (m) +1.215.821.7423
dus...@redhat.com


On Tue, Oct 18, 2016 at 7:09 PM, Dustin Black <dbl...@redhat.com> wrote:

> Dang. I always think I get all the detail and inevitably leave out
> something important. :-/
>
> I'm mobile and don't have the exact version in front of me, but this is
> recent if not latest RHGS on RHEL 7.2.
>
>
> On Oct 18, 2016 7:04 PM, "Dan Lambright" <dlamb...@redhat.com> wrote:
>
>> Dustin,
>>
>> What level code ? I often run smallfile on upstream code with tiered
>> volumes and have not seen this.
>>
>> Sure, one of us will get back to you.
>>
>> Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they
>> overwhelm the boost in transfer speeds you get for small files. A
>> presentation at the Berlin gluster summit evaluated this.  The expectation
>> is md-cache will go a long way towards helping that, before too long.
>>
>> Dan
>>
>>
>>
>> ----- Original Message -----
>> > From: "Dustin Black" <dbl...@redhat.com>
>> > To: gluster-devel@gluster.org
>> > Cc: "Annette Clewett" <aclew...@redhat.com>
>> > Sent: Tuesday, October 18, 2016 4:30:04 PM
>> > Subject: [Gluster-devel] Possible race condition bug with tiered volume
>> >
>> > I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6
>> drives.
>> >
>> > # gluster vol info 1nvme-distrep3x2
>> > Volume Name: 1nvme-distrep3x2
>> > Type: Tier
>> > Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607
>> > Status: Started
>> > Number of Bricks: 12
>> > Transport-type: tcp
>> > Hot Tier :
>> > Hot Tier Type : Distributed-Replicate
>> > Number of Bricks: 3 x 2 = 6
>> > Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot
>> > Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot
>> > Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot
>> > Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot
>> > Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
>> > Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
>> > Cold Tier:
>> > Cold Tier Type : Distributed-Replicate
>> > Number of Bricks: 3 x 2 = 6
>> > Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2
>> > Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2
>> > Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2
>> > Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2
>> > Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2
>> > Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2
>> > Options Reconfigured:
>> > cluster.tier-mode: cache
>> > features.ctr-enabled: on
>> > performance.readdir-ahead: on
>> >
>> >
>> > I am attempting to run the 'smallfile' benchmark tool on this volume.
>> The
>> > 'smallfile' tool creates a starting gate directory and files in a shared
>> > filesystem location. The first run (write) works as expected.
>> >
>> > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
>> > /rhgs/client/1nvme-distrep3x2 --host-set
>> > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
>> > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation
>> create
>> >
>> > For the second run (read), I believe that smallfile attempts first to
>> 'rm
>> > -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing
>> the
>> > run to fail
>> >
>> > # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
>> > /rhgs/client/1nvme-distrep3x2 --host-set
>> > c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
>> > --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation
>> create
>> > ...
>> > Traceback (most recent call last):
>> > File "/root/bin/smallfile_cli.py", line 280, in <module>
>> > run_workload()
>> > File "/root/bin/smallfile_cli.py", line 270, in run_workload
>> > return run_multi_host_workload(params)
>> > File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload
>> > sync_files.create_top_dirs(master_invoke, True)
>> > File "/root/bin/sync_files.py", line 27, in create_top_dirs
>> > shutil.rmtree(master_invoke.network_dir)
>> > File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree
>> > onerror(os.rmdir, path, sys.exc_info())
>> > File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree
>> > os.rmdir(path)
>> > OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-distrep3x2
>> /smf1'
>> >
>> >
>> > From the client perspective, the directory is clearly empty.
>> >
>> > # ls -a /rhgs/client/1nvme-distrep3x2/smf1/
>> > . ..
>> >
>> >
>> > And a quick search on the bricks shows that the hot tier on the last
>> replica
>> > pair is the offender.
>> >
>> > # for i in {0..5}; do ssh n$i "hostname; ls
>> > /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls
>> > /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0
>> > 0
>> > 0
>> > rhosd1
>> > 0
>> > 0
>> > rhosd2
>> > 0
>> > 0
>> > rhosd3
>> > 0
>> > 0
>> > rhosd4
>> > 0
>> > 1
>> > rhosd5
>> > 0
>> > 1
>> >
>> >
>> > (For the record, multiple runs of this reproducer show that it is
>> > consistently the hot tier that is to blame, but it is not always the
>> same
>> > replica pair.)
>> >
>> >
>> > Can someone try recreating this scenario to see if the problem is
>> consistent?
>> > Please reach out if you need me to provide any further details.
>> >
>> >
>> > Dustin Black, RHCA
>> > Senior Architect, Software-Defined Storage
>> > Red Hat, Inc.
>> > (o) +1.212.510.4138 (m) +1.215.821.7423
>> > dus...@redhat.com
>> >
>> > _______________________________________________
>> > Gluster-devel mailing list
>> > Gluster-devel@gluster.org
>> > http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Possible race condition bug with tiered volume

Reply via email to