3.7.11 had quite a few bugs in afr and sharding+afr interop that were fixed in 3.7.12. Some of them were about files being reported as being in split-brain. Chances are that some of them existed in 3.7.10 as well - which is what you're using.
Do you mind trying the same test with 3.7.12 or a later version? -Krutika On Tue, Aug 16, 2016 at 2:46 PM, qingwei wei <[email protected]> wrote: > Hi Niels, > > My situation is that when i unplug the HDD physically, the FIO > application exits with Input/Output error. However, when i do echo > offline on the disk, the FIO application does freeze a bit but still > manage to resume the IO workload after the freeze. > > From what i can see from the client log, the error is split-brain > which does not make sense as i still have 2 working replicas. > > [2016-08-12 10:33:41.854283] E [MSGID: > 108008][afr-transaction.c:1989:afr_transaction] > 0-ad17hwssd7-replicate-0: > Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-fcfa960d95bf:split-brain > observed. [Input/output error] > > So anyone can share their testing experience on this type disruptive > test on shard volume? Thanks! > > Regards, > > Cheng Wee > > On Tue, Aug 16, 2016 at 4:45 PM, Niels de Vos <[email protected]> wrote: > > On Tue, Aug 16, 2016 at 01:34:36PM +0800, qingwei wei wrote: > >> Hi, > >> > >> I am currently trying to test the distributed replica (3 replicas) > >> reliability when 1 brick is down. I tried using both software unplug > >> method by issuing the exho offline > /sys/block/sdx/device/state and > >> also physically unplug the HDD and i encountered 2 different outcomes. > >> For software unplug, the FIO workload continue to run but for > >> physically unplug the HDD, FIO workload cannot continue with the > >> following error: > >> > >> [2016-08-12 10:33:41.854283] E [MSGID: 108008] > >> [afr-transaction.c:1989:afr_transaction] 0-ad17hwssd7-replicate-0: > >> Failing WRITE on gfid 665a43df-1ece-4c9a-a6ee-fcfa960d95bf: > >> split-brain observed. [Input/output error] > >> > >> From the server where i unplug the disk, i can see the following: > >> > >> [2016-08-12 10:33:41.916456] D [MSGID: 0] > >> [io-threads.c:351:iot_schedule] 0-ad17hwssd7-io-threads: LOOKUP > >> scheduled as fast fop > >> [2016-08-12 10:33:41.916666] D [MSGID: 115050] > >> [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8127: > >> LOOKUP /.shard/150e99ee-ce3b-4b57-8c40-99b4ecdf3822.90 > >> (be318638-e8a0-4c6d-977d-7a937aa84806/150e99ee-ce3b- > 4b57-8c40-99b4ecdf3822.90) > >> ==> (No such file or directory) [No such file or directory] > >> [2016-08-12 10:33:41.916804] D [MSGID: 101171] > >> [client_t.c:417:gf_client_unref] 0-client_t: > >> hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960- > ad17hwssd7-client-0-0-0: > >> ref-count 1 > >> [2016-08-12 10:33:41.917098] D [MSGID: 101171] > >> [client_t.c:333:gf_client_ref] 0-client_t: > >> hp.dctopenstack.org-25780-2016/08/12-10:33:07:589960- > ad17hwssd7-client-0-0-0: > >> ref-count 2 > >> [2016-08-12 10:33:41.917145] W [MSGID: 115009] > >> [server-resolve.c:571:server_resolve] 0-ad17hwssd7-server: no > >> resolution type for (null) (LOOKUP) > >> [2016-08-12 10:33:41.917182] E [MSGID: 115050] > >> [server-rpc-fops.c:179:server_lookup_cbk] 0-ad17hwssd7-server: 8128: > >> LOOKUP (null) (00000000-0000-0000-0000-000000000000/150e99ee-ce3b- > 4b57-8c40-99b4ecdf3822.90) > >> ==> (Invalid argument) [Invalid argument] > >> > >> I am using gluster 3.7.10 and the configuration is as follow: > >> > >> diagnostics.brick-log-level: DEBUG > >> diagnostics.client-log-level: DEBUG > >> performance.io-thread-count: 16 > >> client.event-threads: 2 > >> server.event-threads: 2 > >> features.shard-block-size: 16MB > >> features.shard: on > >> server.allow-insecure: on > >> storage.owner-uid: 165 > >> storage.owner-gid: 165 > >> nfs.disable: true > >> performance.quick-read: off > >> performance.io-cache: off > >> performance.read-ahead: off > >> performance.stat-prefetch: off > >> cluster.lookup-optimize: on > >> cluster.quorum-type: auto > >> cluster.server-quorum-type: server > >> transport.address-family: inet > >> performance.readdir-ahead: on > >> > >> This error only occur for sharding configuration. Do you guys perform > >> this type of test before? Or do you think physically unplug the HDD is > >> a valid test case? > > > > If you use replica-3, things should settle down again. The kernel and > > teh brick process needs a little time to find out that the filesystem on > > the disk that you pulled out is not responding anymore. The output og > > "gluster volume status" should show that the brick process is offline. > > As long as you have quorum, things should continue after a small delay > > while waiting to mark the brick offline. > > > > People actually should test this scenario, it can be that power to disks > > fail, or even (connections to) RAID-controllers. Hot-unplugging is > > definitely a scenario that can emulate real-world problems. > > > > Niels > _______________________________________________ > Gluster-devel mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-devel >
_______________________________________________ Gluster-devel mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-devel
