I dont know if this has any relation to you issue. But I have seen several times during gluster healing that my wm’s fail or are marked unresponsive in rhev. My conclusion is that the load gluster puts on the wm-images during checksum while healing, result in to much latency and wm’s fail.
My plans is to try using sharding, so the wm-images/files are split into smaller files, changing the number of allowed concurrent heals ‘cluster.background-self-heal-count’ and disabling ‘cluster.self-heal-daemon’. /Jesper Fra: [email protected] [mailto:[email protected]] På vegne af Krutika Dhananjay Sendt: 8. maj 2017 12:38 Til: Alessandro Briosi <[email protected]>; de Vos, Niels <[email protected]> Cc: gluster-users <[email protected]> Emne: Re: [Gluster-users] VM going down The newly introduced "SEEK" fop seems to be failing at the bricks. Adding Niels for his inputs/help. -Krutika On Mon, May 8, 2017 at 3:43 PM, Alessandro Briosi <[email protected]<mailto:[email protected]>> wrote: Hi all, I have sporadic VM going down which files are on gluster FS. If I look at the gluster logs the only events I find are: /var/log/glusterfs/bricks/data-brick2-brick.log [2017-05-08 09:51:17.661697] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting connection from srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 [2017-05-08 09:51:17.661697] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-datastore2-server: disconnecting connection from srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] 0-datastore2-server: releasing lock on 66d9eefb-ee55-40ad-9f44-c55d1e809006 held by {client=0x7f4c7c004880, pid=0 lk-owner=5c7099efc97f0000} [2017-05-08 09:51:17.661810] W [inodelk.c:399:pl_inodelk_log_cleanup] 0-datastore2-server: releasing lock on a8d82b3d-1cf9-45cf-9858-d8546710b49c held by {client=0x7f4c840f31d0, pid=0 lk-owner=5c7019fac97f0000} [2017-05-08 09:51:17.661835] I [MSGID: 115013] [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on /images/201/vm-201-disk-2.qcow2 [2017-05-08 09:51:17.661838] I [MSGID: 115013] [server-helpers.c:293:do_fd_cleanup] 0-datastore2-server: fd cleanup on /images/201/vm-201-disk-1.qcow2 [2017-05-08 09:51:17.661953] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down connection srvpve2-9074-2017/05/04-14:12:53:301448-datastore2-client-0-0-0 [2017-05-08 09:51:17.661953] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-datastore2-server: Shutting down connection srvpve2-9074-2017/05/04-14:12:53:367950-datastore2-client-0-0-0 [2017-05-08 10:01:06.210392] I [MSGID: 115029] [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted client from srvpve2-162483-2017/05/08-10:01:06:189720-datastore2-client-0-0-0 (version: 3.8.11) [2017-05-08 10:01:06.237433] E [MSGID: 113107] [posix.c:1079:posix_seek] 0-datastore2-posix: seek failed on fd 18 length 42957209600 [No such device or address] [2017-05-08 10:01:06.237463] E [MSGID: 115089] [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 (a8d82b3d-1cf9-45cf-9858-d8546710b49c) ==> (No such device or address) [No such device or address] [2017-05-08 10:01:07.019974] I [MSGID: 115029] [server-handshake.c:692:server_setvolume] 0-datastore2-server: accepted client from srvpve2-162483-2017/05/08-10:01:07:3687-datastore2-client-0-0-0 (version: 3.8.11) [2017-05-08 10:01:07.041967] E [MSGID: 113107] [posix.c:1079:posix_seek] 0-datastore2-posix: seek failed on fd 19 length 859136720896 [No such device or address] [2017-05-08 10:01:07.041992] E [MSGID: 115089] [server-rpc-fops.c:2007:server_seek_cbk] 0-datastore2-server: 18: SEEK-2 (66d9eefb-ee55-40ad-9f44-c55d1e809006) ==> (No such device or address) [No such device or address] The strange part is that I cannot seem to find any other error. If I restart the VM everything works as expected (it stopped at ~9.51 UTC and was started at ~10.01 UTC) . This is not the first time that this happened, and I do not see any problems with networking or the hosts. Gluster version is 3.8.11 this is the incriminated volume (though it happened on a different one too) Volume Name: datastore2 Type: Replicate Volume ID: c95ebb5f-6e04-4f09-91b9-bbbe63d83aea Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: srvpve2g:/data/brick2/brick Brick2: srvpve3g:/data/brick2/brick Brick3: srvpve1g:/data/brick2/brick (arbiter) Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet Any hint on how to dig more deeply into the reason would be greatly appreciated. Alessandro _______________________________________________ Gluster-users mailing list [email protected]<mailto:[email protected]> http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-users
