Re: [Gluster-users] libgfapi failover problem on replica bricks

Roman Wed, 06 Aug 2014 05:16:57 -0700

Yesterday I've reproduced this situation two times.

The setup:
1. Hardware and network
   a. Disks INTEL SSDSC2BB240G4
   b1. Network cards: X540-AT2
   b2. Netgear 10g switch
2. Software setup:
   a. OS: Debian wheezy
   b. Glusterfs: 3.4.4 (latest 3.4.4 from gluster repository)
   c. Promox VE with update glusterfs from gluster repository
3. Software Configuration
   a. create replicated volume with cluster.self-heal-daemon: off;
nfs.disable: off; network.ping-timeout: 2 opts
   b. mount it on proxmox VE (via proxmox gui, it mouts with these
opts: stor1:HA-fast-150G-PVE1 on /mnt/pve/FAST-TESt type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
  )
   c. install VM with qcow2 or raw disk image.
   d. disable port / remove network cable from one of storage servers
   e. wait and put cable back
   f. keep waiting for sync (pointless, it won't ever start)
   g. disable another port for second server (or remove cable from second
server)
   h. profit.


Maybe I could use 3.5.2 from debian sid (testing) repository to test with?


2014-08-06 9:39 GMT+03:00 Pranith Kumar Karampuri <[email protected]>:

>  Roman,
>     The file went into split-brain. I think we should do these tests with
> 3.5.2. Where monitoring the heals is easier. Let me also come up with a
> document about how to do this testing you are trying to do.
>
> Humble/Niels,
>     Do we have debs available for 3.5.2? In 3.5.1 there was packaging
> issue where /usr/bin/glfsheal is not packaged along with the deb. I think
> that should be fixed now as well?
>
> Pranith
>
> On 08/06/2014 11:52 AM, Roman wrote:
>
> good morning,
>
>  root@stor1:~# getfattr -d -m. -e hex
> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
> getfattr: Removing leading '/' from absolute path names
> # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
> trusted.afr.HA-fast-150G-PVE1-client-1=0x000001320000000000000000
> trusted.gfid=0x23c79523075a4158bea38078da570449
>
>  getfattr: Removing leading '/' from absolute path names
> # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000040000000000000000
> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
> trusted.gfid=0x23c79523075a4158bea38078da570449
>
>
>
> 2014-08-06 9:20 GMT+03:00 Pranith Kumar Karampuri <[email protected]>:
>
>>
>> On 08/06/2014 11:30 AM, Roman wrote:
>>
>> Also, this time files are not the same!
>>
>>  root@stor1:~# md5sum
>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> 32411360c53116b96a059f17306caeda
>>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>
>>  root@stor2:~# md5sum
>> /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>> 65b8a6031bcb6f5fb3a11cb1e8b1c9c9
>>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>
>>  What is the getfattr output?
>>
>> Pranith
>>
>>
>>
>> 2014-08-05 16:33 GMT+03:00 Roman <[email protected]>:
>>
>>> Nope, it is not working. But this time it went a bit other way
>>>
>>>  root@gluster-client:~# dmesg
>>> Segmentation fault
>>>
>>>
>>>  I was not able even to start the VM after I done the tests
>>>
>>>  Could not read qcow2 header: Operation not permitted
>>>
>>>  And it seems, it never starts to sync files after first disconnect. VM
>>> survives first disconnect, but not second (I waited around 30 minutes).
>>> Also, I've got network.ping-timeout: 2 in volume settings, but logs react
>>> on first disconnect around 30 seconds. Second was faster, 2 seconds.
>>>
>>>  Reaction was different also:
>>>
>>>  slower one:
>>>  [2014-08-05 13:26:19.558435] W [socket.c:514:__socket_rwv]
>>> 0-glusterfs: readv failed (Connection timed out)
>>> [2014-08-05 13:26:19.558485] W
>>> [socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from
>>> socket failed. Error (Connection timed out), peer (10.250.0.1:24007)
>>> [2014-08-05 13:26:21.281426] W [socket.c:514:__socket_rwv]
>>> 0-HA-fast-150G-PVE1-client-0: readv failed (Connection timed out)
>>> [2014-08-05 13:26:21.281474] W
>>> [socket.c:1962:__socket_proto_state_machine] 0-HA-fast-150G-PVE1-client-0:
>>> reading from socket failed. Error (Connection timed out), peer (
>>> 10.250.0.1:49153)
>>> [2014-08-05 13:26:21.281507] I [client.c:2098:client_rpc_notify]
>>> 0-HA-fast-150G-PVE1-client-0: disconnected
>>>
>>>  the fast one:
>>>  2014-08-05 12:52:44.607389] C
>>> [client-handshake.c:127:rpc_client_ping_timer_expired]
>>> 0-HA-fast-150G-PVE1-client-1: server 10.250.0.2:49153 has not responded
>>> in the last 2 seconds, disconnecting.
>>> [2014-08-05 12:52:44.607491] W [socket.c:514:__socket_rwv]
>>> 0-HA-fast-150G-PVE1-client-1: readv failed (No data available)
>>> [2014-08-05 12:52:44.607585] E [rpc-clnt.c:368:saved_frames_unwind]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>>> [0x7fcb1b4b0558]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>>> [0x7fcb1b4aea63]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced unwinding frame
>>> type(GlusterFS 3.3) op(LOOKUP(27)) called at 2014-08-05 12:52:42.463881
>>> (xid=0x381883x)
>>> [2014-08-05 12:52:44.607604] W
>>> [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-HA-fast-150G-PVE1-client-1:
>>> remote operation failed: Transport endpoint is not connected. Path: /
>>> (00000000-0000-0000-0000-000000000001)
>>> [2014-08-05 12:52:44.607736] E [rpc-clnt.c:368:saved_frames_unwind]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0xf8)
>>> [0x7fcb1b4b0558]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3)
>>> [0x7fcb1b4aea63]
>>> (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)
>>> [0x7fcb1b4ae97e]))) 0-HA-fast-150G-PVE1-client-1: forced unwinding frame
>>> type(GlusterFS Handshake) op(PING(3)) called at 2014-08-05 12:52:42.463891
>>> (xid=0x381884x)
>>> [2014-08-05 12:52:44.607753] W [client-handshake.c:276:client_ping_cbk]
>>> 0-HA-fast-150G-PVE1-client-1: timer must have expired
>>> [2014-08-05 12:52:44.607776] I [client.c:2098:client_rpc_notify]
>>> 0-HA-fast-150G-PVE1-client-1: disconnected
>>>
>>>
>>>
>>>  I've got SSD disks (just for an info).
>>> Should I go and give a try for 3.5.2?
>>>
>>>
>>>
>>>  2014-08-05 13:06 GMT+03:00 Pranith Kumar Karampuri <[email protected]
>>> >:
>>>
>>>  reply along with gluster-users please :-). May be you are hitting
>>>> 'reply' instead of 'reply all'?
>>>>
>>>> Pranith
>>>>
>>>> On 08/05/2014 03:35 PM, Roman wrote:
>>>>
>>>> To make sure and clean, I've created another VM with raw format and
>>>> goint to repeat those steps. So now I've got two VM-s one with qcow2 format
>>>> and other with raw format. I will send another e-mail shortly.
>>>>
>>>>
>>>> 2014-08-05 13:01 GMT+03:00 Pranith Kumar Karampuri <[email protected]
>>>> >:
>>>>
>>>>>
>>>>> On 08/05/2014 03:07 PM, Roman wrote:
>>>>>
>>>>> really, seems like the same file
>>>>>
>>>>>  stor1:
>>>>> a951641c5230472929836f9fcede6b04
>>>>>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>
>>>>>  stor2:
>>>>> a951641c5230472929836f9fcede6b04
>>>>>  /exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>
>>>>>
>>>>>  one thing I've seen from logs, that somehow proxmox VE is connecting
>>>>> with wrong version to servers?
>>>>> [2014-08-05 09:23:45.218550] I
>>>>> [client-handshake.c:1659:select_server_supported_programs]
>>>>> 0-HA-fast-150G-PVE1-client-0: Using Program GlusterFS 3.3, Num (1298437),
>>>>> Version (330)
>>>>>
>>>>>  It is the rpc (over the network data structures) version, which is
>>>>> not changed at all from 3.3 so thats not a problem. So what is the
>>>>> conclusion? Is your test case working now or not?
>>>>>
>>>>> Pranith
>>>>>
>>>>>   but if I issue:
>>>>>  root@pve1:~# glusterfs -V
>>>>> glusterfs 3.4.4 built on Jun 28 2014 03:44:57
>>>>>  seems ok.
>>>>>
>>>>>  server  use 3.4.4 meanwhile
>>>>> [2014-08-05 09:23:45.117875] I
>>>>> [server-handshake.c:567:server_setvolume] 0-HA-fast-150G-PVE1-server:
>>>>> accepted client from
>>>>> stor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0 
>>>>> (version:
>>>>> 3.4.4)
>>>>>  [2014-08-05 09:23:49.103035] I
>>>>> [server-handshake.c:567:server_setvolume] 0-HA-fast-150G-PVE1-server:
>>>>> accepted client from
>>>>> stor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0 
>>>>> (version:
>>>>> 3.4.4)
>>>>>
>>>>>  if this could be the reason, of course.
>>>>> I did restart the Proxmox VE yesterday (just for an information)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2014-08-05 12:30 GMT+03:00 Pranith Kumar Karampuri <
>>>>> [email protected]>:
>>>>>
>>>>>>
>>>>>> On 08/05/2014 02:33 PM, Roman wrote:
>>>>>>
>>>>>> Waited long enough for now, still different sizes and no logs about
>>>>>> healing :(
>>>>>>
>>>>>>  stor1
>>>>>>  # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>>>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>>>>>>
>>>>>>  root@stor1:~# du -sh /exports/fast-test/150G/images/127/
>>>>>> 1.2G    /exports/fast-test/150G/images/127/
>>>>>>
>>>>>>
>>>>>>  stor2
>>>>>>  # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
>>>>>> trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
>>>>>> trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921
>>>>>>
>>>>>>
>>>>>>  root@stor2:~# du -sh /exports/fast-test/150G/images/127/
>>>>>> 1.4G    /exports/fast-test/150G/images/127/
>>>>>>
>>>>>>  According to the changelogs, the file doesn't need any healing.
>>>>>> Could you stop the operations on the VMs and take md5sum on both these
>>>>>> machines?
>>>>>>
>>>>>> Pranith
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-08-05 11:49 GMT+03:00 Pranith Kumar Karampuri <
>>>>>> [email protected]>:
>>>>>>
>>>>>>>
>>>>>>> On 08/05/2014 02:06 PM, Roman wrote:
>>>>>>>
>>>>>>> Well, it seems like it doesn't see the changes were made to the
>>>>>>> volume ? I created two files 200 and 100 MB (from /dev/zero) after I
>>>>>>> disconnected the first brick. Then connected it back and got these logs:
>>>>>>>
>>>>>>>  [2014-08-05 08:30:37.830150] I
>>>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No change in
>>>>>>> volfile, continuing
>>>>>>> [2014-08-05 08:30:37.830207] I [rpc-clnt.c:1676:rpc_clnt_reconfig]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: changing port to 49153 (from 0)
>>>>>>> [2014-08-05 08:30:37.830239] W [socket.c:514:__socket_rwv]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: readv failed (No data available)
>>>>>>> [2014-08-05 08:30:37.831024] I
>>>>>>> [client-handshake.c:1659:select_server_supported_programs]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: Using Program GlusterFS 3.3, Num 
>>>>>>> (1298437),
>>>>>>> Version (330)
>>>>>>> [2014-08-05 08:30:37.831375] I
>>>>>>> [client-handshake.c:1456:client_setvolume_cbk]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: Connected to 10.250.0.1:49153,
>>>>>>> attached to remote volume '/exports/fast-test/150G'.
>>>>>>> [2014-08-05 08:30:37.831394] I
>>>>>>> [client-handshake.c:1468:client_setvolume_cbk]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: Server and Client lk-version numbers are 
>>>>>>> not
>>>>>>> same, reopening the fds
>>>>>>> [2014-08-05 08:30:37.831566] I
>>>>>>> [client-handshake.c:450:client_set_lk_version_cbk]
>>>>>>> 0-HA-fast-150G-PVE1-client-0: Server lk version = 1
>>>>>>>
>>>>>>>
>>>>>>>  [2014-08-05 08:30:37.830150] I
>>>>>>> [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No change in
>>>>>>> volfile, continuing
>>>>>>>  this line seems weird to me tbh.
>>>>>>> I do not see any traffic on switch interfaces between gluster
>>>>>>> servers, which means, there is no syncing between them.
>>>>>>> I tried to ls -l the files on the client and servers to trigger the
>>>>>>> healing, but seems like no success. Should I wait more?
>>>>>>>
>>>>>>>  Yes, it should take around 10-15 minutes. Could you provide
>>>>>>> 'getfattr -d -m. -e hex <file-on-brick>' on both the bricks.
>>>>>>>
>>>>>>> Pranith
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2014-08-05 11:25 GMT+03:00 Pranith Kumar Karampuri <
>>>>>>> [email protected]>:
>>>>>>>
>>>>>>>>
>>>>>>>> On 08/05/2014 01:10 PM, Roman wrote:
>>>>>>>>
>>>>>>>> Ahha! For some reason I was not able to start the VM anymore,
>>>>>>>> Proxmox VE told me, that it is not able to read the qcow2 header due to
>>>>>>>> permission is denied for some reason. So I just deleted that file and
>>>>>>>> created a new VM. And the nex message I've got was this:
>>>>>>>>
>>>>>>>>  Seems like these are the messages where you took down the bricks
>>>>>>>> before self-heal. Could you restart the run waiting for self-heals to
>>>>>>>> complete before taking down the next brick?
>>>>>>>>
>>>>>>>> Pranith
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  [2014-08-05 07:31:25.663412] E
>>>>>>>> [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
>>>>>>>> 0-HA-fast-150G-PVE1-replicate-0: Unable to self-heal contents of
>>>>>>>> '/images/124/vm-124-disk-1.qcow2' (possible split-brain). Please 
>>>>>>>> delete the
>>>>>>>> file from all but the preferred subvolume.- Pending matrix:  [ [ 0 60 
>>>>>>>> ] [
>>>>>>>> 11 0 ] ]
>>>>>>>> [2014-08-05 07:31:25.663955] E
>>>>>>>> [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
>>>>>>>> 0-HA-fast-150G-PVE1-replicate-0: background  data self-heal failed on
>>>>>>>> /images/124/vm-124-disk-1.qcow2
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-08-05 10:13 GMT+03:00 Pranith Kumar Karampuri <
>>>>>>>> [email protected]>:
>>>>>>>>
>>>>>>>>>  I just responded to your earlier mail about how the log looks.
>>>>>>>>> The log comes on the mount's logfile
>>>>>>>>>
>>>>>>>>> Pranith
>>>>>>>>>
>>>>>>>>> On 08/05/2014 12:41 PM, Roman wrote:
>>>>>>>>>
>>>>>>>>> Ok, so I've waited enough, I think. Had no any traffic on switch
>>>>>>>>> ports between servers. Could not find any suitable log message about
>>>>>>>>> completed self-heal (waited about 30 minutes). Plugged out the other
>>>>>>>>> server's UTP cable this time and got in the same situation:
>>>>>>>>> root@gluster-test1:~# cat /var/log/dmesg
>>>>>>>>> -bash: /bin/cat: Input/output error
>>>>>>>>>
>>>>>>>>>  brick logs:
>>>>>>>>>  [2014-08-05 07:09:03.005474] I [server.c:762:server_rpc_notify]
>>>>>>>>> 0-HA-fast-150G-PVE1-server: disconnecting connectionfrom
>>>>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>>>>>>>> [2014-08-05 07:09:03.005530] I
>>>>>>>>> [server-helpers.c:729:server_connection_put] 
>>>>>>>>> 0-HA-fast-150G-PVE1-server:
>>>>>>>>> Shutting down connection
>>>>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>>>>>>>> [2014-08-05 07:09:03.005560] I
>>>>>>>>> [server-helpers.c:463:do_fd_cleanup] 0-HA-fast-150G-PVE1-server: fd 
>>>>>>>>> cleanup
>>>>>>>>> on /images/124/vm-124-disk-1.qcow2
>>>>>>>>> [2014-08-05 07:09:03.005797] I
>>>>>>>>> [server-helpers.c:617:server_connection_destroy]
>>>>>>>>> 0-HA-fast-150G-PVE1-server: destroyed connection of
>>>>>>>>> pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-08-05 9:53 GMT+03:00 Pranith Kumar Karampuri <
>>>>>>>>> [email protected]>:
>>>>>>>>>
>>>>>>>>>>  Do you think it is possible for you to do these tests on the
>>>>>>>>>> latest version 3.5.2? 'gluster volume heal <volname> info' would 
>>>>>>>>>> give you
>>>>>>>>>> that information in versions > 3.5.1.
>>>>>>>>>> Otherwise you will have to check it from either the logs, there
>>>>>>>>>> will be self-heal completed message on the mount logs (or) by 
>>>>>>>>>> observing
>>>>>>>>>> 'getfattr -d -m. -e hex <image-file-on-bricks>'
>>>>>>>>>>
>>>>>>>>>> Pranith
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 08/05/2014 12:09 PM, Roman wrote:
>>>>>>>>>>
>>>>>>>>>> Ok, I understand. I will try this shortly.
>>>>>>>>>> How can I be sure, that healing process is done, if I am not able
>>>>>>>>>> to see its status?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-08-05 9:30 GMT+03:00 Pranith Kumar Karampuri <
>>>>>>>>>> [email protected]>:
>>>>>>>>>>
>>>>>>>>>>>  Mounts will do the healing, not the self-heal-daemon. The
>>>>>>>>>>> problem I feel is that whichever process does the healing has the 
>>>>>>>>>>> latest
>>>>>>>>>>> information about the good bricks in this usecase. Since for VM 
>>>>>>>>>>> usecase,
>>>>>>>>>>> mounts should have the latest information, we should let the mounts 
>>>>>>>>>>> do the
>>>>>>>>>>> healing. If the mount accesses the VM image either by someone doing
>>>>>>>>>>> operations inside the VM or explicit stat on the file it should do 
>>>>>>>>>>> the
>>>>>>>>>>> healing.
>>>>>>>>>>>
>>>>>>>>>>> Pranith.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 08/05/2014 10:39 AM, Roman wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hmmm, you told me to turn it off. Did I understood something
>>>>>>>>>>> wrong? After I issued the command you've sent me, I was not able to 
>>>>>>>>>>> watch
>>>>>>>>>>> the healing process, it said, it won't be healed, becouse its 
>>>>>>>>>>> turned off.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-08-05 5:39 GMT+03:00 Pranith Kumar Karampuri <
>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>>  You didn't mention anything about self-healing. Did you wait
>>>>>>>>>>>> until the self-heal is complete?
>>>>>>>>>>>>
>>>>>>>>>>>> Pranith
>>>>>>>>>>>>
>>>>>>>>>>>> On 08/04/2014 05:49 PM, Roman wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>  Hi!
>>>>>>>>>>>> Result is pretty same. I set the switch port down for 1st
>>>>>>>>>>>> server, it was ok. Then set it up back and set other server's port 
>>>>>>>>>>>> off. and
>>>>>>>>>>>> it triggered IO error on two virtual machines: one with local root 
>>>>>>>>>>>> FS but
>>>>>>>>>>>> network mounted storage. and other with network root FS. 1st gave 
>>>>>>>>>>>> an error
>>>>>>>>>>>> on copying to or from the mounted network disk, other just gave me 
>>>>>>>>>>>> an error
>>>>>>>>>>>> for even reading log.files.
>>>>>>>>>>>>
>>>>>>>>>>>>  cat: /var/log/alternatives.log: Input/output error
>>>>>>>>>>>>  then I reset the kvm VM and it said me, there is no boot
>>>>>>>>>>>> device. Next I virtually powered it off and then back on and it 
>>>>>>>>>>>> has booted.
>>>>>>>>>>>>
>>>>>>>>>>>>  By the way, did I have to start/stop volume?
>>>>>>>>>>>>
>>>>>>>>>>>>  >> Could you do the following and test it again?
>>>>>>>>>>>> >> gluster volume set <volname> cluster.self-heal-daemon off
>>>>>>>>>>>>
>>>>>>>>>>>> >>Pranith
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-08-04 14:10 GMT+03:00 Pranith Kumar Karampuri <
>>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08/04/2014 03:33 PM, Roman wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Hello!
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Facing the same problem as mentioned here:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>  my set up is up and running, so i'm ready to help you back
>>>>>>>>>>>>> with feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  setup:
>>>>>>>>>>>>> proxmox server as client
>>>>>>>>>>>>>  2 gluster physical  servers
>>>>>>>>>>>>>
>>>>>>>>>>>>>  server side and client side both running atm 3.4.4 glusterfs
>>>>>>>>>>>>> from gluster repo.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  the problem is:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  1. craeted replica bricks.
>>>>>>>>>>>>> 2. mounted in proxmox (tried both promox ways: via GUI and
>>>>>>>>>>>>> fstab (with backup volume line), btw while mounting via fstab I'm 
>>>>>>>>>>>>> unable to
>>>>>>>>>>>>> launch a VM without cache, meanwhile direct-io-mode is enabled in 
>>>>>>>>>>>>> fstab
>>>>>>>>>>>>> line)
>>>>>>>>>>>>> 3. installed VM
>>>>>>>>>>>>> 4. bring one volume down - ok
>>>>>>>>>>>>>  5. bringing up, waiting for sync is done.
>>>>>>>>>>>>> 6. bring other volume down - getting IO errors on VM guest and
>>>>>>>>>>>>> not able to restore the VM after I reset the VM via host. It says 
>>>>>>>>>>>>> (no
>>>>>>>>>>>>> bootable media). After I shut it down (forced) and bring back up, 
>>>>>>>>>>>>> it boots.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Could you do the following and test it again?
>>>>>>>>>>>>> gluster volume set <volname> cluster.self-heal-daemon off
>>>>>>>>>>>>>
>>>>>>>>>>>>> Pranith
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Need help. Tried 3.4.3, 3.4.4.
>>>>>>>>>>>>> Still missing pkg-s for 3.4.5 for debian and 3.5.2 (3.5.1
>>>>>>>>>>>>> always gives a healing error for some reason)
>>>>>>>>>>>>>
>>>>>>>>>>>>>  --
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Roman.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>> Gluster-users mailing 
>>>>>>>>>>>>> [email protected]http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  --
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Roman.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Roman.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>>> Best regards,
>>>>>>>>>> Roman.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>>> Best regards,
>>>>>>>>> Roman.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>>> Best regards,
>>>>>>>> Roman.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> Best regards,
>>>>>>> Roman.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Best regards,
>>>>>> Roman.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Best regards,
>>>>> Roman.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>> Best regards,
>>>> Roman.
>>>>
>>>>
>>>>
>>>
>>>
>>>  --
>>> Best regards,
>>> Roman.
>>>
>>
>>
>>
>>  --
>> Best regards,
>> Roman.
>>
>>
>>
>
>
>  --
> Best regards,
> Roman.
>
>
>


-- 
Best regards,
Roman.

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] libgfapi failover problem on replica bricks

Reply via email to