Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)

Wido den Hollander Fri, 11 Sep 2015 06:29:55 -0700


On 11-09-15 14:43, Dmytro Shevchenko wrote:
> Thanks a lot Wido! Any chance to find out why management server decided
> that it lost connection to agent after that exceptions? It's not so
> critical as this bug with 16 snapshots, but during last week we catch
> situation when Agent failed unprotect snapshot, rise exception and this
> is was a reason of disconnection a bit later after that. (It is not
> clear why CS decided remove that volume, it was template with one 'gold'
> snapshot with several active clones)
>


No, I didn't look at CS at all. I just spend the day improving the RADOS
bindings.

Wido

> On 09/11/2015 03:20 PM, Wido den Hollander wrote:
>>
>> On 11-09-15 10:19, Wido den Hollander wrote:
>>>
>>> On 10-09-15 23:15, Andrija Panic wrote:
>>>> Wido,
>>>>
>>>> could you folow maybe what my colegue Dmytro just sent ?
>>>>
>>> Yes, seems logical.
>>>
>>>> Its not only matter of question fixing rados-java (16 snaps limit) 
>>>> - it
>>>> seems that for any RBD exception, ACS will freak out...
>>>>
>>> No, a RbdException will be caught, but the Rados Bindings shouldn't
>>> throw NegativeArraySizeException in any case.
>>>
>>> That's the main problem.
>>>
>> Seems to be fixed with this commit:
>> https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64
>>
>>
>> Just tested it with 256 snapshots:
>>
>> -------------------------------------------------------
>>   T E S T S
>> -------------------------------------------------------
>> Running com.ceph.rbd.TestRbd
>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
>> 521.014 sec
>>
>> Results :
>>
>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
>>
>> The bindings should now be capable of listing more then 16 snapshots.
>>
>> You can build the bindings manually and replace rados.jar on your
>> running systems.
>>
>> For 4.6 I'll try to get the updated rados-java included.
>>
>> Wido
>>
>>> Wido
>>>
>>>> THx
>>>>
>>>> On 10 September 2015 at 17:06, Dmytro Shevchenko <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello everyone, some clarification about this. Configuration:
>>>>> CS: 4.5.1
>>>>> Primary storage: Ceph
>>>>>
>>>>> Actually we have 2 separate bugs:
>>>>>
>>>>> 1. When you remove some volume with more then 16 snapshots (doesn't
>>>>> matter
>>>>> destroyed or active - they always present on Ceph), on next storage
>>>>> garbage
>>>>> collector cycle it invoke 'deletePhysicalDisk' from
>>>>> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots from
>>>>> external rados-java library and getting exception.
>>>>>
>>>>> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854
>>>>>
>>>>> This exception do not catching in current function, but Agent DO
>>>>> NOT CRASH
>>>>> at this moment and continue working fine. Agent form proper answer to
>>>>> server and send it, text in answer - java stack trace. Log from
>>>>> Agent side:
>>>>>
>>>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>> (agentRequest-Handler-4:null) Trying to fetch storage pool
>>>>> 33ebaf83-5d09-3038-b63b-742e759a
>>>>> 992e from libvirt
>>>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>>>> (agentRequest-Handler-4:null) Attempting to remove volume
>>>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f
>>>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e
>>>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>>>> (agentRequest-Handler-4:null) Unprotecting and Removing RBD
>>>>> snapshots of
>>>>> image cloudstack-storage
>>>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image
>>>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>> (agentRequest-Handler-4:null) Succesfully connected to Ceph cluster at
>>>>> 10.10.1.26:6789
>>>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD image
>>>>> cloudstack-storage/4c6a2092
>>>>> -056c-4446-a2ca-d6bba9f7f7f8
>>>>> 2015-09-10 02:32:35,457 WARN  [cloud.agent.Agent]
>>>>> (agentRequest-Handler-4:null) Caught:
>>>>> java.lang.NegativeArraySizeException
>>>>>          at com.ceph.rbd.RbdImage.snapList(Unknown Source)
>>>>>          at
>>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)
>>>>>
>>>>>          at
>>>>> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)
>>>>>
>>>>>          at
>>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)
>>>>>
>>>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent]
>>>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946:  { Ans: ,
>>>>> MgmtId:
>>>>> 90520739779588, via: 1, Ver: v1,
>>>>>   Flags: 10,
>>>>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat
>>>>>
>>>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc
>>>>> e)\n\tat
>>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat
>>>>>
>>>>> com.cloud.hypervisor.kvm.storage.Lib
>>>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat
>>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j
>>>>>
>>>>> ava:1206)\n\tat
>>>>> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat
>>>>>
>>>>> com.cloud.storage.re.....
>>>>>
>>>>> so this volume and it snapshots never will be removed.
>>>>>
>>>>>
>>>>> 2. Second bug. Experimentally it has been found that after 50
>>>>> minutes we
>>>>> had exception on Agent, for some unknown reason Management server
>>>>> decided
>>>>> about it lost connection to this agent, start HA process and start
>>>>> Agent
>>>>> process again.
>>>>> Log on Agent side:
>>>>> 2015-09-10 02:57:12,664 DEBUG [kvm.resource.LibvirtComputingResource]
>>>>> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep
>>>>> Mem:|awk
>>>>> '{print $2}'
>>>>> 2015-09-10 02:57:12,667 DEBUG [kvm.resource.LibvirtComputingResource]
>>>>> (agentRequest-Handler-2:null) Execution is successful.
>>>>> 2015-09-10 02:57:40,502 DEBUG [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Executing:
>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_
>>>>> group.py get_rule_logs_for_vms
>>>>> 2015-09-10 02:57:40,572 DEBUG [kvm.resource.LibvirtComputingResource]
>>>>> (UgentTask-5:null) Execution is successful.
>>>>> 2015-09-10 02:57:54,135 INFO  [cloud.agent.AgentShell] (main:null)
>>>>> Agent
>>>>> started
>>>>> 2015-09-10 02:57:54,136 INFO  [cloud.agent.AgentShell] (main:null)
>>>>> Implementation Version is 4.5.1
>>>>> 2015-09-10 02:57:54,138 INFO  [cloud.agent.AgentShell] (main:null)
>>>>> agent.properties found at /etc/cloudstack/agent/agent.properties
>>>>> .....
>>>>>
>>>>> Log on Server side:
>>>>> 2015-09-10 02:57:53,710 INFO  [c.c.a.m.AgentManagerImpl]
>>>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has
>>>>> disconnected
>>>>> with event AgentDisconnecte
>>>>> d
>>>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl]
>>>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive
>>>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request]
>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Sending {
>>>>> Cmd ,
>>>>> MgmtId: 90520739779588, via: 1(ix1
>>>>> -c7-2), Ver: v1, Flags: 100011,
>>>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
>>>>> 2015-09-10 02:57:53,724 INFO  [c.c.a.m.AgentAttache]
>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Unable to
>>>>> send
>>>>> due to Resource [Host:1] is unr
>>>>> eachable: Host 1: Channel is closed
>>>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache]
>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Cancelling.
>>>>> 2015-09-10 02:57:53,724 WARN  [c.c.a.m.AgentManagerImpl]
>>>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable:
>>>>> Host 1:
>>>>> Channel is closed
>>>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl]
>>>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to
>>>>> determine the
>>>>> state of the host
>>>>> .  Moving on.
>>>>>
>>>>>
>>>>> it's look like connection problem, but it appear only when we have
>>>>> this
>>>>> RBD exception on agent side and only with this node. I tried to
>>>>> play with
>>>>> "storage.cleanup.interval" parameter and set it to 5 minutes, now we
>>>>> getting exception on agent side every 5min, but disconnects still
>>>>> happening
>>>>> every 50min and I can't find out why.
>>>>>
>>>>> On 09/10/2015 03:21 PM, Andrija Panic wrote:
>>>>>
>>>>>> Thx Wido,
>>>>>>
>>>>>> I will have my colegue Igor and Dmytro join with details on this.
>>>>>>
>>>>>> I agree we need fix upstream, that is the main purpose from our side!
>>>>>>
>>>>>> With this temp fix, we just avoid agent crashing (agent somehow
>>>>>> restarts
>>>>>> again fine :) ) but VMs also go down on that host, at least some
>>>>>> of them.
>>>>>>
>>>>>> Do you see any lifecycle/workflow issue, if we implement deleting
>>>>>> SNAP
>>>>>> from CEPH after you SNAP a volume in ACS and sucsssfully move to
>>>>>> Secondary
>>>>>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall SNAP
>>>>>> deletion (when you manually or via scheduled snapshots, delete
>>>>>> snapshot
>>>>>> from DB and NFS) ? Maybe second option is better, I dont know how
>>>>>> you guys
>>>>>> handle this for regular NFS as primary storage etc...
>>>>>>
>>>>>>
>>>>>> Any guidance is most welcomed, and our team will try to code all
>>>>>> this.
>>>>>>
>>>>>> Thx Wido again
>>>>>>
>>>>>> On 10 September 2015 at 14:14, Wido den Hollander <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>      On 10-09-15 14:07, Andrija Panic wrote:
>>>>>>      > Wido,
>>>>>>      >
>>>>>>      > part of code where you want to delete some volume, checks if
>>>>>>      volume is type
>>>>>>      > RBD - and then tries to list snapshots, delete snapshtos, and
>>>>>>      finally
>>>>>>      > remove image. Here first step- Listing snapshtos- fails, if
>>>>>>      there are more
>>>>>>      > than 16 snapshtos present - number 16 is hardcoded in
>>>>>> elsewhere
>>>>>>      part of
>>>>>>      > code and throws RBD exception...then agent crashes... and then
>>>>>>      VMs goe down
>>>>>>      > etc.
>>>>>>      >
>>>>>>
>>>>>>      Hmmm, that seems like a bug in rados-java indeed. I don't
>>>>>> know if
>>>>>>      there
>>>>>>      is a release of rados-java where this is fixed in.
>>>>>>
>>>>>>      Looking at the code of rados-java it should, but I'm not 100%
>>>>>> certain.
>>>>>>
>>>>>>      > So our current way as quick fix is to invoke external script
>>>>>>      which will
>>>>>>      > also list and remove all snapshtos, but will not fail.
>>>>>>      >
>>>>>>
>>>>>>      Yes, but we should fix it upstream. I understand that you
>>>>>> will use a
>>>>>>      temp script to clean up everything.
>>>>>>
>>>>>>      > I'm not sure why is there 16 as the hardcoded limit - will try
>>>>>>      to provide
>>>>>>      > part of code where this is present...we can increase this
>>>>>> number
>>>>>>      but it
>>>>>>      > doesn make any sense (from 16 to i.e. 200), since we still
>>>>>> have
>>>>>>      lot of
>>>>>>      > garbage left on CEPH (snapshtos that were removed in ACS
>>>>>> (DB and
>>>>>>      Secondary
>>>>>>      > NFS) - but not removed from CEPH. And in my understanding this
>>>>>>      needs to be
>>>>>>      > implemented, so we dont catch any exceptions that I originally
>>>>>>      described...
>>>>>>      >
>>>>>>      > Any thoughts on this ?
>>>>>>      >
>>>>>>
>>>>>>      A cleanup script for now should be OK indeed. Afterwards the
>>>>>> Java code
>>>>>>      should be able to do this.
>>>>>>
>>>>>>      You can try manually by using rados-java and fix that.
>>>>>>
>>>>>>      This is the part where the listing is done:
>>>>>>
>>>>>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java
>>>>>>
>>>>>>
>>>>>>      Wido
>>>>>>
>>>>>>      > Thx for input!
>>>>>>      >
>>>>>>      > On 10 September 2015 at 13:56, Wido den Hollander
>>>>>>      <[email protected] <mailto:[email protected]>> wrote:
>>>>>>      >
>>>>>>      >>
>>>>>>      >>
>>>>>>      >> On 10-09-15 12:17, Andrija Panic wrote:
>>>>>>      >>> We are testing some [dirty?] patch on our dev system and we
>>>>>>      shall soon
>>>>>>      >>> share it for review.
>>>>>>      >>>
>>>>>>      >>> Basically, we are using external python script that is
>>>>>> invoked
>>>>>>      in some
>>>>>>      >> part
>>>>>>      >>> of code execution to delete needed CEPH snapshots and then
>>>>>>      after that
>>>>>>      >>> proceeds with volume deleteion etc...
>>>>>>      >>>
>>>>>>      >>
>>>>>>      >> That shouldn't be required. The Java bindings for librbd and
>>>>>>      librados
>>>>>>      >> should be able to remove the snapshots.
>>>>>>      >>
>>>>>>      >> There is no need to invoke external code, this can all be
>>>>>>      handled in Java.
>>>>>>      >>
>>>>>>      >>> On 10 September 2015 at 11:26, Andrija Panic
>>>>>>      <[email protected] <mailto:[email protected]>>
>>>>>>      >>> wrote:
>>>>>>      >>>
>>>>>>      >>>> Eh, OK. Thx for the info.
>>>>>>      >>>>
>>>>>>      >>>> BTW why is 16 snapshot limits hardcoded - any reason for
>>>>>> that ?
>>>>>>      >>>>
>>>>>>      >>>> Not cleaning snapshots on CEPH and trying to delete volume
>>>>>>      after having
>>>>>>      >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM
>>>>>>      side...and some
>>>>>>      >> VMs
>>>>>>      >>>> being rebooted etc - which means downtime :|
>>>>>>      >>>>
>>>>>>      >>>> Thanks,
>>>>>>      >>>>
>>>>>>      >>>> On 9 September 2015 at 22:05, Simon Weller <[email protected]
>>>>>>      <mailto:[email protected]>> wrote:
>>>>>>      >>>>
>>>>>>      >>>>> Andrija,
>>>>>>      >>>>>
>>>>>>      >>>>> The Ceph snapshot deletion is not currently implemented.
>>>>>>      >>>>>
>>>>>>      >>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302
>>>>>>      >>>>>
>>>>>>      >>>>> - Si
>>>>>>      >>>>>
>>>>>>      >>>>> ________________________________________
>>>>>>      >>>>> From: Andrija Panic <[email protected]>
>>>>>>      >>>>> Sent: Wednesday, September 9, 2015 3:03 PM
>>>>>>      >>>>> To: [email protected]
>>>>>>      <mailto:[email protected]>; [email protected]
>>>>>>      <mailto:[email protected]>
>>>>>>
>>>>>>      >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH
>>>>>>      (only from
>>>>>>      >>>>> Secondaryt NFS and DB)
>>>>>>      >>>>>
>>>>>>      >>>>> Hi folks,
>>>>>>      >>>>>
>>>>>>      >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions
>>>>>> also
>>>>>>      affected) -
>>>>>>      >>>>> when we delete some snapshot (volume snapshot) in ACS, ACS
>>>>>>      marks it as
>>>>>>      >>>>> deleted in DB, deletes from NFS Secondary Storage but
>>>>>> fails
>>>>>>      to delete
>>>>>>      >>>>> snapshot on CEPH primary storage (doesn even try to delete
>>>>>>      it AFAIK)
>>>>>>      >>>>>
>>>>>>      >>>>> So we end up having 5 live snapshots in DB (just example)
>>>>>>      but actually
>>>>>>      >> in
>>>>>>      >>>>> CEPH there are more than i.e. 16 snapshots.
>>>>>>      >>>>>
>>>>>>      >>>>> More of the issue, when ACS agent tries to obtain list of
>>>>>>      snapshots
>>>>>>      >> from
>>>>>>      >>>>> CEPH for some volume or so - if number of snapshots is
>>>>>> over
>>>>>>      16, it
>>>>>>      >> raises
>>>>>>      >>>>> exception  (and perhaps this is the reason Agent
>>>>>> crashed for
>>>>>>      us - need
>>>>>>      >> to
>>>>>>      >>>>> check with my colegues who are investigatin this in
>>>>>>      details). This
>>>>>>      >> number
>>>>>>      >>>>> 16 is for whatever reasons hardcoded in ACS code.
>>>>>>      >>>>>
>>>>>>      >>>>> Wondering if anyone experienced this, or have any info
>>>>>> - we
>>>>>>      plan to
>>>>>>      >> try to
>>>>>>      >>>>> fix this, and I will inlcude my dev colegues here, but we
>>>>>>      might need
>>>>>>      >> some
>>>>>>      >>>>> help at least for guidance or-
>>>>>>      >>>>>
>>>>>>      >>>>> Any help is really apreaciated or at list confirmation
>>>>>> that
>>>>>>      this is
>>>>>>      >> known
>>>>>>      >>>>> issue etc.
>>>>>>      >>>>>
>>>>>>      >>>>> Thanks,
>>>>>>      >>>>>
>>>>>>      >>>>> --
>>>>>>      >>>>>
>>>>>>      >>>>> Andrija Panić
>>>>>>      >>>>>
>>>>>>      >>>>
>>>>>>      >>>>
>>>>>>      >>>>
>>>>>>      >>>> --
>>>>>>      >>>>
>>>>>>      >>>> Andrija Panić
>>>>>>      >>>>
>>>>>>      >>>
>>>>>>      >>>
>>>>>>      >>>
>>>>>>      >>
>>>>>>      >
>>>>>>      >
>>>>>>      >
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> Andrija Panić
>>>>>>
>>>>> -- 
>>>>> ---
>>>>> Best regards
>>>>> Dmytro Shevchenko
>>>>> [email protected]
>>>>> skype: demonsh_mk
>>>>> +380(66)2426648
>>>>>
>>>>>
>>>>
>

Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)

Reply via email to