Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)

Wido den Hollander Tue, 15 Sep 2015 05:11:58 -0700


On 15-09-15 13:56, Dmytro Shevchenko wrote:
> Hello Wido, I saw you updated this code again. Maybe you know, what
> procedure for rebuilding this library in Apache maven repository?
> Because here http://repo.maven.apache.org/maven2/com/ceph/rados/ still
> present only old 0.1.4 version and it's impossible recompile Cloudstack
> with new patches.  Of cause we can download source code from Github,
> compile it and replace 'jar' file on production, but this is dirty hack
> and not acceptable for 'continues integration'.
>


It's up to me to do a new release of rados-java and I haven't done that
yet since I wanted to know for sure if the code works.

While writing some code for libvirt yesterday I came up with a better
solution for rados-java as well.

https://www.redhat.com/archives/libvir-list/2015-September/msg00458.html

For now you can replace 'rados.jar' on the production systems, but for
4.6 I want to make sure we depend on a new, to be released, version of
rados-java.

Wido

> 
> ---
> Best regards
> Dmytro Shevchenko
> dshevchenko.m...@gmail.com
> skype: demonsh_mk
> 
> 
> On 09/12/2015 06:16 PM, Wido den Hollander wrote:
>> On 09/11/2015 05:08 PM, Andrija Panic wrote:
>>> THx a lot Wido !!! - we will patch this - For my understanding - is this
>>> "temorary"solution - since it raises limit to 256 snaps ? Or am I
>>> wrong ? I
>>> mean, since we dont stil have proper snapshots removal etc, so after
>>> i.e.
>>> 3-6months we will again have 256 snapshots of a single volume on CEPH ?
>>>
>> No, it will also work with >256 snapshots. I've tested it with 256 and
>> that worked fine. I see no reason why it won't work with 1024 or 2048
>> for example.
>>
>>> BTW we also have other exception, that causes same consequences - agent
>>> disocnnecting and VMs going down...
>>> As Dmytro explained, unprotecting snapshot causes same consequence...
>>>
>>>  From my understanding, any RBD exception, might cause Agent to
>>> disconnect
>>> (or actually mgmt server to disconnect agent)...
>>>
>>> Any clue on this, recommendation?
>>>
>> No, I don't have a clue. It could be that the job hangs somewhere inside
>> the Agent due to a uncaught exception though.
>>
>>> Thx a lot for fixing rados-java stuff !
>>>
>> You're welcome!
>>
>> Wido
>>
>>> Andrija
>>>
>>> On 11 September 2015 at 15:28, Wido den Hollander<w...@widodh.nl> 
>>> wrote:
>>>
>>>> On 11-09-15 14:43, Dmytro Shevchenko wrote:
>>>>> Thanks a lot Wido! Any chance to find out why management server
>>>>> decided
>>>>> that it lost connection to agent after that exceptions? It's not so
>>>>> critical as this bug with 16 snapshots, but during last week we catch
>>>>> situation when Agent failed unprotect snapshot, rise exception and
>>>>> this
>>>>> is was a reason of disconnection a bit later after that. (It is not
>>>>> clear why CS decided remove that volume, it was template with one
>>>>> 'gold'
>>>>> snapshot with several active clones)
>>>>>
>>>> No, I didn't look at CS at all. I just spend the day improving the
>>>> RADOS
>>>> bindings.
>>>>
>>>> Wido
>>>>
>>>>> On 09/11/2015 03:20 PM, Wido den Hollander wrote:
>>>>>> On 11-09-15 10:19, Wido den Hollander wrote:
>>>>>>> On 10-09-15 23:15, Andrija Panic wrote:
>>>>>>>> Wido,
>>>>>>>>
>>>>>>>> could you folow maybe what my colegue Dmytro just sent ?
>>>>>>>>
>>>>>>> Yes, seems logical.
>>>>>>>
>>>>>>>> Its not only matter of question fixing rados-java (16 snaps limit)
>>>>>>>> - it
>>>>>>>> seems that for any RBD exception, ACS will freak out...
>>>>>>>>
>>>>>>> No, a RbdException will be caught, but the Rados Bindings shouldn't
>>>>>>> throw NegativeArraySizeException in any case.
>>>>>>>
>>>>>>> That's the main problem.
>>>>>>>
>>>>>> Seems to be fixed with this commit:
>>>>>>
>>>> https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64
>>>>
>>>>>> Just tested it with 256 snapshots:
>>>>>>
>>>>>> -------------------------------------------------------
>>>>>>    T E S T S
>>>>>> -------------------------------------------------------
>>>>>> Running com.ceph.rbd.TestRbd
>>>>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
>>>>>> 521.014 sec
>>>>>>
>>>>>> Results :
>>>>>>
>>>>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
>>>>>>
>>>>>> The bindings should now be capable of listing more then 16 snapshots.
>>>>>>
>>>>>> You can build the bindings manually and replace rados.jar on your
>>>>>> running systems.
>>>>>>
>>>>>> For 4.6 I'll try to get the updated rados-java included.
>>>>>>
>>>>>> Wido
>>>>>>
>>>>>>> Wido
>>>>>>>
>>>>>>>> THx
>>>>>>>>
>>>>>>>> On 10 September 2015 at 17:06, Dmytro Shevchenko <
>>>>>>>> dmytro.shevche...@safeswisscloud.com> wrote:
>>>>>>>>
>>>>>>>>> Hello everyone, some clarification about this. Configuration:
>>>>>>>>> CS: 4.5.1
>>>>>>>>> Primary storage: Ceph
>>>>>>>>>
>>>>>>>>> Actually we have 2 separate bugs:
>>>>>>>>>
>>>>>>>>> 1. When you remove some volume with more then 16 snapshots
>>>>>>>>> (doesn't
>>>>>>>>> matter
>>>>>>>>> destroyed or active - they always present on Ceph), on next
>>>>>>>>> storage
>>>>>>>>> garbage
>>>>>>>>> collector cycle it invoke 'deletePhysicalDisk' from
>>>>>>>>> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots
>>>> from
>>>>>>>>> external rados-java library and getting exception.
>>>>>>>>>
>>>>>>>>>
>>>> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854
>>>>
>>>>>>>>> This exception do not catching in current function, but Agent DO
>>>>>>>>> NOT CRASH
>>>>>>>>> at this moment and continue working fine. Agent form proper
>>>>>>>>> answer to
>>>>>>>>> server and send it, text in answer - java stack trace. Log from
>>>>>>>>> Agent side:
>>>>>>>>>
>>>>>>>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>> (agentRequest-Handler-4:null) Trying to fetch storage pool
>>>>>>>>> 33ebaf83-5d09-3038-b63b-742e759a
>>>>>>>>> 992e from libvirt
>>>>>>>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>> (agentRequest-Handler-4:null) Attempting to remove volume
>>>>>>>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f
>>>>>>>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e
>>>>>>>>> 2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>> (agentRequest-Handler-4:null) Unprotecting and Removing RBD
>>>>>>>>> snapshots of
>>>>>>>>> image cloudstack-storage
>>>>>>>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image
>>>>>>>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>> (agentRequest-Handler-4:null) Succesfully connected to Ceph
>>>>>>>>> cluster
>>>> at
>>>>>>>>> 10.10.1.26:6789
>>>>>>>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>>>>>>>>> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD
>>>>>>>>> image
>>>>>>>>> cloudstack-storage/4c6a2092
>>>>>>>>> -056c-4446-a2ca-d6bba9f7f7f8
>>>>>>>>> 2015-09-10 02:32:35,457 WARN  [cloud.agent.Agent]
>>>>>>>>> (agentRequest-Handler-4:null) Caught:
>>>>>>>>> java.lang.NegativeArraySizeException
>>>>>>>>>           at com.ceph.rbd.RbdImage.snapList(Unknown Source)
>>>>>>>>>           at
>>>>>>>>>
>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)
>>>>
>>>>>>>>>           at
>>>>>>>>>
>>>> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)
>>>>
>>>>>>>>>           at
>>>>>>>>>
>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)
>>>>
>>>>>>>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent]
>>>>>>>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946:  { Ans: ,
>>>>>>>>> MgmtId:
>>>>>>>>> 90520739779588, via: 1, Ver: v1,
>>>>>>>>>    Flags: 10,
>>>>>>>>>
>>>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat
>>>>
>>>>>>>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc
>>>>>>>>> e)\n\tat
>>>>>>>>>
>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat
>>>>
>>>>>>>>> com.cloud.hypervisor.kvm.storage.Lib
>>>>>>>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat
>>>>>>>>>
>>>>>>>>>
>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j
>>>>
>>>>>>>>> ava:1206)\n\tat
>>>>>>>>>
>>>> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat
>>>>
>>>>>>>>> com.cloud.storage.re.....
>>>>>>>>>
>>>>>>>>> so this volume and it snapshots never will be removed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. Second bug. Experimentally it has been found that after 50
>>>>>>>>> minutes we
>>>>>>>>> had exception on Agent, for some unknown reason Management server
>>>>>>>>> decided
>>>>>>>>> about it lost connection to this agent, start HA process and start
>>>>>>>>> Agent
>>>>>>>>> process again.
>>>>>>>>> Log on Agent side:
>>>>>>>>> 2015-09-10 02:57:12,664 DEBUG
>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep
>>>>>>>>> Mem:|awk
>>>>>>>>> '{print $2}'
>>>>>>>>> 2015-09-10 02:57:12,667 DEBUG
>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>> (agentRequest-Handler-2:null) Execution is successful.
>>>>>>>>> 2015-09-10 02:57:40,502 DEBUG
>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>> (UgentTask-5:null) Executing:
>>>>>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_
>>>>>>>>> group.py get_rule_logs_for_vms
>>>>>>>>> 2015-09-10 02:57:40,572 DEBUG
>>>>>>>>> [kvm.resource.LibvirtComputingResource]
>>>>>>>>> (UgentTask-5:null) Execution is successful.
>>>>>>>>> 2015-09-10 02:57:54,135 INFO  [cloud.agent.AgentShell] (main:null)
>>>>>>>>> Agent
>>>>>>>>> started
>>>>>>>>> 2015-09-10 02:57:54,136 INFO  [cloud.agent.AgentShell] (main:null)
>>>>>>>>> Implementation Version is 4.5.1
>>>>>>>>> 2015-09-10 02:57:54,138 INFO  [cloud.agent.AgentShell] (main:null)
>>>>>>>>> agent.properties found at /etc/cloudstack/agent/agent.properties
>>>>>>>>> .....
>>>>>>>>>
>>>>>>>>> Log on Server side:
>>>>>>>>> 2015-09-10 02:57:53,710 INFO  [c.c.a.m.AgentManagerImpl]
>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has
>>>>>>>>> disconnected
>>>>>>>>> with event AgentDisconnecte
>>>>>>>>> d
>>>>>>>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl]
>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive
>>>>>>>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request]
>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
>>>>>>>>> Sending {
>>>>>>>>> Cmd ,
>>>>>>>>> MgmtId: 90520739779588, via: 1(ix1
>>>>>>>>> -c7-2), Ver: v1, Flags: 100011,
>>>>>>>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
>>>>>>>>> 2015-09-10 02:57:53,724 INFO  [c.c.a.m.AgentAttache]
>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
>>>>>>>>> Unable to
>>>>>>>>> send
>>>>>>>>> due to Resource [Host:1] is unr
>>>>>>>>> eachable: Host 1: Channel is closed
>>>>>>>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache]
>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
>>>>>>>>> Cancelling.
>>>>>>>>> 2015-09-10 02:57:53,724 WARN  [c.c.a.m.AgentManagerImpl]
>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable:
>>>>>>>>> Host 1:
>>>>>>>>> Channel is closed
>>>>>>>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl]
>>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to
>>>>>>>>> determine the
>>>>>>>>> state of the host
>>>>>>>>> .  Moving on.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> it's look like connection problem, but it appear only when we have
>>>>>>>>> this
>>>>>>>>> RBD exception on agent side and only with this node. I tried to
>>>>>>>>> play with
>>>>>>>>> "storage.cleanup.interval" parameter and set it to 5 minutes,
>>>>>>>>> now we
>>>>>>>>> getting exception on agent side every 5min, but disconnects still
>>>>>>>>> happening
>>>>>>>>> every 50min and I can't find out why.
>>>>>>>>>
>>>>>>>>> On 09/10/2015 03:21 PM, Andrija Panic wrote:
>>>>>>>>>
>>>>>>>>>> Thx Wido,
>>>>>>>>>>
>>>>>>>>>> I will have my colegue Igor and Dmytro join with details on this.
>>>>>>>>>>
>>>>>>>>>> I agree we need fix upstream, that is the main purpose from our
>>>> side!
>>>>>>>>>> With this temp fix, we just avoid agent crashing (agent somehow
>>>>>>>>>> restarts
>>>>>>>>>> again fine :) ) but VMs also go down on that host, at least some
>>>>>>>>>> of them.
>>>>>>>>>>
>>>>>>>>>> Do you see any lifecycle/workflow issue, if we implement deleting
>>>>>>>>>> SNAP
>>>>>>>>>> from CEPH after you SNAP a volume in ACS and sucsssfully move to
>>>>>>>>>> Secondary
>>>>>>>>>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall
>>>> SNAP
>>>>>>>>>> deletion (when you manually or via scheduled snapshots, delete
>>>>>>>>>> snapshot
>>>>>>>>>> from DB and NFS) ? Maybe second option is better, I dont know how
>>>>>>>>>> you guys
>>>>>>>>>> handle this for regular NFS as primary storage etc...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Any guidance is most welcomed, and our team will try to code all
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> Thx Wido again
>>>>>>>>>>
>>>>>>>>>> On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl
>>>>>>>>>> <mailto:w...@widodh.nl>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>       On 10-09-15 14:07, Andrija Panic wrote:
>>>>>>>>>>       > Wido,
>>>>>>>>>>       >
>>>>>>>>>>       > part of code where you want to delete some volume,
>>>>>>>>>> checks if
>>>>>>>>>>       volume is type
>>>>>>>>>>       > RBD - and then tries to list snapshots, delete
>>>>>>>>>> snapshtos, and
>>>>>>>>>>       finally
>>>>>>>>>>       > remove image. Here first step- Listing snapshtos-
>>>>>>>>>> fails, if
>>>>>>>>>>       there are more
>>>>>>>>>>       > than 16 snapshtos present - number 16 is hardcoded in
>>>>>>>>>> elsewhere
>>>>>>>>>>       part of
>>>>>>>>>>       > code and throws RBD exception...then agent crashes... and
>>>> then
>>>>>>>>>>       VMs goe down
>>>>>>>>>>       > etc.
>>>>>>>>>>       >
>>>>>>>>>>
>>>>>>>>>>       Hmmm, that seems like a bug in rados-java indeed. I don't
>>>>>>>>>> know if
>>>>>>>>>>       there
>>>>>>>>>>       is a release of rados-java where this is fixed in.
>>>>>>>>>>
>>>>>>>>>>       Looking at the code of rados-java it should, but I'm not
>>>>>>>>>> 100%
>>>>>>>>>> certain.
>>>>>>>>>>
>>>>>>>>>>       > So our current way as quick fix is to invoke external
>>>>>>>>>> script
>>>>>>>>>>       which will
>>>>>>>>>>       > also list and remove all snapshtos, but will not fail.
>>>>>>>>>>       >
>>>>>>>>>>
>>>>>>>>>>       Yes, but we should fix it upstream. I understand that you
>>>>>>>>>> will use a
>>>>>>>>>>       temp script to clean up everything.
>>>>>>>>>>
>>>>>>>>>>       > I'm not sure why is there 16 as the hardcoded limit -
>>>>>>>>>> will
>>>> try
>>>>>>>>>>       to provide
>>>>>>>>>>       > part of code where this is present...we can increase this
>>>>>>>>>> number
>>>>>>>>>>       but it
>>>>>>>>>>       > doesn make any sense (from 16 to i.e. 200), since we
>>>>>>>>>> still
>>>>>>>>>> have
>>>>>>>>>>       lot of
>>>>>>>>>>       > garbage left on CEPH (snapshtos that were removed in ACS
>>>>>>>>>> (DB and
>>>>>>>>>>       Secondary
>>>>>>>>>>       > NFS) - but not removed from CEPH. And in my understanding
>>>> this
>>>>>>>>>>       needs to be
>>>>>>>>>>       > implemented, so we dont catch any exceptions that I
>>>> originally
>>>>>>>>>>       described...
>>>>>>>>>>       >
>>>>>>>>>>       > Any thoughts on this ?
>>>>>>>>>>       >
>>>>>>>>>>
>>>>>>>>>>       A cleanup script for now should be OK indeed. Afterwards
>>>>>>>>>> the
>>>>>>>>>> Java code
>>>>>>>>>>       should be able to do this.
>>>>>>>>>>
>>>>>>>>>>       You can try manually by using rados-java and fix that.
>>>>>>>>>>
>>>>>>>>>>       This is the part where the listing is done:
>>>>>>>>>>
>>>>>>>>>>
>>>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java
>>>>
>>>>>>>>>>       Wido
>>>>>>>>>>
>>>>>>>>>>       > Thx for input!
>>>>>>>>>>       >
>>>>>>>>>>       > On 10 September 2015 at 13:56, Wido den Hollander
>>>>>>>>>>       <w...@widodh.nl  <mailto:w...@widodh.nl>> wrote:
>>>>>>>>>>       >
>>>>>>>>>>       >>
>>>>>>>>>>       >>
>>>>>>>>>>       >> On 10-09-15 12:17, Andrija Panic wrote:
>>>>>>>>>>       >>> We are testing some [dirty?] patch on our dev system
>>>>>>>>>> and we
>>>>>>>>>>       shall soon
>>>>>>>>>>       >>> share it for review.
>>>>>>>>>>       >>>
>>>>>>>>>>       >>> Basically, we are using external python script that is
>>>>>>>>>> invoked
>>>>>>>>>>       in some
>>>>>>>>>>       >> part
>>>>>>>>>>       >>> of code execution to delete needed CEPH snapshots
>>>>>>>>>> and then
>>>>>>>>>>       after that
>>>>>>>>>>       >>> proceeds with volume deleteion etc...
>>>>>>>>>>       >>>
>>>>>>>>>>       >>
>>>>>>>>>>       >> That shouldn't be required. The Java bindings for
>>>>>>>>>> librbd and
>>>>>>>>>>       librados
>>>>>>>>>>       >> should be able to remove the snapshots.
>>>>>>>>>>       >>
>>>>>>>>>>       >> There is no need to invoke external code, this can
>>>>>>>>>> all be
>>>>>>>>>>       handled in Java.
>>>>>>>>>>       >>
>>>>>>>>>>       >>> On 10 September 2015 at 11:26, Andrija Panic
>>>>>>>>>>       <andrija.pa...@gmail.com  <mailto:andrija.pa...@gmail.com>>
>>>>>>>>>>       >>> wrote:
>>>>>>>>>>       >>>
>>>>>>>>>>       >>>> Eh, OK. Thx for the info.
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>> BTW why is 16 snapshot limits hardcoded - any
>>>>>>>>>> reason for
>>>>>>>>>> that ?
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>> Not cleaning snapshots on CEPH and trying to delete
>>>>>>>>>> volume
>>>>>>>>>>       after having
>>>>>>>>>>       >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM
>>>>>>>>>>       side...and some
>>>>>>>>>>       >> VMs
>>>>>>>>>>       >>>> being rebooted etc - which means downtime :|
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>> Thanks,
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>> On 9 September 2015 at 22:05, Simon Weller <
>>>> swel...@ena.com
>>>>>>>>>>       <mailto:swel...@ena.com>> wrote:
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>>> Andrija,
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> The Ceph snapshot deletion is not currently
>>>>>>>>>> implemented.
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> See:
>>>> https://issues.apache.org/jira/browse/CLOUDSTACK-8302
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> - Si
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> ________________________________________
>>>>>>>>>>       >>>>> From: Andrija Panic<andrija.pa...@gmail.com>
>>>>>>>>>>       >>>>> Sent: Wednesday, September 9, 2015 3:03 PM
>>>>>>>>>>       >>>>> To:dev@cloudstack.apache.org
>>>>>>>>>>       <mailto:dev@cloudstack.apache.org>;
>>>> us...@cloudstack.apache.org
>>>>>>>>>>       <mailto:us...@cloudstack.apache.org>
>>>>>>>>>>
>>>>>>>>>>       >>>>> Subject: ACS 4.5 - volume snapshots NOT removed
>>>>>>>>>> from CEPH
>>>>>>>>>>       (only from
>>>>>>>>>>       >>>>> Secondaryt NFS and DB)
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> Hi folks,
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> we enounter issue in ACS 4.5.1 (perhaps other
>>>>>>>>>> versions
>>>>>>>>>> also
>>>>>>>>>>       affected) -
>>>>>>>>>>       >>>>> when we delete some snapshot (volume snapshot) in
>>>>>>>>>> ACS,
>>>> ACS
>>>>>>>>>>       marks it as
>>>>>>>>>>       >>>>> deleted in DB, deletes from NFS Secondary Storage but
>>>>>>>>>> fails
>>>>>>>>>>       to delete
>>>>>>>>>>       >>>>> snapshot on CEPH primary storage (doesn even try to
>>>> delete
>>>>>>>>>>       it AFAIK)
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> So we end up having 5 live snapshots in DB (just
>>>>>>>>>> example)
>>>>>>>>>>       but actually
>>>>>>>>>>       >> in
>>>>>>>>>>       >>>>> CEPH there are more than i.e. 16 snapshots.
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> More of the issue, when ACS agent tries to obtain
>>>>>>>>>> list of
>>>>>>>>>>       snapshots
>>>>>>>>>>       >> from
>>>>>>>>>>       >>>>> CEPH for some volume or so - if number of
>>>>>>>>>> snapshots is
>>>>>>>>>> over
>>>>>>>>>>       16, it
>>>>>>>>>>       >> raises
>>>>>>>>>>       >>>>> exception  (and perhaps this is the reason Agent
>>>>>>>>>> crashed for
>>>>>>>>>>       us - need
>>>>>>>>>>       >> to
>>>>>>>>>>       >>>>> check with my colegues who are investigatin this in
>>>>>>>>>>       details). This
>>>>>>>>>>       >> number
>>>>>>>>>>       >>>>> 16 is for whatever reasons hardcoded in ACS code.
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> Wondering if anyone experienced this, or have any
>>>>>>>>>> info
>>>>>>>>>> - we
>>>>>>>>>>       plan to
>>>>>>>>>>       >> try to
>>>>>>>>>>       >>>>> fix this, and I will inlcude my dev colegues here,
>>>>>>>>>> but we
>>>>>>>>>>       might need
>>>>>>>>>>       >> some
>>>>>>>>>>       >>>>> help at least for guidance or-
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> Any help is really apreaciated or at list
>>>>>>>>>> confirmation
>>>>>>>>>> that
>>>>>>>>>>       this is
>>>>>>>>>>       >> known
>>>>>>>>>>       >>>>> issue etc.
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> Thanks,
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> --
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>> Andrija Panić
>>>>>>>>>>       >>>>>
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>> --
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>> Andrija Panić
>>>>>>>>>>       >>>>
>>>>>>>>>>       >>>
>>>>>>>>>>       >>>
>>>>>>>>>>       >>>
>>>>>>>>>>       >>
>>>>>>>>>>       >
>>>>>>>>>>       >
>>>>>>>>>>       >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>>
>>>>>>>>>> Andrija Panić
>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> ---
>>>>>>>>> Best regards
>>>>>>>>> Dmytro Shevchenko
>>>>>>>>> dshevchenko.m...@gmail.com
>>>>>>>>> skype: demonsh_mk
>>>>>>>>> +380(66)2426648
>>>>>>>>>
>>>>>>>>>
>>>
>

Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)

Reply via email to