On 15-09-15 13:56, Dmytro Shevchenko wrote: > Hello Wido, I saw you updated this code again. Maybe you know, what > procedure for rebuilding this library in Apache maven repository? > Because here http://repo.maven.apache.org/maven2/com/ceph/rados/ still > present only old 0.1.4 version and it's impossible recompile Cloudstack > with new patches. Of cause we can download source code from Github, > compile it and replace 'jar' file on production, but this is dirty hack > and not acceptable for 'continues integration'. >
It's up to me to do a new release of rados-java and I haven't done that yet since I wanted to know for sure if the code works. While writing some code for libvirt yesterday I came up with a better solution for rados-java as well. https://www.redhat.com/archives/libvir-list/2015-September/msg00458.html For now you can replace 'rados.jar' on the production systems, but for 4.6 I want to make sure we depend on a new, to be released, version of rados-java. Wido > > --- > Best regards > Dmytro Shevchenko > dshevchenko.m...@gmail.com > skype: demonsh_mk > > > On 09/12/2015 06:16 PM, Wido den Hollander wrote: >> On 09/11/2015 05:08 PM, Andrija Panic wrote: >>> THx a lot Wido !!! - we will patch this - For my understanding - is this >>> "temorary"solution - since it raises limit to 256 snaps ? Or am I >>> wrong ? I >>> mean, since we dont stil have proper snapshots removal etc, so after >>> i.e. >>> 3-6months we will again have 256 snapshots of a single volume on CEPH ? >>> >> No, it will also work with >256 snapshots. I've tested it with 256 and >> that worked fine. I see no reason why it won't work with 1024 or 2048 >> for example. >> >>> BTW we also have other exception, that causes same consequences - agent >>> disocnnecting and VMs going down... >>> As Dmytro explained, unprotecting snapshot causes same consequence... >>> >>> From my understanding, any RBD exception, might cause Agent to >>> disconnect >>> (or actually mgmt server to disconnect agent)... >>> >>> Any clue on this, recommendation? >>> >> No, I don't have a clue. It could be that the job hangs somewhere inside >> the Agent due to a uncaught exception though. >> >>> Thx a lot for fixing rados-java stuff ! >>> >> You're welcome! >> >> Wido >> >>> Andrija >>> >>> On 11 September 2015 at 15:28, Wido den Hollander<w...@widodh.nl> >>> wrote: >>> >>>> On 11-09-15 14:43, Dmytro Shevchenko wrote: >>>>> Thanks a lot Wido! Any chance to find out why management server >>>>> decided >>>>> that it lost connection to agent after that exceptions? It's not so >>>>> critical as this bug with 16 snapshots, but during last week we catch >>>>> situation when Agent failed unprotect snapshot, rise exception and >>>>> this >>>>> is was a reason of disconnection a bit later after that. (It is not >>>>> clear why CS decided remove that volume, it was template with one >>>>> 'gold' >>>>> snapshot with several active clones) >>>>> >>>> No, I didn't look at CS at all. I just spend the day improving the >>>> RADOS >>>> bindings. >>>> >>>> Wido >>>> >>>>> On 09/11/2015 03:20 PM, Wido den Hollander wrote: >>>>>> On 11-09-15 10:19, Wido den Hollander wrote: >>>>>>> On 10-09-15 23:15, Andrija Panic wrote: >>>>>>>> Wido, >>>>>>>> >>>>>>>> could you folow maybe what my colegue Dmytro just sent ? >>>>>>>> >>>>>>> Yes, seems logical. >>>>>>> >>>>>>>> Its not only matter of question fixing rados-java (16 snaps limit) >>>>>>>> - it >>>>>>>> seems that for any RBD exception, ACS will freak out... >>>>>>>> >>>>>>> No, a RbdException will be caught, but the Rados Bindings shouldn't >>>>>>> throw NegativeArraySizeException in any case. >>>>>>> >>>>>>> That's the main problem. >>>>>>> >>>>>> Seems to be fixed with this commit: >>>>>> >>>> https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64 >>>> >>>>>> Just tested it with 256 snapshots: >>>>>> >>>>>> ------------------------------------------------------- >>>>>> T E S T S >>>>>> ------------------------------------------------------- >>>>>> Running com.ceph.rbd.TestRbd >>>>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: >>>>>> 521.014 sec >>>>>> >>>>>> Results : >>>>>> >>>>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 >>>>>> >>>>>> The bindings should now be capable of listing more then 16 snapshots. >>>>>> >>>>>> You can build the bindings manually and replace rados.jar on your >>>>>> running systems. >>>>>> >>>>>> For 4.6 I'll try to get the updated rados-java included. >>>>>> >>>>>> Wido >>>>>> >>>>>>> Wido >>>>>>> >>>>>>>> THx >>>>>>>> >>>>>>>> On 10 September 2015 at 17:06, Dmytro Shevchenko < >>>>>>>> dmytro.shevche...@safeswisscloud.com> wrote: >>>>>>>> >>>>>>>>> Hello everyone, some clarification about this. Configuration: >>>>>>>>> CS: 4.5.1 >>>>>>>>> Primary storage: Ceph >>>>>>>>> >>>>>>>>> Actually we have 2 separate bugs: >>>>>>>>> >>>>>>>>> 1. When you remove some volume with more then 16 snapshots >>>>>>>>> (doesn't >>>>>>>>> matter >>>>>>>>> destroyed or active - they always present on Ceph), on next >>>>>>>>> storage >>>>>>>>> garbage >>>>>>>>> collector cycle it invoke 'deletePhysicalDisk' from >>>>>>>>> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots >>>> from >>>>>>>>> external rados-java library and getting exception. >>>>>>>>> >>>>>>>>> >>>> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854 >>>> >>>>>>>>> This exception do not catching in current function, but Agent DO >>>>>>>>> NOT CRASH >>>>>>>>> at this moment and continue working fine. Agent form proper >>>>>>>>> answer to >>>>>>>>> server and send it, text in answer - java stack trace. Log from >>>>>>>>> Agent side: >>>>>>>>> >>>>>>>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>>>>>> (agentRequest-Handler-4:null) Trying to fetch storage pool >>>>>>>>> 33ebaf83-5d09-3038-b63b-742e759a >>>>>>>>> 992e from libvirt >>>>>>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >>>>>>>>> (agentRequest-Handler-4:null) Attempting to remove volume >>>>>>>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f >>>>>>>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e >>>>>>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >>>>>>>>> (agentRequest-Handler-4:null) Unprotecting and Removing RBD >>>>>>>>> snapshots of >>>>>>>>> image cloudstack-storage >>>>>>>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image >>>>>>>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>>>>>> (agentRequest-Handler-4:null) Succesfully connected to Ceph >>>>>>>>> cluster >>>> at >>>>>>>>> 10.10.1.26:6789 >>>>>>>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>>>>>> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD >>>>>>>>> image >>>>>>>>> cloudstack-storage/4c6a2092 >>>>>>>>> -056c-4446-a2ca-d6bba9f7f7f8 >>>>>>>>> 2015-09-10 02:32:35,457 WARN [cloud.agent.Agent] >>>>>>>>> (agentRequest-Handler-4:null) Caught: >>>>>>>>> java.lang.NegativeArraySizeException >>>>>>>>> at com.ceph.rbd.RbdImage.snapList(Unknown Source) >>>>>>>>> at >>>>>>>>> >>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854) >>>> >>>>>>>>> at >>>>>>>>> >>>> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175) >>>> >>>>>>>>> at >>>>>>>>> >>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206) >>>> >>>>>>>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent] >>>>>>>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946: { Ans: , >>>>>>>>> MgmtId: >>>>>>>>> 90520739779588, via: 1, Ver: v1, >>>>>>>>> Flags: 10, >>>>>>>>> >>>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat >>>> >>>>>>>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc >>>>>>>>> e)\n\tat >>>>>>>>> >>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat >>>> >>>>>>>>> com.cloud.hypervisor.kvm.storage.Lib >>>>>>>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat >>>>>>>>> >>>>>>>>> >>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j >>>> >>>>>>>>> ava:1206)\n\tat >>>>>>>>> >>>> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat >>>> >>>>>>>>> com.cloud.storage.re..... >>>>>>>>> >>>>>>>>> so this volume and it snapshots never will be removed. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2. Second bug. Experimentally it has been found that after 50 >>>>>>>>> minutes we >>>>>>>>> had exception on Agent, for some unknown reason Management server >>>>>>>>> decided >>>>>>>>> about it lost connection to this agent, start HA process and start >>>>>>>>> Agent >>>>>>>>> process again. >>>>>>>>> Log on Agent side: >>>>>>>>> 2015-09-10 02:57:12,664 DEBUG >>>>>>>>> [kvm.resource.LibvirtComputingResource] >>>>>>>>> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep >>>>>>>>> Mem:|awk >>>>>>>>> '{print $2}' >>>>>>>>> 2015-09-10 02:57:12,667 DEBUG >>>>>>>>> [kvm.resource.LibvirtComputingResource] >>>>>>>>> (agentRequest-Handler-2:null) Execution is successful. >>>>>>>>> 2015-09-10 02:57:40,502 DEBUG >>>>>>>>> [kvm.resource.LibvirtComputingResource] >>>>>>>>> (UgentTask-5:null) Executing: >>>>>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_ >>>>>>>>> group.py get_rule_logs_for_vms >>>>>>>>> 2015-09-10 02:57:40,572 DEBUG >>>>>>>>> [kvm.resource.LibvirtComputingResource] >>>>>>>>> (UgentTask-5:null) Execution is successful. >>>>>>>>> 2015-09-10 02:57:54,135 INFO [cloud.agent.AgentShell] (main:null) >>>>>>>>> Agent >>>>>>>>> started >>>>>>>>> 2015-09-10 02:57:54,136 INFO [cloud.agent.AgentShell] (main:null) >>>>>>>>> Implementation Version is 4.5.1 >>>>>>>>> 2015-09-10 02:57:54,138 INFO [cloud.agent.AgentShell] (main:null) >>>>>>>>> agent.properties found at /etc/cloudstack/agent/agent.properties >>>>>>>>> ..... >>>>>>>>> >>>>>>>>> Log on Server side: >>>>>>>>> 2015-09-10 02:57:53,710 INFO [c.c.a.m.AgentManagerImpl] >>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has >>>>>>>>> disconnected >>>>>>>>> with event AgentDisconnecte >>>>>>>>> d >>>>>>>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive >>>>>>>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request] >>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: >>>>>>>>> Sending { >>>>>>>>> Cmd , >>>>>>>>> MgmtId: 90520739779588, via: 1(ix1 >>>>>>>>> -c7-2), Ver: v1, Flags: 100011, >>>>>>>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } >>>>>>>>> 2015-09-10 02:57:53,724 INFO [c.c.a.m.AgentAttache] >>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: >>>>>>>>> Unable to >>>>>>>>> send >>>>>>>>> due to Resource [Host:1] is unr >>>>>>>>> eachable: Host 1: Channel is closed >>>>>>>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache] >>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: >>>>>>>>> Cancelling. >>>>>>>>> 2015-09-10 02:57:53,724 WARN [c.c.a.m.AgentManagerImpl] >>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable: >>>>>>>>> Host 1: >>>>>>>>> Channel is closed >>>>>>>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl] >>>>>>>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to >>>>>>>>> determine the >>>>>>>>> state of the host >>>>>>>>> . Moving on. >>>>>>>>> >>>>>>>>> >>>>>>>>> it's look like connection problem, but it appear only when we have >>>>>>>>> this >>>>>>>>> RBD exception on agent side and only with this node. I tried to >>>>>>>>> play with >>>>>>>>> "storage.cleanup.interval" parameter and set it to 5 minutes, >>>>>>>>> now we >>>>>>>>> getting exception on agent side every 5min, but disconnects still >>>>>>>>> happening >>>>>>>>> every 50min and I can't find out why. >>>>>>>>> >>>>>>>>> On 09/10/2015 03:21 PM, Andrija Panic wrote: >>>>>>>>> >>>>>>>>>> Thx Wido, >>>>>>>>>> >>>>>>>>>> I will have my colegue Igor and Dmytro join with details on this. >>>>>>>>>> >>>>>>>>>> I agree we need fix upstream, that is the main purpose from our >>>> side! >>>>>>>>>> With this temp fix, we just avoid agent crashing (agent somehow >>>>>>>>>> restarts >>>>>>>>>> again fine :) ) but VMs also go down on that host, at least some >>>>>>>>>> of them. >>>>>>>>>> >>>>>>>>>> Do you see any lifecycle/workflow issue, if we implement deleting >>>>>>>>>> SNAP >>>>>>>>>> from CEPH after you SNAP a volume in ACS and sucsssfully move to >>>>>>>>>> Secondary >>>>>>>>>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall >>>> SNAP >>>>>>>>>> deletion (when you manually or via scheduled snapshots, delete >>>>>>>>>> snapshot >>>>>>>>>> from DB and NFS) ? Maybe second option is better, I dont know how >>>>>>>>>> you guys >>>>>>>>>> handle this for regular NFS as primary storage etc... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Any guidance is most welcomed, and our team will try to code all >>>>>>>>>> this. >>>>>>>>>> >>>>>>>>>> Thx Wido again >>>>>>>>>> >>>>>>>>>> On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl >>>>>>>>>> <mailto:w...@widodh.nl>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10-09-15 14:07, Andrija Panic wrote: >>>>>>>>>> > Wido, >>>>>>>>>> > >>>>>>>>>> > part of code where you want to delete some volume, >>>>>>>>>> checks if >>>>>>>>>> volume is type >>>>>>>>>> > RBD - and then tries to list snapshots, delete >>>>>>>>>> snapshtos, and >>>>>>>>>> finally >>>>>>>>>> > remove image. Here first step- Listing snapshtos- >>>>>>>>>> fails, if >>>>>>>>>> there are more >>>>>>>>>> > than 16 snapshtos present - number 16 is hardcoded in >>>>>>>>>> elsewhere >>>>>>>>>> part of >>>>>>>>>> > code and throws RBD exception...then agent crashes... and >>>> then >>>>>>>>>> VMs goe down >>>>>>>>>> > etc. >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> Hmmm, that seems like a bug in rados-java indeed. I don't >>>>>>>>>> know if >>>>>>>>>> there >>>>>>>>>> is a release of rados-java where this is fixed in. >>>>>>>>>> >>>>>>>>>> Looking at the code of rados-java it should, but I'm not >>>>>>>>>> 100% >>>>>>>>>> certain. >>>>>>>>>> >>>>>>>>>> > So our current way as quick fix is to invoke external >>>>>>>>>> script >>>>>>>>>> which will >>>>>>>>>> > also list and remove all snapshtos, but will not fail. >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> Yes, but we should fix it upstream. I understand that you >>>>>>>>>> will use a >>>>>>>>>> temp script to clean up everything. >>>>>>>>>> >>>>>>>>>> > I'm not sure why is there 16 as the hardcoded limit - >>>>>>>>>> will >>>> try >>>>>>>>>> to provide >>>>>>>>>> > part of code where this is present...we can increase this >>>>>>>>>> number >>>>>>>>>> but it >>>>>>>>>> > doesn make any sense (from 16 to i.e. 200), since we >>>>>>>>>> still >>>>>>>>>> have >>>>>>>>>> lot of >>>>>>>>>> > garbage left on CEPH (snapshtos that were removed in ACS >>>>>>>>>> (DB and >>>>>>>>>> Secondary >>>>>>>>>> > NFS) - but not removed from CEPH. And in my understanding >>>> this >>>>>>>>>> needs to be >>>>>>>>>> > implemented, so we dont catch any exceptions that I >>>> originally >>>>>>>>>> described... >>>>>>>>>> > >>>>>>>>>> > Any thoughts on this ? >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> A cleanup script for now should be OK indeed. Afterwards >>>>>>>>>> the >>>>>>>>>> Java code >>>>>>>>>> should be able to do this. >>>>>>>>>> >>>>>>>>>> You can try manually by using rados-java and fix that. >>>>>>>>>> >>>>>>>>>> This is the part where the listing is done: >>>>>>>>>> >>>>>>>>>> >>>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java >>>> >>>>>>>>>> Wido >>>>>>>>>> >>>>>>>>>> > Thx for input! >>>>>>>>>> > >>>>>>>>>> > On 10 September 2015 at 13:56, Wido den Hollander >>>>>>>>>> <w...@widodh.nl <mailto:w...@widodh.nl>> wrote: >>>>>>>>>> > >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> On 10-09-15 12:17, Andrija Panic wrote: >>>>>>>>>> >>> We are testing some [dirty?] patch on our dev system >>>>>>>>>> and we >>>>>>>>>> shall soon >>>>>>>>>> >>> share it for review. >>>>>>>>>> >>> >>>>>>>>>> >>> Basically, we are using external python script that is >>>>>>>>>> invoked >>>>>>>>>> in some >>>>>>>>>> >> part >>>>>>>>>> >>> of code execution to delete needed CEPH snapshots >>>>>>>>>> and then >>>>>>>>>> after that >>>>>>>>>> >>> proceeds with volume deleteion etc... >>>>>>>>>> >>> >>>>>>>>>> >> >>>>>>>>>> >> That shouldn't be required. The Java bindings for >>>>>>>>>> librbd and >>>>>>>>>> librados >>>>>>>>>> >> should be able to remove the snapshots. >>>>>>>>>> >> >>>>>>>>>> >> There is no need to invoke external code, this can >>>>>>>>>> all be >>>>>>>>>> handled in Java. >>>>>>>>>> >> >>>>>>>>>> >>> On 10 September 2015 at 11:26, Andrija Panic >>>>>>>>>> <andrija.pa...@gmail.com <mailto:andrija.pa...@gmail.com>> >>>>>>>>>> >>> wrote: >>>>>>>>>> >>> >>>>>>>>>> >>>> Eh, OK. Thx for the info. >>>>>>>>>> >>>> >>>>>>>>>> >>>> BTW why is 16 snapshot limits hardcoded - any >>>>>>>>>> reason for >>>>>>>>>> that ? >>>>>>>>>> >>>> >>>>>>>>>> >>>> Not cleaning snapshots on CEPH and trying to delete >>>>>>>>>> volume >>>>>>>>>> after having >>>>>>>>>> >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM >>>>>>>>>> side...and some >>>>>>>>>> >> VMs >>>>>>>>>> >>>> being rebooted etc - which means downtime :| >>>>>>>>>> >>>> >>>>>>>>>> >>>> Thanks, >>>>>>>>>> >>>> >>>>>>>>>> >>>> On 9 September 2015 at 22:05, Simon Weller < >>>> swel...@ena.com >>>>>>>>>> <mailto:swel...@ena.com>> wrote: >>>>>>>>>> >>>> >>>>>>>>>> >>>>> Andrija, >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> The Ceph snapshot deletion is not currently >>>>>>>>>> implemented. >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> See: >>>> https://issues.apache.org/jira/browse/CLOUDSTACK-8302 >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> - Si >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> ________________________________________ >>>>>>>>>> >>>>> From: Andrija Panic<andrija.pa...@gmail.com> >>>>>>>>>> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM >>>>>>>>>> >>>>> To:dev@cloudstack.apache.org >>>>>>>>>> <mailto:dev@cloudstack.apache.org>; >>>> us...@cloudstack.apache.org >>>>>>>>>> <mailto:us...@cloudstack.apache.org> >>>>>>>>>> >>>>>>>>>> >>>>> Subject: ACS 4.5 - volume snapshots NOT removed >>>>>>>>>> from CEPH >>>>>>>>>> (only from >>>>>>>>>> >>>>> Secondaryt NFS and DB) >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> Hi folks, >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> we enounter issue in ACS 4.5.1 (perhaps other >>>>>>>>>> versions >>>>>>>>>> also >>>>>>>>>> affected) - >>>>>>>>>> >>>>> when we delete some snapshot (volume snapshot) in >>>>>>>>>> ACS, >>>> ACS >>>>>>>>>> marks it as >>>>>>>>>> >>>>> deleted in DB, deletes from NFS Secondary Storage but >>>>>>>>>> fails >>>>>>>>>> to delete >>>>>>>>>> >>>>> snapshot on CEPH primary storage (doesn even try to >>>> delete >>>>>>>>>> it AFAIK) >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> So we end up having 5 live snapshots in DB (just >>>>>>>>>> example) >>>>>>>>>> but actually >>>>>>>>>> >> in >>>>>>>>>> >>>>> CEPH there are more than i.e. 16 snapshots. >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> More of the issue, when ACS agent tries to obtain >>>>>>>>>> list of >>>>>>>>>> snapshots >>>>>>>>>> >> from >>>>>>>>>> >>>>> CEPH for some volume or so - if number of >>>>>>>>>> snapshots is >>>>>>>>>> over >>>>>>>>>> 16, it >>>>>>>>>> >> raises >>>>>>>>>> >>>>> exception (and perhaps this is the reason Agent >>>>>>>>>> crashed for >>>>>>>>>> us - need >>>>>>>>>> >> to >>>>>>>>>> >>>>> check with my colegues who are investigatin this in >>>>>>>>>> details). This >>>>>>>>>> >> number >>>>>>>>>> >>>>> 16 is for whatever reasons hardcoded in ACS code. >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> Wondering if anyone experienced this, or have any >>>>>>>>>> info >>>>>>>>>> - we >>>>>>>>>> plan to >>>>>>>>>> >> try to >>>>>>>>>> >>>>> fix this, and I will inlcude my dev colegues here, >>>>>>>>>> but we >>>>>>>>>> might need >>>>>>>>>> >> some >>>>>>>>>> >>>>> help at least for guidance or- >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> Any help is really apreaciated or at list >>>>>>>>>> confirmation >>>>>>>>>> that >>>>>>>>>> this is >>>>>>>>>> >> known >>>>>>>>>> >>>>> issue etc. >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> Thanks, >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> -- >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> Andrija Panić >>>>>>>>>> >>>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> -- >>>>>>>>>> >>>> >>>>>>>>>> >>>> Andrija Panić >>>>>>>>>> >>>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> >> >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Andrija Panić >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> --- >>>>>>>>> Best regards >>>>>>>>> Dmytro Shevchenko >>>>>>>>> dshevchenko.m...@gmail.com >>>>>>>>> skype: demonsh_mk >>>>>>>>> +380(66)2426648 >>>>>>>>> >>>>>>>>> >>> >