THx a lot Wido !!! - we will patch this - For my understanding - is this "temorary"solution - since it raises limit to 256 snaps ? Or am I wrong ? I mean, since we dont stil have proper snapshots removal etc, so after i.e. 3-6months we will again have 256 snapshots of a single volume on CEPH ?
BTW we also have other exception, that causes same consequences - agent disocnnecting and VMs going down... As Dmytro explained, unprotecting snapshot causes same consequence... >From my understanding, any RBD exception, might cause Agent to disconnect (or actually mgmt server to disconnect agent)... Any clue on this, recommendation? Thx a lot for fixing rados-java stuff ! Andrija On 11 September 2015 at 15:28, Wido den Hollander <w...@widodh.nl> wrote: > > > On 11-09-15 14:43, Dmytro Shevchenko wrote: > > Thanks a lot Wido! Any chance to find out why management server decided > > that it lost connection to agent after that exceptions? It's not so > > critical as this bug with 16 snapshots, but during last week we catch > > situation when Agent failed unprotect snapshot, rise exception and this > > is was a reason of disconnection a bit later after that. (It is not > > clear why CS decided remove that volume, it was template with one 'gold' > > snapshot with several active clones) > > > > No, I didn't look at CS at all. I just spend the day improving the RADOS > bindings. > > Wido > > > On 09/11/2015 03:20 PM, Wido den Hollander wrote: > >> > >> On 11-09-15 10:19, Wido den Hollander wrote: > >>> > >>> On 10-09-15 23:15, Andrija Panic wrote: > >>>> Wido, > >>>> > >>>> could you folow maybe what my colegue Dmytro just sent ? > >>>> > >>> Yes, seems logical. > >>> > >>>> Its not only matter of question fixing rados-java (16 snaps limit) > >>>> - it > >>>> seems that for any RBD exception, ACS will freak out... > >>>> > >>> No, a RbdException will be caught, but the Rados Bindings shouldn't > >>> throw NegativeArraySizeException in any case. > >>> > >>> That's the main problem. > >>> > >> Seems to be fixed with this commit: > >> > https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64 > >> > >> > >> Just tested it with 256 snapshots: > >> > >> ------------------------------------------------------- > >> T E S T S > >> ------------------------------------------------------- > >> Running com.ceph.rbd.TestRbd > >> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: > >> 521.014 sec > >> > >> Results : > >> > >> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 > >> > >> The bindings should now be capable of listing more then 16 snapshots. > >> > >> You can build the bindings manually and replace rados.jar on your > >> running systems. > >> > >> For 4.6 I'll try to get the updated rados-java included. > >> > >> Wido > >> > >>> Wido > >>> > >>>> THx > >>>> > >>>> On 10 September 2015 at 17:06, Dmytro Shevchenko < > >>>> dmytro.shevche...@safeswisscloud.com> wrote: > >>>> > >>>>> Hello everyone, some clarification about this. Configuration: > >>>>> CS: 4.5.1 > >>>>> Primary storage: Ceph > >>>>> > >>>>> Actually we have 2 separate bugs: > >>>>> > >>>>> 1. When you remove some volume with more then 16 snapshots (doesn't > >>>>> matter > >>>>> destroyed or active - they always present on Ceph), on next storage > >>>>> garbage > >>>>> collector cycle it invoke 'deletePhysicalDisk' from > >>>>> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots > from > >>>>> external rados-java library and getting exception. > >>>>> > >>>>> > https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854 > >>>>> > >>>>> This exception do not catching in current function, but Agent DO > >>>>> NOT CRASH > >>>>> at this moment and continue working fine. Agent form proper answer to > >>>>> server and send it, text in answer - java stack trace. Log from > >>>>> Agent side: > >>>>> > >>>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor] > >>>>> (agentRequest-Handler-4:null) Trying to fetch storage pool > >>>>> 33ebaf83-5d09-3038-b63b-742e759a > >>>>> 992e from libvirt > >>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] > >>>>> (agentRequest-Handler-4:null) Attempting to remove volume > >>>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f > >>>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e > >>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] > >>>>> (agentRequest-Handler-4:null) Unprotecting and Removing RBD > >>>>> snapshots of > >>>>> image cloudstack-storage > >>>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image > >>>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor] > >>>>> (agentRequest-Handler-4:null) Succesfully connected to Ceph cluster > at > >>>>> 10.10.1.26:6789 > >>>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor] > >>>>> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD image > >>>>> cloudstack-storage/4c6a2092 > >>>>> -056c-4446-a2ca-d6bba9f7f7f8 > >>>>> 2015-09-10 02:32:35,457 WARN [cloud.agent.Agent] > >>>>> (agentRequest-Handler-4:null) Caught: > >>>>> java.lang.NegativeArraySizeException > >>>>> at com.ceph.rbd.RbdImage.snapList(Unknown Source) > >>>>> at > >>>>> > com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854) > >>>>> > >>>>> at > >>>>> > com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175) > >>>>> > >>>>> at > >>>>> > com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206) > >>>>> > >>>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent] > >>>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946: { Ans: , > >>>>> MgmtId: > >>>>> 90520739779588, via: 1, Ver: v1, > >>>>> Flags: 10, > >>>>> > [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat > >>>>> > >>>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc > >>>>> e)\n\tat > >>>>> > com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat > >>>>> > >>>>> com.cloud.hypervisor.kvm.storage.Lib > >>>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat > >>>>> > com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j > >>>>> > >>>>> ava:1206)\n\tat > >>>>> > com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat > >>>>> > >>>>> com.cloud.storage.re..... > >>>>> > >>>>> so this volume and it snapshots never will be removed. > >>>>> > >>>>> > >>>>> 2. Second bug. Experimentally it has been found that after 50 > >>>>> minutes we > >>>>> had exception on Agent, for some unknown reason Management server > >>>>> decided > >>>>> about it lost connection to this agent, start HA process and start > >>>>> Agent > >>>>> process again. > >>>>> Log on Agent side: > >>>>> 2015-09-10 02:57:12,664 DEBUG [kvm.resource.LibvirtComputingResource] > >>>>> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep > >>>>> Mem:|awk > >>>>> '{print $2}' > >>>>> 2015-09-10 02:57:12,667 DEBUG [kvm.resource.LibvirtComputingResource] > >>>>> (agentRequest-Handler-2:null) Execution is successful. > >>>>> 2015-09-10 02:57:40,502 DEBUG [kvm.resource.LibvirtComputingResource] > >>>>> (UgentTask-5:null) Executing: > >>>>> /usr/share/cloudstack-common/scripts/vm/network/security_ > >>>>> group.py get_rule_logs_for_vms > >>>>> 2015-09-10 02:57:40,572 DEBUG [kvm.resource.LibvirtComputingResource] > >>>>> (UgentTask-5:null) Execution is successful. > >>>>> 2015-09-10 02:57:54,135 INFO [cloud.agent.AgentShell] (main:null) > >>>>> Agent > >>>>> started > >>>>> 2015-09-10 02:57:54,136 INFO [cloud.agent.AgentShell] (main:null) > >>>>> Implementation Version is 4.5.1 > >>>>> 2015-09-10 02:57:54,138 INFO [cloud.agent.AgentShell] (main:null) > >>>>> agent.properties found at /etc/cloudstack/agent/agent.properties > >>>>> ..... > >>>>> > >>>>> Log on Server side: > >>>>> 2015-09-10 02:57:53,710 INFO [c.c.a.m.AgentManagerImpl] > >>>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has > >>>>> disconnected > >>>>> with event AgentDisconnecte > >>>>> d > >>>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive > >>>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request] > >>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Sending { > >>>>> Cmd , > >>>>> MgmtId: 90520739779588, via: 1(ix1 > >>>>> -c7-2), Ver: v1, Flags: 100011, > >>>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } > >>>>> 2015-09-10 02:57:53,724 INFO [c.c.a.m.AgentAttache] > >>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Unable to > >>>>> send > >>>>> due to Resource [Host:1] is unr > >>>>> eachable: Host 1: Channel is closed > >>>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache] > >>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Cancelling. > >>>>> 2015-09-10 02:57:53,724 WARN [c.c.a.m.AgentManagerImpl] > >>>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable: > >>>>> Host 1: > >>>>> Channel is closed > >>>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl] > >>>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to > >>>>> determine the > >>>>> state of the host > >>>>> . Moving on. > >>>>> > >>>>> > >>>>> it's look like connection problem, but it appear only when we have > >>>>> this > >>>>> RBD exception on agent side and only with this node. I tried to > >>>>> play with > >>>>> "storage.cleanup.interval" parameter and set it to 5 minutes, now we > >>>>> getting exception on agent side every 5min, but disconnects still > >>>>> happening > >>>>> every 50min and I can't find out why. > >>>>> > >>>>> On 09/10/2015 03:21 PM, Andrija Panic wrote: > >>>>> > >>>>>> Thx Wido, > >>>>>> > >>>>>> I will have my colegue Igor and Dmytro join with details on this. > >>>>>> > >>>>>> I agree we need fix upstream, that is the main purpose from our > side! > >>>>>> > >>>>>> With this temp fix, we just avoid agent crashing (agent somehow > >>>>>> restarts > >>>>>> again fine :) ) but VMs also go down on that host, at least some > >>>>>> of them. > >>>>>> > >>>>>> Do you see any lifecycle/workflow issue, if we implement deleting > >>>>>> SNAP > >>>>>> from CEPH after you SNAP a volume in ACS and sucsssfully move to > >>>>>> Secondary > >>>>>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall > SNAP > >>>>>> deletion (when you manually or via scheduled snapshots, delete > >>>>>> snapshot > >>>>>> from DB and NFS) ? Maybe second option is better, I dont know how > >>>>>> you guys > >>>>>> handle this for regular NFS as primary storage etc... > >>>>>> > >>>>>> > >>>>>> Any guidance is most welcomed, and our team will try to code all > >>>>>> this. > >>>>>> > >>>>>> Thx Wido again > >>>>>> > >>>>>> On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl > >>>>>> <mailto:w...@widodh.nl>> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 10-09-15 14:07, Andrija Panic wrote: > >>>>>> > Wido, > >>>>>> > > >>>>>> > part of code where you want to delete some volume, checks if > >>>>>> volume is type > >>>>>> > RBD - and then tries to list snapshots, delete snapshtos, and > >>>>>> finally > >>>>>> > remove image. Here first step- Listing snapshtos- fails, if > >>>>>> there are more > >>>>>> > than 16 snapshtos present - number 16 is hardcoded in > >>>>>> elsewhere > >>>>>> part of > >>>>>> > code and throws RBD exception...then agent crashes... and > then > >>>>>> VMs goe down > >>>>>> > etc. > >>>>>> > > >>>>>> > >>>>>> Hmmm, that seems like a bug in rados-java indeed. I don't > >>>>>> know if > >>>>>> there > >>>>>> is a release of rados-java where this is fixed in. > >>>>>> > >>>>>> Looking at the code of rados-java it should, but I'm not 100% > >>>>>> certain. > >>>>>> > >>>>>> > So our current way as quick fix is to invoke external script > >>>>>> which will > >>>>>> > also list and remove all snapshtos, but will not fail. > >>>>>> > > >>>>>> > >>>>>> Yes, but we should fix it upstream. I understand that you > >>>>>> will use a > >>>>>> temp script to clean up everything. > >>>>>> > >>>>>> > I'm not sure why is there 16 as the hardcoded limit - will > try > >>>>>> to provide > >>>>>> > part of code where this is present...we can increase this > >>>>>> number > >>>>>> but it > >>>>>> > doesn make any sense (from 16 to i.e. 200), since we still > >>>>>> have > >>>>>> lot of > >>>>>> > garbage left on CEPH (snapshtos that were removed in ACS > >>>>>> (DB and > >>>>>> Secondary > >>>>>> > NFS) - but not removed from CEPH. And in my understanding > this > >>>>>> needs to be > >>>>>> > implemented, so we dont catch any exceptions that I > originally > >>>>>> described... > >>>>>> > > >>>>>> > Any thoughts on this ? > >>>>>> > > >>>>>> > >>>>>> A cleanup script for now should be OK indeed. Afterwards the > >>>>>> Java code > >>>>>> should be able to do this. > >>>>>> > >>>>>> You can try manually by using rados-java and fix that. > >>>>>> > >>>>>> This is the part where the listing is done: > >>>>>> > >>>>>> > https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java > >>>>>> > >>>>>> > >>>>>> Wido > >>>>>> > >>>>>> > Thx for input! > >>>>>> > > >>>>>> > On 10 September 2015 at 13:56, Wido den Hollander > >>>>>> <w...@widodh.nl <mailto:w...@widodh.nl>> wrote: > >>>>>> > > >>>>>> >> > >>>>>> >> > >>>>>> >> On 10-09-15 12:17, Andrija Panic wrote: > >>>>>> >>> We are testing some [dirty?] patch on our dev system and we > >>>>>> shall soon > >>>>>> >>> share it for review. > >>>>>> >>> > >>>>>> >>> Basically, we are using external python script that is > >>>>>> invoked > >>>>>> in some > >>>>>> >> part > >>>>>> >>> of code execution to delete needed CEPH snapshots and then > >>>>>> after that > >>>>>> >>> proceeds with volume deleteion etc... > >>>>>> >>> > >>>>>> >> > >>>>>> >> That shouldn't be required. The Java bindings for librbd and > >>>>>> librados > >>>>>> >> should be able to remove the snapshots. > >>>>>> >> > >>>>>> >> There is no need to invoke external code, this can all be > >>>>>> handled in Java. > >>>>>> >> > >>>>>> >>> On 10 September 2015 at 11:26, Andrija Panic > >>>>>> <andrija.pa...@gmail.com <mailto:andrija.pa...@gmail.com>> > >>>>>> >>> wrote: > >>>>>> >>> > >>>>>> >>>> Eh, OK. Thx for the info. > >>>>>> >>>> > >>>>>> >>>> BTW why is 16 snapshot limits hardcoded - any reason for > >>>>>> that ? > >>>>>> >>>> > >>>>>> >>>> Not cleaning snapshots on CEPH and trying to delete volume > >>>>>> after having > >>>>>> >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM > >>>>>> side...and some > >>>>>> >> VMs > >>>>>> >>>> being rebooted etc - which means downtime :| > >>>>>> >>>> > >>>>>> >>>> Thanks, > >>>>>> >>>> > >>>>>> >>>> On 9 September 2015 at 22:05, Simon Weller < > swel...@ena.com > >>>>>> <mailto:swel...@ena.com>> wrote: > >>>>>> >>>> > >>>>>> >>>>> Andrija, > >>>>>> >>>>> > >>>>>> >>>>> The Ceph snapshot deletion is not currently implemented. > >>>>>> >>>>> > >>>>>> >>>>> See: > https://issues.apache.org/jira/browse/CLOUDSTACK-8302 > >>>>>> >>>>> > >>>>>> >>>>> - Si > >>>>>> >>>>> > >>>>>> >>>>> ________________________________________ > >>>>>> >>>>> From: Andrija Panic <andrija.pa...@gmail.com> > >>>>>> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM > >>>>>> >>>>> To: dev@cloudstack.apache.org > >>>>>> <mailto:dev@cloudstack.apache.org>; > us...@cloudstack.apache.org > >>>>>> <mailto:us...@cloudstack.apache.org> > >>>>>> > >>>>>> >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH > >>>>>> (only from > >>>>>> >>>>> Secondaryt NFS and DB) > >>>>>> >>>>> > >>>>>> >>>>> Hi folks, > >>>>>> >>>>> > >>>>>> >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions > >>>>>> also > >>>>>> affected) - > >>>>>> >>>>> when we delete some snapshot (volume snapshot) in ACS, > ACS > >>>>>> marks it as > >>>>>> >>>>> deleted in DB, deletes from NFS Secondary Storage but > >>>>>> fails > >>>>>> to delete > >>>>>> >>>>> snapshot on CEPH primary storage (doesn even try to > delete > >>>>>> it AFAIK) > >>>>>> >>>>> > >>>>>> >>>>> So we end up having 5 live snapshots in DB (just example) > >>>>>> but actually > >>>>>> >> in > >>>>>> >>>>> CEPH there are more than i.e. 16 snapshots. > >>>>>> >>>>> > >>>>>> >>>>> More of the issue, when ACS agent tries to obtain list of > >>>>>> snapshots > >>>>>> >> from > >>>>>> >>>>> CEPH for some volume or so - if number of snapshots is > >>>>>> over > >>>>>> 16, it > >>>>>> >> raises > >>>>>> >>>>> exception (and perhaps this is the reason Agent > >>>>>> crashed for > >>>>>> us - need > >>>>>> >> to > >>>>>> >>>>> check with my colegues who are investigatin this in > >>>>>> details). This > >>>>>> >> number > >>>>>> >>>>> 16 is for whatever reasons hardcoded in ACS code. > >>>>>> >>>>> > >>>>>> >>>>> Wondering if anyone experienced this, or have any info > >>>>>> - we > >>>>>> plan to > >>>>>> >> try to > >>>>>> >>>>> fix this, and I will inlcude my dev colegues here, but we > >>>>>> might need > >>>>>> >> some > >>>>>> >>>>> help at least for guidance or- > >>>>>> >>>>> > >>>>>> >>>>> Any help is really apreaciated or at list confirmation > >>>>>> that > >>>>>> this is > >>>>>> >> known > >>>>>> >>>>> issue etc. > >>>>>> >>>>> > >>>>>> >>>>> Thanks, > >>>>>> >>>>> > >>>>>> >>>>> -- > >>>>>> >>>>> > >>>>>> >>>>> Andrija Panić > >>>>>> >>>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> >>>> > >>>>>> >>>> -- > >>>>>> >>>> > >>>>>> >>>> Andrija Panić > >>>>>> >>>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> >> > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> Andrija Panić > >>>>>> > >>>>> -- > >>>>> --- > >>>>> Best regards > >>>>> Dmytro Shevchenko > >>>>> dshevchenko.m...@gmail.com > >>>>> skype: demonsh_mk > >>>>> +380(66)2426648 > >>>>> > >>>>> > >>>> > > > -- Andrija Panić