On 10-09-15 23:15, Andrija Panic wrote: > Wido, > > could you folow maybe what my colegue Dmytro just sent ? >
Yes, seems logical. > Its not only matter of question fixing rados-java (16 snaps limit) - it > seems that for any RBD exception, ACS will freak out... > No, a RbdException will be caught, but the Rados Bindings shouldn't throw NegativeArraySizeException in any case. That's the main problem. Wido > THx > > On 10 September 2015 at 17:06, Dmytro Shevchenko < > dmytro.shevche...@safeswisscloud.com> wrote: > >> Hello everyone, some clarification about this. Configuration: >> CS: 4.5.1 >> Primary storage: Ceph >> >> Actually we have 2 separate bugs: >> >> 1. When you remove some volume with more then 16 snapshots (doesn't matter >> destroyed or active - they always present on Ceph), on next storage garbage >> collector cycle it invoke 'deletePhysicalDisk' from >> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots from >> external rados-java library and getting exception. >> >> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854 >> This exception do not catching in current function, but Agent DO NOT CRASH >> at this moment and continue working fine. Agent form proper answer to >> server and send it, text in answer - java stack trace. Log from Agent side: >> >> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor] >> (agentRequest-Handler-4:null) Trying to fetch storage pool >> 33ebaf83-5d09-3038-b63b-742e759a >> 992e from libvirt >> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >> (agentRequest-Handler-4:null) Attempting to remove volume >> 4c6a2092-056c-4446-a2ca-d6bba9f7f >> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e >> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >> (agentRequest-Handler-4:null) Unprotecting and Removing RBD snapshots of >> image cloudstack-storage >> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image >> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor] >> (agentRequest-Handler-4:null) Succesfully connected to Ceph cluster at >> 10.10.1.26:6789 >> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor] >> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD image >> cloudstack-storage/4c6a2092 >> -056c-4446-a2ca-d6bba9f7f7f8 >> 2015-09-10 02:32:35,457 WARN [cloud.agent.Agent] >> (agentRequest-Handler-4:null) Caught: >> java.lang.NegativeArraySizeException >> at com.ceph.rbd.RbdImage.snapList(Unknown Source) >> at >> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854) >> at >> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175) >> at >> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206) >> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent] >> (agentRequest-Handler-4:null) Seq 1-1743737480722513946: { Ans: , MgmtId: >> 90520739779588, via: 1, Ver: v1, >> Flags: 10, >> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat >> com.ceph.rbd.RbdImage.snapList(Unknown Sourc >> e)\n\tat >> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat >> com.cloud.hypervisor.kvm.storage.Lib >> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat >> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j >> ava:1206)\n\tat >> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat >> com.cloud.storage.re..... >> >> so this volume and it snapshots never will be removed. >> >> >> 2. Second bug. Experimentally it has been found that after 50 minutes we >> had exception on Agent, for some unknown reason Management server decided >> about it lost connection to this agent, start HA process and start Agent >> process again. >> Log on Agent side: >> 2015-09-10 02:57:12,664 DEBUG [kvm.resource.LibvirtComputingResource] >> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep Mem:|awk >> '{print $2}' >> 2015-09-10 02:57:12,667 DEBUG [kvm.resource.LibvirtComputingResource] >> (agentRequest-Handler-2:null) Execution is successful. >> 2015-09-10 02:57:40,502 DEBUG [kvm.resource.LibvirtComputingResource] >> (UgentTask-5:null) Executing: >> /usr/share/cloudstack-common/scripts/vm/network/security_ >> group.py get_rule_logs_for_vms >> 2015-09-10 02:57:40,572 DEBUG [kvm.resource.LibvirtComputingResource] >> (UgentTask-5:null) Execution is successful. >> 2015-09-10 02:57:54,135 INFO [cloud.agent.AgentShell] (main:null) Agent >> started >> 2015-09-10 02:57:54,136 INFO [cloud.agent.AgentShell] (main:null) >> Implementation Version is 4.5.1 >> 2015-09-10 02:57:54,138 INFO [cloud.agent.AgentShell] (main:null) >> agent.properties found at /etc/cloudstack/agent/agent.properties >> ..... >> >> Log on Server side: >> 2015-09-10 02:57:53,710 INFO [c.c.a.m.AgentManagerImpl] >> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has disconnected >> with event AgentDisconnecte >> d >> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl] >> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive >> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request] >> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Sending { Cmd , >> MgmtId: 90520739779588, via: 1(ix1 >> -c7-2), Ver: v1, Flags: 100011, >> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } >> 2015-09-10 02:57:53,724 INFO [c.c.a.m.AgentAttache] >> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Unable to send >> due to Resource [Host:1] is unr >> eachable: Host 1: Channel is closed >> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache] >> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Cancelling. >> 2015-09-10 02:57:53,724 WARN [c.c.a.m.AgentManagerImpl] >> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable: Host 1: >> Channel is closed >> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl] >> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to determine the >> state of the host >> . Moving on. >> >> >> it's look like connection problem, but it appear only when we have this >> RBD exception on agent side and only with this node. I tried to play with >> "storage.cleanup.interval" parameter and set it to 5 minutes, now we >> getting exception on agent side every 5min, but disconnects still happening >> every 50min and I can't find out why. >> >> On 09/10/2015 03:21 PM, Andrija Panic wrote: >> >>> Thx Wido, >>> >>> I will have my colegue Igor and Dmytro join with details on this. >>> >>> I agree we need fix upstream, that is the main purpose from our side! >>> >>> With this temp fix, we just avoid agent crashing (agent somehow restarts >>> again fine :) ) but VMs also go down on that host, at least some of them. >>> >>> Do you see any lifecycle/workflow issue, if we implement deleting SNAP >>> from CEPH after you SNAP a volume in ACS and sucsssfully move to Secondary >>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall SNAP >>> deletion (when you manually or via scheduled snapshots, delete snapshot >>> from DB and NFS) ? Maybe second option is better, I dont know how you guys >>> handle this for regular NFS as primary storage etc... >>> >>> >>> Any guidance is most welcomed, and our team will try to code all this. >>> >>> Thx Wido again >>> >>> On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl >>> <mailto:w...@widodh.nl>> wrote: >>> >>> >>> >>> On 10-09-15 14:07, Andrija Panic wrote: >>> > Wido, >>> > >>> > part of code where you want to delete some volume, checks if >>> volume is type >>> > RBD - and then tries to list snapshots, delete snapshtos, and >>> finally >>> > remove image. Here first step- Listing snapshtos- fails, if >>> there are more >>> > than 16 snapshtos present - number 16 is hardcoded in elsewhere >>> part of >>> > code and throws RBD exception...then agent crashes... and then >>> VMs goe down >>> > etc. >>> > >>> >>> Hmmm, that seems like a bug in rados-java indeed. I don't know if >>> there >>> is a release of rados-java where this is fixed in. >>> >>> Looking at the code of rados-java it should, but I'm not 100% certain. >>> >>> > So our current way as quick fix is to invoke external script >>> which will >>> > also list and remove all snapshtos, but will not fail. >>> > >>> >>> Yes, but we should fix it upstream. I understand that you will use a >>> temp script to clean up everything. >>> >>> > I'm not sure why is there 16 as the hardcoded limit - will try >>> to provide >>> > part of code where this is present...we can increase this number >>> but it >>> > doesn make any sense (from 16 to i.e. 200), since we still have >>> lot of >>> > garbage left on CEPH (snapshtos that were removed in ACS (DB and >>> Secondary >>> > NFS) - but not removed from CEPH. And in my understanding this >>> needs to be >>> > implemented, so we dont catch any exceptions that I originally >>> described... >>> > >>> > Any thoughts on this ? >>> > >>> >>> A cleanup script for now should be OK indeed. Afterwards the Java code >>> should be able to do this. >>> >>> You can try manually by using rados-java and fix that. >>> >>> This is the part where the listing is done: >>> >>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java >>> >>> Wido >>> >>> > Thx for input! >>> > >>> > On 10 September 2015 at 13:56, Wido den Hollander >>> <w...@widodh.nl <mailto:w...@widodh.nl>> wrote: >>> > >>> >> >>> >> >>> >> On 10-09-15 12:17, Andrija Panic wrote: >>> >>> We are testing some [dirty?] patch on our dev system and we >>> shall soon >>> >>> share it for review. >>> >>> >>> >>> Basically, we are using external python script that is invoked >>> in some >>> >> part >>> >>> of code execution to delete needed CEPH snapshots and then >>> after that >>> >>> proceeds with volume deleteion etc... >>> >>> >>> >> >>> >> That shouldn't be required. The Java bindings for librbd and >>> librados >>> >> should be able to remove the snapshots. >>> >> >>> >> There is no need to invoke external code, this can all be >>> handled in Java. >>> >> >>> >>> On 10 September 2015 at 11:26, Andrija Panic >>> <andrija.pa...@gmail.com <mailto:andrija.pa...@gmail.com>> >>> >>> wrote: >>> >>> >>> >>>> Eh, OK. Thx for the info. >>> >>>> >>> >>>> BTW why is 16 snapshot limits hardcoded - any reason for that ? >>> >>>> >>> >>>> Not cleaning snapshots on CEPH and trying to delete volume >>> after having >>> >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM >>> side...and some >>> >> VMs >>> >>>> being rebooted etc - which means downtime :| >>> >>>> >>> >>>> Thanks, >>> >>>> >>> >>>> On 9 September 2015 at 22:05, Simon Weller <swel...@ena.com >>> <mailto:swel...@ena.com>> wrote: >>> >>>> >>> >>>>> Andrija, >>> >>>>> >>> >>>>> The Ceph snapshot deletion is not currently implemented. >>> >>>>> >>> >>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302 >>> >>>>> >>> >>>>> - Si >>> >>>>> >>> >>>>> ________________________________________ >>> >>>>> From: Andrija Panic <andrija.pa...@gmail.com> >>> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM >>> >>>>> To: dev@cloudstack.apache.org >>> <mailto:dev@cloudstack.apache.org>; us...@cloudstack.apache.org >>> <mailto:us...@cloudstack.apache.org> >>> >>> >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH >>> (only from >>> >>>>> Secondaryt NFS and DB) >>> >>>>> >>> >>>>> Hi folks, >>> >>>>> >>> >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also >>> affected) - >>> >>>>> when we delete some snapshot (volume snapshot) in ACS, ACS >>> marks it as >>> >>>>> deleted in DB, deletes from NFS Secondary Storage but fails >>> to delete >>> >>>>> snapshot on CEPH primary storage (doesn even try to delete >>> it AFAIK) >>> >>>>> >>> >>>>> So we end up having 5 live snapshots in DB (just example) >>> but actually >>> >> in >>> >>>>> CEPH there are more than i.e. 16 snapshots. >>> >>>>> >>> >>>>> More of the issue, when ACS agent tries to obtain list of >>> snapshots >>> >> from >>> >>>>> CEPH for some volume or so - if number of snapshots is over >>> 16, it >>> >> raises >>> >>>>> exception (and perhaps this is the reason Agent crashed for >>> us - need >>> >> to >>> >>>>> check with my colegues who are investigatin this in >>> details). This >>> >> number >>> >>>>> 16 is for whatever reasons hardcoded in ACS code. >>> >>>>> >>> >>>>> Wondering if anyone experienced this, or have any info - we >>> plan to >>> >> try to >>> >>>>> fix this, and I will inlcude my dev colegues here, but we >>> might need >>> >> some >>> >>>>> help at least for guidance or- >>> >>>>> >>> >>>>> Any help is really apreaciated or at list confirmation that >>> this is >>> >> known >>> >>>>> issue etc. >>> >>>>> >>> >>>>> Thanks, >>> >>>>> >>> >>>>> -- >>> >>>>> >>> >>>>> Andrija Panić >>> >>>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> >>> >>>> Andrija Panić >>> >>>> >>> >>> >>> >>> >>> >>> >>> >> >>> > >>> > >>> > >>> >>> >>> >>> >>> -- >>> >>> Andrija Panić >>> >> >> -- >> --- >> Best regards >> Dmytro Shevchenko >> dshevchenko.m...@gmail.com >> skype: demonsh_mk >> +380(66)2426648 >> >> > >