Wido, could you folow maybe what my colegue Dmytro just sent ?
Its not only matter of question fixing rados-java (16 snaps limit) - it seems that for any RBD exception, ACS will freak out... THx On 10 September 2015 at 17:06, Dmytro Shevchenko < dmytro.shevche...@safeswisscloud.com> wrote: > Hello everyone, some clarification about this. Configuration: > CS: 4.5.1 > Primary storage: Ceph > > Actually we have 2 separate bugs: > > 1. When you remove some volume with more then 16 snapshots (doesn't matter > destroyed or active - they always present on Ceph), on next storage garbage > collector cycle it invoke 'deletePhysicalDisk' from > LibvirtStorageAdaptor.java. On line 854 we calling list snapshots from > external rados-java library and getting exception. > > https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854 > This exception do not catching in current function, but Agent DO NOT CRASH > at this moment and continue working fine. Agent form proper answer to > server and send it, text in answer - java stack trace. Log from Agent side: > > 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor] > (agentRequest-Handler-4:null) Trying to fetch storage pool > 33ebaf83-5d09-3038-b63b-742e759a > 992e from libvirt > 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] > (agentRequest-Handler-4:null) Attempting to remove volume > 4c6a2092-056c-4446-a2ca-d6bba9f7f > 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e > 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] > (agentRequest-Handler-4:null) Unprotecting and Removing RBD snapshots of > image cloudstack-storage > /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image > 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor] > (agentRequest-Handler-4:null) Succesfully connected to Ceph cluster at > 10.10.1.26:6789 > 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor] > (agentRequest-Handler-4:null) Fetching list of snapshots of RBD image > cloudstack-storage/4c6a2092 > -056c-4446-a2ca-d6bba9f7f7f8 > 2015-09-10 02:32:35,457 WARN [cloud.agent.Agent] > (agentRequest-Handler-4:null) Caught: > java.lang.NegativeArraySizeException > at com.ceph.rbd.RbdImage.snapList(Unknown Source) > at > com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854) > at > com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175) > at > com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206) > 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent] > (agentRequest-Handler-4:null) Seq 1-1743737480722513946: { Ans: , MgmtId: > 90520739779588, via: 1, Ver: v1, > Flags: 10, > [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat > com.ceph.rbd.RbdImage.snapList(Unknown Sourc > e)\n\tat > com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat > com.cloud.hypervisor.kvm.storage.Lib > virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat > com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j > ava:1206)\n\tat > com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat > com.cloud.storage.re..... > > so this volume and it snapshots never will be removed. > > > 2. Second bug. Experimentally it has been found that after 50 minutes we > had exception on Agent, for some unknown reason Management server decided > about it lost connection to this agent, start HA process and start Agent > process again. > Log on Agent side: > 2015-09-10 02:57:12,664 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep Mem:|awk > '{print $2}' > 2015-09-10 02:57:12,667 DEBUG [kvm.resource.LibvirtComputingResource] > (agentRequest-Handler-2:null) Execution is successful. > 2015-09-10 02:57:40,502 DEBUG [kvm.resource.LibvirtComputingResource] > (UgentTask-5:null) Executing: > /usr/share/cloudstack-common/scripts/vm/network/security_ > group.py get_rule_logs_for_vms > 2015-09-10 02:57:40,572 DEBUG [kvm.resource.LibvirtComputingResource] > (UgentTask-5:null) Execution is successful. > 2015-09-10 02:57:54,135 INFO [cloud.agent.AgentShell] (main:null) Agent > started > 2015-09-10 02:57:54,136 INFO [cloud.agent.AgentShell] (main:null) > Implementation Version is 4.5.1 > 2015-09-10 02:57:54,138 INFO [cloud.agent.AgentShell] (main:null) > agent.properties found at /etc/cloudstack/agent/agent.properties > ..... > > Log on Server side: > 2015-09-10 02:57:53,710 INFO [c.c.a.m.AgentManagerImpl] > (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has disconnected > with event AgentDisconnecte > d > 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl] > (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive > 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request] > (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Sending { Cmd , > MgmtId: 90520739779588, via: 1(ix1 > -c7-2), Ver: v1, Flags: 100011, > [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } > 2015-09-10 02:57:53,724 INFO [c.c.a.m.AgentAttache] > (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Unable to send > due to Resource [Host:1] is unr > eachable: Host 1: Channel is closed > 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache] > (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Cancelling. > 2015-09-10 02:57:53,724 WARN [c.c.a.m.AgentManagerImpl] > (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable: Host 1: > Channel is closed > 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl] > (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to determine the > state of the host > . Moving on. > > > it's look like connection problem, but it appear only when we have this > RBD exception on agent side and only with this node. I tried to play with > "storage.cleanup.interval" parameter and set it to 5 minutes, now we > getting exception on agent side every 5min, but disconnects still happening > every 50min and I can't find out why. > > On 09/10/2015 03:21 PM, Andrija Panic wrote: > >> Thx Wido, >> >> I will have my colegue Igor and Dmytro join with details on this. >> >> I agree we need fix upstream, that is the main purpose from our side! >> >> With this temp fix, we just avoid agent crashing (agent somehow restarts >> again fine :) ) but VMs also go down on that host, at least some of them. >> >> Do you see any lifecycle/workflow issue, if we implement deleting SNAP >> from CEPH after you SNAP a volume in ACS and sucsssfully move to Secondary >> NFS - or perhaps only delete SNAP from CEPH as a part of actuall SNAP >> deletion (when you manually or via scheduled snapshots, delete snapshot >> from DB and NFS) ? Maybe second option is better, I dont know how you guys >> handle this for regular NFS as primary storage etc... >> >> >> Any guidance is most welcomed, and our team will try to code all this. >> >> Thx Wido again >> >> On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl >> <mailto:w...@widodh.nl>> wrote: >> >> >> >> On 10-09-15 14:07, Andrija Panic wrote: >> > Wido, >> > >> > part of code where you want to delete some volume, checks if >> volume is type >> > RBD - and then tries to list snapshots, delete snapshtos, and >> finally >> > remove image. Here first step- Listing snapshtos- fails, if >> there are more >> > than 16 snapshtos present - number 16 is hardcoded in elsewhere >> part of >> > code and throws RBD exception...then agent crashes... and then >> VMs goe down >> > etc. >> > >> >> Hmmm, that seems like a bug in rados-java indeed. I don't know if >> there >> is a release of rados-java where this is fixed in. >> >> Looking at the code of rados-java it should, but I'm not 100% certain. >> >> > So our current way as quick fix is to invoke external script >> which will >> > also list and remove all snapshtos, but will not fail. >> > >> >> Yes, but we should fix it upstream. I understand that you will use a >> temp script to clean up everything. >> >> > I'm not sure why is there 16 as the hardcoded limit - will try >> to provide >> > part of code where this is present...we can increase this number >> but it >> > doesn make any sense (from 16 to i.e. 200), since we still have >> lot of >> > garbage left on CEPH (snapshtos that were removed in ACS (DB and >> Secondary >> > NFS) - but not removed from CEPH. And in my understanding this >> needs to be >> > implemented, so we dont catch any exceptions that I originally >> described... >> > >> > Any thoughts on this ? >> > >> >> A cleanup script for now should be OK indeed. Afterwards the Java code >> should be able to do this. >> >> You can try manually by using rados-java and fix that. >> >> This is the part where the listing is done: >> >> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java >> >> Wido >> >> > Thx for input! >> > >> > On 10 September 2015 at 13:56, Wido den Hollander >> <w...@widodh.nl <mailto:w...@widodh.nl>> wrote: >> > >> >> >> >> >> >> On 10-09-15 12:17, Andrija Panic wrote: >> >>> We are testing some [dirty?] patch on our dev system and we >> shall soon >> >>> share it for review. >> >>> >> >>> Basically, we are using external python script that is invoked >> in some >> >> part >> >>> of code execution to delete needed CEPH snapshots and then >> after that >> >>> proceeds with volume deleteion etc... >> >>> >> >> >> >> That shouldn't be required. The Java bindings for librbd and >> librados >> >> should be able to remove the snapshots. >> >> >> >> There is no need to invoke external code, this can all be >> handled in Java. >> >> >> >>> On 10 September 2015 at 11:26, Andrija Panic >> <andrija.pa...@gmail.com <mailto:andrija.pa...@gmail.com>> >> >>> wrote: >> >>> >> >>>> Eh, OK. Thx for the info. >> >>>> >> >>>> BTW why is 16 snapshot limits hardcoded - any reason for that ? >> >>>> >> >>>> Not cleaning snapshots on CEPH and trying to delete volume >> after having >> >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM >> side...and some >> >> VMs >> >>>> being rebooted etc - which means downtime :| >> >>>> >> >>>> Thanks, >> >>>> >> >>>> On 9 September 2015 at 22:05, Simon Weller <swel...@ena.com >> <mailto:swel...@ena.com>> wrote: >> >>>> >> >>>>> Andrija, >> >>>>> >> >>>>> The Ceph snapshot deletion is not currently implemented. >> >>>>> >> >>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302 >> >>>>> >> >>>>> - Si >> >>>>> >> >>>>> ________________________________________ >> >>>>> From: Andrija Panic <andrija.pa...@gmail.com> >> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM >> >>>>> To: dev@cloudstack.apache.org >> <mailto:dev@cloudstack.apache.org>; us...@cloudstack.apache.org >> <mailto:us...@cloudstack.apache.org> >> >> >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH >> (only from >> >>>>> Secondaryt NFS and DB) >> >>>>> >> >>>>> Hi folks, >> >>>>> >> >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also >> affected) - >> >>>>> when we delete some snapshot (volume snapshot) in ACS, ACS >> marks it as >> >>>>> deleted in DB, deletes from NFS Secondary Storage but fails >> to delete >> >>>>> snapshot on CEPH primary storage (doesn even try to delete >> it AFAIK) >> >>>>> >> >>>>> So we end up having 5 live snapshots in DB (just example) >> but actually >> >> in >> >>>>> CEPH there are more than i.e. 16 snapshots. >> >>>>> >> >>>>> More of the issue, when ACS agent tries to obtain list of >> snapshots >> >> from >> >>>>> CEPH for some volume or so - if number of snapshots is over >> 16, it >> >> raises >> >>>>> exception (and perhaps this is the reason Agent crashed for >> us - need >> >> to >> >>>>> check with my colegues who are investigatin this in >> details). This >> >> number >> >>>>> 16 is for whatever reasons hardcoded in ACS code. >> >>>>> >> >>>>> Wondering if anyone experienced this, or have any info - we >> plan to >> >> try to >> >>>>> fix this, and I will inlcude my dev colegues here, but we >> might need >> >> some >> >>>>> help at least for guidance or- >> >>>>> >> >>>>> Any help is really apreaciated or at list confirmation that >> this is >> >> known >> >>>>> issue etc. >> >>>>> >> >>>>> Thanks, >> >>>>> >> >>>>> -- >> >>>>> >> >>>>> Andrija Panić >> >>>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> >> >>>> Andrija Panić >> >>>> >> >>> >> >>> >> >>> >> >> >> > >> > >> > >> >> >> >> >> -- >> >> Andrija Panić >> > > -- > --- > Best regards > Dmytro Shevchenko > dshevchenko.m...@gmail.com > skype: demonsh_mk > +380(66)2426648 > > -- Andrija Panić