On 11-09-15 14:43, Dmytro Shevchenko wrote: > Thanks a lot Wido! Any chance to find out why management server decided > that it lost connection to agent after that exceptions? It's not so > critical as this bug with 16 snapshots, but during last week we catch > situation when Agent failed unprotect snapshot, rise exception and this > is was a reason of disconnection a bit later after that. (It is not > clear why CS decided remove that volume, it was template with one 'gold' > snapshot with several active clones) >
No, I didn't look at CS at all. I just spend the day improving the RADOS bindings. Wido > On 09/11/2015 03:20 PM, Wido den Hollander wrote: >> >> On 11-09-15 10:19, Wido den Hollander wrote: >>> >>> On 10-09-15 23:15, Andrija Panic wrote: >>>> Wido, >>>> >>>> could you folow maybe what my colegue Dmytro just sent ? >>>> >>> Yes, seems logical. >>> >>>> Its not only matter of question fixing rados-java (16 snaps limit) >>>> - it >>>> seems that for any RBD exception, ACS will freak out... >>>> >>> No, a RbdException will be caught, but the Rados Bindings shouldn't >>> throw NegativeArraySizeException in any case. >>> >>> That's the main problem. >>> >> Seems to be fixed with this commit: >> https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64 >> >> >> Just tested it with 256 snapshots: >> >> ------------------------------------------------------- >> T E S T S >> ------------------------------------------------------- >> Running com.ceph.rbd.TestRbd >> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: >> 521.014 sec >> >> Results : >> >> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 >> >> The bindings should now be capable of listing more then 16 snapshots. >> >> You can build the bindings manually and replace rados.jar on your >> running systems. >> >> For 4.6 I'll try to get the updated rados-java included. >> >> Wido >> >>> Wido >>> >>>> THx >>>> >>>> On 10 September 2015 at 17:06, Dmytro Shevchenko < >>>> dmytro.shevche...@safeswisscloud.com> wrote: >>>> >>>>> Hello everyone, some clarification about this. Configuration: >>>>> CS: 4.5.1 >>>>> Primary storage: Ceph >>>>> >>>>> Actually we have 2 separate bugs: >>>>> >>>>> 1. When you remove some volume with more then 16 snapshots (doesn't >>>>> matter >>>>> destroyed or active - they always present on Ceph), on next storage >>>>> garbage >>>>> collector cycle it invoke 'deletePhysicalDisk' from >>>>> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots from >>>>> external rados-java library and getting exception. >>>>> >>>>> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854 >>>>> >>>>> This exception do not catching in current function, but Agent DO >>>>> NOT CRASH >>>>> at this moment and continue working fine. Agent form proper answer to >>>>> server and send it, text in answer - java stack trace. Log from >>>>> Agent side: >>>>> >>>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>> (agentRequest-Handler-4:null) Trying to fetch storage pool >>>>> 33ebaf83-5d09-3038-b63b-742e759a >>>>> 992e from libvirt >>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >>>>> (agentRequest-Handler-4:null) Attempting to remove volume >>>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f >>>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e >>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >>>>> (agentRequest-Handler-4:null) Unprotecting and Removing RBD >>>>> snapshots of >>>>> image cloudstack-storage >>>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image >>>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>> (agentRequest-Handler-4:null) Succesfully connected to Ceph cluster at >>>>> 10.10.1.26:6789 >>>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD image >>>>> cloudstack-storage/4c6a2092 >>>>> -056c-4446-a2ca-d6bba9f7f7f8 >>>>> 2015-09-10 02:32:35,457 WARN [cloud.agent.Agent] >>>>> (agentRequest-Handler-4:null) Caught: >>>>> java.lang.NegativeArraySizeException >>>>> at com.ceph.rbd.RbdImage.snapList(Unknown Source) >>>>> at >>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854) >>>>> >>>>> at >>>>> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175) >>>>> >>>>> at >>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206) >>>>> >>>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent] >>>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946: { Ans: , >>>>> MgmtId: >>>>> 90520739779588, via: 1, Ver: v1, >>>>> Flags: 10, >>>>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat >>>>> >>>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc >>>>> e)\n\tat >>>>> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat >>>>> >>>>> com.cloud.hypervisor.kvm.storage.Lib >>>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat >>>>> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j >>>>> >>>>> ava:1206)\n\tat >>>>> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat >>>>> >>>>> com.cloud.storage.re..... >>>>> >>>>> so this volume and it snapshots never will be removed. >>>>> >>>>> >>>>> 2. Second bug. Experimentally it has been found that after 50 >>>>> minutes we >>>>> had exception on Agent, for some unknown reason Management server >>>>> decided >>>>> about it lost connection to this agent, start HA process and start >>>>> Agent >>>>> process again. >>>>> Log on Agent side: >>>>> 2015-09-10 02:57:12,664 DEBUG [kvm.resource.LibvirtComputingResource] >>>>> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep >>>>> Mem:|awk >>>>> '{print $2}' >>>>> 2015-09-10 02:57:12,667 DEBUG [kvm.resource.LibvirtComputingResource] >>>>> (agentRequest-Handler-2:null) Execution is successful. >>>>> 2015-09-10 02:57:40,502 DEBUG [kvm.resource.LibvirtComputingResource] >>>>> (UgentTask-5:null) Executing: >>>>> /usr/share/cloudstack-common/scripts/vm/network/security_ >>>>> group.py get_rule_logs_for_vms >>>>> 2015-09-10 02:57:40,572 DEBUG [kvm.resource.LibvirtComputingResource] >>>>> (UgentTask-5:null) Execution is successful. >>>>> 2015-09-10 02:57:54,135 INFO [cloud.agent.AgentShell] (main:null) >>>>> Agent >>>>> started >>>>> 2015-09-10 02:57:54,136 INFO [cloud.agent.AgentShell] (main:null) >>>>> Implementation Version is 4.5.1 >>>>> 2015-09-10 02:57:54,138 INFO [cloud.agent.AgentShell] (main:null) >>>>> agent.properties found at /etc/cloudstack/agent/agent.properties >>>>> ..... >>>>> >>>>> Log on Server side: >>>>> 2015-09-10 02:57:53,710 INFO [c.c.a.m.AgentManagerImpl] >>>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has >>>>> disconnected >>>>> with event AgentDisconnecte >>>>> d >>>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl] >>>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive >>>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request] >>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Sending { >>>>> Cmd , >>>>> MgmtId: 90520739779588, via: 1(ix1 >>>>> -c7-2), Ver: v1, Flags: 100011, >>>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } >>>>> 2015-09-10 02:57:53,724 INFO [c.c.a.m.AgentAttache] >>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Unable to >>>>> send >>>>> due to Resource [Host:1] is unr >>>>> eachable: Host 1: Channel is closed >>>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache] >>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Cancelling. >>>>> 2015-09-10 02:57:53,724 WARN [c.c.a.m.AgentManagerImpl] >>>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable: >>>>> Host 1: >>>>> Channel is closed >>>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl] >>>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to >>>>> determine the >>>>> state of the host >>>>> . Moving on. >>>>> >>>>> >>>>> it's look like connection problem, but it appear only when we have >>>>> this >>>>> RBD exception on agent side and only with this node. I tried to >>>>> play with >>>>> "storage.cleanup.interval" parameter and set it to 5 minutes, now we >>>>> getting exception on agent side every 5min, but disconnects still >>>>> happening >>>>> every 50min and I can't find out why. >>>>> >>>>> On 09/10/2015 03:21 PM, Andrija Panic wrote: >>>>> >>>>>> Thx Wido, >>>>>> >>>>>> I will have my colegue Igor and Dmytro join with details on this. >>>>>> >>>>>> I agree we need fix upstream, that is the main purpose from our side! >>>>>> >>>>>> With this temp fix, we just avoid agent crashing (agent somehow >>>>>> restarts >>>>>> again fine :) ) but VMs also go down on that host, at least some >>>>>> of them. >>>>>> >>>>>> Do you see any lifecycle/workflow issue, if we implement deleting >>>>>> SNAP >>>>>> from CEPH after you SNAP a volume in ACS and sucsssfully move to >>>>>> Secondary >>>>>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall SNAP >>>>>> deletion (when you manually or via scheduled snapshots, delete >>>>>> snapshot >>>>>> from DB and NFS) ? Maybe second option is better, I dont know how >>>>>> you guys >>>>>> handle this for regular NFS as primary storage etc... >>>>>> >>>>>> >>>>>> Any guidance is most welcomed, and our team will try to code all >>>>>> this. >>>>>> >>>>>> Thx Wido again >>>>>> >>>>>> On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl >>>>>> <mailto:w...@widodh.nl>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 10-09-15 14:07, Andrija Panic wrote: >>>>>> > Wido, >>>>>> > >>>>>> > part of code where you want to delete some volume, checks if >>>>>> volume is type >>>>>> > RBD - and then tries to list snapshots, delete snapshtos, and >>>>>> finally >>>>>> > remove image. Here first step- Listing snapshtos- fails, if >>>>>> there are more >>>>>> > than 16 snapshtos present - number 16 is hardcoded in >>>>>> elsewhere >>>>>> part of >>>>>> > code and throws RBD exception...then agent crashes... and then >>>>>> VMs goe down >>>>>> > etc. >>>>>> > >>>>>> >>>>>> Hmmm, that seems like a bug in rados-java indeed. I don't >>>>>> know if >>>>>> there >>>>>> is a release of rados-java where this is fixed in. >>>>>> >>>>>> Looking at the code of rados-java it should, but I'm not 100% >>>>>> certain. >>>>>> >>>>>> > So our current way as quick fix is to invoke external script >>>>>> which will >>>>>> > also list and remove all snapshtos, but will not fail. >>>>>> > >>>>>> >>>>>> Yes, but we should fix it upstream. I understand that you >>>>>> will use a >>>>>> temp script to clean up everything. >>>>>> >>>>>> > I'm not sure why is there 16 as the hardcoded limit - will try >>>>>> to provide >>>>>> > part of code where this is present...we can increase this >>>>>> number >>>>>> but it >>>>>> > doesn make any sense (from 16 to i.e. 200), since we still >>>>>> have >>>>>> lot of >>>>>> > garbage left on CEPH (snapshtos that were removed in ACS >>>>>> (DB and >>>>>> Secondary >>>>>> > NFS) - but not removed from CEPH. And in my understanding this >>>>>> needs to be >>>>>> > implemented, so we dont catch any exceptions that I originally >>>>>> described... >>>>>> > >>>>>> > Any thoughts on this ? >>>>>> > >>>>>> >>>>>> A cleanup script for now should be OK indeed. Afterwards the >>>>>> Java code >>>>>> should be able to do this. >>>>>> >>>>>> You can try manually by using rados-java and fix that. >>>>>> >>>>>> This is the part where the listing is done: >>>>>> >>>>>> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java >>>>>> >>>>>> >>>>>> Wido >>>>>> >>>>>> > Thx for input! >>>>>> > >>>>>> > On 10 September 2015 at 13:56, Wido den Hollander >>>>>> <w...@widodh.nl <mailto:w...@widodh.nl>> wrote: >>>>>> > >>>>>> >> >>>>>> >> >>>>>> >> On 10-09-15 12:17, Andrija Panic wrote: >>>>>> >>> We are testing some [dirty?] patch on our dev system and we >>>>>> shall soon >>>>>> >>> share it for review. >>>>>> >>> >>>>>> >>> Basically, we are using external python script that is >>>>>> invoked >>>>>> in some >>>>>> >> part >>>>>> >>> of code execution to delete needed CEPH snapshots and then >>>>>> after that >>>>>> >>> proceeds with volume deleteion etc... >>>>>> >>> >>>>>> >> >>>>>> >> That shouldn't be required. The Java bindings for librbd and >>>>>> librados >>>>>> >> should be able to remove the snapshots. >>>>>> >> >>>>>> >> There is no need to invoke external code, this can all be >>>>>> handled in Java. >>>>>> >> >>>>>> >>> On 10 September 2015 at 11:26, Andrija Panic >>>>>> <andrija.pa...@gmail.com <mailto:andrija.pa...@gmail.com>> >>>>>> >>> wrote: >>>>>> >>> >>>>>> >>>> Eh, OK. Thx for the info. >>>>>> >>>> >>>>>> >>>> BTW why is 16 snapshot limits hardcoded - any reason for >>>>>> that ? >>>>>> >>>> >>>>>> >>>> Not cleaning snapshots on CEPH and trying to delete volume >>>>>> after having >>>>>> >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM >>>>>> side...and some >>>>>> >> VMs >>>>>> >>>> being rebooted etc - which means downtime :| >>>>>> >>>> >>>>>> >>>> Thanks, >>>>>> >>>> >>>>>> >>>> On 9 September 2015 at 22:05, Simon Weller <swel...@ena.com >>>>>> <mailto:swel...@ena.com>> wrote: >>>>>> >>>> >>>>>> >>>>> Andrija, >>>>>> >>>>> >>>>>> >>>>> The Ceph snapshot deletion is not currently implemented. >>>>>> >>>>> >>>>>> >>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302 >>>>>> >>>>> >>>>>> >>>>> - Si >>>>>> >>>>> >>>>>> >>>>> ________________________________________ >>>>>> >>>>> From: Andrija Panic <andrija.pa...@gmail.com> >>>>>> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM >>>>>> >>>>> To: dev@cloudstack.apache.org >>>>>> <mailto:dev@cloudstack.apache.org>; us...@cloudstack.apache.org >>>>>> <mailto:us...@cloudstack.apache.org> >>>>>> >>>>>> >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH >>>>>> (only from >>>>>> >>>>> Secondaryt NFS and DB) >>>>>> >>>>> >>>>>> >>>>> Hi folks, >>>>>> >>>>> >>>>>> >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions >>>>>> also >>>>>> affected) - >>>>>> >>>>> when we delete some snapshot (volume snapshot) in ACS, ACS >>>>>> marks it as >>>>>> >>>>> deleted in DB, deletes from NFS Secondary Storage but >>>>>> fails >>>>>> to delete >>>>>> >>>>> snapshot on CEPH primary storage (doesn even try to delete >>>>>> it AFAIK) >>>>>> >>>>> >>>>>> >>>>> So we end up having 5 live snapshots in DB (just example) >>>>>> but actually >>>>>> >> in >>>>>> >>>>> CEPH there are more than i.e. 16 snapshots. >>>>>> >>>>> >>>>>> >>>>> More of the issue, when ACS agent tries to obtain list of >>>>>> snapshots >>>>>> >> from >>>>>> >>>>> CEPH for some volume or so - if number of snapshots is >>>>>> over >>>>>> 16, it >>>>>> >> raises >>>>>> >>>>> exception (and perhaps this is the reason Agent >>>>>> crashed for >>>>>> us - need >>>>>> >> to >>>>>> >>>>> check with my colegues who are investigatin this in >>>>>> details). This >>>>>> >> number >>>>>> >>>>> 16 is for whatever reasons hardcoded in ACS code. >>>>>> >>>>> >>>>>> >>>>> Wondering if anyone experienced this, or have any info >>>>>> - we >>>>>> plan to >>>>>> >> try to >>>>>> >>>>> fix this, and I will inlcude my dev colegues here, but we >>>>>> might need >>>>>> >> some >>>>>> >>>>> help at least for guidance or- >>>>>> >>>>> >>>>>> >>>>> Any help is really apreaciated or at list confirmation >>>>>> that >>>>>> this is >>>>>> >> known >>>>>> >>>>> issue etc. >>>>>> >>>>> >>>>>> >>>>> Thanks, >>>>>> >>>>> >>>>>> >>>>> -- >>>>>> >>>>> >>>>>> >>>>> Andrija Panić >>>>>> >>>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> -- >>>>>> >>>> >>>>>> >>>> Andrija Panić >>>>>> >>>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >> >>>>>> > >>>>>> > >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Andrija Panić >>>>>> >>>>> -- >>>>> --- >>>>> Best regards >>>>> Dmytro Shevchenko >>>>> dshevchenko.m...@gmail.com >>>>> skype: demonsh_mk >>>>> +380(66)2426648 >>>>> >>>>> >>>> >