> On Jul 8, 2016, at 2:53 AM, Wido den Hollander <w...@widodh.nl> wrote: > > >> Op 7 juli 2016 om 6:35 schreef Aaron Hurt <ah...@ena.com>: >> >> >> >>> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <w...@widodh.nl> wrote: >>> >>>> >>>> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ah...@ena.com >>>> <mailto:ah...@ena.com>>: >>>> >>>> >>>> >>>>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <w...@widodh.nl >>>>> <mailto:w...@widodh.nl>> wrote: >>>>> >>>>>> >>>>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ah...@ena.com >>>>>> <mailto:ah...@ena.com> <mailto:ah...@ena.com <mailto:ah...@ena.com>>>: >>>>>> >>>>>> >>>>>> In preparation to roll a new platform built on 4.8 with a Ceph storage >>>>>> backend we’ve been encountering segfaults that appear to be related to >>>>>> snapshot operations via java-jados (librbd) on the host agent. We’ve >>>>>> been able to isolate this to two possible places in the code: >>>>>> >>>>>> lines ~866-875 in >>>>>> plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java >>>>>> >>>>>> for (RbdSnapInfo snap : snaps) { >>>>>> if (image.snapIsProtected(snap.name)) { >>>>>> s_logger.debug("Unprotecting snapshot " + >>>>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name); >>>>>> image.snapUnprotect(snap.name); >>>>>> } else { >>>>>> s_logger.debug("Snapshot " + pool.getSourceDir() + >>>>>> "/" + uuid + "@" + snap.name + " is not protected."); >>>>>> } >>>>>> s_logger.debug("Removing snapshot " + >>>>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name); >>>>>> image.snapRemove(snap.name); >>>>>> } >>>>>> >>>>>> Should we be checking if the unprotect actually failed/succeeded before >>>>>> attempting to remove the snapshot? >>>>>> >>>>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 >>>>>> <https://github.com/apache/cloudstack/pull/1230> >>>>>> <https://github.com/apache/cloudstack/pull/1230 >>>>>> <https://github.com/apache/cloudstack/pull/1230>> >>>>>> <https://github.com/apache/cloudstack/pull/1230 >>>>>> <https://github.com/apache/cloudstack/pull/1230> >>>>>> <https://github.com/apache/cloudstack/pull/1230 >>>>>> <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of >>>>>> this functionality and there doesn’t seem to be any protection >>>>>> preventing deletePhysicalDisk and the cleanup routine being run >>>>>> simultaneously. >>>>>> >>>>>> >>>>>> To Reproduce (with ceph/rbd primary storage) >>>>>> >>>>>> 1. Set global concurrent.snapshots.threshold.perhost to the default >>>>>> NULL value >>>>>> 2. Set global snapshot.poll.interval and storage.cleanup.interval to a >>>>>> low interval … 10 seconds >>>>>> 3. Restart management server >>>>>> 4. Deploy several VMs from templates >>>>>> 5. Destroy+expunge the VMs after they are running >>>>>> 6. Observe segfaults in management server >>>>>> >>>>>> >>>>>> Workaround >>>>>> >>>>>> We’ve been able to eliminate the segfaults of the host agent in our >>>>>> testing by simply setting concurrent.snapshots.threshold.perhost to 1 >>>>>> even with the decreased poll intervals. >>>>>> >>>>>> Segfault Logs >>>>>> >>>>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> >>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> >>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> >>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>> >>>>>> >>>>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> >>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> >>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> >>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>> >>>>>> >>>>>> We would really appreciate any feedback and/or confirmation from the >>>>>> community around the above issues. I’d also be happy to provide any >>>>>> additional information needed to get this addressed. >>>>> >>>>> What seems to be happening is that it failed to unprotect the snapshot of >>>>> the volume. This could have various reasons, for example if there is a >>>>> child image of the snapshot. I don't think it's the case however. >>>>> >>>>> It could still be that it tries to remove the master/golden image from >>>>> the template while it still has childs attached to that snapshot. >>>>> >>>>> I'm not sure if this is due to rados-java or a bug in librados. The Java >>>>> could should just throw a exception and not completely crash the JVM. >>>>> This happens lower in the code and not in Java. >>>>> >>>>> The assert shows this also happens when Java is talking to libvirt. I >>>>> guess a librados bug, but now completely sure. >>>>> >>>>> Wido >>>> >>>> >>>> We’re seeing this happen around other issues and it does seem to be >>>> related to java-rados and the JNA wrappings around librbd. This is an >>>> exception that just occurred this morning when performing a load balancer >>>> update. >>>> >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) >>>> (logid:7b48049b) Execution is successful. >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) >>>> (logid:4a2bd0ba) Executing: >>>> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh >>>> checkrouter.sh 169.254.3.93 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq >>>> 1-5871286539207573659: { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, >>>> Flags: 10, >>>> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: >>>> BACKUP\n","wait":0}}] } >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq >>>> 1-5871286539207573662: { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, >>>> Flags: 100001, >>>> [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: >>>> .0.11","router.name":"r-167-QA"},"wait":0}}] } >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing >>>> command: com.cloud.agent.api.routing.LoadBalancerConfigCommand >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) >>>> (logid:aec84669) Transforming >>>> com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: global >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: log 127.0.0.1:3914 local0 >>>> warning >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: maxconn 4096 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: maxpipes 1024 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: chroot /var/lib/haproxy >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: user haproxy >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: group haproxy >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) global section: daemon >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: defaults >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: log global >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: mode tcp >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: option dontlognull >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: retries 3 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: option redispatch >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: option forwardfor >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: option forceclose >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: timeout connect 5000 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: timeout client 50000 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) default section: timeout server 50000 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) Haproxy mode http enabled >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >>>> (logid:aec84669) Haproxystats rule: >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public >>>> 10.107.0.6:8081 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri >>>> /admin?stats >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm Haproxy\ >>>> Statistics >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth >>>> admin1:AdMiN123 >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) >>>> (logid:4a2bd0ba) Execution is successful. >>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq >>>> 1-5871286539207573661: { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, >>>> Flags: 10, >>>> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: >>>> BACKUP\n","wait":0}}] } >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) >>>> (logid:aec84669) Processing FileConfigItem, copying 1315 characters to >>>> load_balancer.json took 237ms >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG >>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) >>>> (logid:aec84669) Executing: >>>> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh >>>> update_config.py 169.254.0.11 load_balancer.json >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: >>>> In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, >>>> int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659 >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: >>>> 62: FAILED assert(sub < m_subsys.size()) >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 >>>> (45107e21c568dd033c2f0a3107dec8f0b0e58374) >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) >>>> [0x7f534a94bda5] >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) >>>> [0x7f534a712028] >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) >>>> [0x7f534a743ff3] >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) >>>> [0x7f534ab9ae8e] >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) >>>> [0x7f534aba817d] >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) >>>> [0x7f5455e78dc5] >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) >>>> [0x7f5455793ced] >>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the >>>> executable, or `objdump -rdS <executable>` is needed to interpret this. >>>> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 >>>> Aborted (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m >>>> -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS >>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: >>>> cloudstack-agent.service: main process exited, code=exited, status=134/n/a >>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit >>>> cloudstack-agent.service entered failed state. >>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: >>>> cloudstack-agent.service failed. >>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: >>>> cloudstack-agent.service holdoff time over, scheduling restart. >>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack >>>> Agent. >>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack >>>> Agent… >>>> >>>> Is there something obvious I’m missing here? >>>> >>> >>> I am confused. In t his path you are not even touching Ceph, but still it >>> crashes on librados. >>> >>> Searching I found this issue in the Ceph tracker: >>> http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314> >>> >>> Isn't there a package version mismatch in your Ceph cluster? >>> >>> Wido >>> >> >> I agree it’s very confusing and I’m running out of ideas as to what the >> cause may be. >> >> Here are the package versions on all our related boxes in the lab. >> >> http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g >> >> I’ve also collected the most recent instances of our two segfaults/aborts >> below. >> >> journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted >> >> http://sprunge.us/LcYA >> >> journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 >> com.ceph.rbd.RbdException >> >> http://sprunge.us/SiCf >> > > Looking at this I come to the conclusion that this is a librados bug and not > rados-java nor CloudStack. > > The crashes are happening with exactly the same backtrace. I have a few > clusters running with Hammer 0.94.5 and they all clean up their snapshots > just fine, no crashes. > >> I also went back to look at the two places in the code where snapshot >> cleanup is taking place in our tree: >> >> The place where the failed to unprotect exceptions seem to be triggered: >> https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 >> >> <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840> >> >> The cleanup code for rbd snapshots: >> https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 >> >> <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289> >> >> I’ve rolled a small patch into our testing tree that moves the context and >> image cleanup into a ‘finally’ block because I thought there may be a >> problem with those not being closed/freed when the unprotect threw an >> exception. So the code in LibvirtStorageAdapter.java is slightly different >> than mainline the 4.8 branch. Here is the pr/diff in our fork that shows >> the changes. >> >> https://github.com/myENA/cloudstack/pull/11/files >> >> Does this make sense? Is this even possibly related to the issue I’m seeing? > > The patch looks sane, you can more exceptions, but still, such a true crash > of Java/librados can't be triggered easily from Java. I truly think this is a > librados bug. > > Wido > >> >> Once again, all the advice and attention is definitely appreciated. >> >> — Aaron >>
We finally solved the issue. The exceptions were being thrown due to a malformed/typoed ACL (“reb_children" instead of “rbd_children”) that was causing an “Operation not permitted” exception to be thrown by librbd when attempting to unprotect the base snapshot. This exception in combination with a failure to cleanup contexts when exceptions occurred within the loop in deletePhysicalDisk eventually caused the host agent to crash. We fixed the access list and issued a PR to ensure contexts are cleaned up even when exceptions occurred. In addition to the context cleanup we also added additional logging to help troubleshoot similar issues if/when they happen again in the future. PR for the fix: https://github.com/apache/cloudstack/pull/1608 <https://github.com/apache/cloudstack/pull/1608> Thank you again for all the support. — Aaron
signature.asc
Description: Message signed with OpenPGP using GPGMail