> Op 7 juli 2016 om 6:35 schreef Aaron Hurt <ah...@ena.com>: > > > > > On Jul 6, 2016, at 5:14 PM, Wido den Hollander <w...@widodh.nl> wrote: > > > >> > >> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ah...@ena.com > >> <mailto:ah...@ena.com>>: > >> > >> > >> > >>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <w...@widodh.nl > >>> <mailto:w...@widodh.nl>> wrote: > >>> > >>>> > >>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ah...@ena.com > >>>> <mailto:ah...@ena.com> <mailto:ah...@ena.com <mailto:ah...@ena.com>>>: > >>>> > >>>> > >>>> In preparation to roll a new platform built on 4.8 with a Ceph storage > >>>> backend we’ve been encountering segfaults that appear to be related to > >>>> snapshot operations via java-jados (librbd) on the host agent. We’ve > >>>> been able to isolate this to two possible places in the code: > >>>> > >>>> lines ~866-875 in > >>>> plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java > >>>> > >>>> for (RbdSnapInfo snap : snaps) { > >>>> if (image.snapIsProtected(snap.name)) { > >>>> s_logger.debug("Unprotecting snapshot " + > >>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name); > >>>> image.snapUnprotect(snap.name); > >>>> } else { > >>>> s_logger.debug("Snapshot " + pool.getSourceDir() + > >>>> "/" + uuid + "@" + snap.name + " is not protected."); > >>>> } > >>>> s_logger.debug("Removing snapshot " + > >>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name); > >>>> image.snapRemove(snap.name); > >>>> } > >>>> > >>>> Should we be checking if the unprotect actually failed/succeeded before > >>>> attempting to remove the snapshot? > >>>> > >>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 > >>>> <https://github.com/apache/cloudstack/pull/1230> > >>>> <https://github.com/apache/cloudstack/pull/1230 > >>>> <https://github.com/apache/cloudstack/pull/1230>> > >>>> <https://github.com/apache/cloudstack/pull/1230 > >>>> <https://github.com/apache/cloudstack/pull/1230> > >>>> <https://github.com/apache/cloudstack/pull/1230 > >>>> <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of > >>>> this functionality and there doesn’t seem to be any protection > >>>> preventing deletePhysicalDisk and the cleanup routine being run > >>>> simultaneously. > >>>> > >>>> > >>>> To Reproduce (with ceph/rbd primary storage) > >>>> > >>>> 1. Set global concurrent.snapshots.threshold.perhost to the default > >>>> NULL value > >>>> 2. Set global snapshot.poll.interval and storage.cleanup.interval to a > >>>> low interval … 10 seconds > >>>> 3. Restart management server > >>>> 4. Deploy several VMs from templates > >>>> 5. Destroy+expunge the VMs after they are running > >>>> 6. Observe segfaults in management server > >>>> > >>>> > >>>> Workaround > >>>> > >>>> We’ve been able to eliminate the segfaults of the host agent in our > >>>> testing by simply setting concurrent.snapshots.threshold.perhost to 1 > >>>> even with the decreased poll intervals. > >>>> > >>>> Segfault Logs > >>>> > >>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d > >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> > >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d > >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> > >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d > >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> > >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d > >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>> > >>>> > >>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 > >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> > >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 > >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> > >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 > >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> > >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 > >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>> > >>>> > >>>> We would really appreciate any feedback and/or confirmation from the > >>>> community around the above issues. I’d also be happy to provide any > >>>> additional information needed to get this addressed. > >>> > >>> What seems to be happening is that it failed to unprotect the snapshot of > >>> the volume. This could have various reasons, for example if there is a > >>> child image of the snapshot. I don't think it's the case however. > >>> > >>> It could still be that it tries to remove the master/golden image from > >>> the template while it still has childs attached to that snapshot. > >>> > >>> I'm not sure if this is due to rados-java or a bug in librados. The Java > >>> could should just throw a exception and not completely crash the JVM. > >>> This happens lower in the code and not in Java. > >>> > >>> The assert shows this also happens when Java is talking to libvirt. I > >>> guess a librados bug, but now completely sure. > >>> > >>> Wido > >> > >> > >> We’re seeing this happen around other issues and it does seem to be > >> related to java-rados and the JNA wrappings around librbd. This is an > >> exception that just occurred this morning when performing a load balancer > >> update. > >> > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) > >> (logid:7b48049b) Execution is successful. > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) > >> (logid:4a2bd0ba) Executing: > >> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh > >> checkrouter.sh 169.254.3.93 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq > >> 1-5871286539207573659: { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, > >> Flags: 10, > >> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: > >> BACKUP\n","wait":0}}] } > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq > >> 1-5871286539207573662: { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, > >> Flags: 100001, > >> [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: > >> .0.11","router.name":"r-167-QA"},"wait":0}}] } > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing > >> command: com.cloud.agent.api.routing.LoadBalancerConfigCommand > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) > >> (logid:aec84669) Transforming > >> com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: global > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: log 127.0.0.1:3914 local0 > >> warning > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: maxconn 4096 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: maxpipes 1024 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: chroot /var/lib/haproxy > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: user haproxy > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: group haproxy > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) global section: daemon > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: defaults > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: log global > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: mode tcp > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: option dontlognull > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: retries 3 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: option redispatch > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: option forwardfor > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: option forceclose > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: timeout connect 5000 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: timeout client 50000 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) default section: timeout server 50000 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) Haproxy mode http enabled > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) > >> (logid:aec84669) Haproxystats rule: > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public > >> 10.107.0.6:8081 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri > >> /admin?stats > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm Haproxy\ > >> Statistics > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth > >> admin1:AdMiN123 > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) > >> (logid:4a2bd0ba) Execution is successful. > >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq > >> 1-5871286539207573661: { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, > >> Flags: 10, > >> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: > >> BACKUP\n","wait":0}}] } > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) > >> (logid:aec84669) Processing FileConfigItem, copying 1315 characters to > >> load_balancer.json took 237ms > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG > >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) > >> (logid:aec84669) Executing: > >> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh > >> update_config.py 169.254.0.11 load_balancer.json > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: > >> In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, > >> int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659 > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: > >> 62: FAILED assert(sub < m_subsys.size()) > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 > >> (45107e21c568dd033c2f0a3107dec8f0b0e58374) > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) > >> [0x7f534a94bda5] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) > >> [0x7f534a712028] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) > >> [0x7f534a743ff3] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) > >> [0x7f534ab9ae8e] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) > >> [0x7f534aba817d] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) > >> [0x7f5455e78dc5] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) > >> [0x7f5455793ced] > >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the > >> executable, or `objdump -rdS <executable>` is needed to interpret this. > >> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 > >> Aborted (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m > >> -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS > >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: > >> cloudstack-agent.service: main process exited, code=exited, status=134/n/a > >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit > >> cloudstack-agent.service entered failed state. > >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: > >> cloudstack-agent.service failed. > >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: > >> cloudstack-agent.service holdoff time over, scheduling restart. > >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack > >> Agent. > >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack > >> Agent… > >> > >> Is there something obvious I’m missing here? > >> > > > > I am confused. In t his path you are not even touching Ceph, but still it > > crashes on librados. > > > > Searching I found this issue in the Ceph tracker: > > http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314> > > > > Isn't there a package version mismatch in your Ceph cluster? > > > > Wido > > > > I agree it’s very confusing and I’m running out of ideas as to what the cause > may be. > > Here are the package versions on all our related boxes in the lab. > > http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g > > I’ve also collected the most recent instances of our two segfaults/aborts > below. > > journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted > > http://sprunge.us/LcYA > > journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 > com.ceph.rbd.RbdException > > http://sprunge.us/SiCf >
Looking at this I come to the conclusion that this is a librados bug and not rados-java nor CloudStack. The crashes are happening with exactly the same backtrace. I have a few clusters running with Hammer 0.94.5 and they all clean up their snapshots just fine, no crashes. > I also went back to look at the two places in the code where snapshot cleanup > is taking place in our tree: > > The place where the failed to unprotect exceptions seem to be triggered: > https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 > > <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840> > > The cleanup code for rbd snapshots: > https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 > > <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289> > > I’ve rolled a small patch into our testing tree that moves the context and > image cleanup into a ‘finally’ block because I thought there may be a problem > with those not being closed/freed when the unprotect threw an exception. So > the code in LibvirtStorageAdapter.java is slightly different than mainline > the 4.8 branch. Here is the pr/diff in our fork that shows the changes. > > https://github.com/myENA/cloudstack/pull/11/files > > Does this make sense? Is this even possibly related to the issue I’m seeing? The patch looks sane, you can more exceptions, but still, such a true crash of Java/librados can't be triggered easily from Java. I truly think this is a librados bug. Wido > > Once again, all the advice and attention is definitely appreciated. > > — Aaron >