Re: Ceph RBD related host agent segfault

Aaron Hurt Mon, 11 Jul 2016 21:47:18 -0700

> On Jul 8, 2016, at 2:53 AM, Wido den Hollander <[email protected]> wrote:
> 
> 
>> Op 7 juli 2016 om 6:35 schreef Aaron Hurt <[email protected]>:
>> 
>> 
>> 
>>> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <[email protected]> wrote:
>>> 
>>>> 
>>>> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <[email protected] 
>>>> <mailto:[email protected]>>:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>>> 
>>>>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <[email protected] 
>>>>>> <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>:
>>>>>> 
>>>>>> 
>>>>>> In preparation to roll a new platform built on 4.8 with a Ceph storage 
>>>>>> backend we’ve been encountering segfaults that appear to be related to 
>>>>>> snapshot operations via java-jados (librbd) on the host agent.  We’ve 
>>>>>> been able to isolate this to two possible places in the code:
>>>>>> 
>>>>>> lines ~866-875 in 
>>>>>> plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>>>>>> 
>>>>>>              for (RbdSnapInfo snap : snaps) {
>>>>>>                  if (image.snapIsProtected(snap.name)) {
>>>>>>                      s_logger.debug("Unprotecting snapshot " + 
>>>>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>>>                      image.snapUnprotect(snap.name);
>>>>>>                  } else {
>>>>>>                      s_logger.debug("Snapshot " + pool.getSourceDir() + 
>>>>>> "/" + uuid + "@" + snap.name + " is not protected.");
>>>>>>                  }
>>>>>>                  s_logger.debug("Removing snapshot " + 
>>>>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>>>>>                  image.snapRemove(snap.name);
>>>>>>              }
>>>>>> 
>>>>>> Should we be checking if the unprotect actually failed/succeeded before 
>>>>>> attempting to remove the snapshot?
>>>>>> 
>>>>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 
>>>>>> <https://github.com/apache/cloudstack/pull/1230> 
>>>>>> <https://github.com/apache/cloudstack/pull/1230 
>>>>>> <https://github.com/apache/cloudstack/pull/1230>> 
>>>>>> <https://github.com/apache/cloudstack/pull/1230 
>>>>>> <https://github.com/apache/cloudstack/pull/1230> 
>>>>>> <https://github.com/apache/cloudstack/pull/1230 
>>>>>> <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of 
>>>>>> this functionality and there doesn’t seem to be any protection 
>>>>>> preventing deletePhysicalDisk and the cleanup routine being run 
>>>>>> simultaneously.
>>>>>> 
>>>>>> 
>>>>>> To Reproduce (with ceph/rbd primary storage)
>>>>>> 
>>>>>> 1.  Set global concurrent.snapshots.threshold.perhost to the default 
>>>>>> NULL value
>>>>>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a 
>>>>>> low interval … 10 seconds
>>>>>> 3.  Restart management server
>>>>>> 4.  Deploy several VMs from templates
>>>>>> 5.  Destroy+expunge the VMs after they are running
>>>>>> 6.  Observe segfaults in management server
>>>>>> 
>>>>>> 
>>>>>> Workaround
>>>>>> 
>>>>>> We’ve been able to eliminate the segfaults of the host agent in our 
>>>>>> testing by simply setting concurrent.snapshots.threshold.perhost to 1 
>>>>>> even with the decreased poll intervals.
>>>>>> 
>>>>>> Segfault Logs
>>>>>> 
>>>>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d 
>>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> 
>>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d 
>>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> 
>>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d 
>>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> 
>>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d 
>>>>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>>
>>>>>> 
>>>>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 
>>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> 
>>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 
>>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> 
>>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 
>>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> 
>>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 
>>>>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>>
>>>>>> 
>>>>>> We would really appreciate any feedback and/or confirmation from the 
>>>>>> community around the above issues.  I’d also be happy to provide any 
>>>>>> additional information needed to get this addressed.
>>>>> 
>>>>> What seems to be happening is that it failed to unprotect the snapshot of 
>>>>> the volume. This could have various reasons, for example if there is a 
>>>>> child image of the snapshot. I don't think it's the case however.
>>>>> 
>>>>> It could still be that it tries to remove the master/golden image from 
>>>>> the template while it still has childs attached to that snapshot.
>>>>> 
>>>>> I'm not sure if this is due to rados-java or a bug in librados. The Java 
>>>>> could should just throw a exception and not completely crash the JVM. 
>>>>> This happens lower in the code and not in Java.
>>>>> 
>>>>> The assert shows this also happens when Java is talking to libvirt. I 
>>>>> guess a librados bug, but now completely sure.
>>>>> 
>>>>> Wido
>>>> 
>>>> 
>>>> We’re seeing this happen around other issues and it does seem to be 
>>>> related to java-rados and the JNA wrappings around librbd.  This is an 
>>>> exception that just occurred this morning when performing a load balancer 
>>>> update.
>>>> 
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) 
>>>> (logid:7b48049b) Execution is successful.
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) 
>>>> (logid:4a2bd0ba) Executing: 
>>>> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh 
>>>> checkrouter.sh 169.254.3.93
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.agent.Agent] (agentRequest-Handler-3:) (logid:7b48049b) Seq 
>>>> 1-5871286539207573659:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, 
>>>> Flags: 10, 
>>>> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status:
>>>>  BACKUP\n","wait":0}}] }
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Request:Seq 
>>>> 1-5871286539207573662:  { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, 
>>>> Flags: 100001, 
>>>> [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: 
>>>> .0.11","router.name":"r-167-QA"},"wait":0}}] }
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.agent.Agent] (agentRequest-Handler-1:) (logid:aec84669) Processing 
>>>> command: com.cloud.agent.api.routing.LoadBalancerConfigCommand
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) Transforming 
>>>> com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section: global
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section:         log 127.0.0.1:3914   local0 
>>>> warning
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section:         maxconn 4096
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section:         maxpipes 1024
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section:         chroot /var/lib/haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section:         user haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section:         group haproxy
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) global section:         daemon
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section: defaults
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         log     global
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         mode    tcp
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         option  dontlognull
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         retries 3
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         option redispatch
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         option forwardfor
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         option forceclose
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         timeout connect    5000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         timeout client     50000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) default section:         timeout server     50000
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) Haproxy mode http enabled
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) Haproxystats rule:
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 
>>>> 10.107.0.6:8081
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     
>>>> /admin?stats
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ 
>>>> Statistics
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    
>>>> admin1:AdMiN123
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) 
>>>> (logid:4a2bd0ba) Execution is successful.
>>>> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [cloud.agent.Agent] (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 
>>>> 1-5871286539207573661:  { Ans: , MgmtId: 52239507206, via: 1, Ver: v1, 
>>>> Flags: 10, 
>>>> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status:
>>>>  BACKUP\n","wait":0}}] }
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) Processing FileConfigItem, copying 1315 characters to 
>>>> load_balancer.json took 237ms
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG 
>>>> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) 
>>>> (logid:aec84669) Executing: 
>>>> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh 
>>>> update_config.py 169.254.0.11 load_balancer.json
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 
>>>> In function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, 
>>>> int)' thread 7f530dff3700 time 2016-07-06 08:59:09.143659
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 
>>>> 62: FAILED assert(sub < m_subsys.size())
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 
>>>> (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) 
>>>> [0x7f534a94bda5]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) 
>>>> [0x7f534a712028]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) 
>>>> [0x7f534a743ff3]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) 
>>>> [0x7f534ab9ae8e]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) 
>>>> [0x7f534aba817d]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) 
>>>> [0x7f5455e78dc5]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) 
>>>> [0x7f5455793ced]
>>>> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the 
>>>> executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 
>>>> Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m 
>>>> -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: 
>>>> cloudstack-agent.service: main process exited, code=exited, status=134/n/a
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit 
>>>> cloudstack-agent.service entered failed state.
>>>> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: 
>>>> cloudstack-agent.service failed.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: 
>>>> cloudstack-agent.service holdoff time over, scheduling restart.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack 
>>>> Agent.
>>>> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack 
>>>> Agent…
>>>> 
>>>> Is there something obvious I’m missing here?
>>>> 
>>> 
>>> I am confused. In t his path you are not even touching Ceph, but still it 
>>> crashes on librados.
>>> 
>>> Searching I found this issue in the Ceph tracker: 
>>> http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314>
>>> 
>>> Isn't there a package version mismatch in your Ceph cluster?
>>> 
>>> Wido
>>> 
>> 
>> I agree it’s very confusing and I’m running out of ideas as to what the 
>> cause may be.
>> 
>> Here are the package versions on all our related boxes in the lab.
>> 
>> http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g
>> 
>> I’ve also collected the most recent instances of our two segfaults/aborts 
>> below.
>> 
>> journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted
>> 
>> http://sprunge.us/LcYA
>> 
>> journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 
>> com.ceph.rbd.RbdException
>> 
>> http://sprunge.us/SiCf
>> 
> 
> Looking at this I come to the conclusion that this is a librados bug and not 
> rados-java nor CloudStack.
> 
> The crashes are happening with exactly the same backtrace. I have a few 
> clusters running with Hammer 0.94.5 and they all clean up their snapshots 
> just fine, no crashes.
> 
>> I also went back to look at the two places in the code where snapshot 
>> cleanup is taking place in our tree:
>> 
>> The place where the failed to unprotect exceptions seem to be triggered:
>> https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840
>>  
>> <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840>
>> 
>> The cleanup code for rbd snapshots:
>> https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289
>>  
>> <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289>
>> 
>> I’ve rolled a small patch into our testing tree that moves the context and 
>> image cleanup into a ‘finally’ block because I thought there may be a 
>> problem with those not being closed/freed when the unprotect threw an 
>> exception.  So the code in LibvirtStorageAdapter.java is slightly different 
>> than mainline the 4.8 branch.  Here is the pr/diff in our fork that shows 
>> the changes.
>> 
>> https://github.com/myENA/cloudstack/pull/11/files
>> 
>> Does this make sense?  Is this even possibly related to the issue I’m seeing?
> 
> The patch looks sane, you can more exceptions, but still, such a true crash 
> of Java/librados can't be triggered easily from Java. I truly think this is a 
> librados bug.
> 
> Wido
> 
>> 
>> Once again, all the advice and attention is definitely appreciated.
>> 
>> — Aaron
>>


We finally solved the issue.  The exceptions were being thrown due to a 
malformed/typoed ACL (“reb_children" instead of “rbd_children”) that was 
causing an “Operation not permitted” exception to be thrown by librbd when 
attempting to unprotect the base snapshot.  This exception in combination with 
a failure to cleanup contexts when exceptions occurred within the loop in 
deletePhysicalDisk eventually caused the host agent to crash.  We fixed the 
access list and issued a PR to ensure contexts are cleaned up even when 
exceptions occurred.  In addition to the context cleanup we also added 
additional logging to help troubleshoot similar issues if/when they happen 
again in the future.

PR for the fix:

https://github.com/apache/cloudstack/pull/1608 
<https://github.com/apache/cloudstack/pull/1608>

Thank you again for all the support.

— Aaron

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Ceph RBD related host agent segfault

Reply via email to