>From all we analyzed - look like - it's this issue
http://tracker.ceph.com/issues/13045

PR: https://github.com/ceph/ceph/pull/6097

Can anyone help us to confirm this? :)

2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <igor.voloshane...@gmail.com>:

> Additional trace:
>
> #0  0x00007f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f30f98950d8 in __GI_abort () at abort.c:89
> #2  0x00007f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #3  0x00007f30f87b1836 in ?? () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #4  0x00007f30f87b1863 in std::terminate() () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #5  0x00007f30f87b1aa2 in __cxa_throw () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #6  0x00007f2fddb50778 in ceph::__ceph_assert_fail
> (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",
>     file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", line=line@entry
> =62,
>     func=func@entry=0x7f2fdddedba0
> <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool
> ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> common/assert.cc:77
> #7  0x00007f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> (level=<optimized out>, sub=<optimized out>, this=<optimized out>)
>     at ./log/SubsystemMap.h:62
> #8  0x00007f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> (this=<optimized out>, sub=<optimized out>, level=<optimized out>)
>     at ./log/SubsystemMap.h:61
> #9  0x00007f2fddd879be in ObjectCacher::flusher_entry
> (this=0x7f2ff80b27a0) at osdc/ObjectCacher.cc:1527
> #10 0x00007f2fddd9851d in ObjectCacher::FlusherThread::entry
> (this=<optimized out>) at osdc/ObjectCacher.h:374
> #11 0x00007f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
> pthread_create.c:312
> #12 0x00007f30f995547d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor <igor.voloshane...@gmail.com>
> :
>
>> Hi Wido and all community.
>>
>> We catched very idiotic issue on our Cloudstack installation, which
>> related to ceph and possible to java-rados lib.
>>
>> So, we have constantly agent crashed (which cause very big problem for
>> us... ).
>>
>> When agent crashed - it's crash JVM. And no event in logs at all.
>> We enabled crush dump, and after crash we see next picture:
>>
>> #grep -A1 "Problematic frame" < /hs_err_pid30260.log
>>  Problematic frame:
>>  C  [librbd.so.1.0.0+0x5d681]
>>
>> # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
>> (gdb)  bt
>> ...
>> #7  0x00007f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
>> (level=<optimized out>, sub=<optimized out>, this=<optimized out>)
>>     at ./log/SubsystemMap.h:62
>> #8  0x00007f30b9a3b693 in ceph::log::SubsystemMap::should_gather
>> (this=<optimized out>, sub=<optimized out>, level=<optimized out>)
>>     at ./log/SubsystemMap.h:61
>> #9  0x00007f30b9d879be in ObjectCacher::flusher_entry
>> (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
>> #10 0x00007f30b9d9851d in ObjectCacher::FlusherThread::entry
>> (this=<optimized out>) at osdc/ObjectCacher.h:374
>>
>> From ceph code, this part executed when flushing cache object... And we
>> don;t understand why. Becasue we have absolutely different race condition
>> to reproduce it.
>>
>> As cloudstack have not good implementation yet of snapshot lifecycle,
>> sometime, it's happen, that some volumes already marked as EXPUNGED in DB
>> and then cloudstack try to delete bas Volume, before it's try to unprotect
>> it.
>>
>> Sure, unprotecting fail, normal exception returned back (fail because
>> snap has childs... )
>>
>> 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
>> (Thread-1304:null) Executing:
>> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i
>> 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
>> /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
>> 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
>> (Thread-1304:null) Execution is successful.
>> 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
>> (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
>> image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image
>> 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
>> cephmon.anolim.net:6789
>> 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>> (agentRequest-Handler-5:null) Unprotecting snapshot
>> cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
>> 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-5:null) Failed to delete volume:
>> com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException:
>> Failed to unprotect snapshot cloudstack-base-snap
>> 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
>> (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
>> 161344838950, via: 4, Ver: v1, Flags: 10,
>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
>> com.ceph.rbd.RbdException: Failed to unprotect snapshot
>> cloudstack-base-snap","wait":0}}] }
>> 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
>> (agentRequest-Handler-2:null) Processing command:
>> com.cloud.agent.api.GetHostStatsCommand
>> 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
>> (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n 1|
>> awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle
>> 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]
>> (agentRequest-Handler-2:null) Execution is successful.
>> 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]
>> (agentRequest-Handler-2:null) Executing: /bin/bash -c freeMem=$(free|grep
>> cache:|awk '{print $4}');echo $freeMem
>> 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]
>> (agentRequest-Handler-2:null) Execution is successful.
>>
>> BUT, after 20 minutes - agent crashed... If we remove all childs and
>> create conditions for cloudstack to delete volume - all OK - no agent crash
>> in 20 minutes...
>>
>> We can't connect this action - Volume delete with agent crashe... Also we
>> don't understand why +- 20 minutes need to last, and only then agent
>> crashed...
>>
>> From logs, before crash - only GetVMStats... And then - agent started...
>>
>> 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
>> Sending ping: Seq 4-1343:  { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags: 11,
>> [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"
>> cs2.anolim.net"},"i-7-106-VM":{"state":"PowerOn","host":"cs2.anolim.net
>> "},"i-1683-1984-VM":{"state":"PowerOn","host":"cs2.anolim.net
>> "},"i-11-504-VM":{"state":"PowerOn","host":"cs2.anolim.net
>> "},"i-325-616-VM":{"state":"PowerOn","host":"cs2.anolim.net
>> "},"i-10-52-VM":{"state":"PowerOn","host":"cs2.anolim.net
>> "},"i-941-1237-VM":{"state":"PowerOn","host":"cs2.anolim.net"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}]
>> }
>> 2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null)
>> Received response: Seq 4-1343:  { Ans: , MgmtId: 161344838950, via: 4, Ver:
>> v1, Flags: 100010,
>> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}]
>> }
>> 2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent]
>> (agentRequest-Handler-5:null) Processing command:
>> com.cloud.agent.api.GetVmStatsCommand
>> 2015-10-29 09:23:02,630 INFO  [cloud.agent.AgentShell] (main:null) Agent
>> started
>> 2015-10-29 09:23:02,632 INFO  [cloud.agent.AgentShell] (main:null)
>> Implementation Version is 4.3.2
>>
>>
>> Can you please give us direction to go deeper... As we dont udnerstand
>> what happened...
>>
>> Issue reproducing very easy:
>>
>> (to speed up delete process we set storage.cleanup.interval=10)
>>
>> 1. Create Volume
>> 2. Attach it to VM
>> 3. Create snapshot from this volume by hand on ceph
>> 4. Do snap_protect by hand on ceph for snap from previous step
>> 5. Create child via snap clone
>> 6. Deattach volume from VM and delete it
>>
>> Our problem - that this happen not in such lab situation, but in real
>> life...
>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to