Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-03 Thread Wido den Hollander


On 03-11-15 01:54, Voloshanenko Igor wrote:
> Thank you, Jason!
> 
> Any advice, for troubleshooting
> 
> I'm looking in code, and right now don;t see any bad things :(
> 

Can you run the CloudStack Agent in DEBUG mode and then see after which
lines in the logs it crashes?

Wido

> 2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com
> <mailto:dilla...@redhat.com>>:
> 
> Most likely not going to be related to 13045 since you aren't
> actively exporting an image diff.  The most likely problem is that
> the RADOS IO context is being closed prior to closing the RBD image.
> 
> --
> 
> Jason Dillaman
> 
> 
> - Original Message -
> 
> > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com
> <mailto:igor.voloshane...@gmail.com>>
> > To: "Ceph Users" <ceph-users@lists.ceph.com
>     <mailto:ceph-users@lists.ceph.com>>
> > Sent: Thursday, October 29, 2015 5:27:17 PM
> > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with
> exception in
> > librbd
> 
> > From all we analyzed - look like - it's this issue
> > http://tracker.ceph.com/issues/13045
> 
> > PR: https://github.com/ceph/ceph/pull/6097
> 
> > Can anyone help us to confirm this? :)
> 
> > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
> igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com> >
> > :
> 
> > > Additional trace:
> >
> 
> > > #0 0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> >
> > > #1 0x7f30f98950d8 in __GI_abort () at abort.c:89
> >
> > > #2 0x7f30f87b36b5 in
> __gnu_cxx::__verbose_terminate_handler() () from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #3 0x7f30f87b1836 in ?? () from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #4 0x7f30f87b1863 in std::terminate() () from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #5 0x7f30f87b1aa2 in __cxa_throw () from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #6 0x7f2fddb50778 in ceph::__ceph_assert_fail
> > > (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",
> >
> > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h",
> line=line@entry=62,
> >
> > > func=func@entry=0x7f2fdddedba0
> > >
> <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> 
> "bool
> > > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> > > common/assert.cc:77
> >
> > > #7 0x7f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> > > (level=, sub=, this=)
> >
> > > at ./log/SubsystemMap.h:62
> >
> > > #8 0x7f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> > > (this=, sub=, level=)
> >
> > > at ./log/SubsystemMap.h:61
> >
> > > #9 0x7f2fddd879be in ObjectCacher::flusher_entry
> (this=0x7f2ff80b27a0)
> > > at
> > > osdc/ObjectCacher.cc:1527
> >
> > > #10 0x7f2fddd9851d in ObjectCacher::FlusherThread::entry
> > > (this= > > out>) at osdc/ObjectCacher.h:374
> >
> > > #11 0x7f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
> > > pthread_create.c:312
> >
> > > #12 0x7f30f995547d in clone () at
> > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> >
> 
> > > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor <
> igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com>
> > > >
> > > :
> >
> 
> > > > Hi Wido and all community.
> > >
> >
> 
> > > > We catched very idiotic issue on our Cloudstack installation,
> which
> > > > related
> > > > to ceph and possible to java-rados lib.
> > >
> >
> 
> > > > So, we have constantly agent crashed (which cause very big
> problem for
> > > > us...
> > > > ).
> > >
> >
> 
> > > > When agent crashed - it's crash JVM. And no event in logs at

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-03 Thread Voloshanenko Igor
Wido, also minor issue with 0,2.0 java-rados

We still catch:

-storage/ae1b6e5f-f5f4-4abe-aee3-084f2fe71876
2015-11-02 11:41:14,958 WARN  [cloud.agent.Agent]
(agentRequest-Handler-4:null) Caught:
java.lang.NegativeArraySizeException
at com.ceph.rbd.RbdImage.snapList(Unknown Source)
at
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)
at
com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)
at
com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)
at
com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)
at
com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:57)
at
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1385)
at com.cloud.agent.Agent.processRequest(Agent.java:503)
at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:808)
at com.cloud.utils.nio.Task.run(Task.java:84)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Even with updated lib:

root@ix1-c7-2:/usr/share/cloudstack-agent/lib# ls
/usr/share/cloudstack-agent/lib | grep rados
rados-0.2.0.jar

2015-11-03 11:01 GMT+02:00 Voloshanenko Igor <igor.voloshane...@gmail.com>:

> Wido, it's the main issue. No records at all...
>
>
> So, from last time:
>
>
> 2015-11-02 11:40:33,204 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep Mem:|awk
> '{print $2}'
> 2015-11-02 11:40:33,207 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Execution is successful.
> 2015-11-02 11:40:35,316 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Processing command:
> com.cloud.agent.api.GetVmStatsCommand
> 2015-11-02 11:40:35,867 INFO  [cloud.agent.AgentShell] (main:null) Agent
> started
> 2015-11-02 11:40:35,868 INFO  [cloud.agent.AgentShell] (main:null)
> Implementation Version is 4.5.1
>
> So, almost alsways it's exception after RbdUnprotect then in approx . 20
> minutes - crash..
> Almost all the time - it's happen after GetVmStatsCommand or Disks
> stats... Possible that evil hiden into UpadteDiskInfo method... but i can;t
> find any bad code there (((
>
> 2015-11-03 10:40 GMT+02:00 Wido den Hollander <w...@42on.com>:
>
>>
>>
>> On 03-11-15 01:54, Voloshanenko Igor wrote:
>> > Thank you, Jason!
>> >
>> > Any advice, for troubleshooting
>> >
>> > I'm looking in code, and right now don;t see any bad things :(
>> >
>>
>> Can you run the CloudStack Agent in DEBUG mode and then see after which
>> lines in the logs it crashes?
>>
>> Wido
>>
>> > 2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com
>> > <mailto:dilla...@redhat.com>>:
>> >
>> > Most likely not going to be related to 13045 since you aren't
>> > actively exporting an image diff.  The most likely problem is that
>> > the RADOS IO context is being closed prior to closing the RBD image.
>> >
>> > --
>> >
>> > Jason Dillaman
>> >
>> >
>> >     ----- Original Message -----
>> >
>> > > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com
>> > <mailto:igor.voloshane...@gmail.com>>
>> > > To: "Ceph Users" <ceph-users@lists.ceph.com
>> > <mailto:ceph-users@lists.ceph.com>>
>> > > Sent: Thursday, October 29, 2015 5:27:17 PM
>> > > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with
>> > exception in
>> > > librbd
>> >
>> > > From all we analyzed - look like - it's this issue
>> > > http://tracker.ceph.com/issues/13045
>> >
>> > > PR: https://github.com/ceph/ceph/pull/6097
>> >
>> > > Can anyone help us to confirm this? :)
>> >
>> > > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
>> > igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com> >
>> > > :
>> >
>> > > > Additional trace:
>> > >
>> >
>> > > > #0 0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
>> > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> > >
>

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-03 Thread Wido den Hollander


On 03-11-15 10:04, Voloshanenko Igor wrote:
> Wido, also minor issue with 0,2.0 java-rados
> 

Did you also re-compile CloudStack against the new rados-java? I still
think it's related to when the Agent starts cleaning up and there are
snapshots which need to be unprotected.

In the meantime you might want to remove any existing RBD snapshots
using the RBD commands from Ceph, that might solve the problem.

Wido

> We still catch:
> 
> -storage/ae1b6e5f-f5f4-4abe-aee3-084f2fe71876
> 2015-11-02 11:41:14,958 WARN  [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Caught:
> java.lang.NegativeArraySizeException
> at com.ceph.rbd.RbdImage.snapList(Unknown Source)
> at
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)
> at
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)
> at
> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)
> at
> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)
> at
> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.handleStorageCommands(StorageSubsystemCommandHandlerBase.java:57)
> at
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1385)
> at com.cloud.agent.Agent.processRequest(Agent.java:503)
> at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:808)
> at com.cloud.utils.nio.Task.run(Task.java:84)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> Even with updated lib:
> 
> root@ix1-c7-2:/usr/share/cloudstack-agent/lib# ls
> /usr/share/cloudstack-agent/lib | grep rados
> rados-0.2.0.jar
> 
> 2015-11-03 11:01 GMT+02:00 Voloshanenko Igor
> <igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com>>:
> 
> Wido, it's the main issue. No records at all...
> 
> 
> So, from last time:
> 
> 
> 2015-11-02 11:40:33,204 DEBUG
> [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep
> Mem:|awk '{print $2}'
> 2015-11-02 11:40:33,207 DEBUG
> [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Execution is successful.
> 2015-11-02 11:40:35,316 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Processing command:
> com.cloud.agent.api.GetVmStatsCommand
> 2015-11-02 11:40:35,867 INFO  [cloud.agent.AgentShell] (main:null)
> Agent started
> 2015-11-02 11:40:35,868 INFO  [cloud.agent.AgentShell] (main:null)
> Implementation Version is 4.5.1
> 
> So, almost alsways it's exception after RbdUnprotect then in approx
> . 20 minutes - crash..
> Almost all the time - it's happen after GetVmStatsCommand or Disks
> stats... Possible that evil hiden into UpadteDiskInfo method... but
> i can;t find any bad code there (((
> 
> 2015-11-03 10:40 GMT+02:00 Wido den Hollander <w...@42on.com
> <mailto:w...@42on.com>>:
> 
> 
> 
> On 03-11-15 01:54, Voloshanenko Igor wrote:
> > Thank you, Jason!
> >
> > Any advice, for troubleshooting
> >
> > I'm looking in code, and right now don;t see any bad things :(
> >
> 
> Can you run the CloudStack Agent in DEBUG mode and then see
> after which
> lines in the logs it crashes?
> 
> Wido
> 
> > 2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com 
> <mailto:dilla...@redhat.com>
> > <mailto:dilla...@redhat.com <mailto:dilla...@redhat.com>>>:
> >
> > Most likely not going to be related to 13045 since you aren't
> > actively exporting an image diff.  The most likely problem is 
> that
> > the RADOS IO context is being closed prior to closing the RBD 
> image.
> >
> > --
> >
> > Jason Dillaman
> >
> >
> > - Original Message -
> >
> > > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com 
> <mailto:igor.voloshane...@gmail.com>
> > <mailto:igor.voloshane...@gmail.com
> <mailto:igor.voloshane...@gmail.com>>>
> > > To: "Ceph Users" <ceph-users@lists.ceph.com
> <mai

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-03 Thread Voloshanenko Igor
t;>:
> > >
> > > Most likely not going to be related to 13045 since you
> aren't
> > > actively exporting an image diff.  The most likely problem
> is that
> > > the RADOS IO context is being closed prior to closing the
> RBD image.
> >     >
> > > --
> >     >
> >     > Jason Dillaman
> > >
> > >
> > > - Original Message -
> > >
> > > > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com
> <mailto:igor.voloshane...@gmail.com>
> > > <mailto:igor.voloshane...@gmail.com
> > <mailto:igor.voloshane...@gmail.com>>>
> > > > To: "Ceph Users" <ceph-users@lists.ceph.com
> > <mailto:ceph-users@lists.ceph.com>
> > > <mailto:ceph-users@lists.ceph.com
> > <mailto:ceph-users@lists.ceph.com>>>
> > > > Sent: Thursday, October 29, 2015 5:27:17 PM
> > > > Subject: Re: [ceph-users] Cloudstack agent crashed JVM
> with
> > > exception in
> > > > librbd
> > >
> > > > From all we analyzed - look like - it's this issue
> > > > http://tracker.ceph.com/issues/13045
> > >
> > > > PR: https://github.com/ceph/ceph/pull/6097
> > >
> > > > Can anyone help us to confirm this? :)
> > >
> > > > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
> > > igor.voloshane...@gmail.com
> > <mailto:igor.voloshane...@gmail.com>
> > <mailto:igor.voloshane...@gmail.com
> > <mailto:igor.voloshane...@gmail.com>> >
> > > > :
> > >
> > > > > Additional trace:
> > > >
> > >
> > > > > #0 0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6)
> at
> > > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> > > >
> > > > > #1 0x7f30f98950d8 in __GI_abort () at abort.c:89
> > > >
> > > > > #2 0x7f30f87b36b5 in
> > > __gnu_cxx::__verbose_terminate_handler() () from
> > > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > > >
> > > > > #3 0x7f30f87b1836 in ?? () from
> > > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > > >
> > > > > #4 0x7f30f87b1863 in std::terminate() () from
> > > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > > >
> > > > > #5 0x7f30f87b1aa2 in __cxa_throw () from
> > > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > > >
> > > > > #6 0x7f2fddb50778 in ceph::__ceph_assert_fail
> > > > > (assertion=assertion@entry=0x7f2fdddeca05 "sub <
> > m_subsys.size()",
> > > >
> > > > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h",
> > > line=line@entry=62,
> > > >
> > > > > func=func@entry=0x7f2fdddedba0
> > > > >
> > >
> >
> <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__>
> > "bool
> > > > > ceph::log::SubsystemMap::should_gather(unsigned int,
> > int)") at
> > > > > common/assert.cc:77
> > > >
> > > > > #7 0x7f2fdda1fed2 in
> > ceph::log::SubsystemMap::should_gather
> > > > > (level=, sub=,
> > this=)
> > > >
> > > > > at ./log/SubsystemMap.h:62
> > > >
> > > > > #8 0x7f2fdda3b693 in
> > ceph::log::SubsystemMap::should_gather
> > > > > (this=, sub=,
> > level=)
> > > >
> > > > > at ./log/SubsystemMap.h:61
> > > >
> > > > > #9 0x7f2fddd879be in Obje

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-03 Thread Voloshanenko Igor
Wido, it's the main issue. No records at all...


So, from last time:


2015-11-02 11:40:33,204 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep Mem:|awk
'{print $2}'
2015-11-02 11:40:33,207 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Execution is successful.
2015-11-02 11:40:35,316 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Processing command:
com.cloud.agent.api.GetVmStatsCommand
2015-11-02 11:40:35,867 INFO  [cloud.agent.AgentShell] (main:null) Agent
started
2015-11-02 11:40:35,868 INFO  [cloud.agent.AgentShell] (main:null)
Implementation Version is 4.5.1

So, almost alsways it's exception after RbdUnprotect then in approx . 20
minutes - crash..
Almost all the time - it's happen after GetVmStatsCommand or Disks stats...
Possible that evil hiden into UpadteDiskInfo method... but i can;t find any
bad code there (((

2015-11-03 10:40 GMT+02:00 Wido den Hollander <w...@42on.com>:

>
>
> On 03-11-15 01:54, Voloshanenko Igor wrote:
> > Thank you, Jason!
> >
> > Any advice, for troubleshooting
> >
> > I'm looking in code, and right now don;t see any bad things :(
> >
>
> Can you run the CloudStack Agent in DEBUG mode and then see after which
> lines in the logs it crashes?
>
> Wido
>
> > 2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com
> > <mailto:dilla...@redhat.com>>:
> >
> > Most likely not going to be related to 13045 since you aren't
> > actively exporting an image diff.  The most likely problem is that
> > the RADOS IO context is being closed prior to closing the RBD image.
> >
> > --
> >
> > Jason Dillaman
> >
> >
> > - Original Message -
> >
> > > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com
> > <mailto:igor.voloshane...@gmail.com>>
> >     > To: "Ceph Users" <ceph-users@lists.ceph.com
> > <mailto:ceph-users@lists.ceph.com>>
> > > Sent: Thursday, October 29, 2015 5:27:17 PM
> > > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with
> > exception in
> > > librbd
> >
> > > From all we analyzed - look like - it's this issue
> > > http://tracker.ceph.com/issues/13045
> >
> > > PR: https://github.com/ceph/ceph/pull/6097
> >
> > > Can anyone help us to confirm this? :)
> >
> > > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
> > igor.voloshane...@gmail.com <mailto:igor.voloshane...@gmail.com> >
> > > :
> >
> > > > Additional trace:
> > >
> >
> > > > #0 0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> > >
> > > > #1 0x7f30f98950d8 in __GI_abort () at abort.c:89
> > >
> > > > #2 0x7f30f87b36b5 in
> > __gnu_cxx::__verbose_terminate_handler() () from
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > >
> > > > #3 0x7f30f87b1836 in ?? () from
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > >
> > > > #4 0x7f30f87b1863 in std::terminate() () from
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > >
> > > > #5 0x7f30f87b1aa2 in __cxa_throw () from
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> > >
> > > > #6 0x7f2fddb50778 in ceph::__ceph_assert_fail
> > > > (assertion=assertion@entry=0x7f2fdddeca05 "sub <
> m_subsys.size()",
> > >
> > > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h",
> > line=line@entry=62,
> > >
> > > > func=func@entry=0x7f2fdddedba0
> > > >
> >
>  <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool
> > > > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> > > > common/assert.cc:77
> > >
> > > > #7 0x7f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> > > > (level=, sub=, this= out>)
> > >
> > > > at ./log/SubsystemMap.h:62
> > >
> > > > #8 0x7f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> > > > (this=, sub=, level= out>)
> > >
> > > > at ./log/SubsystemMap.h:61
> > >
> > > &

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-02 Thread Jason Dillaman
Most likely not going to be related to 13045 since you aren't actively 
exporting an image diff.  The most likely problem is that the RADOS IO context 
is being closed prior to closing the RBD image.

-- 

Jason Dillaman 


- Original Message - 

> From: "Voloshanenko Igor" <igor.voloshane...@gmail.com>
> To: "Ceph Users" <ceph-users@lists.ceph.com>
> Sent: Thursday, October 29, 2015 5:27:17 PM
> Subject: Re: [ceph-users] Cloudstack agent crashed JVM with exception in
> librbd

> From all we analyzed - look like - it's this issue
> http://tracker.ceph.com/issues/13045

> PR: https://github.com/ceph/ceph/pull/6097

> Can anyone help us to confirm this? :)

> 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor < igor.voloshane...@gmail.com >
> :

> > Additional trace:
> 

> > #0 0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 
> > #1 0x7f30f98950d8 in __GI_abort () at abort.c:89
> 
> > #2 0x7f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() () from
> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > #3 0x7f30f87b1836 in ?? () from
> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > #4 0x7f30f87b1863 in std::terminate() () from
> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > #5 0x7f30f87b1aa2 in __cxa_throw () from
> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > #6 0x7f2fddb50778 in ceph::__ceph_assert_fail
> > (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",
> 
> > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", line=line@entry=62,
> 
> > func=func@entry=0x7f2fdddedba0
> > <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool
> > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> > common/assert.cc:77
> 
> > #7 0x7f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> > (level=, sub=, this=)
> 
> > at ./log/SubsystemMap.h:62
> 
> > #8 0x7f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> > (this=, sub=, level=)
> 
> > at ./log/SubsystemMap.h:61
> 
> > #9 0x7f2fddd879be in ObjectCacher::flusher_entry (this=0x7f2ff80b27a0)
> > at
> > osdc/ObjectCacher.cc:1527
> 
> > #10 0x7f2fddd9851d in ObjectCacher::FlusherThread::entry
> > (this= > out>) at osdc/ObjectCacher.h:374
> 
> > #11 0x7f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
> > pthread_create.c:312
> 
> > #12 0x7f30f995547d in clone () at
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 

> > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor < igor.voloshane...@gmail.com
> > >
> > :
> 

> > > Hi Wido and all community.
> > 
> 

> > > We catched very idiotic issue on our Cloudstack installation, which
> > > related
> > > to ceph and possible to java-rados lib.
> > 
> 

> > > So, we have constantly agent crashed (which cause very big problem for
> > > us...
> > > ).
> > 
> 

> > > When agent crashed - it's crash JVM. And no event in logs at all.
> > 
> 
> > > We enabled crush dump, and after crash we see next picture:
> > 
> 

> > > #grep -A1 "Problematic frame" < /hs_err_pid30260.log
> > 
> 
> > > Problematic frame:
> > 
> 
> > > C [librbd.so.1.0.0+0x5d681]
> > 
> 

> > > # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> > 
> 
> > > (gdb) bt
> > 
> 
> > > ...
> > 
> 
> > > #7 0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> > > (level=, sub=, this=)
> > 
> 
> > > at ./log/SubsystemMap.h:62
> > 
> 
> > > #8 0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
> > > (this=, sub=, level=)
> > 
> 
> > > at ./log/SubsystemMap.h:61
> > 
> 
> > > #9 0x7f30b9d879be in ObjectCacher::flusher_entry
> > > (this=0x7f2fb4017910)
> > > at
> > > osdc/ObjectCacher.cc:1527
> > 
> 
> > > #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
> > > (this= > > out>) at osdc/ObjectCacher.h:374
> > 
> 

> > > From ceph code, this part executed when flushing cache object... And we
> > > don;t
> > > understand why. Becasue we have absolutely different race condition to
> > > reproduce it.
> > 
> 

> > > As cloudstack have not good implementation yet of snapshot lifecycle,
> > >

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-02 Thread Voloshanenko Igor
Dear all, can anybody help?

2015-10-30 10:37 GMT+02:00 Voloshanenko Igor :

> It's pain, but not... :(
> We already used your updated lib in dev env... :(
>
> 2015-10-30 10:06 GMT+02:00 Wido den Hollander :
>
>>
>>
>> On 29-10-15 16:38, Voloshanenko Igor wrote:
>> > Hi Wido and all community.
>> >
>> > We catched very idiotic issue on our Cloudstack installation, which
>> > related to ceph and possible to java-rados lib.
>> >
>>
>> I think you ran into this one:
>> https://issues.apache.org/jira/browse/CLOUDSTACK-8879
>>
>> Cleaning up RBD snapshots for volumes didn't go well and caused the JVM
>> to crash.
>>
>> Wido
>>
>> > So, we have constantly agent crashed (which cause very big problem for
>> > us... ).
>> >
>> > When agent crashed - it's crash JVM. And no event in logs at all.
>> > We enabled crush dump, and after crash we see next picture:
>> >
>> > #grep -A1 "Problematic frame" < /hs_err_pid30260.log
>> >  Problematic frame:
>> >  C  [librbd.so.1.0.0+0x5d681]
>> >
>> > # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
>> > (gdb)  bt
>> > ...
>> > #7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
>> > (level=, sub=, this=)
>> > at ./log/SubsystemMap.h:62
>> > #8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
>> > (this=, sub=, level=)
>> > at ./log/SubsystemMap.h:61
>> > #9  0x7f30b9d879be in ObjectCacher::flusher_entry
>> > (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
>> > #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
>> > (this=) at osdc/ObjectCacher.h:374
>> >
>> > From ceph code, this part executed when flushing cache object... And we
>> > don;t understand why. Becasue we have absolutely different race
>> > condition to reproduce it.
>> >
>> > As cloudstack have not good implementation yet of snapshot lifecycle,
>> > sometime, it's happen, that some volumes already marked as EXPUNGED in
>> > DB and then cloudstack try to delete bas Volume, before it's try to
>> > unprotect it.
>> >
>> > Sure, unprotecting fail, normal exception returned back (fail because
>> > snap has childs... )
>> >
>> > 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
>> > (Thread-1304:null) Executing:
>> > /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
>> > -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
>> > /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
>> > 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
>> > (Thread-1304:null) Execution is successful.
>> > 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
>> > (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
>> > image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the
>> image
>> > 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>> > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
>> > cephmon.anolim.net:6789 
>> > 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>> > (agentRequest-Handler-5:null) Unprotecting snapshot
>> > cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
>> > 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
>> > (agentRequest-Handler-5:null) Failed to delete volume:
>> > com.cloud.utils.exception.CloudRuntimeException:
>> > com.ceph.rbd.RbdException: Failed to unprotect snapshot
>> cloudstack-base-snap
>> > 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
>> > (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
>> > 161344838950, via: 4, Ver: v1, Flags: 10,
>> >
>> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
>> > com.ceph.rbd.RbdException: Failed to unprotect snapshot
>> > cloudstack-base-snap","wait":0}}] }
>> > 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
>> > (agentRequest-Handler-2:null) Processing command:
>> > com.cloud.agent.api.GetHostStatsCommand
>> > 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
>> > (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n
>> > 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo
>> $idle
>> > 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]
>> > (agentRequest-Handler-2:null) Execution is successful.
>> > 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]
>> > (agentRequest-Handler-2:null) Executing: /bin/bash -c
>> > freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem
>> > 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]
>> > (agentRequest-Handler-2:null) Execution is successful.
>> >
>> > BUT, after 20 minutes - agent crashed... If we remove all childs and
>> > create conditions for cloudstack to delete volume - all OK - no agent
>> > crash in 20 minutes...
>> >
>> > We can't connect this action - Volume delete with agent crashe... Also
>> > we don't 

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-02 Thread Voloshanenko Igor
Thank you, Jason!

Any advice, for troubleshooting

I'm looking in code, and right now don;t see any bad things :(

2015-11-03 1:32 GMT+02:00 Jason Dillaman <dilla...@redhat.com>:

> Most likely not going to be related to 13045 since you aren't actively
> exporting an image diff.  The most likely problem is that the RADOS IO
> context is being closed prior to closing the RBD image.
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>
> > From: "Voloshanenko Igor" <igor.voloshane...@gmail.com>
> > To: "Ceph Users" <ceph-users@lists.ceph.com>
> > Sent: Thursday, October 29, 2015 5:27:17 PM
> > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with exception in
> > librbd
>
> > From all we analyzed - look like - it's this issue
> > http://tracker.ceph.com/issues/13045
>
> > PR: https://github.com/ceph/ceph/pull/6097
>
> > Can anyone help us to confirm this? :)
>
> > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
> igor.voloshane...@gmail.com >
> > :
>
> > > Additional trace:
> >
>
> > > #0 0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> >
> > > #1 0x7f30f98950d8 in __GI_abort () at abort.c:89
> >
> > > #2 0x7f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() ()
> from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #3 0x7f30f87b1836 in ?? () from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #4 0x7f30f87b1863 in std::terminate() () from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #5 0x7f30f87b1aa2 in __cxa_throw () from
> > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> >
> > > #6 0x7f2fddb50778 in ceph::__ceph_assert_fail
> > > (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",
> >
> > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", line=line@entry
> =62,
> >
> > > func=func@entry=0x7f2fdddedba0
> > > <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__>
> "bool
> > > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> > > common/assert.cc:77
> >
> > > #7 0x7f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> > > (level=, sub=, this=)
> >
> > > at ./log/SubsystemMap.h:62
> >
> > > #8 0x7f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> > > (this=, sub=, level=)
> >
> > > at ./log/SubsystemMap.h:61
> >
> > > #9 0x7f2fddd879be in ObjectCacher::flusher_entry
> (this=0x7f2ff80b27a0)
> > > at
> > > osdc/ObjectCacher.cc:1527
> >
> > > #10 0x7f2fddd9851d in ObjectCacher::FlusherThread::entry
> > > (this= > > out>) at osdc/ObjectCacher.h:374
> >
> > > #11 0x7f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
> > > pthread_create.c:312
> >
> > > #12 0x7f30f995547d in clone () at
> > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> >
>
> > > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor <
> igor.voloshane...@gmail.com
> > > >
> > > :
> >
>
> > > > Hi Wido and all community.
> > >
> >
>
> > > > We catched very idiotic issue on our Cloudstack installation, which
> > > > related
> > > > to ceph and possible to java-rados lib.
> > >
> >
>
> > > > So, we have constantly agent crashed (which cause very big problem
> for
> > > > us...
> > > > ).
> > >
> >
>
> > > > When agent crashed - it's crash JVM. And no event in logs at all.
> > >
> >
> > > > We enabled crush dump, and after crash we see next picture:
> > >
> >
>
> > > > #grep -A1 "Problematic frame" < /hs_err_pid30260.log
> > >
> >
> > > > Problematic frame:
> > >
> >
> > > > C [librbd.so.1.0.0+0x5d681]
> > >
> >
>
> > > > # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> > >
> >
> > > > (gdb) bt
> > >
> >
> > > > ...
> > >
> >
> > > > #7 0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> > > > (level=, sub=, this=)
> > >
> >
> > > > at ./log/SubsystemMap.h:62
> > >
> >
> > > > #8 0x7f30b9

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-11-02 Thread Jason Dillaman
I can't say that I know too much about Cloudstack's integration with RBD to 
offer much assistance.  Perhaps if Cloudstack is receiving an exception for 
something, it is not properly handling object lifetimes in this case.

-- 

Jason Dillaman 


- Original Message - 

> From: "Voloshanenko Igor" <igor.voloshane...@gmail.com>
> To: "Jason Dillaman" <dilla...@redhat.com>
> Cc: "Ceph Users" <ceph-users@lists.ceph.com>
> Sent: Monday, November 2, 2015 7:54:23 PM
> Subject: Re: [ceph-users] Cloudstack agent crashed JVM with exception in
> librbd

> Thank you, Jason!

> Any advice, for troubleshooting

> I'm looking in code, and right now don;t see any bad things :(

> 2015-11-03 1:32 GMT+02:00 Jason Dillaman < dilla...@redhat.com > :

> > Most likely not going to be related to 13045 since you aren't actively
> > exporting an image diff. The most likely problem is that the RADOS IO
> > context is being closed prior to closing the RBD image.
> 

> > --
> 

> > Jason Dillaman
> 

> > - Original Message -
> 

> > > From: "Voloshanenko Igor" < igor.voloshane...@gmail.com >
> 
> > > To: "Ceph Users" < ceph-users@lists.ceph.com >
> 
> > > Sent: Thursday, October 29, 2015 5:27:17 PM
> 
> > > Subject: Re: [ceph-users] Cloudstack agent crashed JVM with exception in
> 
> > > librbd
> 

> > > From all we analyzed - look like - it's this issue
> 
> > > http://tracker.ceph.com/issues/13045
> 

> > > PR: https://github.com/ceph/ceph/pull/6097
> 

> > > Can anyone help us to confirm this? :)
> 

> > > 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <
> > > igor.voloshane...@gmail.com
> > > >
> 
> > > :
> 

> > > > Additional trace:
> 
> > >
> 

> > > > #0 0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> 
> > > > ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> 
> > >
> 
> > > > #1 0x7f30f98950d8 in __GI_abort () at abort.c:89
> 
> > >
> 
> > > > #2 0x7f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() ()
> > > > from
> 
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > >
> 
> > > > #3 0x7f30f87b1836 in ?? () from
> 
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > >
> 
> > > > #4 0x7f30f87b1863 in std::terminate() () from
> 
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > >
> 
> > > > #5 0x7f30f87b1aa2 in __cxa_throw () from
> 
> > > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> 
> > >
> 
> > > > #6 0x7f2fddb50778 in ceph::__ceph_assert_fail
> 
> > > > (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",
> 
> > >
> 
> > > > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h",
> > > > line=line@entry=62,
> 
> > >
> 
> > > > func=func@entry=0x7f2fdddedba0
> 
> > > > <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__>
> > > > "bool
> 
> > > > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> 
> > > > common/assert.cc:77
> 
> > >
> 
> > > > #7 0x7f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> 
> > > > (level=, sub=, this=)
> 
> > >
> 
> > > > at ./log/SubsystemMap.h:62
> 
> > >
> 
> > > > #8 0x7f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> 
> > > > (this=, sub=, level=)
> 
> > >
> 
> > > > at ./log/SubsystemMap.h:61
> 
> > >
> 
> > > > #9 0x7f2fddd879be in ObjectCacher::flusher_entry
> > > > (this=0x7f2ff80b27a0)
> 
> > > > at
> 
> > > > osdc/ObjectCacher.cc:1527
> 
> > >
> 
> > > > #10 0x7f2fddd9851d in ObjectCacher::FlusherThread::entry
> 
> > > > (this= 
> > > > out>) at osdc/ObjectCacher.h:374
> 
> > >
> 
> > > > #11 0x7f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
> 
> > > > pthread_create.c:312
> 
> > >
> 
> > > > #12 0x7f30f995547d in clone () at
> 
> > > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> > >
> 

> > > > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor <
> > > > igor.voloshane..

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-10-30 Thread Wido den Hollander


On 29-10-15 16:38, Voloshanenko Igor wrote:
> Hi Wido and all community.
> 
> We catched very idiotic issue on our Cloudstack installation, which
> related to ceph and possible to java-rados lib.
> 

I think you ran into this one:
https://issues.apache.org/jira/browse/CLOUDSTACK-8879

Cleaning up RBD snapshots for volumes didn't go well and caused the JVM
to crash.

Wido

> So, we have constantly agent crashed (which cause very big problem for
> us... ).
> 
> When agent crashed - it's crash JVM. And no event in logs at all.
> We enabled crush dump, and after crash we see next picture:
> 
> #grep -A1 "Problematic frame" < /hs_err_pid30260.log
>  Problematic frame:
>  C  [librbd.so.1.0.0+0x5d681]
> 
> # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> (gdb)  bt
> ...
> #7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> (level=, sub=, this=)
> at ./log/SubsystemMap.h:62
> #8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
> (this=, sub=, level=)
> at ./log/SubsystemMap.h:61
> #9  0x7f30b9d879be in ObjectCacher::flusher_entry
> (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
> #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
> (this=) at osdc/ObjectCacher.h:374
> 
> From ceph code, this part executed when flushing cache object... And we
> don;t understand why. Becasue we have absolutely different race
> condition to reproduce it.
> 
> As cloudstack have not good implementation yet of snapshot lifecycle,
> sometime, it's happen, that some volumes already marked as EXPUNGED in
> DB and then cloudstack try to delete bas Volume, before it's try to
> unprotect it.
> 
> Sure, unprotecting fail, normal exception returned back (fail because
> snap has childs... )
> 
> 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
> (Thread-1304:null) Executing:
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
> /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
> 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
> (Thread-1304:null) Execution is successful.
> 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
> image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image
> 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> cephmon.anolim.net:6789 
> 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Unprotecting snapshot
> cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
> 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-5:null) Failed to delete volume:
> com.cloud.utils.exception.CloudRuntimeException:
> com.ceph.rbd.RbdException: Failed to unprotect snapshot cloudstack-base-snap
> 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
> 161344838950, via: 4, Ver: v1, Flags: 10,
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> com.ceph.rbd.RbdException: Failed to unprotect snapshot
> cloudstack-base-snap","wait":0}}] }
> 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-2:null) Processing command:
> com.cloud.agent.api.GetHostStatsCommand
> 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n
> 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle
> 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Execution is successful.
> 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Executing: /bin/bash -c
> freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem
> 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Execution is successful.
> 
> BUT, after 20 minutes - agent crashed... If we remove all childs and
> create conditions for cloudstack to delete volume - all OK - no agent
> crash in 20 minutes...
> 
> We can't connect this action - Volume delete with agent crashe... Also
> we don't understand why +- 20 minutes need to last, and only then agent
> crashed...
> 
> From logs, before crash - only GetVMStats... And then - agent started...
> 
> 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> Sending ping: Seq 4-1343:  { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags:
> 11,
> [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"cs2.anolim.net
> "},"i-7-106-VM":{"state":"PowerOn","host":"cs2.anolim.net
> 

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-10-30 Thread Voloshanenko Igor
It's pain, but not... :(
We already used your updated lib in dev env... :(

2015-10-30 10:06 GMT+02:00 Wido den Hollander :

>
>
> On 29-10-15 16:38, Voloshanenko Igor wrote:
> > Hi Wido and all community.
> >
> > We catched very idiotic issue on our Cloudstack installation, which
> > related to ceph and possible to java-rados lib.
> >
>
> I think you ran into this one:
> https://issues.apache.org/jira/browse/CLOUDSTACK-8879
>
> Cleaning up RBD snapshots for volumes didn't go well and caused the JVM
> to crash.
>
> Wido
>
> > So, we have constantly agent crashed (which cause very big problem for
> > us... ).
> >
> > When agent crashed - it's crash JVM. And no event in logs at all.
> > We enabled crush dump, and after crash we see next picture:
> >
> > #grep -A1 "Problematic frame" < /hs_err_pid30260.log
> >  Problematic frame:
> >  C  [librbd.so.1.0.0+0x5d681]
> >
> > # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> > (gdb)  bt
> > ...
> > #7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> > (level=, sub=, this=)
> > at ./log/SubsystemMap.h:62
> > #8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
> > (this=, sub=, level=)
> > at ./log/SubsystemMap.h:61
> > #9  0x7f30b9d879be in ObjectCacher::flusher_entry
> > (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
> > #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
> > (this=) at osdc/ObjectCacher.h:374
> >
> > From ceph code, this part executed when flushing cache object... And we
> > don;t understand why. Becasue we have absolutely different race
> > condition to reproduce it.
> >
> > As cloudstack have not good implementation yet of snapshot lifecycle,
> > sometime, it's happen, that some volumes already marked as EXPUNGED in
> > DB and then cloudstack try to delete bas Volume, before it's try to
> > unprotect it.
> >
> > Sure, unprotecting fail, normal exception returned back (fail because
> > snap has childs... )
> >
> > 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
> > (Thread-1304:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> > -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
> > /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
> > 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
> > (Thread-1304:null) Execution is successful.
> > 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
> > (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
> > image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the
> image
> > 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> > cephmon.anolim.net:6789 
> > 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> > (agentRequest-Handler-5:null) Unprotecting snapshot
> > cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
> > 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Failed to delete volume:
> > com.cloud.utils.exception.CloudRuntimeException:
> > com.ceph.rbd.RbdException: Failed to unprotect snapshot
> cloudstack-base-snap
> > 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
> > 161344838950, via: 4, Ver: v1, Flags: 10,
> >
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> > com.ceph.rbd.RbdException: Failed to unprotect snapshot
> > cloudstack-base-snap","wait":0}}] }
> > 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-2:null) Processing command:
> > com.cloud.agent.api.GetHostStatsCommand
> > 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n
> > 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo
> $idle
> > 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Execution is successful.
> > 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Executing: /bin/bash -c
> > freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem
> > 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Execution is successful.
> >
> > BUT, after 20 minutes - agent crashed... If we remove all childs and
> > create conditions for cloudstack to delete volume - all OK - no agent
> > crash in 20 minutes...
> >
> > We can't connect this action - Volume delete with agent crashe... Also
> > we don't understand why +- 20 minutes need to last, and only then agent
> > crashed...
> >
> > From logs, before crash - only GetVMStats... And then - agent started...
> >
> > 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] 

[ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-10-29 Thread Voloshanenko Igor
Hi Wido and all community.

We catched very idiotic issue on our Cloudstack installation, which related
to ceph and possible to java-rados lib.

So, we have constantly agent crashed (which cause very big problem for
us... ).

When agent crashed - it's crash JVM. And no event in logs at all.
We enabled crush dump, and after crash we see next picture:

#grep -A1 "Problematic frame" < /hs_err_pid30260.log
 Problematic frame:
 C  [librbd.so.1.0.0+0x5d681]

# gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
(gdb)  bt
...
#7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
(level=, sub=, this=)
at ./log/SubsystemMap.h:62
#8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
(this=, sub=, level=)
at ./log/SubsystemMap.h:61
#9  0x7f30b9d879be in ObjectCacher::flusher_entry (this=0x7f2fb4017910)
at osdc/ObjectCacher.cc:1527
#10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
(this=) at osdc/ObjectCacher.h:374

>From ceph code, this part executed when flushing cache object... And we
don;t understand why. Becasue we have absolutely different race condition
to reproduce it.

As cloudstack have not good implementation yet of snapshot lifecycle,
sometime, it's happen, that some volumes already marked as EXPUNGED in DB
and then cloudstack try to delete bas Volume, before it's try to unprotect
it.

Sure, unprotecting fail, normal exception returned back (fail because snap
has childs... )

2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
(Thread-1304:null) Executing:
/usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i
10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
/mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
(Thread-1304:null) Execution is successful.
2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image
2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
cephmon.anolim.net:6789
2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-5:null) Unprotecting snapshot
cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
(agentRequest-Handler-5:null) Failed to delete volume:
com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException:
Failed to unprotect snapshot cloudstack-base-snap
2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
161344838950, via: 4, Ver: v1, Flags: 10,
[{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
com.ceph.rbd.RbdException: Failed to unprotect snapshot
cloudstack-base-snap","wait":0}}] }
2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-2:null) Processing command:
com.cloud.agent.api.GetHostStatsCommand
2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n 1|
awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle
2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Execution is successful.
2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Executing: /bin/bash -c freeMem=$(free|grep
cache:|awk '{print $4}');echo $freeMem
2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Execution is successful.

BUT, after 20 minutes - agent crashed... If we remove all childs and create
conditions for cloudstack to delete volume - all OK - no agent crash in 20
minutes...

We can't connect this action - Volume delete with agent crashe... Also we
don't understand why +- 20 minutes need to last, and only then agent
crashed...

>From logs, before crash - only GetVMStats... And then - agent started...

2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
Sending ping: Seq 4-1343:  { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags: 11,
[{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"
cs2.anolim.net"},"i-7-106-VM":{"state":"PowerOn","host":"cs2.anolim.net
"},"i-1683-1984-VM":{"state":"PowerOn","host":"cs2.anolim.net
"},"i-11-504-VM":{"state":"PowerOn","host":"cs2.anolim.net
"},"i-325-616-VM":{"state":"PowerOn","host":"cs2.anolim.net
"},"i-10-52-VM":{"state":"PowerOn","host":"cs2.anolim.net
"},"i-941-1237-VM":{"state":"PowerOn","host":"cs2.anolim.net"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}]
}
2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null)
Received 

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-10-29 Thread Voloshanenko Igor
>From all we analyzed - look like - it's this issue
http://tracker.ceph.com/issues/13045

PR: https://github.com/ceph/ceph/pull/6097

Can anyone help us to confirm this? :)

2015-10-29 23:13 GMT+02:00 Voloshanenko Igor :

> Additional trace:
>
> #0  0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x7f30f98950d8 in __GI_abort () at abort.c:89
> #2  0x7f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #3  0x7f30f87b1836 in ?? () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #4  0x7f30f87b1863 in std::terminate() () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #5  0x7f30f87b1aa2 in __cxa_throw () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #6  0x7f2fddb50778 in ceph::__ceph_assert_fail
> (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",
> file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", line=line@entry
> =62,
> func=func@entry=0x7f2fdddedba0
> <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool
> ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
> common/assert.cc:77
> #7  0x7f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
> (level=, sub=, this=)
> at ./log/SubsystemMap.h:62
> #8  0x7f2fdda3b693 in ceph::log::SubsystemMap::should_gather
> (this=, sub=, level=)
> at ./log/SubsystemMap.h:61
> #9  0x7f2fddd879be in ObjectCacher::flusher_entry
> (this=0x7f2ff80b27a0) at osdc/ObjectCacher.cc:1527
> #10 0x7f2fddd9851d in ObjectCacher::FlusherThread::entry
> (this=) at osdc/ObjectCacher.h:374
> #11 0x7f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
> pthread_create.c:312
> #12 0x7f30f995547d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor 
> :
>
>> Hi Wido and all community.
>>
>> We catched very idiotic issue on our Cloudstack installation, which
>> related to ceph and possible to java-rados lib.
>>
>> So, we have constantly agent crashed (which cause very big problem for
>> us... ).
>>
>> When agent crashed - it's crash JVM. And no event in logs at all.
>> We enabled crush dump, and after crash we see next picture:
>>
>> #grep -A1 "Problematic frame" < /hs_err_pid30260.log
>>  Problematic frame:
>>  C  [librbd.so.1.0.0+0x5d681]
>>
>> # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
>> (gdb)  bt
>> ...
>> #7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
>> (level=, sub=, this=)
>> at ./log/SubsystemMap.h:62
>> #8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
>> (this=, sub=, level=)
>> at ./log/SubsystemMap.h:61
>> #9  0x7f30b9d879be in ObjectCacher::flusher_entry
>> (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
>> #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
>> (this=) at osdc/ObjectCacher.h:374
>>
>> From ceph code, this part executed when flushing cache object... And we
>> don;t understand why. Becasue we have absolutely different race condition
>> to reproduce it.
>>
>> As cloudstack have not good implementation yet of snapshot lifecycle,
>> sometime, it's happen, that some volumes already marked as EXPUNGED in DB
>> and then cloudstack try to delete bas Volume, before it's try to unprotect
>> it.
>>
>> Sure, unprotecting fail, normal exception returned back (fail because
>> snap has childs... )
>>
>> 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
>> (Thread-1304:null) Executing:
>> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i
>> 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
>> /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
>> 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
>> (Thread-1304:null) Execution is successful.
>> 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
>> (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
>> image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image
>> 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
>> cephmon.anolim.net:6789
>> 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
>> (agentRequest-Handler-5:null) Unprotecting snapshot
>> cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
>> 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
>> (agentRequest-Handler-5:null) Failed to delete volume:
>> com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException:
>> Failed to unprotect snapshot cloudstack-base-snap
>> 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
>> (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
>> 161344838950, via: 4, Ver: v1, Flags: 10,
>> 

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-10-29 Thread Voloshanenko Igor
Additional trace:

#0  0x7f30f9891cc9 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f30f98950d8 in __GI_abort () at abort.c:89
#2  0x7f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x7f30f87b1836 in ?? () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x7f30f87b1863 in std::terminate() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x7f30f87b1aa2 in __cxa_throw () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x7f2fddb50778 in ceph::__ceph_assert_fail
(assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",
file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", line=line@entry
=62,
func=func@entry=0x7f2fdddedba0
<_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool
ceph::log::SubsystemMap::should_gather(unsigned int, int)") at
common/assert.cc:77
#7  0x7f2fdda1fed2 in ceph::log::SubsystemMap::should_gather
(level=, sub=, this=)
at ./log/SubsystemMap.h:62
#8  0x7f2fdda3b693 in ceph::log::SubsystemMap::should_gather
(this=, sub=, level=)
at ./log/SubsystemMap.h:61
#9  0x7f2fddd879be in ObjectCacher::flusher_entry (this=0x7f2ff80b27a0)
at osdc/ObjectCacher.cc:1527
#10 0x7f2fddd9851d in ObjectCacher::FlusherThread::entry
(this=) at osdc/ObjectCacher.h:374
#11 0x7f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at
pthread_create.c:312
#12 0x7f30f995547d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

2015-10-29 17:38 GMT+02:00 Voloshanenko Igor :

> Hi Wido and all community.
>
> We catched very idiotic issue on our Cloudstack installation, which
> related to ceph and possible to java-rados lib.
>
> So, we have constantly agent crashed (which cause very big problem for
> us... ).
>
> When agent crashed - it's crash JVM. And no event in logs at all.
> We enabled crush dump, and after crash we see next picture:
>
> #grep -A1 "Problematic frame" < /hs_err_pid30260.log
>  Problematic frame:
>  C  [librbd.so.1.0.0+0x5d681]
>
> # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> (gdb)  bt
> ...
> #7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> (level=, sub=, this=)
> at ./log/SubsystemMap.h:62
> #8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
> (this=, sub=, level=)
> at ./log/SubsystemMap.h:61
> #9  0x7f30b9d879be in ObjectCacher::flusher_entry
> (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
> #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
> (this=) at osdc/ObjectCacher.h:374
>
> From ceph code, this part executed when flushing cache object... And we
> don;t understand why. Becasue we have absolutely different race condition
> to reproduce it.
>
> As cloudstack have not good implementation yet of snapshot lifecycle,
> sometime, it's happen, that some volumes already marked as EXPUNGED in DB
> and then cloudstack try to delete bas Volume, before it's try to unprotect
> it.
>
> Sure, unprotecting fail, normal exception returned back (fail because snap
> has childs... )
>
> 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
> (Thread-1304:null) Executing:
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i
> 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
> /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
> 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
> (Thread-1304:null) Execution is successful.
> 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
> image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image
> 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> cephmon.anolim.net:6789
> 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Unprotecting snapshot
> cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
> 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-5:null) Failed to delete volume:
> com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException:
> Failed to unprotect snapshot cloudstack-base-snap
> 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
> 161344838950, via: 4, Ver: v1, Flags: 10,
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> com.ceph.rbd.RbdException: Failed to unprotect snapshot
> cloudstack-base-snap","wait":0}}] }
> 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-2:null) Processing command:
> com.cloud.agent.api.GetHostStatsCommand
> 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Executing: