Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)

Dmytro Shevchenko Thu, 17 Sep 2015 05:34:58 -0700

Nice work. I compiled and install new version into local mavenrepository, but now I can't compile Cloudstack with this library. Ichanged dependency version in pom file to new, but got next exceptionwhile compiling 'cloud-plugin-hypervisor-kvm':


Konsole output

[ERROR] Failed to execute goalorg.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile(default-compile) on project cloud-plugin-hypervisor-kvm: Compilationfailure: Co

mpilation failure:

[ERROR] Picked up JAVA_TOOL_OPTIONS:-javaagent:/usr/share/java/jayatanaag.jar[ERROR]/home/dmytro.shevchenko/test/cloudstack/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/LibvirtComputingResource.java:[80,21]error: cannot find symbol

[ERROR] symbol:   class RadosException
[ERROR] location: package com.ceph.rados

After investigation I found that class RadosException was moved toKonsole output 'exceptions' subdirectory, but inLibvirtComputingResource.java it import as " importcom.ceph.rados.RadosException;". Question is next, if I want compilesome release version with this new changes, which way would bepreferred? change import path LibvirtComputingResource.java and otherplaces where this class is used to"com.ceph.rados.exceptions.RadosException"?



--
Best regards
Dmytro Shevchenko
[email protected]
skype: demonsh_mk



On 09/15/2015 03:11 PM, Wido den Hollander wrote:


On 15-09-15 13:56, Dmytro Shevchenko wrote:

Hello Wido, I saw you updated this code again. Maybe you know, what
procedure for rebuilding this library in Apache maven repository?
Because here http://repo.maven.apache.org/maven2/com/ceph/rados/ still
present only old 0.1.4 version and it's impossible recompile Cloudstack
with new patches.  Of cause we can download source code from Github,
compile it and replace 'jar' file on production, but this is dirty hack
and not acceptable for 'continues integration'.

It's up to me to do a new release of rados-java and I haven't done that
yet since I wanted to know for sure if the code works.

While writing some code for libvirt yesterday I came up with a better
solution for rados-java as well.

https://www.redhat.com/archives/libvir-list/2015-September/msg00458.html

For now you can replace 'rados.jar' on the production systems, but for
4.6 I want to make sure we depend on a new, to be released, version of
rados-java.

Wido

---
Best regards
Dmytro Shevchenko
[email protected]
skype: demonsh_mk


On 09/12/2015 06:16 PM, Wido den Hollander wrote:

On 09/11/2015 05:08 PM, Andrija Panic wrote:

THx a lot Wido !!! - we will patch this - For my understanding - is this
"temorary"solution - since it raises limit to 256 snaps ? Or am I
wrong ? I
mean, since we dont stil have proper snapshots removal etc, so after
i.e.
3-6months we will again have 256 snapshots of a single volume on CEPH ?

No, it will also work with >256 snapshots. I've tested it with 256 and
that worked fine. I see no reason why it won't work with 1024 or 2048
for example.

BTW we also have other exception, that causes same consequences - agent
disocnnecting and VMs going down...
As Dmytro explained, unprotecting snapshot causes same consequence...

  From my understanding, any RBD exception, might cause Agent to
disconnect
(or actually mgmt server to disconnect agent)...

Any clue on this, recommendation?

No, I don't have a clue. It could be that the job hangs somewhere inside
the Agent due to a uncaught exception though.

Thx a lot for fixing rados-java stuff !

You're welcome!

Wido

Andrija

On 11 September 2015 at 15:28, Wido den Hollander<[email protected]>
wrote:

On 11-09-15 14:43, Dmytro Shevchenko wrote:

Thanks a lot Wido! Any chance to find out why management server
decided
that it lost connection to agent after that exceptions? It's not so
critical as this bug with 16 snapshots, but during last week we catch
situation when Agent failed unprotect snapshot, rise exception and
this
is was a reason of disconnection a bit later after that. (It is not
clear why CS decided remove that volume, it was template with one
'gold'
snapshot with several active clones)

No, I didn't look at CS at all. I just spend the day improving the
RADOS
bindings.

Wido

On 09/11/2015 03:20 PM, Wido den Hollander wrote:

On 11-09-15 10:19, Wido den Hollander wrote:

On 10-09-15 23:15, Andrija Panic wrote:

Wido,

could you folow maybe what my colegue Dmytro just sent ?

Yes, seems logical.

Its not only matter of question fixing rados-java (16 snaps limit)
- it
seems that for any RBD exception, ACS will freak out...

No, a RbdException will be caught, but the Rados Bindings shouldn't
throw NegativeArraySizeException in any case.

That's the main problem.

Seems to be fixed with this commit:

https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64

Just tested it with 256 snapshots:

-------------------------------------------------------
    T E S T S
-------------------------------------------------------
Running com.ceph.rbd.TestRbd
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
521.014 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

The bindings should now be capable of listing more then 16 snapshots.

You can build the bindings manually and replace rados.jar on your
running systems.

For 4.6 I'll try to get the updated rados-java included.

Wido

Wido

THx

On 10 September 2015 at 17:06, Dmytro Shevchenko <
[email protected]> wrote:

Hello everyone, some clarification about this. Configuration:
CS: 4.5.1
Primary storage: Ceph

Actually we have 2 separate bugs:

1. When you remove some volume with more then 16 snapshots
(doesn't
matter
destroyed or active - they always present on Ceph), on next
storage
garbage
collector cycle it invoke 'deletePhysicalDisk' from
LibvirtStorageAdaptor.java. On line 854 we calling list snapshots

from

external rados-java library and getting exception.

https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854

This exception do not catching in current function, but Agent DO
NOT CRASH
at this moment and continue working fine. Agent form proper
answer to
server and send it, text in answer - java stack trace. Log from
Agent side:

2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-4:null) Trying to fetch storage pool
33ebaf83-5d09-3038-b63b-742e759a
992e from libvirt
2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-4:null) Attempting to remove volume
4c6a2092-056c-4446-a2ca-d6bba9f7f
7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e
2015-09-10 02:32:35,431 INFO  [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-4:null) Unprotecting and Removing RBD
snapshots of
image cloudstack-storage
/4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image
2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-4:null) Succesfully connected to Ceph
cluster

at

10.10.1.26:6789
2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor]
(agentRequest-Handler-4:null) Fetching list of snapshots of RBD
image
cloudstack-storage/4c6a2092
-056c-4446-a2ca-d6bba9f7f7f8
2015-09-10 02:32:35,457 WARN  [cloud.agent.Agent]
(agentRequest-Handler-4:null) Caught:
java.lang.NegativeArraySizeException
           at com.ceph.rbd.RbdImage.snapList(Unknown Source)
           at

com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)

at

com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)

at

com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206)

2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Seq 1-1743737480722513946:  { Ans: ,
MgmtId:
90520739779588, via: 1, Ver: v1,
    Flags: 10,

[{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat

com.ceph.rbd.RbdImage.snapList(Unknown Sourc
e)\n\tat

com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat

com.cloud.hypervisor.kvm.storage.Lib
virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat

com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j

ava:1206)\n\tat

com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat

com.cloud.storage.re.....

so this volume and it snapshots never will be removed.


2. Second bug. Experimentally it has been found that after 50
minutes we
had exception on Agent, for some unknown reason Management server
decided
about it lost connection to this agent, start HA process and start
Agent
process again.
Log on Agent side:
2015-09-10 02:57:12,664 DEBUG
[kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep
Mem:|awk
'{print $2}'
2015-09-10 02:57:12,667 DEBUG
[kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-2:null) Execution is successful.
2015-09-10 02:57:40,502 DEBUG
[kvm.resource.LibvirtComputingResource]
(UgentTask-5:null) Executing:
/usr/share/cloudstack-common/scripts/vm/network/security_
group.py get_rule_logs_for_vms
2015-09-10 02:57:40,572 DEBUG
[kvm.resource.LibvirtComputingResource]
(UgentTask-5:null) Execution is successful.
2015-09-10 02:57:54,135 INFO  [cloud.agent.AgentShell] (main:null)
Agent
started
2015-09-10 02:57:54,136 INFO  [cloud.agent.AgentShell] (main:null)
Implementation Version is 4.5.1
2015-09-10 02:57:54,138 INFO  [cloud.agent.AgentShell] (main:null)
agent.properties found at /etc/cloudstack/agent/agent.properties
.....

Log on Server side:
2015-09-10 02:57:53,710 INFO  [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has
disconnected
with event AgentDisconnecte
d
2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive
2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request]
(AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
Sending {
Cmd ,
MgmtId: 90520739779588, via: 1(ix1
-c7-2), Ver: v1, Flags: 100011,
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2015-09-10 02:57:53,724 INFO  [c.c.a.m.AgentAttache]
(AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
Unable to
send
due to Resource [Host:1] is unr
eachable: Host 1: Channel is closed
2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache]
(AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988:
Cancelling.
2015-09-10 02:57:53,724 WARN  [c.c.a.m.AgentManagerImpl]
(AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable:
Host 1:
Channel is closed
2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl]
(AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to
determine the
state of the host
.  Moving on.


it's look like connection problem, but it appear only when we have
this
RBD exception on agent side and only with this node. I tried to
play with
"storage.cleanup.interval" parameter and set it to 5 minutes,
now we
getting exception on agent side every 5min, but disconnects still
happening
every 50min and I can't find out why.

On 09/10/2015 03:21 PM, Andrija Panic wrote:

Thx Wido,

I will have my colegue Igor and Dmytro join with details on this.

I agree we need fix upstream, that is the main purpose from our

side!

With this temp fix, we just avoid agent crashing (agent somehow
restarts
again fine :) ) but VMs also go down on that host, at least some
of them.

Do you see any lifecycle/workflow issue, if we implement deleting
SNAP
from CEPH after you SNAP a volume in ACS and sucsssfully move to
Secondary
NFS - or perhaps only delete SNAP from CEPH as a part of actuall

SNAP

deletion (when you manually or via scheduled snapshots, delete
snapshot
from DB and NFS) ? Maybe second option is better, I dont know how
you guys
handle this for regular NFS as primary storage etc...

Any guidance is most welcomed, and our team will try to code all
this.

Thx Wido again

On 10 September 2015 at 14:14, Wido den Hollander <[email protected]
<mailto:[email protected]>> wrote:

       On 10-09-15 14:07, Andrija Panic wrote:
       > Wido,
       >
       > part of code where you want to delete some volume,
checks if
       volume is type
       > RBD - and then tries to list snapshots, delete
snapshtos, and
       finally
       > remove image. Here first step- Listing snapshtos-
fails, if
       there are more
       > than 16 snapshtos present - number 16 is hardcoded in
elsewhere
       part of
       > code and throws RBD exception...then agent crashes... and

then

       VMs goe down
       > etc.
       >

       Hmmm, that seems like a bug in rados-java indeed. I don't
know if
       there
       is a release of rados-java where this is fixed in.

       Looking at the code of rados-java it should, but I'm not
100%
certain.

       > So our current way as quick fix is to invoke external
script
       which will
       > also list and remove all snapshtos, but will not fail.
       >

       Yes, but we should fix it upstream. I understand that you
will use a
       temp script to clean up everything.

       > I'm not sure why is there 16 as the hardcoded limit -
will

try

       to provide
       > part of code where this is present...we can increase this
number
       but it
       > doesn make any sense (from 16 to i.e. 200), since we
still
have
       lot of
       > garbage left on CEPH (snapshtos that were removed in ACS
(DB and
       Secondary
       > NFS) - but not removed from CEPH. And in my understanding

this

       needs to be
       > implemented, so we dont catch any exceptions that I

originally

       described...
       >
       > Any thoughts on this ?
       >

       A cleanup script for now should be OK indeed. Afterwards
the
Java code
       should be able to do this.

       You can try manually by using rados-java and fix that.

       This is the part where the listing is done:

https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java

       Wido

       > Thx for input!
       >
       > On 10 September 2015 at 13:56, Wido den Hollander
       <[email protected]  <mailto:[email protected]>> wrote:
       >
       >>
       >>
       >> On 10-09-15 12:17, Andrija Panic wrote:
       >>> We are testing some [dirty?] patch on our dev system
and we
       shall soon
       >>> share it for review.
       >>>
       >>> Basically, we are using external python script that is
invoked
       in some
       >> part
       >>> of code execution to delete needed CEPH snapshots
and then
       after that
       >>> proceeds with volume deleteion etc...
       >>>
       >>
       >> That shouldn't be required. The Java bindings for
librbd and
       librados
       >> should be able to remove the snapshots.
       >>
       >> There is no need to invoke external code, this can
all be
       handled in Java.
       >>
       >>> On 10 September 2015 at 11:26, Andrija Panic
       <[email protected]  <mailto:[email protected]>>
       >>> wrote:
       >>>
       >>>> Eh, OK. Thx for the info.
       >>>>
       >>>> BTW why is 16 snapshot limits hardcoded - any
reason for
that ?
       >>>>
       >>>> Not cleaning snapshots on CEPH and trying to delete
volume
       after having
       >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM
       side...and some
       >> VMs
       >>>> being rebooted etc - which means downtime :|
       >>>>
       >>>> Thanks,
       >>>>
       >>>> On 9 September 2015 at 22:05, Simon Weller <

[email protected]

       <mailto:[email protected]>> wrote:
       >>>>
       >>>>> Andrija,
       >>>>>
       >>>>> The Ceph snapshot deletion is not currently
implemented.
       >>>>>
       >>>>> See:

https://issues.apache.org/jira/browse/CLOUDSTACK-8302

       >>>>>
       >>>>> - Si
       >>>>>
       >>>>> ________________________________________
       >>>>> From: Andrija Panic<[email protected]>
       >>>>> Sent: Wednesday, September 9, 2015 3:03 PM
       >>>>> To:[email protected]
       <mailto:[email protected]>;

[email protected]

       <mailto:[email protected]>

       >>>>> Subject: ACS 4.5 - volume snapshots NOT removed
from CEPH
       (only from
       >>>>> Secondaryt NFS and DB)
       >>>>>
       >>>>> Hi folks,
       >>>>>
       >>>>> we enounter issue in ACS 4.5.1 (perhaps other
versions
also
       affected) -
       >>>>> when we delete some snapshot (volume snapshot) in
ACS,

ACS

       marks it as
       >>>>> deleted in DB, deletes from NFS Secondary Storage but
fails
       to delete
       >>>>> snapshot on CEPH primary storage (doesn even try to

delete

       it AFAIK)
       >>>>>
       >>>>> So we end up having 5 live snapshots in DB (just
example)
       but actually
       >> in
       >>>>> CEPH there are more than i.e. 16 snapshots.
       >>>>>
       >>>>> More of the issue, when ACS agent tries to obtain
list of
       snapshots
       >> from
       >>>>> CEPH for some volume or so - if number of
snapshots is
over
       16, it
       >> raises
       >>>>> exception  (and perhaps this is the reason Agent
crashed for
       us - need
       >> to
       >>>>> check with my colegues who are investigatin this in
       details). This
       >> number
       >>>>> 16 is for whatever reasons hardcoded in ACS code.
       >>>>>
       >>>>> Wondering if anyone experienced this, or have any
info
- we
       plan to
       >> try to
       >>>>> fix this, and I will inlcude my dev colegues here,
but we
       might need
       >> some
       >>>>> help at least for guidance or-
       >>>>>
       >>>>> Any help is really apreaciated or at list
confirmation
that
       this is
       >> known
       >>>>> issue etc.
       >>>>>
       >>>>> Thanks,
       >>>>>
       >>>>> --
       >>>>>
       >>>>> Andrija Panić
       >>>>>
       >>>>
       >>>>
       >>>>
       >>>> --
       >>>>
       >>>> Andrija Panić
       >>>>
       >>>
       >>>
       >>>
       >>
       >
       >
       >




--

Andrija Panić

--
---
Best regards
Dmytro Shevchenko
[email protected]
skype: demonsh_mk
+380(66)2426648

Re: ACS 4.5 - volume snapshots NOT removed from CEPH (only from Secondaryt NFS and DB)

Reply via email to