Re: [ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Jason Dillaman
Thanks -- let me know. In the future, you may want to consider having
librbd create an admin socket so that you can change (certain)
settings or interact w/ the process w/o restarting it.

On Wed, Oct 25, 2017 at 9:54 AM, Piotr Dałek  wrote:
> On 17-10-25 03:30 PM, Jason Dillaman wrote:
>>
>> Hmm, hard to say off the top of my head. If you could enable "debug
>> librbd = 20" logging on the buggy client that owns the lock, create a
>> new snapshot, and attempt to delete it, it would be interesting to
>> verify that the image is being properly refreshed.
>
>
> I'd love to, but that would require us to restart that client - not an
> option. We'll try to reproduce this somehow anyway and let you know if
> something interesting shows up.
>
>
> --
> Piotr Dałek
> piotr.da...@corp.ovh.com
> https://www.ovh.com/us/



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Piotr Dałek

On 17-10-25 03:30 PM, Jason Dillaman wrote:

Hmm, hard to say off the top of my head. If you could enable "debug
librbd = 20" logging on the buggy client that owns the lock, create a
new snapshot, and attempt to delete it, it would be interesting to
verify that the image is being properly refreshed.


I'd love to, but that would require us to restart that client - not an 
option. We'll try to reproduce this somehow anyway and let you know if 
something interesting shows up.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Jason Dillaman
Hmm, hard to say off the top of my head. If you could enable "debug
librbd = 20" logging on the buggy client that owns the lock, create a
new snapshot, and attempt to delete it, it would be interesting to
verify that the image is being properly refreshed.

On Wed, Oct 25, 2017 at 9:23 AM, Piotr Dałek  wrote:
> On 17-10-25 02:39 PM, Jason Dillaman wrote:
>>
>> That log is showing that a snap remove request was made from a client
>> that couldn't acquire the lock to a client that currently owns the
>> lock. The client that currently owns the lock responded w/ an -ENOENT
>> error that the snapshot doesn't exist. Depending on the maintenance
>> operation requested, different errors codes are filtered out to handle
>> the case where Ceph double (or more) delivers the request message to
>> the lock owner. Normally this isn't an issue since the local client
>> pre-checks the image state before sending the RPC message (i.e. snap
>> remove will first locally ensure the snap exists and respond w/
>> -ENOENT if it doesn't).
>>
>> Therefore, in this case, the question is who is this rogue client that
>> still owns the lock and is responding the a snap remove request but
>> hasn't refreshed its state to know that the snapshot exists.
>
>
> Thanks, that makes things clear.
>
> Seems like we have some Cinders utilizing Infernalis (9.2.1) librbd. Are you
> aware of any bugs in 9.2.x that could cause such behavior? We've seen that
> for the first time...
>
> --
> Piotr Dałek
> piotr.da...@corp.ovh.com
> https://www.ovh.com/us/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Piotr Dałek

On 17-10-25 02:39 PM, Jason Dillaman wrote:

That log is showing that a snap remove request was made from a client
that couldn't acquire the lock to a client that currently owns the
lock. The client that currently owns the lock responded w/ an -ENOENT
error that the snapshot doesn't exist. Depending on the maintenance
operation requested, different errors codes are filtered out to handle
the case where Ceph double (or more) delivers the request message to
the lock owner. Normally this isn't an issue since the local client
pre-checks the image state before sending the RPC message (i.e. snap
remove will first locally ensure the snap exists and respond w/
-ENOENT if it doesn't).

Therefore, in this case, the question is who is this rogue client that
still owns the lock and is responding the a snap remove request but
hasn't refreshed its state to know that the snapshot exists.


Thanks, that makes things clear.

Seems like we have some Cinders utilizing Infernalis (9.2.1) librbd. Are you 
aware of any bugs in 9.2.x that could cause such behavior? We've seen that 
for the first time...


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Jason Dillaman
That log is showing that a snap remove request was made from a client
that couldn't acquire the lock to a client that currently owns the
lock. The client that currently owns the lock responded w/ an -ENOENT
error that the snapshot doesn't exist. Depending on the maintenance
operation requested, different errors codes are filtered out to handle
the case where Ceph double (or more) delivers the request message to
the lock owner. Normally this isn't an issue since the local client
pre-checks the image state before sending the RPC message (i.e. snap
remove will first locally ensure the snap exists and respond w/
-ENOENT if it doesn't).

Therefore, in this case, the question is who is this rogue client that
still owns the lock and is responding the a snap remove request but
hasn't refreshed its state to know that the snapshot exists.

On Wed, Oct 25, 2017 at 3:33 AM, Piotr Dałek  wrote:
> Hello,
>
> I have a 10.2.5 Ceph cluster, there is an image with exclusive lock that is
> being held by client. Some other client creates a snapshot on that image,
> then (that client) goes away. Later, third client attempts to remove that
> snapshot using rbd snap rm, but fails to do so without error:
>
> 2017-10-24 09:50:29.585675 7f754cbb1d80  5 librbd::AioImageRequestWQ:
> 0x7f75579349f0 : ictx=0x7f7557932f50
> 2017-10-24 09:50:29.585759 7f754cbb1d80 20 librbd::ImageState: open
> 2017-10-24 09:50:29.585761 7f754cbb1d80 10 librbd::ImageState:
> 0x7f7557933d90 send_open_unlock
> 2017-10-24 09:50:29.585767 7f754cbb1d80 10 librbd::image::OpenRequest:
> 0x7f7557935360 send_v2_detect_header
> 2017-10-24 09:50:29.588006 7f752e205700 10 librbd::image::OpenRequest:
> handle_v2_detect_header: r=0
> 2017-10-24 09:50:29.588018 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 send_v2_get_id
> 2017-10-24 09:50:29.589679 7f752e205700 10 librbd::image::OpenRequest:
> handle_v2_get_id: r=0
> 2017-10-24 09:50:29.589698 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 send_v2_get_immutable_metadata
> 2017-10-24 09:50:29.598466 7f752e205700 10 librbd::image::OpenRequest:
> handle_v2_get_immutable_metadata: r=0
> 2017-10-24 09:50:29.598481 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 send_v2_get_stripe_unit_count
> 2017-10-24 09:50:29.601050 7f752e205700 10 librbd::image::OpenRequest:
> handle_v2_get_stripe_unit_count: r=-8
> 2017-10-24 09:50:29.601074 7f752e205700 10 librbd::ImageCtx: init_layout
> stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix
> rbd_data.[redacted] format rbd_data.[redacted].%016llx
> 2017-10-24 09:50:29.601077 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 send_v2_apply_metadata: start_key=conf_
> 2017-10-24 09:50:29.603277 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 handle_v2_apply_metadata: r=0
> 2017-10-24 09:50:29.603294 7f752e205700 20 librbd::ImageCtx: apply_metadata
> 2017-10-24 09:50:29.603425 7f752e205700 20 librbd::ImageCtx: enabling
> caching...
> 2017-10-24 09:50:29.603432 7f752e205700 20 librbd::ImageCtx: Initial cache
> settings: size=33554432 num_objects=10 max_dirty=25165824
> target_dirty=16777216 max_dirty_age=1
> 2017-10-24 09:50:29.603535 7f752e205700 10 librbd::ImageCtx:  cache bytes
> 33554432 -> about 1048 objects
> 2017-10-24 09:50:29.603790 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 send_register_watch
> 2017-10-24 09:50:29.603867 7f752e205700 10 librbd::ImageWatcher:
> 0x7f75140017e0 registering image watcher
> 2017-10-24 09:50:29.606497 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 handle_register_watch: r=0
> 2017-10-24 09:50:29.606513 7f752e205700 10 librbd::image::OpenRequest:
> 0x7f7557935360 send_refresh
> 2017-10-24 09:50:29.606518 7f752e205700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 send_v2_get_mutable_metadata
> 2017-10-24 09:50:29.619931 7f752e205700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 handle_v2_get_mutable_metadata: r=0
> 2017-10-24 09:50:29.619964 7f752e205700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 send_v2_get_flags
> 2017-10-24 09:50:29.626218 7f752e205700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 handle_v2_get_flags: r=0
> 2017-10-24 09:50:29.626239 7f752e205700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 send_v2_get_snapshots
> 2017-10-24 09:50:29.647141 7f752e205700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 handle_v2_get_snapshots: r=0
> 2017-10-24 09:50:29.647164 7f752e205700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 send_v2_init_exclusive_lock
> 2017-10-24 09:50:29.647173 7f752e205700 10 librbd::ExclusiveLock:
> 0x7f7514013510 init
> 2017-10-24 09:50:29.647177 7f752e205700  5 librbd::AioImageRequestWQ:
> block_writes: 0x7f7557932f50, num=1
> 2017-10-24 09:50:29.647227 7f752da04700 10 librbd::ExclusiveLock:
> 0x7f7514013510 handle_init_complete
> 2017-10-24 09:50:29.647237 7f752da04700 10 librbd::image::RefreshRequest:
> 0x7f7514008ab0 handle_v2_init_exclusive_lock: 

[ceph-users] rbd rm snap on image with exclusive lock

2017-10-25 Thread Piotr Dałek

Hello,

I have a 10.2.5 Ceph cluster, there is an image with exclusive lock that is 
being held by client. Some other client creates a snapshot on that image, 
then (that client) goes away. Later, third client attempts to remove that 
snapshot using rbd snap rm, but fails to do so without error:


2017-10-24 09:50:29.585675 7f754cbb1d80  5 librbd::AioImageRequestWQ: 
0x7f75579349f0 : ictx=0x7f7557932f50

2017-10-24 09:50:29.585759 7f754cbb1d80 20 librbd::ImageState: open
2017-10-24 09:50:29.585761 7f754cbb1d80 10 librbd::ImageState: 
0x7f7557933d90 send_open_unlock
2017-10-24 09:50:29.585767 7f754cbb1d80 10 librbd::image::OpenRequest: 
0x7f7557935360 send_v2_detect_header
2017-10-24 09:50:29.588006 7f752e205700 10 librbd::image::OpenRequest: 
handle_v2_detect_header: r=0
2017-10-24 09:50:29.588018 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 send_v2_get_id
2017-10-24 09:50:29.589679 7f752e205700 10 librbd::image::OpenRequest: 
handle_v2_get_id: r=0
2017-10-24 09:50:29.589698 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 send_v2_get_immutable_metadata
2017-10-24 09:50:29.598466 7f752e205700 10 librbd::image::OpenRequest: 
handle_v2_get_immutable_metadata: r=0
2017-10-24 09:50:29.598481 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 send_v2_get_stripe_unit_count
2017-10-24 09:50:29.601050 7f752e205700 10 librbd::image::OpenRequest: 
handle_v2_get_stripe_unit_count: r=-8
2017-10-24 09:50:29.601074 7f752e205700 10 librbd::ImageCtx: init_layout 
stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix 
rbd_data.[redacted] format rbd_data.[redacted].%016llx
2017-10-24 09:50:29.601077 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 send_v2_apply_metadata: start_key=conf_
2017-10-24 09:50:29.603277 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 handle_v2_apply_metadata: r=0

2017-10-24 09:50:29.603294 7f752e205700 20 librbd::ImageCtx: apply_metadata
2017-10-24 09:50:29.603425 7f752e205700 20 librbd::ImageCtx: enabling caching...
2017-10-24 09:50:29.603432 7f752e205700 20 librbd::ImageCtx: Initial cache 
settings: size=33554432 num_objects=10 max_dirty=25165824 
target_dirty=16777216 max_dirty_age=1
2017-10-24 09:50:29.603535 7f752e205700 10 librbd::ImageCtx:  cache bytes 
33554432 -> about 1048 objects
2017-10-24 09:50:29.603790 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 send_register_watch
2017-10-24 09:50:29.603867 7f752e205700 10 librbd::ImageWatcher: 
0x7f75140017e0 registering image watcher
2017-10-24 09:50:29.606497 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 handle_register_watch: r=0
2017-10-24 09:50:29.606513 7f752e205700 10 librbd::image::OpenRequest: 
0x7f7557935360 send_refresh
2017-10-24 09:50:29.606518 7f752e205700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 send_v2_get_mutable_metadata
2017-10-24 09:50:29.619931 7f752e205700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 handle_v2_get_mutable_metadata: r=0
2017-10-24 09:50:29.619964 7f752e205700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 send_v2_get_flags
2017-10-24 09:50:29.626218 7f752e205700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 handle_v2_get_flags: r=0
2017-10-24 09:50:29.626239 7f752e205700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 send_v2_get_snapshots
2017-10-24 09:50:29.647141 7f752e205700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 handle_v2_get_snapshots: r=0
2017-10-24 09:50:29.647164 7f752e205700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 send_v2_init_exclusive_lock
2017-10-24 09:50:29.647173 7f752e205700 10 librbd::ExclusiveLock: 
0x7f7514013510 init
2017-10-24 09:50:29.647177 7f752e205700  5 librbd::AioImageRequestWQ: 
block_writes: 0x7f7557932f50, num=1
2017-10-24 09:50:29.647227 7f752da04700 10 librbd::ExclusiveLock: 
0x7f7514013510 handle_init_complete
2017-10-24 09:50:29.647237 7f752da04700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 handle_v2_init_exclusive_lock: r=0
2017-10-24 09:50:29.647239 7f752da04700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 send_v2_apply
2017-10-24 09:50:29.647242 7f752da04700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 handle_v2_apply
2017-10-24 09:50:29.647246 7f752da04700 20 librbd::image::RefreshRequest: 
0x7f7514008ab0 apply
2017-10-24 09:50:29.647248 7f752da04700 20 librbd::image::RefreshRequest: 
new snapshot id=272717 name=[redacted name] size=107374182400
2017-10-24 09:50:29.647257 7f752da04700 20 librbd::image::RefreshRequest: 
new snapshot id=240459 name=[redacted name#2] size=107374182400
2017-10-24 09:50:29.647270 7f752da04700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 send_flush_aio
2017-10-24 09:50:29.647272 7f752da04700 10 librbd::image::RefreshRequest: 
0x7f7514008ab0 handle_flush_aio: r=0
2017-10-24 09:50:29.647276 7f752da04700 10 librbd::image::OpenRequest: 
handle_refresh: r=0
2017-10-24 09:50:29.647278 7f752da04700 10 librbd::ImageState: 
0x7f7557933d90 handle_open: r=0
2017-10-24 09:50:29.647328