Re: [ceph-users] Not timing out watcher

Serguei Bezverkhi (sbezverk) Wed, 20 Dec 2017 09:20:55 -0800

It took 30 minutes for the Watcher to time out after ungraceful restart. Is 
there a way limit it to something a bit more reasonable? Like 1-3 minutes?


On 2017-12-20, 12:01 PM, "Serguei Bezverkhi (sbezverk)" <[email protected]> 
wrote:

    Ok, here is what I found out. If I gracefully kill a pod then watcher gets 
properly cleared, but if it is done ungracefully, without “rbd unmap” then even 
after a node reboot Watcher stays up for a long time,  it has been more than 20 
minutes and it is still active (no any kubernetes services are running).
    
    I was wondering if you would accept the following solution. If in rbdStatus 
instead of checking just for watcher, we check for existence of 
/dev/rbd/{pool}/{image}. If it is not there, it would mean stale Watcher and it 
is safe to map this image. Appreciate your thoughts here.
    
    Thank you
    Serguei
    
    On 2017-12-20, 11:32 AM, "Serguei Bezverkhi (sbezverk)" 
<[email protected]> wrote:
    
        
        
        On 2017-12-20, 11:17 AM, "Jason Dillaman" <[email protected]> wrote:
        
            On Wed, Dec 20, 2017 at 11:01 AM, Serguei Bezverkhi (sbezverk)
            <[email protected]> wrote:
            > Hello Jason, thank you for your prompt reply.
            >
            > My setup is very simple, I have 1 Centos 7.4 VM which is a 
storage node which is running latest 12.2.2 Luminous and 2nd VM is Ubuntu 
16.04.3 192.168.80.235 where I run local kubernetes cluster based on the master.
            >
            > On client side I have ceph-common installed and I copied to 
/etc/ceph config and key rings from the storage.
            >
            > While running my PR I noticed that rmd map was failing on a just 
rebooted VM because rbdStatus was finding active Watcher. Even adding 30 
seconds did not help as it was not timing out at all even with no any image 
mapping.
            
            OK -- but how did you get two different watchers listed? That 
implies
            the first one timed out at some point in time. Does the watcher
            eventually go away if you shut down all k8s processes on
        
        I cannot answer why there are two different watchers, I was just 
capturing info and until you pointed as I was not aware of that. I just checked 
VM and finally Watcher timed out. I cannot say how long it took, but I will run 
another set of tests to find out.
        
            192.168.80.235?  Are you overriding the "osd_client_watch_timeout"
            configuration setting somewhere on the OSD host?
        
        No, no changes to default values were done.
            
        > As per your format 1 comment, I tried using format v2 and it was 
failing to map due to differences in capabilities as per rootfs suggestion I 
switched back to v1. Once Watcher issue is resolved I can switch back to v2 to 
show the exact issue I hit.
            >
            > Please let me know if you need any additional info.
            >
            > Thank you
            > Serguei
            >
            > On 2017-12-20, 10:39 AM, "Jason Dillaman" <[email protected]> 
wrote:
            >
            >     Can you please provide steps to repeat this scenario? What 
is/was the
            >     client running on the host at 192.168.80.235 and how did you 
shut down
            >     that client? In your PR [1], it showed a different client as 
a watcher
            >     ("192.168.80.235:0/34739158 client.64354 cookie=1"), so how 
did the
            >     previous entry get cleaned up?
            >
            >     BTW -- unrelated, but k8s should be creating RBD image format 
2 images
            >     [2]. Was that image created using an older version of k8s or 
did you
            >     override your settings to pick the deprecated v1 format?
            >
            >     [1] 
https://github.com/kubernetes/kubernetes/pull/56651#issuecomment-352850884
            >     [2] https://github.com/kubernetes/kubernetes/pull/51574
            >
            >     On Wed, Dec 20, 2017 at 10:24 AM, Serguei Bezverkhi (sbezverk)
            >     <[email protected]> wrote:
            >     > Hello,
            >     >
            >     > I hit an issue with latest Luminous when a Watcher is not 
timing out when the image is not mapped. It seems something similar was 
reported in 2016, here is the link:
            >     > 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012140.html
            >     > Has it been fixed? Appreciate some help here.
            >     > Thank you
            >     > Serguei
            >     >
            >     > date; sudo rbd status raw-volume --pool kubernetes
            >     > Wed Dec 20 10:04:19 EST 2017
            >     > Watchers:
            >     >         watcher=192.168.80.235:0/3789045165 client.64439 
cookie=1
            >     > date; sudo rbd status raw-volume --pool kubernetes
            >     > Wed Dec 20 10:04:51 EST 2017
            >     > Watchers:
            >     >         watcher=192.168.80.235:0/3789045165 client.64439 
cookie=1
            >     > date; sudo rbd status raw-volume --pool kubernetes
            >     > Wed Dec 20 10:05:14 EST 2017
            >     > Watchers:
            >     >         watcher=192.168.80.235:0/3789045165 client.64439 
cookie=1
            >     >
            >     > date; sudo rbd status raw-volume --pool kubernetes
            >     > Wed Dec 20 10:07:24 EST 2017
            >     > Watchers:
            >     >         watcher=192.168.80.235:0/3789045165 client.64439 
cookie=1
            >     >
            >     > sudo ls /dev/rbd*
            >     > ls: cannot access '/dev/rbd*': No such file or directory
            >     >
            >     > sudo rbd info raw-volume --pool kubernetes
            >     > rbd image 'raw-volume':
            >     >         size 10240 MB in 2560 objects
            >     >         order 22 (4096 kB objects)
            >     >         block_name_prefix: rb.0.fafa.625558ec
            >     >         format: 1
            >     >
            >     >
            >     >
            >     > _______________________________________________
            >     > ceph-users mailing list
            >     > [email protected]
            >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
            >
            >
            >
            >     --
            >     Jason
            >
            >
            
            
            
            -- 
            Jason
            
        
        
    
    

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Not timing out watcher

Reply via email to