Re: [ceph-users] Ceph cache tier removal.

Daznis Wed, 11 Jan 2017 04:53:31 -0800

Hello,



On Tue, Jan 10, 2017 at 11:11 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Daznis
>> Sent: 09 January 2017 12:54
>> To: ceph-users <ceph-users@lists.ceph.com>
>> Subject: [ceph-users] Ceph cache tier removal.
>>
>> Hello,
>>
>>
>> I'm running preliminary test on cache tier removal on a live cluster, before 
>> I try to do that on a production one. I'm trying to
> avoid
>> downtime, but from what I noticed it's either impossible or I'm doing 
>> something wrong. My cluster is running Centos 7.2 and 0.94.9
>> ceph.
>>
>> Example 1:
>>  I'm setting the cache layer to forward.
>>     1. ceph osd tier cache-mode test-cache forward .
>> Then flushing the cache:
>>      1. rados -p test-cache cache-flush-evict-all Then I'm getting stuck 
>> with the some objects that can't be removed:
>>
>> rbd_header.29c3cdb2ae8944a
>> failed to evict /rbd_header.29c3cdb2ae8944a: (16) Device or resource busy
>>         rbd_header.28c96316763845e
>> failed to evict /rbd_header.28c96316763845e: (16) Device or resource busy 
>> error from cache-flush-evict-all: (1) Operation not
>> permitted
>>
>
> These are probably the objects which have watchers attached. The current 
> evict logic seems to be unable to evict these, hence the
> error. I'm not sure if anything can be done to work around this other than 
> what you have tried...ie stopping the VM, which will
> remove the watcher.

You can move them from cache pool once you remove tier overlay. Bu I
wasn't sure about the data consistency. So I ran a few test to
confirm. So I spawned a few VM's that were just idling, few that were
writing small files to disk with consistent crc and few that were
writing larger files with sync option to disk. I have run it multiple
times, don't remember the number as I was really waiting for a crc
mismatch or general VM crash, but it was 20+ times.
You flush the cache a few times. Once no new objects appear in it. Do
a flush follow by overlay removal. After about a minute header files
will unlock and you will be able to flush them down to cold storage.
Once that's done ran a crc check on the everything I was verifying. So
I'm pretty confident that I will not lose any data while doing this on
a live/production server. I will run a few more tests and decide what
to do then. And If I do this on production I will report the progress.
Maybe this will help others struggling with similar options.



>
>> I found a workaround for this. You can bypass these errors by running
>>       1. ceph osd tier remove-overlay test-pool
>>       2. turning off the VM's that are using them.
>>
>> For the second option. I can boot the VM's normally after recreating a new 
>> overlay/cauchetier. At this point everything is working
> fine,
>> but I'm trying to avoid downtime as it takes almost 8h to start and check 
>> everything to be in optimal condition.
>>
>> Now for the first part. I can remove the overlay and flush cache layer. And 
>> VM's are running fine with it removed. Issues start
> after I
>> have readed the cache layer to the cold pool and try to write/read from the 
>> disk. For no apparent reason VM's just freeze. And you
>> need to force stop/start all VM's to start working.
>
> Which pool are the VM's being pointed at base or cache? I'm wondering if it's 
> something to do with the pool id changing?

It was pointing to the base pool. So after reading about online I
found that I can add it with live machines. Just need to run these
commands:

      1. "ceph osd tier add cold-pool cache-pool --force-nonempty"
      2. "ceph osd tier cache-mode cache-pool forward" <--- no other
mode seems to work only forward. Plus you need to wait a while for all
rbd_header to reappear in this pool before switching cache-mode or the
VM's will crash.
      3. "ceph osd tier set-overlay cold-pool cache-pool" <--- after
you run this header pools should start appearing in it. rados -p
cache-pool ls


>
>>
>> From what I have read about it all objects should leave cache tier and you 
>> don't have to  "force" removing the tier with objects.
>>
>> Now onto the questions:
>>
>>    1. Is it normal for VPS to freeze while adding a cache layer/tier?
>>    2. Do VMS' need to be offline to remove caching layer?
>>    3. I have read somewhere that snapshots might interfere with cache
>> tier clean up. Is it true?        4. Are there some other ways to
>> remove the caching tier on a live system?
>>
>>
>> Regards,
>>
>>
>> Darius
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


Regards,

Darius
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cache tier removal.

Reply via email to