Hello,
On Tue, Jan 10, 2017 at 11:11 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Daznis >> Sent: 09 January 2017 12:54 >> To: ceph-users <ceph-users@lists.ceph.com> >> Subject: [ceph-users] Ceph cache tier removal. >> >> Hello, >> >> >> I'm running preliminary test on cache tier removal on a live cluster, before >> I try to do that on a production one. I'm trying to > avoid >> downtime, but from what I noticed it's either impossible or I'm doing >> something wrong. My cluster is running Centos 7.2 and 0.94.9 >> ceph. >> >> Example 1: >> I'm setting the cache layer to forward. >> 1. ceph osd tier cache-mode test-cache forward . >> Then flushing the cache: >> 1. rados -p test-cache cache-flush-evict-all Then I'm getting stuck >> with the some objects that can't be removed: >> >> rbd_header.29c3cdb2ae8944a >> failed to evict /rbd_header.29c3cdb2ae8944a: (16) Device or resource busy >> rbd_header.28c96316763845e >> failed to evict /rbd_header.28c96316763845e: (16) Device or resource busy >> error from cache-flush-evict-all: (1) Operation not >> permitted >> > > These are probably the objects which have watchers attached. The current > evict logic seems to be unable to evict these, hence the > error. I'm not sure if anything can be done to work around this other than > what you have tried...ie stopping the VM, which will > remove the watcher. You can move them from cache pool once you remove tier overlay. Bu I wasn't sure about the data consistency. So I ran a few test to confirm. So I spawned a few VM's that were just idling, few that were writing small files to disk with consistent crc and few that were writing larger files with sync option to disk. I have run it multiple times, don't remember the number as I was really waiting for a crc mismatch or general VM crash, but it was 20+ times. You flush the cache a few times. Once no new objects appear in it. Do a flush follow by overlay removal. After about a minute header files will unlock and you will be able to flush them down to cold storage. Once that's done ran a crc check on the everything I was verifying. So I'm pretty confident that I will not lose any data while doing this on a live/production server. I will run a few more tests and decide what to do then. And If I do this on production I will report the progress. Maybe this will help others struggling with similar options. > >> I found a workaround for this. You can bypass these errors by running >> 1. ceph osd tier remove-overlay test-pool >> 2. turning off the VM's that are using them. >> >> For the second option. I can boot the VM's normally after recreating a new >> overlay/cauchetier. At this point everything is working > fine, >> but I'm trying to avoid downtime as it takes almost 8h to start and check >> everything to be in optimal condition. >> >> Now for the first part. I can remove the overlay and flush cache layer. And >> VM's are running fine with it removed. Issues start > after I >> have readed the cache layer to the cold pool and try to write/read from the >> disk. For no apparent reason VM's just freeze. And you >> need to force stop/start all VM's to start working. > > Which pool are the VM's being pointed at base or cache? I'm wondering if it's > something to do with the pool id changing? It was pointing to the base pool. So after reading about online I found that I can add it with live machines. Just need to run these commands: 1. "ceph osd tier add cold-pool cache-pool --force-nonempty" 2. "ceph osd tier cache-mode cache-pool forward" <--- no other mode seems to work only forward. Plus you need to wait a while for all rbd_header to reappear in this pool before switching cache-mode or the VM's will crash. 3. "ceph osd tier set-overlay cold-pool cache-pool" <--- after you run this header pools should start appearing in it. rados -p cache-pool ls > >> >> From what I have read about it all objects should leave cache tier and you >> don't have to "force" removing the tier with objects. >> >> Now onto the questions: >> >> 1. Is it normal for VPS to freeze while adding a cache layer/tier? >> 2. Do VMS' need to be offline to remove caching layer? >> 3. I have read somewhere that snapshots might interfere with cache >> tier clean up. Is it true? 4. Are there some other ways to >> remove the caching tier on a live system? >> >> >> Regards, >> >> >> Darius >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Regards, Darius _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com