Re: [ceph-users] pg incomplete second osd in acting set still available
So I think I know what might have gone wrong. When I took might osd's out of the cluster and shut them down, the first set of osds likely came back up and in the cluster before 300 seconds expired. This would have prevented cluster triggering recovery of the pg from the replica osd. So the question is, can I force this to happen? Can I take the supposed primary osd down for 300+ seconds to allow the cluster to start recovering the pgs (this will of course affect all other pgs on the osds). Or is there a better way? Note that all my secondary osds in these pgs have the expected amount of data for the pg, remained up during the primary's downtime and should have the state to become the primary for the acting set. Thanks for listening. ~jpr On 03/25/2016 11:57 AM, John-Paul Robinson wrote: > Hi Folks, > > One last dip into my old bobtail cluster. (new hardware is on order) > > I have three pg in an incomplete state. The cluster was previously > stable but with a health warn state due to a few near full osds. I > started resizing drives on one host to expand space after taking the > osds that served them out and down. My failure domain is two levels > osds and hosts and have two copies per placement group. > > I have three of my pgs flagging incomplete. > > root@d90-b1-1c-3a-c4-8f:~# date; sudo ceph --id nova health detail | > grep incomplete > Fri Mar 25 11:28:47 CDT 2016 > HEALTH_WARN 168 pgs backfill; 107 pgs backfilling; 241 pgs degraded; 3 > pgs incomplete; 3 pgs stuck inactive; 287 pgs stuck unclean; recovery > 4913393/39589336 degraded (12.411%); recovering 120 o/s, 481MB/s; 4 > near full osd(s) > pg 3.5 is stuck inactive since forever, current state incomplete, last > acting [53,22] > pg 3.150 is stuck inactive since forever, current state incomplete, last > acting [50,74] > pg 3.38c is stuck inactive since forever, current state incomplete, last > acting [14,70] > pg 3.5 is stuck unclean since forever, current state incomplete, last > acting [53,22] > pg 3.150 is stuck unclean since forever, current state incomplete, last > acting [50,74] > pg 3.38c is stuck unclean since forever, current state incomplete, last > acting [14,70] > pg 3.38c is incomplete, acting [14,70] > pg 3.150 is incomplete, acting [50,74] > pg 3.5 is incomplete, acting [53,22] > > Given that incomplete means: > > "Ceph detects that a placement group is missing information about writes > that may have occurred, or does not have any healthy copies. If you see > this state, try to start any failed OSDs that may contain the needed > information or temporarily adjust min_size to allow recovery." > > I have restarted all osds in these acting sets and they log normally, > opening their respective journals and such. However, the incomplete > state remains. > > All three of the primary osds 53,50,14 have were reformatted to expand > size, so I know there's no "spare" journal if its referring to what was > there before. Btw, I did take all osds to out and down before resizing > their drives, so I'm not sure how these pg would actually be expecting > old journal. > > I suspect I need to forgo the journal and let the secondaries become > primary for these pg. > > I sure hope that's possible. > > As always, thanks for any pointers. > > ~jpr > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pg incomplete second osd in acting set still available
Hi Folks, One last dip into my old bobtail cluster. (new hardware is on order) I have three pg in an incomplete state. The cluster was previously stable but with a health warn state due to a few near full osds. I started resizing drives on one host to expand space after taking the osds that served them out and down. My failure domain is two levels osds and hosts and have two copies per placement group. I have three of my pgs flagging incomplete. root@d90-b1-1c-3a-c4-8f:~# date; sudo ceph --id nova health detail | grep incomplete Fri Mar 25 11:28:47 CDT 2016 HEALTH_WARN 168 pgs backfill; 107 pgs backfilling; 241 pgs degraded; 3 pgs incomplete; 3 pgs stuck inactive; 287 pgs stuck unclean; recovery 4913393/39589336 degraded (12.411%); recovering 120 o/s, 481MB/s; 4 near full osd(s) pg 3.5 is stuck inactive since forever, current state incomplete, last acting [53,22] pg 3.150 is stuck inactive since forever, current state incomplete, last acting [50,74] pg 3.38c is stuck inactive since forever, current state incomplete, last acting [14,70] pg 3.5 is stuck unclean since forever, current state incomplete, last acting [53,22] pg 3.150 is stuck unclean since forever, current state incomplete, last acting [50,74] pg 3.38c is stuck unclean since forever, current state incomplete, last acting [14,70] pg 3.38c is incomplete, acting [14,70] pg 3.150 is incomplete, acting [50,74] pg 3.5 is incomplete, acting [53,22] Given that incomplete means: "Ceph detects that a placement group is missing information about writes that may have occurred, or does not have any healthy copies. If you see this state, try to start any failed OSDs that may contain the needed information or temporarily adjust min_size to allow recovery." I have restarted all osds in these acting sets and they log normally, opening their respective journals and such. However, the incomplete state remains. All three of the primary osds 53,50,14 have were reformatted to expand size, so I know there's no "spare" journal if its referring to what was there before. Btw, I did take all osds to out and down before resizing their drives, so I'm not sure how these pg would actually be expecting old journal. I suspect I need to forgo the journal and let the secondaries become primary for these pg. I sure hope that's possible. As always, thanks for any pointers. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] upgrading to major releases
Hi, When upgrading to the next release, is it necessary to first upgrade to the most recent point release of the prior release or can one upgrade from the initial release of the named version? The release notes don't appear to indicate it is necessary (http://docs.ceph.com/docs/master/release-notes), just wondering if there are benefits or assumptions. Thanks, ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] hanging nfsd requests on an RBD to NFS gateway
Hi, Has anyone else experienced a problem with RBD-to-NFS gateways blocking nfsd server requests when their ceph cluster has a placement group that is not servicing I/O for some reason, eg. too few replicas or an osd with slow request warnings? We have an RBD-NFS gateway that stops responding to NFS clients (interaction with RBD-backed NFS shares hang on the NFS client), whenever our ceph cluster has some part of it in an I/O block condition. This issue only affects the ability of the nfsd processes to serve requests to the client. I can look at and access underlying mounted RBD containers without issue, although they appear hung from the NFS client side. The gateway node load numbers spike to a number that reflects the number of nfsd processes, but the system is otherwise untaxed (unlike the case in a normal high os load, ie. i can type and run commands with normal responsiveness.) The behavior comes accross like there is some nfsd global lock that an nfsd sets before requesting I/O from a backend device. In the case above, the I/O request hangs on one RBD image affected by the I/O block caused by the problematic pg or OSD. The nfsd request blocks on the ceph I/O and because it has set a global lock, all other nfsd processes are prevented from servicing requests to their clients. The nfsd processes are now all in the wait queue causing the load number on the gateway system to spike. Once the Ceph I/O issues is resolved, the nfsd I/O request completes and all service returns to normal. The load on the gateway drops to normal immediately and all NFS clients can again interact with the nfsd processes. Thoughout this time unaffected ceph objects remain available to other clients, eg. OpenStack volumes. Our RBD-NFS gateway is running Ubuntu 12.04.5 with kernel 3.11.0-15-generic. The ceph version installed on this client is 0.72.2, though I assume only the kernel resident RBD module matters. Any thoughts or pointers appreciated. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg incomplete state
Greg, Thanks for providing this background on the incomplete state. With that context, and a little more digging online and in our environment, I was able to resolve the issue. My cluster is back in health ok. The key to fixing the incomplete state was the information provided by pg query. I did not have to change the min_size setting. In addition to your comments, these two references were helpful. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-August/042102.html http://tracker.ceph.com/issues/5226 The tail of `ceph pg 3.ea query` showed there were an number of osds involved in servicing the backfill. "probing_osds": [ 10, 11, 30, 37, 39, 54], "down_osds_we_would_probe": [], "peering_blocked_by": []}, { "name": "Started", "enter_time": "2015-10-21 14:39:13.824613"}]} After checking all the OSDs, I confirmed that only osd.11 had the pg data and all the rest had an empty dir for pg 3.ea. Because osd 10 was listed first and had an empty copy of the pg, my assumption was it was blocking the backfill. I stopped osd.10 briefly and the state of pg 3.ea immediately entered "active+degraded+remapped+backfilling". After the backfill started started osd.10. In particular, osd 11 became the primary (as desired) and began backfilling osd 30. { "state": "active+degraded+remapped+backfilling", "up": [ 30, 11], "acting": [ 11, 30], osd.10 was no longer holding up the start of backfill operation: "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2015-10-22 12:46:50.907955", "might_have_unfound": [ { "osd": 10, "status": "not queried"}], "recovery_progress": { "backfill_target": 30, "waiting_on_backfill": 0, Based on the steps that triggered the original incomplete state, my guess is that when I took osd.30 down and out to reformat, a number of alternates (including osd.10) were mapped as backfill targets for the pg. These operations didn't have a chance to start up before osd 30's reformat completed and was back in the cluster. At that point, pg 3.ea was remapped again, leaving osd 10 at the top of the list. Not having any data, it blocked the backfill from osd 11 from starting. Not sure if that was the exact cause, but it makes some sense. Thanks again for the pointing me in a useful direction. ~jpr On 10/21/2015 03:01 PM, Gregory Farnum wrote: > I don't remember the exact timeline, but min_size is designed to > prevent data loss from under-replicated objects (ie, if you only have > 1 copy out of 3 and you lose that copy, you're in trouble, so maybe > you don't want it to go active). Unfortunately it could also prevent > the OSDs from replicating/backfilling the data to new OSDs in the case > where you only had one copy left — that's fixed now, but wasn't > initially. And in that case it reported the PG as incomplete (in later > versions, PGs in this state get reported as undersized). > > So if you drop the min_size to 1, it will allow new writes to the PG > (which might not be great), but it will also let the OSD go into the > backfilling state. (At least, assuming the number of replicas is the > only problem.). Based on your description of the problem I think this > is the state you're in, and decreasing min_size is the solution. > *shrug* > You could also try and do something like extracting the PG from osd.11 > and copying it to osd.30, but that's quite tricky without the modern > objectstore tool stuff, and I don't know if any of that works on > dumpling (which it sounds like you're on — incidentally, you probably > want to upgrade from that). > -Greg > > On Wed, Oct 21, 2015 at 12:55 PM, John-Paul Robinson <j...@uab.edu> wrote: >> Greg, >> >> Thanks for the insight. I suspect things are somewhat sane given that I >> did erase the primary (osd.30) and the secondary (osd.11) still contains >> pg data. >> >> If I may, could you clarify the process of backfill a little? >> >> I understand the min_size allows I/O on the object to resume while there >> are only that many replicas (ie. 1 once changed) and this would let >> things move forward. >> >> I would expect, however, that some backfill would already be on-going >> for pg 3.ea on osd.30. As far as I can tell, there isn't anything >> happening. The pg 3.ea directory is just as empty today as it was >> yesterday. >> >> Will changing t
Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway
On 10/22/2015 04:03 PM, Wido den Hollander wrote: > On 10/22/2015 10:57 PM, John-Paul Robinson wrote: >> Hi, >> >> Has anyone else experienced a problem with RBD-to-NFS gateways blocking >> nfsd server requests when their ceph cluster has a placement group that >> is not servicing I/O for some reason, eg. too few replicas or an osd >> with slow request warnings? >> >> We have an RBD-NFS gateway that stops responding to NFS clients >> (interaction with RBD-backed NFS shares hang on the NFS client), >> whenever our ceph cluster has some part of it in an I/O block >> condition. This issue only affects the ability of the nfsd processes >> to serve requests to the client. I can look at and access underlying >> mounted RBD containers without issue, although they appear hung from the >> NFS client side. The gateway node load numbers spike to a number that >> reflects the number of nfsd processes, but the system is otherwise >> untaxed (unlike the case in a normal high os load, ie. i can type and >> run commands with normal responsiveness.) >> > Well, that is normal I think. Certain objects become unresponsive if a > PG is not serving I/O. > > With a simple 'ls' or 'df -h' you might not be touching those objects, > so for you it seems like everything is functioning. > > The nfsd process however might be hung due to a blocking I/O call. That > is completely normal and to be excpected. I agree that an nfsd process blocking on a blocked backend I/O request is expected an normal. > That it hangs the complete NFS server might be just a side-effect on how > nfsd was written. Hanging all nfsd processes is the part I find unexpected. I'm just wondering is someone has experience with this or if this is a known nfsd issue. > It might be that Ganesha works better for you: > http://blog.widodh.nl/2014/12/nfs-ganesha-with-libcephfs-on-ubuntu-14-04/ Thanks genesha looks very interesting! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway
A few clarifications on our experience: * We have 200+ rbd images mounted on our RBD-NFS gateway. (There's nothing easier for a user to understand than "your disk is full".) * I'd expect more contention potential with a single shared RBD back end, but with many distinct and presumably isolated backend RBD images, I've always been surprised that *all* the nfsd task hang. This leads me to think it's an nfsd issue rather than and rbd issue. (I realize this is an rbd list, looking for shared experience. ;) ) * I haven't seen any difference between reads and writes. Any access to any backing RBD store from the NFS client hangs. ~jpr On 10/22/2015 06:42 PM, Ryan Tokarek wrote: >> On Oct 22, 2015, at 3:57 PM, John-Paul Robinson <j...@uab.edu> wrote: >> >> Hi, >> >> Has anyone else experienced a problem with RBD-to-NFS gateways blocking >> nfsd server requests when their ceph cluster has a placement group that >> is not servicing I/O for some reason, eg. too few replicas or an osd >> with slow request warnings? > We have experienced exactly that kind of problem except that it sometimes > happens even when ceph health reports "HEALTH_OK". This has been incredibly > vexing for us. > > > If the cluster is unhealthy for some reason, then I'd expect your/our > symptoms as writes can't be completed. > > I'm guessing that you have file systems with barriers turned on. Whichever > file system that has a barrier write stuck on the problem pg, will cause any > other process trying to write anywhere in that FS also to block. This likely > means a cascade of nfsd processes will block as they each try to service > various client writes to that FS. Even though, theoretically, the rest of the > "disk" (rbd) and other file systems might still be writable, the NFS > processes will still be in uninterruptible sleep just because of that stuck > write request (or such is my understanding). > > Disabling barriers on the gateway machine might postpone the problem (never > tried it and don't want to) until you hit your vm.dirty_bytes or > vm.dirty_ratio thresholds, but it is dangerous as you could much more easily > lose data. You'd be better off solving the underlying issues when they happen > (too few replicas available or overloaded osds). > > > For us, even when the cluster reports itself as healthy, we sometimes have > this problem. All nfsd processes block. sync blocks. echo 3 > > /proc/sys/vm/drop_caches blocks. There is a persistent 4-8MB "Dirty" in > /proc/meminfo. None of the osds log slow requests. Everything seems fine on > the osds and mons. Neither CPU nor I/O load is extraordinary on the ceph > nodes, but at least one file system on the gateway machine will stop > accepting writes. > > If we just wait, the situation resolves itself in 10 to 30 minutes. A forced > reboot of the NFS gateway "solves" the performance problem, but is annoying > and dangerous (we unmount all of the file systems that are still unmountable, > but the stuck ones lead us to a sysrq-b). > > This is on Scientific Linux 6.7 systems with elrepo 4.1.10 Kernels running > Ceph Firefly (0.8.10) and XFS file systems exported over NFS and samba. > > Ryan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg incomplete state
Greg, Thanks for the insight. I suspect things are somewhat sane given that I did erase the primary (osd.30) and the secondary (osd.11) still contains pg data. If I may, could you clarify the process of backfill a little? I understand the min_size allows I/O on the object to resume while there are only that many replicas (ie. 1 once changed) and this would let things move forward. I would expect, however, that some backfill would already be on-going for pg 3.ea on osd.30. As far as I can tell, there isn't anything happening. The pg 3.ea directory is just as empty today as it was yesterday. Will changing the min_size actually trigger backfill to begin for an object if has stalled or never got started? An alternative idea I had was to take osd.30 back out of the cluster so that pg 3.ae [30,11] would get mapped to some other osd to maintain replication. This seems a bit heavy handed though, given that only this one pg is affected. Thanks for any follow up. ~jpr On 10/21/2015 01:21 PM, Gregory Farnum wrote: > On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson <j...@uab.edu> wrote: >> Hi folks >> >> I've been rebuilding drives in my cluster to add space. This has gone >> well so far. >> >> After the last batch of rebuilds, I'm left with one placement group in >> an incomplete state. >> >> [sudo] password for jpr: >> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean >> pg 3.ea is stuck inactive since forever, current state incomplete, last >> acting [30,11] >> pg 3.ea is stuck unclean since forever, current state incomplete, last >> acting [30,11] >> pg 3.ea is incomplete, acting [30,11] >> >> I've restarted both OSD a few times but it hasn't cleared the error. >> >> On the primary I see errors in the log related to slow requests: >> >> 2015-10-20 08:40:36.678569 7f361585c700 0 log [WRN] : 8 slow requests, >> 3 included below; oldest blocked for > 31.922487 secs >> 2015-10-20 08:40:36.678580 7f361585c700 0 log [WRN] : slow request >> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902: >> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read >> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg >> 2015-10-20 08:40:36.678592 7f361585c700 0 log [WRN] : slow request >> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917: >> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read >> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg >> 2015-10-20 08:40:36.678599 7f361585c700 0 log [WRN] : slow request >> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957: >> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0] >> 3.e4bd50ea) v4 currently reached pg >> >> Note's online suggest this is an issue with the journal and that it may >> be possible to export and rebuild thepg. I don't have firefly. >> >> https://ceph.com/community/incomplete-pgs-oh-my/ >> >> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary) >> but missing entirely on osd.30 (the primary). >> >> on osd.33 (primary): >> >> crowbar@da0-36-9f-0e-2b-88:~$ du -sk >> /var/lib/ceph/osd/ceph-30/current/3.ea_head/ >> 0 /var/lib/ceph/osd/ceph-30/current/3.ea_head/ >> >> on osd.11 (secondary): >> >> crowbar@da0-36-9f-0e-2b-40:~$ du -sh >> /var/lib/ceph/osd/ceph-11/current/3.ea_head/ >> 63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/ >> >> This makes some sense since, my disk drive rebuilding activity >> reformatted the primary osd.30. It also gives me some hope that my data >> is not lost. >> >> I understand incomplete means problem with journal, but is there a way >> to dig deeper into this or possible to get the secondary's data to take >> over? > If you're running an older version of Ceph (Firefly or earlier, > maybe?), "incomplete" can also mean "not enough replicas". It looks > like that's what you're hitting here, if osd.11 is not reporting any > issues. If so, simply setting the min_size on this pool to 1 until the > backfilling is done should let you get going. > -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg incomplete state
Yes. That's the intention. I was fixing the osd size to ensure the cluster was in health ok for the upgrades (instead of multiple osds in near full). Thanks again for all the insight. Very helpful. ~jpr On 10/21/2015 03:01 PM, Gregory Farnum wrote: > (which it sounds like you're on — incidentally, you probably > want to upgrade from that). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pg incomplete state
Hi folks I've been rebuilding drives in my cluster to add space. This has gone well so far. After the last batch of rebuilds, I'm left with one placement group in an incomplete state. [sudo] password for jpr: HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean pg 3.ea is stuck inactive since forever, current state incomplete, last acting [30,11] pg 3.ea is stuck unclean since forever, current state incomplete, last acting [30,11] pg 3.ea is incomplete, acting [30,11] I've restarted both OSD a few times but it hasn't cleared the error. On the primary I see errors in the log related to slow requests: 2015-10-20 08:40:36.678569 7f361585c700 0 log [WRN] : 8 slow requests, 3 included below; oldest blocked for > 31.922487 secs 2015-10-20 08:40:36.678580 7f361585c700 0 log [WRN] : slow request 31.531606 seconds old, received at 2015-10-20 08:40:05.146902: osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg 2015-10-20 08:40:36.678592 7f361585c700 0 log [WRN] : slow request 31.531591 seconds old, received at 2015-10-20 08:40:05.146917: osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg 2015-10-20 08:40:36.678599 7f361585c700 0 log [WRN] : slow request 31.531551 seconds old, received at 2015-10-20 08:40:05.146957: osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0] 3.e4bd50ea) v4 currently reached pg Note's online suggest this is an issue with the journal and that it may be possible to export and rebuild thepg. I don't have firefly. https://ceph.com/community/incomplete-pgs-oh-my/ Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary) but missing entirely on osd.30 (the primary). on osd.33 (primary): crowbar@da0-36-9f-0e-2b-88:~$ du -sk /var/lib/ceph/osd/ceph-30/current/3.ea_head/ 0 /var/lib/ceph/osd/ceph-30/current/3.ea_head/ on osd.11 (secondary): crowbar@da0-36-9f-0e-2b-40:~$ du -sh /var/lib/ceph/osd/ceph-11/current/3.ea_head/ 63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/ This makes some sense since, my disk drive rebuilding activity reformatted the primary osd.30. It also gives me some hope that my data is not lost. I understand incomplete means problem with journal, but is there a way to dig deeper into this or possible to get the secondary's data to take over? Thanks, ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question on reusing OSD
The move journal, partition resize, grow file system approach would work nicely if the spare capacity were at the end of the disk. Unfortunately, the gdisk (0.8.1) end of disk location bug caused the journal placement to be at the 800GB mark, leaving the largest remaining partition at the end of the disk. I'm assuming the gdisk bug was caused by overflowing a 32bit int during the -1000M offset from end of disk calculation. When it computed the end of disk for the journal placement on disks >2TB it dropped the 2TB part of the size and was left only with the 800GB value, putting the journal there. After gdisk created the journal at the 800GB mark (splitting the disk), ceph-disk-prepare told gdisk to take the largest remaining partition for data, using the 2TB partition at the end. Here's an example of the buggy partitioning: crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd GPT fdisk (gdisk) version 0.8.8 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/sdd: 5859442688 sectors, 2.7 TiB Logical sector size: 512 bytes Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81 Partition table holds up to 128 entries First usable sector is 34, last usable sector is 5859442654 Partitions will be aligned on 2048-sector boundaries Total free space is 1562425343 sectors (745.0 GiB) Number Start (sector)End (sector) Size Code Name 1 1564475392 5859442654 2.0 TiB ceph data 2 1562425344 1564475358 1001.0 MiB ceph journal I assume I could still follow a disk-level relocation of data using dd and shift all my content forward in the disk and then grow the file system to the end, but this would take a significant amount of time, more than a quick restart of the OSD. This leaves me the option of setting noout and hoping for the best (no other failures) during my somewhat lengthy dd data movement or taking my osd down and letting the cluster begin repairing the redundancy. If I follow the second option of normal osd loss repair, my disk repartition step would be fast and I could bring the OSD back up rather quickly. Does taking an OSD out of service, erasing it and bringing the same OSD back into service present any undue stress to the cluster? I'd prefer to use the second option if I can because I'm likely to repeat this in the near future in order to add encryption to these disks. ~jpr On 09/15/2015 06:44 PM, Lionel Bouton wrote: > Le 16/09/2015 01:21, John-Paul Robinson a écrit : >> Hi, >> >> I'm working to correct a partitioning error from when our cluster was >> first installed (ceph 0.56.4, ubuntu 12.04). This left us with 2TB >> partitions for our OSDs, instead of the 2.8TB actually available on >> disk, a 29% space hit. (The error was due to a gdisk bug that >> mis-computed the end of the disk during the ceph-disk-prepare and placed >> the journal at the 2TB mark instead of the true end of the disk at >> 2.8TB. I've updated gdisk to a newer release that works correctly.) >> >> I'd like to fix this problem by taking my existing 2TB OSDs offline one >> at a time, repartitioning them and then bringing them back into the >> cluster. Unfortunately I can't just grow the partitions, so the >> repartition will be destructive. > Hum, why should it be? If the journal is at the 2TB mark, you should be > able to: > - stop the OSD, > - flush the journal, (ceph-osd -i --flush-journal), > - unmount the data filesystem (might be superfluous but the kernel seems > to cache the partition layout when a partition is active), > - remove the journal partition, > - extend the data partition, > - place the journal partition at the end of the drive (in fact you > probably want to write a precomputed partition layout in one go). > - mount the data filesystem, resize it online, > - ceph-osd -i --mkjournal (assuming your setup can find the > partition again automatically without reconfiguration) > - start the OSD > > If you script this you should not have to use noout: the OSD should come > back in a matter of seconds and the impact on the storage network minimal. > > Note that the start of the disk is where you get the best sequential > reads/writes. Given that most data accesses are random and all journal > accesses are sequential I put the journal at the start of the disk when > data and journal are sharing the same platters. > > Best regards, > > Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question on reusing OSD
So I just realized I had described the partition error incorrectly in my initial post. The journal was placed at the 800GB mark leaving the 2TB data partition at the end of the disk. (See my follow-up to Lionel for details.) I'm working to correct that so I have a single large partition the size of the disk, save for the journal. Sorry for any confusion. ~jpr > On Sep 15, 2015, at 6:21 PM, John-Paul Robinson <j...@uab.edu> wrote: > > I'm working to correct a partitioning error from when our cluster was > first installed (ceph 0.56.4, ubuntu 12.04). This left us with 2TB > partitions for our OSDs, instead of the 2.8TB actually available on > disk, a 29% space hit. (The error was due to a gdisk bug that > mis-computed the end of the disk during the ceph-disk-prepare and placed > the journal at the 2TB mark instead of the true end of the disk at > 2.8TB. I've updated gdisk to a newer release that works correctly.) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question on reusing OSD
Christian, Thanks for the feedback. I guess I'm wondering about step 4 "clobber partition, leaving data in tact and grow partition and the file system as needed". My understanding of xfs_growfs is that the free space must be at the end of the existing file system. In this case the existing partition starts around the 800GB mark on the disk and and extends to the end of the disk. My goal is to add the first 800GB on the disk to that partition so it can become a single data partition. Note that my volumes are not LVM based so I can't extend the volume by incorporating the free space at the start of the disk. Am I misunderstanding something about file system grow commands? Regarding your comments, on impact to the cluster of a downed OSD. I have lost OSDs and the impact is minimal (acceptable). My concern is around taking an OSD down, having the cluster initiate recovery and then bringing that same OSD back into the cluster in an empty state. Are the placement groups that originally had data on this OSD already remapped by this point (even if they aren't fully recovered) so that bring the empty, replacement OSD on-line simply causes a different set of placement groups to be mapped onto it to achieve the rebalance? Thanks, ~jpr On 09/16/2015 08:37 AM, Christian Balzer wrote: > Hello, > > On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote: > >> > The move journal, partition resize, grow file system approach would >> > work nicely if the spare capacity were at the end of the disk. >> > > That shouldn't matter, you can "safely" loose your journal in controlled > circumstances. > > This would also be an ideal time to put your journals on SSDs. ^o^ > > Roughly (you do have a test cluster, do you? Or at least try this with > just one OSD): > > 1. set noout just to be sure. > 2. stop the OSD > 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or > --help) > 4. clobber your partitions in a way that leaves you with an intact data > partition, grow that and the FS in it as desired. > 5. re-init the journal with "ceph-osd -i osdnum --mkjournal" > 6. start the OSD and rejoice. > > More below. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] question on reusing OSD
Hi, I'm working to correct a partitioning error from when our cluster was first installed (ceph 0.56.4, ubuntu 12.04). This left us with 2TB partitions for our OSDs, instead of the 2.8TB actually available on disk, a 29% space hit. (The error was due to a gdisk bug that mis-computed the end of the disk during the ceph-disk-prepare and placed the journal at the 2TB mark instead of the true end of the disk at 2.8TB. I've updated gdisk to a newer release that works correctly.) I'd like to fix this problem by taking my existing 2TB OSDs offline one at a time, repartitioning them and then bringing them back into the cluster. Unfortunately I can't just grow the partitions, so the repartition will be destructive. I would like for the reformatted OSD to come back into the cluster looking just like the original OSD, except that it now has 2.8TB for it's data. That is, I'd like the OSD number to stay the same and for the cluster to think of it like the original disk (save for not having any data on it). Ordinarily, I would add an OSD by bringing a system into the cluster triggering these events: ceph-disk-prepare /dev/sdb /dev/sdb # partitions disk, note older cluster with journal on same disk ceph-disk-activate /dev/sdb # registers osd with cluster The ceph-disk-prepare is focused on partitioning and doesn't interact with the cluster. The ceph-disk-activate takes care of making the OSD look like an OSD and adding it into the cluster. Inside of the ceph-disk-activate the code looks for some special files at the top of the /dev/sdb1 file system, including magic, ceph_fsid, and whoami (which is where the osd number is stored). My first question is, can I preserve these special files and put them back on the repartitioned/formatted drive causing ceph-disk-activate to just bring the OSD back into the cluster using it's original identity or is there a better way to do what I want? My second question is, if I take an OSD out of the cluster, should I wait for the subsequent rebalance to complete before bringing the reformatted OSD back in the cluster? That is, will it cause problems to drop an OSD out of the cluster and then bring the same OSD back into the cluster except without any of the data. I'm assuming this is similar to what would happen in a standard disk replacement scenario. I reviewed the thread from Sept 2014 (https://www.mail-archive.com/ceph-users@lists.ceph.com/msg13394.html) discussing a similar scenario. This was more focused on re-using a journal slot on an SSD. In my case the journal is on the same disk as the data. Also, I don't have a recent release of the ceph so likely won't benefit from the associated fix. Thanks for any suggestions. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS interaction with RBD
In the end this came down to one slow OSD. There were no hardware issues so have to just assume something gummed up during rebalancing and peering. I restarted the osd process after setting the cluster to noout. After the osd was restarted the rebalance completed and the cluster returned to health ok. As soon as the osd restarted all previously hanging operations returned to normal. I'm surprised by a single slow OSD impacting access to the entire cluster. I understand now that only the primary osd is used for reads and writes must go to the primary then secondary, but I would have expected the impact to be more contained. We currently build XFS file systems directly on RBD images. I'm wondering if there would be any value in using an LVM abstraction on top to spread access to other osds for read and failure scenarios. Any thoughts on the above appreciated. ~jpr On 05/28/2015 03:18 PM, John-Paul Robinson wrote: To follow up on the original post, Further digging indicates this is a problem with RBD image access and is not related to NFS-RBD interaction as initially suspected. The nfsd is simply hanging as a result of a hung request to the XFS file system mounted on our RBD-NFS gateway.This hung XFS call is caused by a problem with the RBD module interacting with our Ceph pool. I've found a reliable way to trigger a hang directly on an rbd image mapped into our RBD-NFS gateway box. The image contains an XFS file system. When I try to list the contents of a particular directory, the request hangs indefinitely. Two weeks ago our ceph status was: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 near full osd(s) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e5978: 66 osds: 66 up, 66 in pgmap v26434260: 3072 pgs: 3062 active+clean, 6 active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s mdsmap e1: 0/0/1 up The near full osd was number 53 and we updated our crush map to rewieght the osd. All of the OSDs had a weight of 1 based on the assumption that all osds were 2.0TB. Apparently one of our severs had the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough we are only at 50% utilization. We reweighted the near full osd to .8 and that initiated a rebalance that has since relieved the 95% full condition on that OSD. However, since that time the repeering has not completed and we suspect this is causing problems with our access of RBD images. Our current ceph status is: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck unclean; recovery 9/23842120 degraded (0.000%) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e6036: 66 osds: 66 up, 66 in pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9 active+clean+scrubbing, 1 remapped+peering, 3 active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%) mdsmap e1: 0/0/1 up Here are further details on our stuck pgs: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck inactive ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.3af 11600 0 0 0 47941791744 153812 153812 remapped+peering2015-05-15 12:47:17.223786 5979'293066 6000'1248735 [48,62] [53,48,62] 5979'293056 2015-05-15 07:40:36.275563 5979'293056 2015-05-15 07:40:36.275563 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck unclean ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.106 11870 0 9 0 49010106368 163991 163991 active 2015-05-15 12:47:19.761469 6035'356332 5968'1358516 [62,53] [62,53] 5979'356242 2015-05-14 22:22:12.966150 5979'351351 2015-05-12 18:04:41.838686 5.104 0 0 0 0 0 0 0 active 2015-05-15 12:47:19.800676 0'0 5968'1615
Re: [ceph-users] NFS interaction with RBD
To follow up on the original post, Further digging indicates this is a problem with RBD image access and is not related to NFS-RBD interaction as initially suspected. The nfsd is simply hanging as a result of a hung request to the XFS file system mounted on our RBD-NFS gateway.This hung XFS call is caused by a problem with the RBD module interacting with our Ceph pool. I've found a reliable way to trigger a hang directly on an rbd image mapped into our RBD-NFS gateway box. The image contains an XFS file system. When I try to list the contents of a particular directory, the request hangs indefinitely. Two weeks ago our ceph status was: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 near full osd(s) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e5978: 66 osds: 66 up, 66 in pgmap v26434260: 3072 pgs: 3062 active+clean, 6 active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s mdsmap e1: 0/0/1 up The near full osd was number 53 and we updated our crush map to rewieght the osd. All of the OSDs had a weight of 1 based on the assumption that all osds were 2.0TB. Apparently one of our severs had the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough we are only at 50% utilization. We reweighted the near full osd to .8 and that initiated a rebalance that has since relieved the 95% full condition on that OSD. However, since that time the repeering has not completed and we suspect this is causing problems with our access of RBD images. Our current ceph status is: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs stuck unclean; recovery 9/23842120 degraded (0.000%) monmap e1: 3 mons at {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0}, election epoch 350, quorum 0,1,2 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0 osdmap e6036: 66 osds: 66 up, 66 in pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9 active+clean+scrubbing, 1 remapped+peering, 3 active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%) mdsmap e1: 0/0/1 up Here are further details on our stuck pgs: jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck inactive ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.3af 11600 0 0 0 47941791744 153812 153812 remapped+peering2015-05-15 12:47:17.223786 5979'293066 6000'1248735 [48,62] [53,48,62] 5979'293056 2015-05-15 07:40:36.275563 5979'293056 2015-05-15 07:40:36.275563 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg dump_stuck unclean ok pg_stat objects mip degrunf bytes log disklog state state_stamp v reportedup acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.106 11870 0 9 0 49010106368 163991 163991 active 2015-05-15 12:47:19.761469 6035'356332 5968'1358516 [62,53] [62,53] 5979'356242 2015-05-14 22:22:12.966150 5979'351351 2015-05-12 18:04:41.838686 5.104 0 0 0 0 0 0 0 active 2015-05-15 12:47:19.800676 0'0 5968'1615 [62,53] [62,53] 0'0 2015-05-14 18:43:22.425105 0'0 2015-05-08 10:19:54.938934 4.105 0 0 0 0 0 0 0 active 2015-05-15 12:47:19.801028 0'0 5968'1615 [62,53] [62,53] 0'0 2015-05-14 18:43:04.434826 0'0 2015-05-14 18:43:04.434826 3.3af 11600 0 0 0 47941791744 153812 153812 remapped+peering2015-05-15 12:47:17.223786 5979'293066 6000'1248735 [48,62] [53,48,62] 5979'293056 2015-05-15 07:40:36.275563 5979'293056 2015-05-15 07:40:36.275563 The servers in the pool are not overloaded. On the ceph server that originally had the nearly full osd, (osd 53), I'm seeing entries like this in the osd log: 2015-05-28 06:25:02.900129 7f2ea8a4f700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for 1096430.805069 secs 2015-05-28 06:25:02.900145 7f2ea8a4f700 0 log [WRN] : slow request
[ceph-users] NFS interaction with RBD
We've had a an NFS gateway serving up RBD images successfully for over a year. Ubuntu 12.04 and ceph .73 iirc. In the past couple of weeks we have developed a problem where the nfs clients hang while accessing exported rbd containers. We see errors on the server about nfsd hanging for 120sec etc. The nfs server is still able to successfully interact with the images it is serving. We can export non rbd shares from the local file system and nfs clients can use them just fine. There seems to be something weird going on with rbd and nfs kernel modules. Our ceph pool is in a warn state due to an osd rebalance that is continuing slowly. But the fact that we continue to have good rbd image access directly on the server makes me think this is not related. Also the nfs server is only a client of the pool, it doesnt participate in it. Has anyone experienced similar issues? We do have a lot of images attached to the server but he issue is there even when we map only a few. Thanks for any pointers. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recommended way to use Ceph as storage for file server
We have an NFS to RBD gateway with a large number of smaller RBDs. In our use case we are allowing users to request their own RBD containers that are then served up via NFS into a mixed cluster of clients.Our gateway is quite beefy, probably more than it needs to be, 2x8 core cpus and 96GB ram. It has been pressed into this service, drawn from a pool homogeneous servers rather then being spec'd out for this role explicitly (it could likely be less beefy). It has performed well. Our RBD nodes connected via 2x10GB nics in a transmit-load-balance config. The server has performed well in this role. It could just be the specs. An individual RBD in this NFS gateway won't see the parallel performance advantages that CephFS promises, however, one potential advantage is that a multi-RBD backend will be able to simultaneously manage NFS client requests isolated to different RBD. One RBD may still get a heavy load but it at least the server as a whole has the potential to spread requests across different devices. I haven't done load comparisons so this is just a point of interest. It's probably moot if the kernel doesn't do a good job of spreading NFS load across threads or there is some other kernel/RBD constriction point. ~jpr On 06/02/2014 12:35 PM, Dimitri Maziuk wrote: A more or less obvious alternative for CephFS would be to simply create a huge RBD and have a separate file server (running NFS / Samba / whatever) use that block device as backend. Just put a regular FS on top of the RBD and use it that way. Clients wouldn't really have any of the real performance and resilience benefits that Ceph could offer though, because the (single machine?) file server is now the bottleneck. Performance: assuming all your nodes are fast storage on a quad-10Gb pipe. Resilience: your gateway can be an active-passive HA pair, that shouldn't be any different from NFS+DRBD setups. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question on harvesting freed space
So in the mean time, are there any common work-arounds? I'm assuming monitoring imageused/imagesize ratio and if its greater than some tolerance create a new image and move file system content over is an effective, if crude approach. I'm not clear on how to measure the amount of storage an image uses at the RBD level. Probably because I don't understand info output: $ sudo rbd --id nova info somecontainer rbd image 'somecontainer': size 1024 GB in 262144 objects order 22 (4096 kB objects) block_name_prefix: rb.0.176f3.238e1f29 format: 1 Are there others? I assume snapshotting images doesn't help here since RBD still wouldn't be able to distinguish what's in use and what's not. Thoughts? ~jpr On 04/17/2014 01:38 AM, Wido den Hollander wrote: On 04/17/2014 02:39 AM, Somnath Roy wrote: It seems Discard support for kernel rbd is targeted for v80.. http://tracker.ceph.com/issues/190 True, but it will obviously take time before this hits the upstream kernels and goes into distributions. For RHEL 7 it might be that the krbd module from the Ceph extra repo might work. For Ubuntu it's waiting for newer kernels to be backported to the LTS releases. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] question on harvesting freed space
So having learned some about fstrim, I ran it on an SSD backed file system and it reported space freed. I ran it on an RBD backed file system and was told it's not implemented. This is consistent with the test for FITRIM. $ cat /sys/block/rbd3/queue/discard_max_bytes 0 On my SSD backed device I get: $ cat /sys/block/sda/queue/discard_max_bytes 2147450880 Is this just not needed by RBD or is cleanup handled in a different way? I'm wondering what will happen to a thin provisioned RBD image overtime on a file system with lots of file create delete activity. Will the storage in the ceph pool stay allocated to this application (the file system) in that case? Thanks for any additional insights. ~jpr On 04/15/2014 04:16 PM, John-Paul Robinson wrote: Thanks for the insight. Based on that I found the fstrim command for xfs file systems. http://xfs.org/index.php/FITRIM/discard Anyone had experiences using the this command with RBD image backends? ~jpr On 04/15/2014 02:00 PM, Kyle Bader wrote: I'm assuming Ceph/RBD doesn't have any direct awareness of this since the file system doesn't traditionally have a give back blocks operation to the block device. Is there anything special RBD does in this case that communicates the release of the Ceph storage back to the pool? VMs running a 3.2+ kernel (iirc) can give back blocks by issuing TRIM. http://wiki.qemu.org/Features/QED/Trim ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] question on harvesting freed space
Hi, If I have an 1GB RBD image and format it with say xfs of ext4, then I basically have thin provisioned disk. It takes up only as much space from the Ceph pool as is needed to hold the data structure of the empty file system. If I add files to my file systems and then remove them, how does Ceph deal with these freed blocks? At the file system level the pointers to the blocks get removed from the dir tree and the blocks get added to the free list for potential use by other files. I'm assuming Ceph/RBD doesn't have any direct awareness of this since the file system doesn't traditionally have a give back blocks operation to the block device. Is there anything special RBD does in this case that communicates the release of the Ceph storage back to the pool? Sorry for any gross oversimplifications. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Live database files on Ceph
I've seen this fast everything except sequential reads asymmetry in my own simple dd tests on RBD images but haven't really understood the cause. Could you clarify what's going on that would cause that kind of asymmetry. I've been assuming that once I get around to turning on/tuning read caching on my underlying OSD nodes the situation will improve but haven't dug into that yet. ~jpr On 04/04/2014 04:46 AM, Mark Kirkwood wrote: However you may see some asymmetry in this performance - fast random and sequential writes, fast random reads but considerably slower sequential reads. The RBD cache may help here, but I need to investigate this further (and also some of the more fiddly settings to do with vertio disk config). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rebooting nodes in a ceph cluster
What impact does rebooting nodes in a ceph cluster have on the health of the ceph cluster? Can it trigger rebalancing activities that then have to be undone once the node comes back up? I have a 4 node ceph cluster each node has 11 osds. There is a single pool with redundant storage. If it takes 15 minutes for one of my servers to reboot is there a risk that some sort of needless automatic processing will begin? I'm assuming that the ceph cluster can go into a not ok state but that in this particular configuration all the data is protected against the single node failure and there is no place for the data to migrate too so nothing bad will happen. Thanks for any feedback. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rebooting nodes in a ceph cluster
So is it recommended to adjust the rebalance timeout to align with the time to reboot individual nodes? I didn't see this in my pass through the ops manual but maybe I'm not looking in the right place. Thanks, ~jpr On Dec 19, 2013, at 6:51 PM, Sage Weil s...@inktank.com wrote: On Thu, 19 Dec 2013, John-Paul Robinson wrote: What impact does rebooting nodes in a ceph cluster have on the health of the ceph cluster? Can it trigger rebalancing activities that then have to be undone once the node comes back up? I have a 4 node ceph cluster each node has 11 osds. There is a single pool with redundant storage. If it takes 15 minutes for one of my servers to reboot is there a risk that some sort of needless automatic processing will begin? By default, we start rebalancing data after 5 minutes. You can adjust this (to, say, 15 minutes) with mon osd down out interval = 900 in ceph.conf. sage I'm assuming that the ceph cluster can go into a not ok state but that in this particular configuration all the data is protected against the single node failure and there is no place for the data to migrate too so nothing bad will happen. Thanks for any feedback. ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is Ceph a provider of block device too ?
Is this statement accurate? As I understand DRBD, you can replicate online block devices reliably, but with Ceph the replication for RBD images requires that the file system be offline. Thanks for the clarification, ~jpr On 11/08/2013 03:46 PM, Gregory Farnum wrote: Does Ceph provides the distributed filesystem and block device? Ceph's RBD is a distributed block device and works very well; you could use it to replace DRBD. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph and RAID
What is this take on such a configuration? Is it worth the effort of tracking rebalancing at two layers, RAID mirror and possibly Ceph if the pool has a redundancy policy. Or is it better to just let ceph rebalance itself when you lose a non-mirrored disk? If following the raid mirror approach, would you then skip redundency at the ceph layer to keep your total overhead the same? It seems that would be risky in the even you loose your storage server with the raid-1'd drives. No Ceph level redunancy would then be fatal. But if you do raid-1 plus ceph redundancy, doesn't that mean it takes 4TB for each 1 real TB? ~jpr On 10/02/2013 10:03 AM, Dimitri Maziuk wrote: I would consider (mdadm) raid-1, dep. on the hardware budget, because this way a single disk failure will not trigger a cluster-wide rebalance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Using RBD with LVM
Thanks. This fixed the problem. BTW, after adding this line I still got the same error on my pvcreate but then I ran pvcreate -vvv and found that it was ignorning my /dev/rbd1 device because it had detected a partition signature (which I had added in an earlier attempt to work around this ignored issue). I deleted the partion and the pvcreate worked on all my RBD devices. A basic recipe for creating an LVM volume is: for i in 1 2 3 do rbd create user1-home-lvm-p0$i --size 102400 rbd map user1-home-lvm-p0$i pvcreate user1-home-lvm-p0$i done vgcreate user1-home-vg \ /dev/rbd/rbd/user1-home-lvm-p01 \ /dev/rbd/rbd/user1-home-lvm-p02 \ /dev/rbd/rbd/user1-home-lvm-p03 lvcreate -nuser1-home-lv -l%100FREE user1-home-vg mkfs.ext4 /dev/user1-home-vg/user1-home-lv mount /dev/user1-home-vg/user1-home-lv /somewhere ~jpr On 09/24/2013 07:58 PM, Mandell Degerness wrote: You need to add a line to /etc/lvm/lvm.conf: types = [ rbd, 1024 ] It should be in the devices section of the file. On Tue, Sep 24, 2013 at 5:00 PM, John-Paul Robinson j...@uab.edu wrote: Hi, I'm exploring a configuration with multiple Ceph block devices used with LVM. The goal is to provide a way to grow and shrink my file systems while they are on line. I've created three block devices: $ sudo ./ceph-ls | grep home jpr-home-lvm-p01: 102400 MB jpr-home-lvm-p02: 102400 MB jpr-home-lvm-p03: 102400 MB And have them mapped into my kernel (3.2.0-23-generic #36-Ubuntu SMP): $ sudo rbd showmapped id pool imagesnap device 0 rbd jpr-test-vol01 -/dev/rbd0 1 rbd jpr-home-lvm-p01 -/dev/rbd1 2 rbd jpr-home-lvm-p02 -/dev/rbd2 3 rbd jpr-home-lvm-p03 -/dev/rbd3 In order to use them with LVM, I need to define them as physical volumes. But when I run this command I get an unexpected error: $ sudo pvcreate /dev/rbd1 Device /dev/rbd1 not found (or ignored by filtering). I am able to use other RBD on this same machine to create file systems directly and mount them: $ df -h /mnt-test Filesystem Size Used Avail Use% Mounted on /dev/rbd050G 885M 47G 2% /mnt-test Is there a reason that the /dev/rbd[1-2] devices can't be initialized as physical volumes in LVM? Thanks, ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Using RBD with LVM
Thanks. After fixing the issue with the types entry in lvm.conf, I discovered the -vvv option which helped me detect a the second cause for the ignored error: pvcreate saw a partition signature and skipped the device. The -vvv is s good flag. :) ~jpr On 09/25/2013 01:52 AM, Wido den Hollander wrote: Try this: $ sudo pvcreate -vvv /dev/rbd1 It has something to do with LVM filtering RBD devices away, you might need to add them manually in /etc/lvm/lvm.conf I've seen this before and fixed it, but I forgot what the root cause was. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD on CentOS 6
Hi, We've been working with Ceph 0.56 on Ubuntu 12.04 and are able to create, map, and mount ceph block devices via the RBD kernel module. We have a CentOS6.4 box one which we would like to do the same. http://ceph.com/docs/next/install/os-recommendations/ OS recommedations state that we should be at kernel v3.4.20 or better. Does anyone have any recommendations for or against using a CentOS6.4 platform to work with RBD in the kernel? We're assuming we will have to upgrade the kernel in 3.4.20 or better (if possible). Thanks, ~jpr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com