Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hi Christian, Thank you for the valuable info. As I will use this cluster mainly at home for my data, and testing (backup in place), I will continue to use BTRFS. In production, I would go with XFS as recommended. ZFS - perhaps when this will become supported officially. BTW, I fixed the HEALTH of my cluster: 1. I set "ceph osd pool set rbd size 2" 2. I set "ceph osd pool set rbd pg_num 256" and "ceph osd pool set rbd pgp_num 256" 5 pgs remained stuck unclean (stuck unclean since forever, current state active, last acting). I fixed this by restarting ceph -a. I think the OSD restart fixed this. I guess there might be more elegant solution, but I was not able to figure it out. Tried "pg repair" but that didn't do trick. Anyway, it seems to be healthy now :). cephadmin@ceph1:~$ sudo ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_OK monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 10, quorum 0,1 ceph1,ceph2 osdmap e59: 4 osds: 4 up, 4 in pgmap v179: 256 pgs, 1 pools, 0 bytes data, 0 objects 16924 kB used, 11154 GB / 11158 GB avail 256 active+clean Thanks for the help! Jiri On 28/12/2014 16:59, Christian Balzer wrote: Hello Jiri, On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote: Hi Christian. Thank you for your comments again. Very helpful. I will try to fix the current pool and see how it goes. Its good to learn some troubleshooting skills. Indeed, knowing what to do when things break is where it's at. Regarding the BTRFS vs XFS, not sure if the documentation is old. My decision was based on this: http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/ It's dated for sure and a bit of wishful thinking on behalf of the Ceph developers. Who understandably didn't want to re-invent the wheel inside Ceph when the underlying file system could provide it (checksums, snapshots, etc). ZFS has all the features (and much better tested) BTRFS is aspiring to and if kept below 80% utilization doesn't fragment itself to death. And the end of that page they mention deduplication, which of course (as I wrote recently in the "use ZFS for OSDs" thread is unlikely to do anything worthwhile at all. Simply put, some things _need_ to be done in Ceph to work properly and can't be delegated to the underlying FS or other storage backend. Christian Note We currently recommendXFSfor production deployments. We recommendbtrfsfor testing, development, and any non-critical deployments. *We believe thatbtrfshas the correct feature set and roadmap to serve Ceph in the long-term*, butXFSandext4provide the necessary stability for today’s deployments.btrfsdevelopment is proceeding rapidly: users should be comfortable installing the latest released upstream kernels and be able to track development activity for critical bug fixes. Thanks Jiri On 28/12/2014 16:01, Christian Balzer wrote: Hello, On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote: Hi Christian. Thank you for your suggestions. I will set the "osd pool default size" to 2 as you recommended. As mentioned the documentation is talking about OSDs, not nodes, so that must have confused me. Note that changing this will only affect new pools of course. So to sort out your current state either start over with this value set before creating/starting anything or reduce the current size (ceph osd pool set size). Have a look at the crushmap example or even better your own, current one and you will see where by default the host is the failure domain. Which of course makes a lot of sense. Regarding the BTRFS, i thought that btrfs is better option for the future providing more features. I know that XFS might be more stable, but again my impression was that btrfs is the focus for future development. Is that correct? I'm not a developer, but if you scour the ML archives you will find a number of threads about BTRFS (and ZFS). The biggest issues with BTRFS are not just stability but also the fact that it degrades rather quickly (fragmentation) due to the COW nature of it and less smarts than ZFS in that area. So development on the Ceph side is not the issue per se. IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might become the better choice (in the future), with KV store backends being an alternative for some use cases (also far from production ready at this time). Regards, Christian You are right with the round up. I forgot about that. Thanks for your help. Much appreciated. Jiri - Reply message - From: "Christian Balzer" To: Cc: "Jiri Kanicky" Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec 28, 2014 03:29 Hello, On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: Hi, I just build my CEPH cluster but having problems with the health of the cluster.
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hello Jiri, On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote: > Hi Christian. > > Thank you for your comments again. Very helpful. > > I will try to fix the current pool and see how it goes. Its good to > learn some troubleshooting skills. > Indeed, knowing what to do when things break is where it's at. > Regarding the BTRFS vs XFS, not sure if the documentation is old. My > decision was based on this: > > http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/ > It's dated for sure and a bit of wishful thinking on behalf of the Ceph developers. Who understandably didn't want to re-invent the wheel inside Ceph when the underlying file system could provide it (checksums, snapshots, etc). ZFS has all the features (and much better tested) BTRFS is aspiring to and if kept below 80% utilization doesn't fragment itself to death. And the end of that page they mention deduplication, which of course (as I wrote recently in the "use ZFS for OSDs" thread is unlikely to do anything worthwhile at all. Simply put, some things _need_ to be done in Ceph to work properly and can't be delegated to the underlying FS or other storage backend. Christian > Note > > We currently recommendXFSfor production deployments. We > recommendbtrfsfor testing, development, and any non-critical > deployments. *We believe thatbtrfshas the correct feature set > and roadmap to serve Ceph in the long-term*, butXFSandext4provide the > necessary stability for today’s deployments.btrfsdevelopment is > proceeding rapidly: users should be comfortable installing the latest > released upstream kernels and be able to track development activity for > critical bug fixes. > > > > Thanks > Jiri > > > On 28/12/2014 16:01, Christian Balzer wrote: > > Hello, > > > > On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote: > > > >> Hi Christian. > >> > >> Thank you for your suggestions. > >> > >> I will set the "osd pool default size" to 2 as you recommended. As > >> mentioned the documentation is talking about OSDs, not nodes, so that > >> must have confused me. > >> > > Note that changing this will only affect new pools of course. So to > > sort out your current state either start over with this value set > > before creating/starting anything or reduce the current size (ceph osd > > pool set size). > > > > Have a look at the crushmap example or even better your own, current > > one and you will see where by default the host is the failure domain. > > Which of course makes a lot of sense. > > > >> Regarding the BTRFS, i thought that btrfs is better option for the > >> future providing more features. I know that XFS might be more stable, > >> but again my impression was that btrfs is the focus for future > >> development. Is that correct? > >> > > I'm not a developer, but if you scour the ML archives you will find a > > number of threads about BTRFS (and ZFS). > > The biggest issues with BTRFS are not just stability but also the fact > > that it degrades rather quickly (fragmentation) due to the COW nature > > of it and less smarts than ZFS in that area. > > So development on the Ceph side is not the issue per se. > > > > IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS > > might become the better choice (in the future), with KV store backends > > being an alternative for some use cases (also far from production > > ready at this time). > > > > Regards, > > > > Christian > >> You are right with the round up. I forgot about that. > >> > >> Thanks for your help. Much appreciated. > >> Jiri > >> > >> - Reply message - > >> From: "Christian Balzer" > >> To: > >> Cc: "Jiri Kanicky" > >> Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck > >> degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, > >> Dec 28, 2014 03:29 > >> > >> Hello, > >> > >> On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: > >> > >>> Hi, > >>> > >>> I just build my CEPH cluster but having problems with the health of > >>> the cluster. > >>> > >> You're not telling us the version, but it's clearly 0.87 or beyond. > >> > >>> Here are few details: > >>> - I followed the ceph documentation. > >> Outdated, unfortunately. > >> > >>> - I used btrfs filesystem for all OSDs > >> Big mistake number 1, do some research (google, ML archives). > >> Though not related to to your problems. > >> > >>> - I did not set "osd pool default size = 2 " as I thought that if I > >>> have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this > >>> was right. > >> Big mistake, assumption number 2, replications size by the default > >> CRUSH rule is determined by hosts. So that's your main issue here. > >> Either set it to 2 or use 3 hosts. > >> > >>> - I noticed that default pools "data,metadata" were not created. Only > >>> "rbd" pool was created. > >> See outdated docs above. The majority of use cases is with RBD, so > >> since Giant the cephfs pools are not created by default
Re: [ceph-users] Improving Performance with more OSD's?
On Sun, 28 Dec 2014 08:59:33 +1000 Lindsay Mathieson wrote: > I'm looking to improve the raw performance on my small setup (2 Compute > Nodes, 2 OSD's). Only used for hosting KVM images. > This doesn't really make things clear, do you mean 2 STORAGE nodes with 2 OSDs (HDDs) each? In either case that's a very small setup (and with a replication of 2 a risky one, too), so don't expect great performance. It would help if you'd tell us what these nodes are made of (CPU, RAM, disks, network) so we can at least guess what that cluster might be capable of. > Raw read/write is roughly 200/35 MB/s. Starting 4+ VM's simultaneously > pushes iowaits over 30%, though the system keeps chugging along. > Throughput numbers aren't exactly worthless, but you will find IOPS to be the killer in most cases. Also without describing how you measured these numbers (rados bench, fio, bonnie, on the host, inside a VM) they become even more muddled. > Budget is limited ... :( > > I plan to upgrade my SSD journals to something better than the Samsung > 840 EVO's (Intel 520/530?) > Not a big improvement really. Take a look at the 100GB Intel DC S3700s, while they can write "only" at 200MB/s they are priced rather nicely and they will deliver that performance at ANY time and for a long time, too. > One of the things I see mentioned a lot in blogs etc is how ceph's > performance improves as you add more OSD's and that the quality of the > disks does not matter so much as the quantity. > > How does this work? does ceph stripe reads and writes across the OSD's > to improve performance? > Yes and no. It stripes by default to 4MB objects, so with enough OSDs and clients I/Os will become distributed, scaling up nicely. However a single client could be hitting the same object on the same OSD all the time (small DB file for example), so you won't see much or any improvement in that case. There is also the option to stripe things on a much smaller scale, however that takes some planning and needs to be done at pool creation time. See and read the Ceph documentation. > If I add 3 cheap OSD's to each node (500GB - 1TB) with 10GB SSD journal > partition each could I expect a big improvement in performance? > That depends a lot on the stuff you haven't told us (CPU/RAM/network). Given that there is sufficient of those, especially CPU, the answer is yes. A large amount of RAM on the storage nodes will improve reads, as hot objects become and remain cached. Of course having decent HDDs will help even with journals on SSDs, for example the Toshiba DTxx (totally not recommended for ANYTHING) HDDs cost about the same as their entry level "enterprise" MG0x drives, which are nearly twice as fast in the IOPS department. > What sort of redundancy to setup? currently its min= 1, size=2. Size is > not an issue, we already have 150% more space than we need, redundancy > and performance is more important. > You really, really want size 3 and a third node for both performance (reads) and redundancy. > Now I think on it, we can live with the slow write performance, but > reducing iowait would be *really* good. > Decent SSDs (see above) and more (decent) spindles will help with both. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hi Christian. Thank you for your comments again. Very helpful. I will try to fix the current pool and see how it goes. Its good to learn some troubleshooting skills. Regarding the BTRFS vs XFS, not sure if the documentation is old. My decision was based on this: http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/ Note We currently recommendXFSfor production deployments. We recommendbtrfsfor testing, development, and any non-critical deployments. *We believe thatbtrfshas the correct feature set and roadmap to serve Ceph in the long-term*, butXFSandext4provide the necessary stability for today’s deployments.btrfsdevelopment is proceeding rapidly: users should be comfortable installing the latest released upstream kernels and be able to track development activity for critical bug fixes. Thanks Jiri On 28/12/2014 16:01, Christian Balzer wrote: Hello, On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote: Hi Christian. Thank you for your suggestions. I will set the "osd pool default size" to 2 as you recommended. As mentioned the documentation is talking about OSDs, not nodes, so that must have confused me. Note that changing this will only affect new pools of course. So to sort out your current state either start over with this value set before creating/starting anything or reduce the current size (ceph osd pool set size). Have a look at the crushmap example or even better your own, current one and you will see where by default the host is the failure domain. Which of course makes a lot of sense. Regarding the BTRFS, i thought that btrfs is better option for the future providing more features. I know that XFS might be more stable, but again my impression was that btrfs is the focus for future development. Is that correct? I'm not a developer, but if you scour the ML archives you will find a number of threads about BTRFS (and ZFS). The biggest issues with BTRFS are not just stability but also the fact that it degrades rather quickly (fragmentation) due to the COW nature of it and less smarts than ZFS in that area. So development on the Ceph side is not the issue per se. IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might become the better choice (in the future), with KV store backends being an alternative for some use cases (also far from production ready at this time). Regards, Christian You are right with the round up. I forgot about that. Thanks for your help. Much appreciated. Jiri - Reply message - From: "Christian Balzer" To: Cc: "Jiri Kanicky" Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec 28, 2014 03:29 Hello, On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: Hi, I just build my CEPH cluster but having problems with the health of the cluster. You're not telling us the version, but it's clearly 0.87 or beyond. Here are few details: - I followed the ceph documentation. Outdated, unfortunately. - I used btrfs filesystem for all OSDs Big mistake number 1, do some research (google, ML archives). Though not related to to your problems. - I did not set "osd pool default size = 2 " as I thought that if I have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right. Big mistake, assumption number 2, replications size by the default CRUSH rule is determined by hosts. So that's your main issue here. Either set it to 2 or use 3 hosts. - I noticed that default pools "data,metadata" were not created. Only "rbd" pool was created. See outdated docs above. The majority of use cases is with RBD, so since Giant the cephfs pools are not created by default. - As it was complaining that the pg_num is too low, I increased the pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num 133 > pgp_num 64". Re-read the (in this case correct) documentation. It clearly states to round up to nearest power of 2, in your case 256. Regards. Christian Would you give me hint where I have made the mistake? (I can remove the OSDs and start over if needed.) cephadmin@ceph1:/etc/ceph$ sudo ceph health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 cephadmin@ceph1:/etc/ceph$ sudo ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 8, quorum 0,1 ceph1,ceph2 osdmap e42: 4 osds: 4 up, 4 in pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects 11704 kB used, 11154 GB / 11158 GB avail 29 active+undersized+degraded 104 active+remapped cepha
Re: [ceph-users] xfs/nobarrier
On Sun, 28 Dec 2014, Mark Kirkwood wrote: > On 28/12/14 15:51, Kyle Bader wrote: > > > do people consider a UPS + Shutdown procedures a suitable substitute? > > > > I certainly wouldn't, I've seen utility power fail and the transfer > > switch fail to transition to UPS strings. Had this happened to me with > > nobarrier it would have been a very sad day. > > > > I'd second that. In addition I've heard of cases where the switchover to the > UPS worked ok but the damn thing had a flat battery! So the switchover process > and UPS reliability need to be be well rehearsed + monitored if you want to > reply on this type of solution. Right. nobarrier is definitely *NOT* recommended under almost any circumstances. Yes, there are some situations where it is safe, but there are so many things that can go wrong and break it (from buggy kernel to buggy controller firmware to storage device to power etc) that it is IMO rarely worth the risk. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hello, On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote: > Hi Christian. > > Thank you for your suggestions. > > I will set the "osd pool default size" to 2 as you recommended. As > mentioned the documentation is talking about OSDs, not nodes, so that > must have confused me. > Note that changing this will only affect new pools of course. So to sort out your current state either start over with this value set before creating/starting anything or reduce the current size (ceph osd pool set size). Have a look at the crushmap example or even better your own, current one and you will see where by default the host is the failure domain. Which of course makes a lot of sense. > Regarding the BTRFS, i thought that btrfs is better option for the > future providing more features. I know that XFS might be more stable, > but again my impression was that btrfs is the focus for future > development. Is that correct? > I'm not a developer, but if you scour the ML archives you will find a number of threads about BTRFS (and ZFS). The biggest issues with BTRFS are not just stability but also the fact that it degrades rather quickly (fragmentation) due to the COW nature of it and less smarts than ZFS in that area. So development on the Ceph side is not the issue per se. IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might become the better choice (in the future), with KV store backends being an alternative for some use cases (also far from production ready at this time). Regards, Christian > You are right with the round up. I forgot about that. > > Thanks for your help. Much appreciated. > Jiri > > - Reply message - > From: "Christian Balzer" > To: > Cc: "Jiri Kanicky" > Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck > degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec > 28, 2014 03:29 > > Hello, > > On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: > > > Hi, > > > > I just build my CEPH cluster but having problems with the health of > > the cluster. > > > You're not telling us the version, but it's clearly 0.87 or beyond. > > > Here are few details: > > - I followed the ceph documentation. > Outdated, unfortunately. > > > - I used btrfs filesystem for all OSDs > Big mistake number 1, do some research (google, ML archives). > Though not related to to your problems. > > > - I did not set "osd pool default size = 2 " as I thought that if I > > have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this > > was right. > Big mistake, assumption number 2, replications size by the default CRUSH > rule is determined by hosts. So that's your main issue here. > Either set it to 2 or use 3 hosts. > > > - I noticed that default pools "data,metadata" were not created. Only > > "rbd" pool was created. > See outdated docs above. The majority of use cases is with RBD, so since > Giant the cephfs pools are not created by default. > > > - As it was complaining that the pg_num is too low, I increased the > > pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num > > 133 > > > pgp_num 64". > > > Re-read the (in this case correct) documentation. > It clearly states to round up to nearest power of 2, in your case 256. > > Regards. > > Christian > > > Would you give me hint where I have made the mistake? (I can remove > > the OSDs and start over if needed.) > > > > > > cephadmin@ceph1:/etc/ceph$ sudo ceph health > > HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck > > unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num > > 133 > > > pgp_num 64 > > cephadmin@ceph1:/etc/ceph$ sudo ceph status > > cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > > health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 > > pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool > > rbd pg_num 133 > pgp_num 64 > > monmap e1: 2 mons at > > {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election > > epoch 8, quorum 0,1 ceph1,ceph2 > > osdmap e42: 4 osds: 4 up, 4 in > >pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects > > 11704 kB used, 11154 GB / 11158 GB avail > >29 active+undersized+degraded > > 104 active+remapped > > > > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree > > # idweight type name up/down reweight > > -1 10.88 root default > > -2 5.44host ceph1 > > 0 2.72osd.0 up 1 > > 1 2.72osd.1 up 1 > > -3 5.44host ceph2 > > 2 2.72osd.2 up 1 > > 3 2.72osd.3 up 1 > > > > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools > > 0 rbd, > > > > cephadmin@ceph1:/etc/ceph$ cat ceph.conf > > [global] > > fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > > public_network = 192.168.30.0/24 > > cluster_n
Re: [ceph-users] xfs/nobarrier
On 28/12/14 15:51, Kyle Bader wrote: do people consider a UPS + Shutdown procedures a suitable substitute? I certainly wouldn't, I've seen utility power fail and the transfer switch fail to transition to UPS strings. Had this happened to me with nobarrier it would have been a very sad day. I'd second that. In addition I've heard of cases where the switchover to the UPS worked ok but the damn thing had a flat battery! So the switchover process and UPS reliability need to be be well rehearsed + monitored if you want to reply on this type of solution. Cheers Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
> do people consider a UPS + Shutdown procedures a suitable substitute? I certainly wouldn't, I've seen utility power fail and the transfer switch fail to transition to UPS strings. Had this happened to me with nobarrier it would have been a very sad day. -- Kyle Bader ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird scrub problem
On Sat, Dec 27, 2014 at 4:09 PM, Andrey Korolyov wrote: > On Tue, Dec 23, 2014 at 4:17 AM, Samuel Just wrote: >> Oh, that's a bit less interesting. The bug might be still around though. >> -Sam >> >> On Mon, Dec 22, 2014 at 2:50 PM, Andrey Korolyov wrote: >>> On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just wrote: You'll have to reproduce with logs on all three nodes. I suggest you open a high priority bug and attach the logs. debug osd = 20 debug filestore = 20 debug ms = 1 I'll be out for the holidays, but I should be able to look at it when I get back. -Sam >>> >>> >>> Thanks Sam, >>> >>> although I am not sure if it makes not only a historical interest (the >>> mentioned cluster running cuttlefish), I`ll try to collect logs for >>> scrub. > > Same stuff: > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15447.html > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14918.html > > Looks like issue is still with us, though it requires meta or file > structure corruption to show itself. I`ll check if it can be > reproduced via rsync -X sec pg subdir -> pri pg subdir or vice-versa. > Mine case shows slightly different pathnames for same objects with > same checksums, may be a root reason then. As every case mentioned, > including mine, happened in oh-shit-hardware-is-broken case, I suggest > that the incurable corruption happens during primary backfill from > active replica at the recovery time. Recovery/backfill from corrupted primary copy results to crash (attached) of primary OSD, for example it can be triggered by purging one of secondary copies (top of cuttlefish branch for line numbers). Although as secondaries preserve same data with same checksums, it is possible to destroy both meta record and pg directory and refill primary back. The interesting point is that the corrupted primary was completely refilled after hardware failure, but looks like it survived long enough after a failure event to spread corruption to the copies, I simply can not imagine better explanation. Thread 1 (Thread 0x7f193190d700 (LWP 64087)): #0 0x7f194a47ab7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00857d59 in reraise_fatal (signum=6) at global/signal_handler.cc:58 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:104 #3 #4 0x7f1948879405 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #5 0x7f194887cb5b in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x7f194917789d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x7f1949175996 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x7f19491759c3 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x7f1949175bee in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x0090436a in ceph::__ceph_assert_fail ( assertion=0x9caf67 "r >= 0", file=, line=7115, func=0x9d1900 "void ReplicatedPG::scan_range(hobject_t, int, int, PG::BackfillInterval*)") at common/assert.cc:77 #11 0x0065de69 in ReplicatedPG::scan_range (this=this@entry=0x4df6000, begin=..., min=min@entry=32, max=max@entry=64, bi=bi@entry=0x4df6d40) at osd/ReplicatedPG.cc:7115 #12 0x0066f5c6 in ReplicatedPG::recover_backfill ( this=this@entry=0x4df6000, max=max@entry=1) at osd/ReplicatedPG.cc:6923 #13 0x0067c18d in ReplicatedPG::start_recovery_ops (this=0x4df6000, max=1, prctx=) at osd/ReplicatedPG.cc:6561 #14 0x006f2340 in OSD::do_recovery (this=0x2ba7000, pg=pg@entry= 0x4df6000) at osd/OSD.cc:6104 #15 0x00735361 in OSD::RecoveryWQ::_process (this=, pg=0x4df6000) at osd/OSD.h:1248 #16 0x008faeba in ThreadPool::worker (this=0x2ba75e0, wt=0x7be1540) at common/WorkQueue.cc:119 #17 0x008fc160 in ThreadPool::WorkThread::entry (this=) at common/WorkQueue.h:316 #18 0x7f194a472e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #19 0x7f19489353dd in clone () from /lib/x86_64-linux-gnu/libc.so.6 #20 0x in ?? () ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hi Christian. Thank you for your suggestions. I will set the "osd pool default size" to 2 as you recommended. As mentioned the documentation is talking about OSDs, not nodes, so that must have confused me. Regarding the BTRFS, i thought that btrfs is better option for the future providing more features. I know that XFS might be more stable, but again my impression was that btrfs is the focus for future development. Is that correct? You are right with the round up. I forgot about that. Thanks for your help. Much appreciated. Jiri - Reply message - From: "Christian Balzer" To: Cc: "Jiri Kanicky" Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec 28, 2014 03:29 Hello, On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: > Hi, > > I just build my CEPH cluster but having problems with the health of the > cluster. > You're not telling us the version, but it's clearly 0.87 or beyond. > Here are few details: > - I followed the ceph documentation. Outdated, unfortunately. > - I used btrfs filesystem for all OSDs Big mistake number 1, do some research (google, ML archives). Though not related to to your problems. > - I did not set "osd pool default size = 2 " as I thought that if I have > 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right. Big mistake, assumption number 2, replications size by the default CRUSH rule is determined by hosts. So that's your main issue here. Either set it to 2 or use 3 hosts. > - I noticed that default pools "data,metadata" were not created. Only > "rbd" pool was created. See outdated docs above. The majority of use cases is with RBD, so since Giant the cephfs pools are not created by default. > - As it was complaining that the pg_num is too low, I increased the > pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num 133 > > pgp_num 64". > Re-read the (in this case correct) documentation. It clearly states to round up to nearest power of 2, in your case 256. Regards. Christian > Would you give me hint where I have made the mistake? (I can remove the > OSDs and start over if needed.) > > > cephadmin@ceph1:/etc/ceph$ sudo ceph health > HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck > unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > > pgp_num 64 > cephadmin@ceph1:/etc/ceph$ sudo ceph status > cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs > stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd > pg_num 133 > pgp_num 64 > monmap e1: 2 mons at > {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch > 8, quorum 0,1 ceph1,ceph2 > osdmap e42: 4 osds: 4 up, 4 in >pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects > 11704 kB used, 11154 GB / 11158 GB avail >29 active+undersized+degraded > 104 active+remapped > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree > # idweight type name up/down reweight > -1 10.88 root default > -2 5.44host ceph1 > 0 2.72osd.0 up 1 > 1 2.72osd.1 up 1 > -3 5.44host ceph2 > 2 2.72osd.2 up 1 > 3 2.72osd.3 up 1 > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools > 0 rbd, > > cephadmin@ceph1:/etc/ceph$ cat ceph.conf > [global] > fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > public_network = 192.168.30.0/24 > cluster_network = 10.1.1.0/24 > mon_initial_members = ceph1, ceph2 > mon_host = 192.168.30.21,192.168.30.22 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > > Thank you > Jiri -- Christian BalzerNetwork/Systems Engineer ch...@gol.comGlobal OnLine Japan/Fusion Communications http://www.gol.com/___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Thanks for the tip. Will do. Jiri - Reply message - From: "Nico Schottelius" To: Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec 28, 2014 03:49 Hey Jiri, also rais the pgp_num (pg != pgp - it's easy to overread). Cheers, Nico Jiri Kanicky [Sun, Dec 28, 2014 at 01:52:39AM +1100]: > Hi, > > I just build my CEPH cluster but having problems with the health of > the cluster. > > Here are few details: > - I followed the ceph documentation. > - I used btrfs filesystem for all OSDs > - I did not set "osd pool default size = 2 " as I thought that if I > have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this > was right. > - I noticed that default pools "data,metadata" were not created. > Only "rbd" pool was created. > - As it was complaining that the pg_num is too low, I increased the > pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num > 133 > pgp_num 64". > > Would you give me hint where I have made the mistake? (I can remove > the OSDs and start over if needed.) > > > cephadmin@ceph1:/etc/ceph$ sudo ceph health > HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck > unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num > 133 > pgp_num 64 > cephadmin@ceph1:/etc/ceph$ sudo ceph status > cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 > pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool > rbd pg_num 133 > pgp_num 64 > monmap e1: 2 mons at > {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election > epoch 8, quorum 0,1 ceph1,ceph2 > osdmap e42: 4 osds: 4 up, 4 in > pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects > 11704 kB used, 11154 GB / 11158 GB avail > 29 active+undersized+degraded > 104 active+remapped > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree > # idweight type name up/down reweight > -1 10.88 root default > -2 5.44host ceph1 > 0 2.72osd.0 up 1 > 1 2.72osd.1 up 1 > -3 5.44host ceph2 > 2 2.72osd.2 up 1 > 3 2.72osd.3 up 1 > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools > 0 rbd, > > cephadmin@ceph1:/etc/ceph$ cat ceph.conf > [global] > fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > public_network = 192.168.30.0/24 > cluster_network = 10.1.1.0/24 > mon_initial_members = ceph1, ceph2 > mon_host = 192.168.30.21,192.168.30.22 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > > Thank you > Jiri > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Improving Performance with more OSD's?
I'm looking to improve the raw performance on my small setup (2 Compute Nodes, 2 OSD's). Only used for hosting KVM images. Raw read/write is roughly 200/35 MB/s. Starting 4+ VM's simultaneously pushes iowaits over 30%, though the system keeps chugging along. Budget is limited ... :( I plan to upgrade my SSD journals to something better than the Samsung 840 EVO's (Intel 520/530?) One of the things I see mentioned a lot in blogs etc is how ceph's performance improves as you add more OSD's and that the quality of the disks does not matter so much as the quantity. How does this work? does ceph stripe reads and writes across the OSD's to improve performance? If I add 3 cheap OSD's to each node (500GB - 1TB) with 10GB SSD journal partition each could I expect a big improvement in performance? What sort of redundancy to setup? currently its min= 1, size=2. Size is not an issue, we already have 150% more space than we need, redundancy and performance is more important. Now I think on it, we can live with the slow write performance, but reducing iowait would be *really* good. thanks, -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On Sun, Dec 28, 2014 at 1:25 AM, Lindsay Mathieson wrote: > On Sat, 27 Dec 2014 06:02:32 PM you wrote: >> Are you able to separate log with data in your setup and check the >> difference? > > Do you mean putting the OSD journal on a separate disk? I have the journals on > SSD partitions, which has helped a lot, previously I was getting 13 MB/s > No, I meant XFS journal, as we are speaking about filestore fs performance. > Its not a good SSD - Samsung 840 EVO :( one of my plans for the new year is to > get SSD's with better seq write speed and IOPS > > I've been trying to figure out if adding more OSD's will improve my > performance, I only have 2 OSD's (one per node) Erm, yes. Two OSDs cannot be considered even for a performance measurement testbed setup, neither should three or any other small number. This explains numbers you are getting and impact from nobarrier option. > >> So, depending on type of your benchmark >> (sync/async/IOPS-/bandwidth-hungry) you may win something just for >> crossing journal and data between disks (and increase failure domain >> for a single disk as well ). > > One does tend to foxus on raw seq read/writes for becnhmarking, but my actual > usage is solely for hosting KVM images, so really random R/W is probably more > important. Ok, then my suggestion may not help as much as it can. > > -- > Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On Sat, 27 Dec 2014 06:02:32 PM you wrote: > Are you able to separate log with data in your setup and check the > difference? Do you mean putting the OSD journal on a separate disk? I have the journals on SSD partitions, which has helped a lot, previously I was getting 13 MB/s Its not a good SSD - Samsung 840 EVO :( one of my plans for the new year is to get SSD's with better seq write speed and IOPS I've been trying to figure out if adding more OSD's will improve my performance, I only have 2 OSD's (one per node) > So, depending on type of your benchmark > (sync/async/IOPS-/bandwidth-hungry) you may win something just for > crossing journal and data between disks (and increase failure domain > for a single disk as well ). One does tend to foxus on raw seq read/writes for becnhmarking, but my actual usage is solely for hosting KVM images, so really random R/W is probably more important. -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD client & STRIPINGV2 support
On Sat, Dec 27, 2014 at 6:46 PM, Florent MONTHEL wrote: > Hi, > > I’ve just created image with striping support like below (image type 2 - 16 > stripes of 64K with 4MB object) : > > rbd create sandevices/flaprdweb01_lun010 --size 102400 --stripe-unit 65536 > --stripe-count 16 --order 22 --image-format 2 > > rbd info sandevices/flaprdweb01_lun010 > rbd image 'flaprdweb01_lun010': > size 102400 MB in 25600 objects > order 22 (4096 kB objects) > block_name_prefix: rbd_data.40c52ae8944a > format: 2 > features: layering, striping > stripe unit: 65536 bytes > stripe count: 16 > > But when I try to map device, I’ve unsupported striping alert on my dmesg > console. > > rbd map sandevices/flaprdweb01_lun010 --name client.admin > rbd: sysfs write failed > rbd: map failed: (22) Invalid argument > > dmesg | tail > [15352.510385] rbd: image flaprdweb01_lun010: unsupported stripe unit (got > 65536 want 4194304) > > Do you know if it’s scheduled to support STRIPINGV2 on tbd client ? > How can I mount my device ? You can't - krbd doesn't support it yet. It's planned, in fact it's the top item on the krbd list. Currently STRIPINGV2 images can be mapped only if su=4M and sc=1 (i.e. if striping params match v1 images) and that's the error you are tripping over. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Not running multiple services on the same machine?
Hi folks, I've heard several comments on the mailing list warning against running multiple Ceph services (monitors, daemons, MDS, gateway) on the same machine. I was wondering if someone could shed more light on the dangers of this. In Deis[1] we only require clusters to be 3 machines big, and we need to run monitors, daemons, and MDS servers. Deis runs on CoreOS, so all of our services are shipped as Docker containers. We run Ceph within containers as our store[2] component, so on a single CoreOS host we're running a monitor, daemon, MDS, gateway, and consuming the cluster with a CephFS mount. I know it's ill-advised, but my question is - why? What sort of issues are we looking at? Data loss, performance, etc.? When I implemented this I was unaware of the recommendation not to do this, and I'd like to address any potential issues now. Thanks! Chris [1]: https://github.com/deis/deis [2]: https://github.com/deis/deis/tree/master/store ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hey Jiri, also rais the pgp_num (pg != pgp - it's easy to overread). Cheers, Nico Jiri Kanicky [Sun, Dec 28, 2014 at 01:52:39AM +1100]: > Hi, > > I just build my CEPH cluster but having problems with the health of > the cluster. > > Here are few details: > - I followed the ceph documentation. > - I used btrfs filesystem for all OSDs > - I did not set "osd pool default size = 2 " as I thought that if I > have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this > was right. > - I noticed that default pools "data,metadata" were not created. > Only "rbd" pool was created. > - As it was complaining that the pg_num is too low, I increased the > pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num > 133 > pgp_num 64". > > Would you give me hint where I have made the mistake? (I can remove > the OSDs and start over if needed.) > > > cephadmin@ceph1:/etc/ceph$ sudo ceph health > HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck > unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num > 133 > pgp_num 64 > cephadmin@ceph1:/etc/ceph$ sudo ceph status > cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 > pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool > rbd pg_num 133 > pgp_num 64 > monmap e1: 2 mons at > {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election > epoch 8, quorum 0,1 ceph1,ceph2 > osdmap e42: 4 osds: 4 up, 4 in > pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects > 11704 kB used, 11154 GB / 11158 GB avail > 29 active+undersized+degraded > 104 active+remapped > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree > # idweight type name up/down reweight > -1 10.88 root default > -2 5.44host ceph1 > 0 2.72osd.0 up 1 > 1 2.72osd.1 up 1 > -3 5.44host ceph2 > 2 2.72osd.2 up 1 > 3 2.72osd.3 up 1 > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools > 0 rbd, > > cephadmin@ceph1:/etc/ceph$ cat ceph.conf > [global] > fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > public_network = 192.168.30.0/24 > cluster_network = 10.1.1.0/24 > mon_initial_members = ceph1, ceph2 > mon_host = 192.168.30.21,192.168.30.22 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > > Thank you > Jiri > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- New PGP key: 659B 0D91 E86E 7E24 FD15 69D0 C729 21A1 293F 2D24 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hello, On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: > Hi, > > I just build my CEPH cluster but having problems with the health of the > cluster. > You're not telling us the version, but it's clearly 0.87 or beyond. > Here are few details: > - I followed the ceph documentation. Outdated, unfortunately. > - I used btrfs filesystem for all OSDs Big mistake number 1, do some research (google, ML archives). Though not related to to your problems. > - I did not set "osd pool default size = 2 " as I thought that if I have > 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right. Big mistake, assumption number 2, replications size by the default CRUSH rule is determined by hosts. So that's your main issue here. Either set it to 2 or use 3 hosts. > - I noticed that default pools "data,metadata" were not created. Only > "rbd" pool was created. See outdated docs above. The majority of use cases is with RBD, so since Giant the cephfs pools are not created by default. > - As it was complaining that the pg_num is too low, I increased the > pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num 133 > > pgp_num 64". > Re-read the (in this case correct) documentation. It clearly states to round up to nearest power of 2, in your case 256. Regards. Christian > Would you give me hint where I have made the mistake? (I can remove the > OSDs and start over if needed.) > > > cephadmin@ceph1:/etc/ceph$ sudo ceph health > HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck > unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > > pgp_num 64 > cephadmin@ceph1:/etc/ceph$ sudo ceph status > cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs > stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd > pg_num 133 > pgp_num 64 > monmap e1: 2 mons at > {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch > 8, quorum 0,1 ceph1,ceph2 > osdmap e42: 4 osds: 4 up, 4 in >pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects > 11704 kB used, 11154 GB / 11158 GB avail >29 active+undersized+degraded > 104 active+remapped > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree > # idweight type name up/down reweight > -1 10.88 root default > -2 5.44host ceph1 > 0 2.72osd.0 up 1 > 1 2.72osd.1 up 1 > -3 5.44host ceph2 > 2 2.72osd.2 up 1 > 3 2.72osd.3 up 1 > > > cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools > 0 rbd, > > cephadmin@ceph1:/etc/ceph$ cat ceph.conf > [global] > fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > public_network = 192.168.30.0/24 > cluster_network = 10.1.1.0/24 > mon_initial_members = ceph1, ceph2 > mon_host = 192.168.30.21,192.168.30.22 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > > Thank you > Jiri -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD client & STRIPINGV2 support
Hi, I’ve just created image with striping support like below (image type 2 - 16 stripes of 64K with 4MB object) : rbd create sandevices/flaprdweb01_lun010 --size 102400 --stripe-unit 65536 --stripe-count 16 --order 22 --image-format 2 rbd info sandevices/flaprdweb01_lun010 rbd image 'flaprdweb01_lun010': size 102400 MB in 25600 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.40c52ae8944a format: 2 features: layering, striping stripe unit: 65536 bytes stripe count: 16 But when I try to map device, I’ve unsupported striping alert on my dmesg console. rbd map sandevices/flaprdweb01_lun010 --name client.admin rbd: sysfs write failed rbd: map failed: (22) Invalid argument dmesg | tail [15352.510385] rbd: image flaprdweb01_lun010: unsupported stripe unit (got 65536 want 4194304) Do you know if it’s scheduled to support STRIPINGV2 on tbd client ? How can I mount my device ? Thanks in advance Florent Monthel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;
Hi, I just build my CEPH cluster but having problems with the health of the cluster. Here are few details: - I followed the ceph documentation. - I used btrfs filesystem for all OSDs - I did not set "osd pool default size = 2 " as I thought that if I have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right. - I noticed that default pools "data,metadata" were not created. Only "rbd" pool was created. - As it was complaining that the pg_num is too low, I increased the pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num 133 > pgp_num 64". Would you give me hint where I have made the mistake? (I can remove the OSDs and start over if needed.) cephadmin@ceph1:/etc/ceph$ sudo ceph health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 cephadmin@ceph1:/etc/ceph$ sudo ceph status cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 > pgp_num 64 monmap e1: 2 mons at {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 8, quorum 0,1 ceph1,ceph2 osdmap e42: 4 osds: 4 up, 4 in pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects 11704 kB used, 11154 GB / 11158 GB avail 29 active+undersized+degraded 104 active+remapped cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree # idweight type name up/down reweight -1 10.88 root default -2 5.44host ceph1 0 2.72osd.0 up 1 1 2.72osd.1 up 1 -3 5.44host ceph2 2 2.72osd.2 up 1 3 2.72osd.3 up 1 cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools 0 rbd, cephadmin@ceph1:/etc/ceph$ cat ceph.conf [global] fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 public_network = 192.168.30.0/24 cluster_network = 10.1.1.0/24 mon_initial_members = ceph1, ceph2 mon_host = 192.168.30.21,192.168.30.22 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On Sat, Dec 27, 2014 at 4:31 PM, Lindsay Mathieson wrote: > On Sat, 27 Dec 2014 04:59:51 PM you wrote: >> Power supply means bigger capex and less redundancy, as the emergency >> procedure in case of power failure is less deterministic than with >> controlled battery-backed cache. > > Yes, the whole auto shut-down procedure is rather more complex and fragile > for a UPS than a controller cache > >> Anyway XFS nobarrier >> does not bring enough performance boost to be enabled by my >> experience. > > It makes a non-trivial difference on my (admittedly slow) setup, with write > bandwidth going from 35 MB/s to 51 MB/s > Are you able to separate log with data in your setup and check the difference? If your devices are working strictly under their upper limits for bw/IOPS, separating meta and data bytes may help a lot, at least for synchronous clients. So, depending on type of your benchmark (sync/async/IOPS-/bandwidth-hungry) you may win something just for crossing journal and data between disks (and increase failure domain for a single disk as well :) ). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird scrub problem
On Tue, Dec 23, 2014 at 4:17 AM, Samuel Just wrote: > Oh, that's a bit less interesting. The bug might be still around though. > -Sam > > On Mon, Dec 22, 2014 at 2:50 PM, Andrey Korolyov wrote: >> On Tue, Dec 23, 2014 at 1:12 AM, Samuel Just wrote: >>> You'll have to reproduce with logs on all three nodes. I suggest you >>> open a high priority bug and attach the logs. >>> >>> debug osd = 20 >>> debug filestore = 20 >>> debug ms = 1 >>> >>> I'll be out for the holidays, but I should be able to look at it when >>> I get back. >>> -Sam >>> >> >> >> Thanks Sam, >> >> although I am not sure if it makes not only a historical interest (the >> mentioned cluster running cuttlefish), I`ll try to collect logs for >> scrub. Same stuff: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg15447.html https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14918.html Looks like issue is still with us, though it requires meta or file structure corruption to show itself. I`ll check if it can be reproduced via rsync -X sec pg subdir -> pri pg subdir or vice-versa. Mine case shows slightly different pathnames for same objects with same checksums, may be a root reason then. As every case mentioned, including mine, happened in oh-shit-hardware-is-broken case, I suggest that the incurable corruption happens during primary backfill from active replica at the recovery time. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On Sat, 27 Dec 2014 04:59:51 PM you wrote: > Power supply means bigger capex and less redundancy, as the emergency > procedure in case of power failure is less deterministic than with > controlled battery-backed cache. Yes, the whole auto shut-down procedure is rather more complex and fragile for a UPS than a controller cache > Anyway XFS nobarrier > does not bring enough performance boost to be enabled by my > experience. It makes a non-trivial difference on my (admittedly slow) setup, with write bandwidth going from 35 MB/s to 51 MB/s -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
Power supply means bigger capex and less redundancy, as the emergency procedure in case of power failure is less deterministic than with controlled battery-backed cache. Cache battery is smaller and way more predictable for a health measurements than a UPS (if passes internal check, it will be *always* enough to keep memory powered for a while, but UPS requires periodical battle testing, if you want to know that it still be able to hold power failure, with two power lanes should be safe enough, simply because device itself has more complex structure than a battery with a single voltage stabilizer). Anyway XFS nobarrier does not bring enough performance boost to be enabled by my experience. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] replace osd's disk, cann't auto recover data
Sorry, the probem is fixed , because of we modify code lead to this bug. From: 邱尚高 Date: 2014-12-27 12:40 To: ceph-users Subject: replace osd's disk, cann't auto recover data 3 HOST: 1 CPU + 4Disks(3T SATA Disk) Ceph version: 0.80.6 OS: Redhat 6.5 Cluster: 3 host, and have 3 MONs + 9 OSDs( One OSD hold one Disk) 1. When cluster status is Health_OK, I write a little data, then I can find some block file in PG directory. [root@rhls-test2 release]# ll data/osd/ceph-0/current/2.106_head/ total 4100 -rw-r--r--. 1 root root 4194304 Dec 17 16:25 rb.0.1021.6b8b4567.0024__head_753F3906__2 2. Before replace the osd disk , we set the cluster NOOUT flag. 3. We stop one OSD.2 which response the PG(2.106) as replica node, and replace the disk with empty disk. 4. and we format the disk with xfs filesystem, and use the ceph-osd --mkfs format. ceph-osd -i 2 --mkfs --set-osd-fsid 86828adf-7579-4127-8789-cb5e8266f15c note: For simply to replace disk, we modify the ceph-osd code , add -set-osd-fsid option for ceph-osd to set the osd use the old fsid. 5. the osd start is OK , and we can find all PG's statues is active+clean. cluster 7c731223-9637-4e21-a6f5-c576a9cf92a4 health HEALTH_OK monmap e1: 3 mons at {a=192.169.1.84:6789/0,b=192.169.1.85:6789/0,c=192.169.1.86:6789/0}, election epoch 78, quorum 0,1,2 a,b,c osdmap e808: 9 osds: 9 up, 9 in pgmap v36218: 3072 pgs, 3 pools, 7069 MB data, 8254 objects 48063 MB used, 22298 GB / 22345 GB avail 3072 active+clean 6. but I find the osd.2 disk have not any data block except the meta data(omap, superblock ,etc). and I can find the all PG's directory, but is empty. [root@rhls-test2 release]# ll data/osd/ceph-2/current/2.106_head/ total 0 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On Sat, 27 Dec 2014 09:03:16 PM Mark Kirkwood wrote: > Yep. If you have 'em plugged into a RAID/HBA card with a battery backup > (that also disables their individual caches) then it is safe to use > nobarrier, otherwise data corruption will result if the server > experiences power loss. Thanks Mark, do people consider a UPS + Shutdown procedures a suitable substitute? -- Lindsay signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xfs/nobarrier
On 27/12/14 20:32, Lindsay Mathieson wrote: I see a lot of people mount their xfs osd's with nobarrier for extra performance, certainly it makes a huge difference to my small system. However I don't do it as my understanding is this runs a risk of data corruption in the event of power failure - this is the case, even with ceph? side note: How do I tell if my disk cache is battery backed? I have WD Red 3TB (WD30EFRX-68EUZN0) with 64M cache, but no mention of battery backup in the docs. I presume that means it isn't? :) Yep. If you have 'em plugged into a RAID/HBA card with a battery backup (that also disables their individual caches) then it is safe to use nobarrier, otherwise data corruption will result if the server experiences power loss. Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com