Re: [ceph-users] Understanding EC properties for CephFS / small files.
> I'm trying to understand the nuts and bolts of EC / CephFS > We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty > slow bulk / archive storage. Ok, did some more searching and found this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021642.html. Which to some degree confirms my understanding, I'd still like to get even more insight though. Gregory Farnum comes with this comments: "Unfortunately any logic like this would need to be handled in your application layer. Raw RADOS does not do object sharding or aggregation on its own. CERN did contribute the libradosstriper, which will break down your multi-gigabyte objects into more typical sizes, but a generic system for packing many small objects into larger ones is tough the choices depend so much on likely access patterns and such. I would definitely recommend working out something like that, though! " An idea about how to advance this stuff: I can see that this would be "very hard" by the Ceph concepts to do at the objects level, but a suggestion would be to do it at the CephFS/MDS level. A basic thing that "often" would work, would be to on a "directory level" have a special type of "packed" object, where multiple files went into the same CephFS object. For common access patterns people are reading through entire catalogs in the first place, which would also limits IO on the overall system for tree traversals (Think tar cxvf linux.kernel.tar.gz git-checkout) I have no idea about how cephfs is dealing with concurrent updates around entitites, but in this situation, dealing with concurrency at the packed-object level. It would be harder to "pack files across catalogs", since that is not the native way of the MDS to keep track of things. A third way would be to more "agressively" inline data on the MDS. How mature - well tested - efficient is that feature? http://docs.ceph.com/docs/master/cephfs/experimental-features/ The unfortunate consequence of bumping the 2KB size upwards to meet the point where EC-pools become efficient would mean that we end up hitting the MDS way harder than what we do today. 2KB seem like a safe limit. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Understanding EC properties for CephFS / small files.
Hi List. I'm trying to understand the nuts and bolts of EC / CephFS We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty slow bulk / archive storage. # getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup getfattr: Removing leading '/' from absolute path names # file: mnt/home/cluster/mysqlbackup ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data_ec42" This configuration is taken directly out of the online documentation: (Which may have been where it went all wrong from our perspective): http://docs.ceph.com/docs/master/cephfs/file-layouts/ Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each with 2 erasure coding chuncks? I dont really understand the stripe_count element? And since erasure-coding works at the object level, striping individual objects across - here 4 replicas - it'll end up filling 16MB ? Or is there an internal optimization causing this not to be the case? Additionally, when reading the file, all 4 chunck need to be read to assemble the object. Causing (at a minumum) 4 IOPS per file. Now, my common file size is < 8MB and commonly 512KB files are on this pool. Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks to fill the erasure coded profile and then 2 coding chuncks on top? In total 24MB for storing 512KB ? And when reading it I'll hit 4 random IO's to read 512KB or can it optimize around not reading "empty" chuncks? If this is true, then I would be both performance and space/cost-wise way better off with 3x replication. Or is it less worse than what I get to here? If the math is true, then we can begin to calculate chunksize and EC profiles for when EC begins to deliver benefits. In terms of IO it seems like I'll always suffer a 1:4 ratio on IOPS in a reading scenario on a 4+2 EC pool, compared to a 3x replication. Side-note: I'm trying to get bacula (tape-backup) to read off my archive to tape in a "resonable time/speed". Thanks in advance. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Second radosgw install
Hi all, I know that it seems like a stupid question, but I have some concerns about this, maybe someone can clear the things for me. I read in the offical docs that , when I create a rgw server with 'ceph-deploy rgw create', the rgw scripts will automatically create the rgw system pools. I'm not sure what happens with the existing system pools if I already have a working rgw server... Thanks. On 2/15/2019 6:35 PM, Adrian Nicolae wrote: Hi, I want to install a second radosgw to my existing ceph cluster (mimic) on another server. Should I create it like the first one, with 'ceph-deploy rgw create' ? I don't want to mess with the existing rgw system pools. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG_AVAILABILITY with one osd down?
Clients' experience depends on whether at the very moment they need to read/write to those particular PGs involved in peering. If their objects are placed in another PGs, then I/O operations shouldn't be impacted. If clients were performing I/O ops to those PGs that went into peering, then they will notice increased latency. That's the case for Object and RBD. In case of CephFS I have no experience. Peering of several PGs does not mean the whole cluster is unavailable during that time. Only a tiny part of it. Also, those 6 seconds is a period of the "PG_AVAIL health check warning" duration. It is not the length of each PG unavailablity. It's the cluster which noticed that during that time some groups performed peering. In a proper setup and healthy conditions one group peers in fractions of second. Restarting an OSD causes the same thing. However is more "smooth" than an unexpected death (going into the details would require quite a long elaboration). If your setup is correct, you should be able to perform a cluster-wide restart of everything and only effect visible outside would be a slightly increased latency. Kind regards, Maks sob., 16 lut 2019 o 21:39 napisał(a): > > Hello, > > your log extract shows that: > > > > 2019-02-15 21:40:08 OSD.29 DOWN > > 2019-02-15 21:40:09 PG_AVAILABILITY warning start > > 2019-02-15 21:40:15 PG_AVAILABILITY warning cleared > > > > 2019-02-15 21:44:06 OSD.29 UP > > 2019-02-15 21:44:08 PG_AVAILABILITY warning start > > 2019-02-15 21:44:15 PG_AVAILABILITY warning cleared > > > > What you saw is the natural consequence of OSD state change. Those two > > periods of limited PG availability (6s each) are related to peering > > that happens shortly after an OSD goes down or up. > > Basically, the placement groups stored on that OSD need peering, so > > the incoming connections are directed to other (alive) OSDs. And, yes, > > during those few seconds the data are not accessible. > > Thanks, bear over with my questions. I'm pretty new to Ceph. > What will clients (CephFS, Object) experience? > .. will they just block until time has passed and they get through or? > > Which means that I'll get 72 x 6 seconds unavailabilty when doing > a rolling restart of my OSD's during upgrades and such? Or is a > controlled restart different than a crash? > > -- > Jesper. > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Some ceph config parameters default values
Dear Cephalopodians, in some recent threads on this list, I have read about the "knobs": pglog_hardlimit (false by default, available at least with 12.2.11 and 13.2.5) bdev_enable_discard (false by default, advanced option, no description) bdev_async_discard (false by default, advanced option, no description) I am wondering about the defaults for these settings, and why these settings seem mostly undocumented. It seems to me that on SSD / NVMe devices, you would always want to enable discard for significantly increased lifetime, or run fstrim regularly (which you can't with bluestore since it's a filesystem of its own). From personal experience, I have already lost two eMMC devices in Android phones early due to trimming not working fine. Of course, on first generation SSD devices, "discard" may lead to data loss (which for most devices has been fixed with firmware updates, though). I would presume that async-discard is also advantageous, since it seems to queue the discards and work on these in bulk later instead of issuing them immediately (that's what I grasp from the code). Additionally, it's unclear to me whether the bdev-discard settings also affect WAL/DB devices, which are very commonly SSD/NVMe devices in the Bluestore age. Concerning the pglog_hardlimit, I read on that list that it's safe and limits maximum memory consumption especially for backfills / during recovery. So it "sounds" like this is also something that could be on by default. But maybe that is not the case yet to allow downgrades after failed upgrades? So in the end, my question is: Is there a reason why these values are not on by default, and are also not really mentioned in the documentation? Are they just "not ready yet" / unsafe to be on by default, or are the defaults just like that because they have always been at this value, and defaults will change with the next major release (nautilus)? Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG_AVAILABILITY with one osd down?
> Hello, > your log extract shows that: > > 2019-02-15 21:40:08 OSD.29 DOWN > 2019-02-15 21:40:09 PG_AVAILABILITY warning start > 2019-02-15 21:40:15 PG_AVAILABILITY warning cleared > > 2019-02-15 21:44:06 OSD.29 UP > 2019-02-15 21:44:08 PG_AVAILABILITY warning start > 2019-02-15 21:44:15 PG_AVAILABILITY warning cleared > > What you saw is the natural consequence of OSD state change. Those two > periods of limited PG availability (6s each) are related to peering > that happens shortly after an OSD goes down or up. > Basically, the placement groups stored on that OSD need peering, so > the incoming connections are directed to other (alive) OSDs. And, yes, > during those few seconds the data are not accessible. Thanks, bear over with my questions. I'm pretty new to Ceph. What will clients (CephFS, Object) experience? .. will they just block until time has passed and they get through or? Which means that I'll get 72 x 6 seconds unavailabilty when doing a rolling restart of my OSD's during upgrades and such? Or is a controlled restart different than a crash? -- Jesper. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG_AVAILABILITY with one osd down?
Hello, your log extract shows that: 2019-02-15 21:40:08 OSD.29 DOWN 2019-02-15 21:40:09 PG_AVAILABILITY warning start 2019-02-15 21:40:15 PG_AVAILABILITY warning cleared 2019-02-15 21:44:06 OSD.29 UP 2019-02-15 21:44:08 PG_AVAILABILITY warning start 2019-02-15 21:44:15 PG_AVAILABILITY warning cleared What you saw is the natural consequence of OSD state change. Those two periods of limited PG availability (6s each) are related to peering that happens shortly after an OSD goes down or up. Basically, the placement groups stored on that OSD need peering, so the incoming connections are directed to other (alive) OSDs. And, yes, during those few seconds the data are not accessible. Kind regards, Maks sob., 16 lut 2019 o 07:25 napisał(a): > Yesterday I saw this one.. it puzzles me: > 2019-02-15 21:00:00.000126 mon.torsk1 mon.0 10.194.132.88:6789/0 604164 : > cluster [INF] overall HEALTH_OK > 2019-02-15 21:39:55.793934 mon.torsk1 mon.0 10.194.132.88:6789/0 604304 : > cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec. > Implicated osds 58 (REQUEST_SLOW) > 2019-02-15 21:40:00.887766 mon.torsk1 mon.0 10.194.132.88:6789/0 604305 : > cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec. > Implicated osds 9,19,52,58,68 (REQUEST_SLOW) > 2019-02-15 21:40:06.973901 mon.torsk1 mon.0 10.194.132.88:6789/0 604306 : > cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec. > Implicated osds 3,9,19,29,32,52,55,58,68,69 (REQUEST_SLOW) > 2019-02-15 21:40:08.466266 mon.torsk1 mon.0 10.194.132.88:6789/0 604307 : > cluster [INF] osd.29 failed (root=default,host=bison) (6 reporters from > different host after 33.862482 >= grace 29.247323) > 2019-02-15 21:40:08.473703 mon.torsk1 mon.0 10.194.132.88:6789/0 604308 : > cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) > 2019-02-15 21:40:09.489494 mon.torsk1 mon.0 10.194.132.88:6789/0 604310 : > cluster [WRN] Health check failed: Reduced data availability: 6 pgs > peering (PG_AVAILABILITY) > 2019-02-15 21:40:11.008906 mon.torsk1 mon.0 10.194.132.88:6789/0 604312 : > cluster [WRN] Health check failed: Degraded data redundancy: > 3828291/700353996 objects degraded (0.547%), 77 pgs degraded (PG_DEGRADED) > 2019-02-15 21:40:13.474777 mon.torsk1 mon.0 10.194.132.88:6789/0 604313 : > cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec. > Implicated osds 3,9,32,55,58,69 (REQUEST_SLOW) > 2019-02-15 21:40:15.060165 mon.torsk1 mon.0 10.194.132.88:6789/0 604314 : > cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data > availability: 17 pgs peering) > 2019-02-15 21:40:17.128185 mon.torsk1 mon.0 10.194.132.88:6789/0 604315 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897139/700354131 objects degraded (1.413%), 200 pgs degraded > (PG_DEGRADED) > 2019-02-15 21:40:17.128219 mon.torsk1 mon.0 10.194.132.88:6789/0 604316 : > cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are > blocked > 32 sec. Implicated osds 32,55) > 2019-02-15 21:40:22.137090 mon.torsk1 mon.0 10.194.132.88:6789/0 604317 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897140/700354194 objects degraded (1.413%), 200 pgs degraded > (PG_DEGRADED) > 2019-02-15 21:40:27.249354 mon.torsk1 mon.0 10.194.132.88:6789/0 604318 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897142/700354287 objects degraded (1.413%), 200 pgs degraded > (PG_DEGRADED) > 2019-02-15 21:40:33.335147 mon.torsk1 mon.0 10.194.132.88:6789/0 604322 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897143/700354356 objects degraded (1.413%), 200 pgs degraded > (PG_DEGRADED) > ... shortened .. > 2019-02-15 21:43:48.496536 mon.torsk1 mon.0 10.194.132.88:6789/0 604366 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897168/700356693 objects degraded (1.413%), 200 pgs degraded, 201 pgs > undersized (PG_DEGRADED) > 2019-02-15 21:43:53.496924 mon.torsk1 mon.0 10.194.132.88:6789/0 604367 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897170/700356804 objects degraded (1.413%), 200 pgs degraded, 201 pgs > undersized (PG_DEGRADED) > 2019-02-15 21:43:58.497313 mon.torsk1 mon.0 10.194.132.88:6789/0 604368 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897172/700356879 objects degraded (1.413%), 200 pgs degraded, 201 pgs > undersized (PG_DEGRADED) > 2019-02-15 21:44:03.497696 mon.torsk1 mon.0 10.194.132.88:6789/0 604369 : > cluster [WRN] Health check update: Degraded data redundancy: > 9897174/700356996 objects degraded (1.413%), 200 pgs degraded, 201 pgs > undersized (PG_DEGRADED) > 2019-02-15 21:44:06.939331 mon.torsk1 mon.0 10.194.132.88:6789/0 604372 : > cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down) > 2019-02-15 21:44:06.965401 mon.torsk1 mon.0 10.194.132.88:6789/0 604373 : > cluster [INF] osd.29 10.194.133.58:6844/305358 boot > 2019-02-15 21:44:08.498060 mon.torsk1 mon.0
Re: [ceph-users] Placing replaced disks to correct buckets.
I recently replaced failed HDDs and removed them from their respective buckets as per procedure. But I’m now facing an issue when trying to place new ones back into the buckets. I’m getting an error of ‘osd nr not found’ OR ‘file or directory not found’ OR command sintax error. I have been using the commands below: ceph osd crush set ceph osd crush set I do however find the OSD number when i run command: ceph osd find Your assistance/response to this will be highly appreciated. Regards John. Please, paste your `ceph osd tree`, your version and what exactly error you get include osd number. Less obfuscation is better in this, perhaps, simple case. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Placing replaced disks to correct buckets.
Hi Everyone, I recently replaced failed HDDs and removed them from their respective buckets as per procedure. But I’m now facing an issue when trying to place new ones back into the buckets. I’m getting an error of ‘osd nr not found’ OR ‘file or directory not found’ OR command sintax error. I have been using the commands below: ceph osd crush set ceph osd crush set I do however find the OSD number when i run command: ceph osd find Your assistance/response to this will be highly appreciated. Regards John. Sent from my iPhone Vrywaringsklousule / Disclaimer: http://www.nwu.ac.za/it/gov-man/disclaimer.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph auth caps 'create rbd image' permission
Currently I am using 'profile rbd' on mon and osd. Is it possible with the caps to allow a user to - List rbd images - get state of images - write/read to images Etc But do not allow to have it create new images? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>There are 10 OSDs in these systems with 96GB of memory in total. We are >>runnigh with memory target on 6G right now to make sure there is no >>leakage. If this runs fine for a longer period we will go to 8GB per OSD >>so it will max out on 80GB leaving 16GB as spare. Thanks Wido. I send results monday with my increased memory @Igor: I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example). (op_w_process_latency). If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too. does "op_w_process_latency" counter include replication time ? - Mail original - De: "Wido den Hollander" À: "aderumier" Cc: "Igor Fedotov" , "ceph-users" , "ceph-devel" Envoyé: Vendredi 15 Février 2019 14:59:30 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: >>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe >>> OSDs as well. Over time their latency increased until we started to >>> notice I/O-wait inside VMs. > > I'm also notice it in the vms. BTW, what it your nvme disk size ? Samsung PM983 3.84TB SSDs in both clusters. > > >>> A restart fixed it. We also increased memory target from 4G to 6G on >>> these OSDs as the memory would allow it. > > I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. > (my last test was 8gb with 1osd of 6TB, but that didn't help) There are 10 OSDs in these systems with 96GB of memory in total. We are runnigh with memory target on 6G right now to make sure there is no leakage. If this runs fine for a longer period we will go to 8GB per OSD so it will max out on 80GB leaving 16GB as spare. As these OSDs were all restarted earlier this week I can't tell how it will hold up over a longer period. Monitoring (Zabbix) shows the latency is fine at the moment. Wido > > > - Mail original - > De: "Wido den Hollander" > À: "Alexandre Derumier" , "Igor Fedotov" > > Cc: "ceph-users" , "ceph-devel" > > Envoyé: Vendredi 15 Février 2019 14:50:34 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, until > restart > > On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: >> Thanks Igor. >> >> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is >> different. >> >> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't >> see this latency problem. >> >> > > Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe > OSDs as well. Over time their latency increased until we started to > notice I/O-wait inside VMs. > > A restart fixed it. We also increased memory target from 4G to 6G on > these OSDs as the memory would allow it. > > But we noticed this on two different 12.2.10/11 clusters. > > A restart made the latency drop. Not only the numbers, but the > real-world latency as experienced by a VM as well. > > Wido > >> >> >> >> >> >> - Mail original - >> De: "Igor Fedotov" >> Cc: "ceph-users" , "ceph-devel" >> >> Envoyé: Vendredi 15 Février 2019 13:47:57 >> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until >> restart >> >> Hi Alexander, >> >> I've read through your reports, nothing obvious so far. >> >> I can only see several times average latency increase for OSD write ops >> (in seconds) >> 0.002040060 (first hour) vs. >> >> 0.002483516 (last 24 hours) vs. >> 0.008382087 (last hour) >> >> subop_w_latency: >> 0.000478934 (first hour) vs. >> 0.000537956 (last 24 hours) vs. >> 0.003073475 (last hour) >> >> and OSD read ops, osd_r_latency: >> >> 0.000408595 (first hour) >> 0.000709031 (24 hours) >> 0.004979540 (last hour) >> >> What's interesting is that such latency differences aren't observed at >> neither BlueStore level (any _lat params under "bluestore" section) nor >> rocksdb one. >> >> Which probably means that the issue is rather somewhere above BlueStore. >> >> Suggest to proceed with perf dumps collection to see if the picture >> stays the same. >> >> W.r.t. memory usage you observed I see nothing suspicious so far - No >> decrease in RSS report is a known artifact that seems to be safe. >> >> Thanks, >> Igor >> >> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: >>> Hi Igor, >>> >>> Thanks again for helping ! >>> >>> >>> >>> I have upgrade to last mimic this weekend, and with new autotune memory, >>> I have setup osd_memory_target to 8G. (my nvme are 6TB) >>> >>> >>> I have done a lot of perf dump and mempool dump and ps of process to >> see rss memory at different hours, >>> here the reports for osd.0: >>> >>> http://odisoweb1.odiso.net/perfanalysis/ >>> >>> >>> osd has been started the 12-02-2019 at 08:00 >>> >>> first report after 1h running >>>
Re: [ceph-users] Openstack RBD EC pool
### ceph.conf [global] fsid = b5e30221-a214-353c-b66b-8c37b4349123 mon host = ceph-mon.service.i.ewcs.ch auth cluster required = cephx auth service required = cephx auth client required = cephx ### ## ceph.ec.conf [global] fsid = b5e30221-a214-353c-b66b-8c37b4349123 mon host = ceph-mon.service.i.. auth cluster required = cephx auth service required = cephx auth client required = cephx [client.cinder-ec] rbd default data pool = ewos1-prod_cinder_ec # There is not necessary to split this settings to two files. Use one ceph.conf instead. [client.cinder-ec] rbd default data pool = ewos1-prod_cinder_ec But your pool is: ceph osd pool create cinder_ec 512 512 erasure ec32 k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com