Re: [ceph-users] Performance counters oddities, cache tier and otherwise
It seems that temperature / recency estimation haven't worked properly at some point. Cheers, Shinobu - Original Message - From: "Christian Balzer"To: ceph-users@lists.ceph.com Sent: Thursday, April 7, 2016 11:51:38 AM Subject: [ceph-users] Performance counters oddities, cache tier and otherwise Hello, Ceph 0.94.5 for the record. As some may remember, I phased in a 2TB cache tier 5 weeks ago. About now it has reached about 60% usage, which is what I have the cache_target_dirty_ratio set to. And for the last 3 days I could see some writes (op_in_bytes) to the backing storage (aka HDD pool), which hadn't seen any write action for the aforementioned 5 weeks. Alas my graphite dashboard showed no flushes (tier_flush), whereas tier_promote on the cache pool could always be matched more or less to op_out_bytes on the HDD pool. The documentation (RH site) just parrots the names of the various perf counters, so no help there. OK, lets look a what we got: --- "tier_promote": 49776, "tier_flush": 0, "tier_flush_fail": 0, "tier_try_flush": 558, "tier_try_flush_fail": 0, "agent_flush": 558, "tier_evict": 0, "agent_evict": 0, --- Lots of promotions, that's fine. Not a single tier_flush, er. wot? So what does this denote then? OK, clearly tier_try_flush and agent_flush are where the flushing is actually recorded (in my test cluster they differ, as I have run that against the wall several times). No evictions yet, that will happen at 90% usage. So now I changed the graph data source for flushes to tier_try_flush, however that does not match most of the op_in_bytes (or any other counter I tried!) on the HDDs. As, in there are flushes but no activity on the HDD OSDs as far as Ceph seems to be concerned. I can however match the flushes to actual disk activity on the HDDs (gathered by collectd), which are otherwise totally dormant. Can somebody shed some light on this, is it a known problem, in need of a bug report? Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] adding cache tier in productive hammer environment
Hello, On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote: > Hi, > > i have some IO issues, and after Christian's great article/hint about > caches i plan to add caches too. > Thanks, version 2 is still a work in progress, as I keep running into unknowns. IO issues in what sense, like in too many write IOPS for the current HW to sustain? Also, what are you using Ceph for, RBD hosting VM images? It will help you a lot if you can identify and quantify the usage patterns (including a rough idea on how many hot objects you have) and where you run into limits. > So now comes the troublesome question: > > How much dangerous is it to add cache tiers in an existing cluster with > around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes. > You're reducing nodes? Why? More nodes/OSDs equates to more IOPS in general. 40TB is a sizable amount of data, how many objects does you cluster hold? Also is that raw data or after replication (size 3?)? In short, "ceph -s" output please. ^.^ > I mean will just everything explode and i just die, or how is the road > map to introduce this, after you have an already running cluster ? > That's pretty much straightforward from the Ceph docs at: http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ (replace master with hammer if you're running that) Nothing happens until the "set-overlay" bit and you will want to configure all the pertinent bits before that. A basic question is if you will have dedicated SSD cache tier hosts or have the SSDs holding the cache pool in your current hosts. Dedicated hosts have the advantage matched HW, CPU power the SSDs and simpler configuration, shared hosts can have the advantage of spreading the network load further out instead of having everything going through the cache tier nodes. The size and length of the explosion will entirely depend on: 1) how capable your current cluster is, how (over)loaded it is. 2) the actual load/usage at the time you phase the cache tier in 3) the amount of "truly hot" objects you have. As I wrote here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html In my case with a BADLY overload base pool and a constant stream of log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized after 10 minutes. Truly hot objects as mentioned above will be those (in the case of VM images) holding active directory inodes and files. > Anything that needs to be considered ? Dangerous no-no's ? > > Also it will happen, that i have to add the cache tiers server by > server, and not all at the same time. > You want at least 2 cache tier servers from the start and well known, well tested (LSI timeouts!) SSDs in them. Christian > I am happy for any kind of advice. > > Thank you ! > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Performance counters oddities, cache tier and otherwise
Hello, Ceph 0.94.5 for the record. As some may remember, I phased in a 2TB cache tier 5 weeks ago. About now it has reached about 60% usage, which is what I have the cache_target_dirty_ratio set to. And for the last 3 days I could see some writes (op_in_bytes) to the backing storage (aka HDD pool), which hadn't seen any write action for the aforementioned 5 weeks. Alas my graphite dashboard showed no flushes (tier_flush), whereas tier_promote on the cache pool could always be matched more or less to op_out_bytes on the HDD pool. The documentation (RH site) just parrots the names of the various perf counters, so no help there. OK, lets look a what we got: --- "tier_promote": 49776, "tier_flush": 0, "tier_flush_fail": 0, "tier_try_flush": 558, "tier_try_flush_fail": 0, "agent_flush": 558, "tier_evict": 0, "agent_evict": 0, --- Lots of promotions, that's fine. Not a single tier_flush, er. wot? So what does this denote then? OK, clearly tier_try_flush and agent_flush are where the flushing is actually recorded (in my test cluster they differ, as I have run that against the wall several times). No evictions yet, that will happen at 90% usage. So now I changed the graph data source for flushes to tier_try_flush, however that does not match most of the op_in_bytes (or any other counter I tried!) on the HDDs. As, in there are flushes but no activity on the HDD OSDs as far as Ceph seems to be concerned. I can however match the flushes to actual disk activity on the HDDs (gathered by collectd), which are otherwise totally dormant. Can somebody shed some light on this, is it a known problem, in need of a bug report? Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Maximizing OSD to PG quantity
Hello, On Wed, 6 Apr 2016 18:15:57 + David Turner wrote: > You can mitigate how much it affects the IO but for the cost of how long > it will take to complete. > > ceph tell osd.* injectargs '--osd-max-backfills #' > Also have a read of: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg27970.html for more knobs to twiddle. > Where # is the most pgs any osd can participate backfill data for at any > given time. This is the same setting that is used when you add, remove, > lose, or reweight osds in your cluster. The lower the number, the less > impact to cluster IO but the longer it will take to finish the task. > Max-backfills of 5 seems to work out well enough to get through things > in a timely manner while not critically impacting IO. I do up that to > 20 if I need speed more than IO. These numbers are very dependent on > your individual hardware and configuration. > Very very, true words. Which brings me to the OP, you haven't told us your cluster details. 12 OSDs sounds like 2 hosts with 6 OSDs each to me. If that's the case, you'll need/want a 3rd host. If you already have 3 or more storage nodes, you can go ahead with the replica increase, but note that this will not only reduce your storage capacity accordingly but also have an impact on performance, one more OSD will have to ACK each write. This will be particular noticeable with non-SSD journals, but the additional network latency will be there in any case. Christian > From: ceph-users > [ceph-users-boun...@lists.ceph.com] on behalf of Oliver Dzombic > [i...@ip-interactive.de] Sent: Wednesday, April 06, 2016 11:45 AM To: > ceph-users@lists.ceph.com Subject: Re: [ceph-users] Maximizing OSD to PG > quantity > > Hi, > > huge, deadly, IO :-) > > Imagine, everything has to multiplied 1 time. Thats nothing what will go > smooth :-) > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:i...@ip-interactive.de > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) > Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 > UST ID: DE274086107 > > > Am 06.04.2016 um 16:41 schrieb d...@integrityhost.com: > > Will changing the replication size from 2 to 3 cause huge I/O resources > > to be used, or does this happen quietly in the background? > > > > > > On 2016-04-06 00:40, Christian Balzer wrote: > >> Hello, > >> > >> Brian already mentioned a number very pertinent things, I've got a few > >> more: > >> > >> On Tue, 05 Apr 2016 10:48:49 -0400 d...@integrityhost.com wrote: > >> > >>> In a 12 OSD setup, the following config is there: > >>> > >>> (OSDs * 100) > >>> Total PGs = -- > >>> pool size > >>> > >> > >> The PGcalc page at http://ceph.com/pgcalc/ is quite helpful and > >> contains a > >> lot of background info as well. > >> > >> As Brian said, you can never decrease PG count, but growing it is > >> also a very I/O intensive operation and you want to avoid that as > >> much as possible. > >> > >>> > >>> So with 12 OSD's and a pool size of 2 replicas, this would equal > >>> Total PGs of 600 as per this url: > >> PGcalc with a target of 200 PGs per OSD (doubling of cluster size > >> expected) gives us 1024, which is also what I would go for myself. > >> > >> However if this a production cluster and your OSDs are NOT RAID1 or > >> very very reliable, fast and well monitored SSDs you're basically > >> asking Murphy > >> to come visit, destroying your data while eating babies and washing > >> them down with bath water. > >> > >> The default replication size was changed to 3 for a very good reason, > >> there are plenty of threads in this ML about failure scenarios and > >> probabilities. > >> > >> Christian > >> > >>> > >>> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#preselection > >>> > >>> > >>> Yet in the same page, at the top it says: > >>> > >>> Between 10 and 50 OSDs set pg_num to 4096 > >>> > >>> Our use is for shared hosting so there are lots of small writes and > >>> reads. Which of these would be correct? > >>> > >>> Also is it a simple process to update PGs on a live system without > >>> affecting service? > >>> ___ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > --
Re: [ceph-users] cephfs rm -rf on directory of 160TB /40M files
On Wed, Apr 6, 2016 at 10:42 PM, Scottixwrote: > I have been running some speed tests in POSIX file operations and I noticed > even just listing files can take a while compared to an attached HDD. I am > wondering is there a reason it takes so long to even just list files. If you're running comparisons, it would really be more instructive to compare ceph with something like an NFS server, rather than a local filesystem. > Here is the test I ran > > time for i in {1..10}; do touch $i; done > > Internal HDD: > real 4m37.492s > user 0m18.125s > sys 1m5.040s > > Ceph Dir > real 12m30.059s > user 0m16.749s > sys 0m53.451s > > ~300% faster on HDD > > *I am actually ok with this but nice to be quicker. > > When I am listing the directory it is taking a lot longer compared to an > attached HDD > > time ls -1 > > Internal HDD > real 0m2.112s > user 0m0.560s > sys 0m0.440s > > Ceph Dir > real 3m35.982s > user 0m2.788s > sys 0m4.580s > > ~1000% faster on HDD > > *I understand there is some time in the display so what is really making it > odd is the following test. > > time ls -1 > /dev/null > > Internal HDD > real 0m0.367s > user 0m0.324s > sys 0m0.040s > > Ceph Dir > real 0m2.807s > user 0m0.128s > sys 0m0.052s If the difference when sending to /dev/null is reproducible (not just a cache artifact), I would suspect that your `ls` is noticing that it's not talking to a tty, so it's not bothering to color things, so it's not individually statting each file to decide what color to make it. On network filesystems, "ls -l" (or colored ls) is often much slower than a straight directory listing. Cheers, John > ~700% faster on HDD > > My guess the performance issue is with the batch requests as you stated. So > I am wondering if the file deletion of the 40M files is not just deleting > the files but even just traversing that many files takes a while. It's an unhappy feedback combination of listing them, sending N individual unlink operations, and then the MDS getting bogged down in the resulting purges while it's still trying to handle incoming unlink requests. > I am running this on 0.94.6 with Ceph Fuse Client > And config > fuse multithreaded = false > > Since multithreaded crashes in hammer. > > It would be interesting to see the performance on newer versions. > > Any thoughts or comments would be good. > > On Tue, Apr 5, 2016 at 9:22 AM Gregory Farnum wrote: >> >> On Mon, Apr 4, 2016 at 9:55 AM, Gregory Farnum wrote: >> > Deletes are just slow right now. You can look at the ops in flight on >> > you >> > client or MDS admin socket to see how far along it is and watch them to >> > see >> > how long stuff is taking -- I think it's a sync disk commit for each >> > unlink >> > though so at 40M it's going to be a good looong while. :/ >> > -Greg >> >> Oh good, I misremembered — it's a synchronous request to the MDS, but >> it's not a synchronous disk commit. They get batched up normally in >> the metadata log. :) >> Still, a sync MDS request can take a little bit of time. Someday we >> will make the client able to respond to these more quickly locally and >> batch up MDS requests or something, but it'll be tricky. Faster file >> creates will probably come first. (If we're lucky they can use some of >> the same client-side machinery.) >> -Greg >> >> > >> > >> > On Monday, April 4, 2016, Kenneth Waegeman >> > wrote: >> >> >> >> Hi all, >> >> >> >> I want to remove a large directory containing +- 40M files /160TB of >> >> data >> >> in CephFS by running rm -rf on the directory via the ceph kernel >> >> client. >> >> After 7h , the rm command is still running. I checked the rados df >> >> output, >> >> and saw that only about 2TB and 2M files are gone. >> >> I know this output of rados df can be confusing because ceph should >> >> delete >> >> objects asyncroniously, but then I don't know why the rm command still >> >> hangs. >> >> Is there some way to speed this up? And is there a way to check how far >> >> the marked for deletion has progressed ? >> >> >> >> Thank you very much! >> >> >> >> Kenneth >> >> >> >> ___ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs rm -rf on directory of 160TB /40M files
On Wed, Apr 6, 2016 at 2:42 PM, Scottixwrote: > I have been running some speed tests in POSIX file operations and I noticed > even just listing files can take a while compared to an attached HDD. I am > wondering is there a reason it takes so long to even just list files. > > Here is the test I ran > > time for i in {1..10}; do touch $i; done > > Internal HDD: > real 4m37.492s > user 0m18.125s > sys 1m5.040s > > Ceph Dir > real 12m30.059s > user 0m16.749s > sys 0m53.451s > > ~300% faster on HDD > > *I am actually ok with this but nice to be quicker. > > When I am listing the directory it is taking a lot longer compared to an > attached HDD > > time ls -1 > > Internal HDD > real 0m2.112s > user 0m0.560s > sys 0m0.440s > > Ceph Dir > real 3m35.982s > user 0m2.788s > sys 0m4.580s > > ~1000% faster on HDD This might be a bad interaction between your MDS cache size and the size of the directory. The subsequent run is a lot faster because after running an "ls" once you've got most of the information you need for it cached locally on the client (but perhaps not all of it, depending on various things). > > *I understand there is some time in the display so what is really making it > odd is the following test. > > time ls -1 > /dev/null > > Internal HDD > real 0m0.367s > user 0m0.324s > sys 0m0.040s > > Ceph Dir > real 0m2.807s > user 0m0.128s > sys 0m0.052s > > ~700% faster on HDD > > My guess the performance issue is with the batch requests as you stated. So > I am wondering if the file deletion of the 40M files is not just deleting > the files but even just traversing that many files takes a while. > > I am running this on 0.94.6 with Ceph Fuse Client > And config > fuse multithreaded = false > > Since multithreaded crashes in hammer. Oh, that's probably hurting things in various ways. The fix for http://tracker.ceph.com/issues/13729 ended up getting into the hammer branch after all and should go out whenever there's another stable release, FYI. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs rm -rf on directory of 160TB /40M files
I have been running some speed tests in POSIX file operations and I noticed even just listing files can take a while compared to an attached HDD. I am wondering is there a reason it takes so long to even just list files. Here is the test I ran time for i in {1..10}; do touch $i; done Internal HDD: real 4m37.492s user 0m18.125s sys 1m5.040s Ceph Dir real 12m30.059s user 0m16.749s sys 0m53.451s ~300% faster on HDD *I am actually ok with this but nice to be quicker. When I am listing the directory it is taking a lot longer compared to an attached HDD time ls -1 Internal HDD real 0m2.112s user 0m0.560s sys 0m0.440s Ceph Dir real 3m35.982s user 0m2.788s sys 0m4.580s ~1000% faster on HDD *I understand there is some time in the display so what is really making it odd is the following test. time ls -1 > /dev/null Internal HDD real 0m0.367s user 0m0.324s sys 0m0.040s Ceph Dir real 0m2.807s user 0m0.128s sys 0m0.052s ~700% faster on HDD My guess the performance issue is with the batch requests as you stated. So I am wondering if the file deletion of the 40M files is not just deleting the files but even just traversing that many files takes a while. I am running this on 0.94.6 with Ceph Fuse Client And config fuse multithreaded = false Since multithreaded crashes in hammer. It would be interesting to see the performance on newer versions. Any thoughts or comments would be good. On Tue, Apr 5, 2016 at 9:22 AM Gregory Farnumwrote: > On Mon, Apr 4, 2016 at 9:55 AM, Gregory Farnum wrote: > > Deletes are just slow right now. You can look at the ops in flight on you > > client or MDS admin socket to see how far along it is and watch them to > see > > how long stuff is taking -- I think it's a sync disk commit for each > unlink > > though so at 40M it's going to be a good looong while. :/ > > -Greg > > Oh good, I misremembered — it's a synchronous request to the MDS, but > it's not a synchronous disk commit. They get batched up normally in > the metadata log. :) > Still, a sync MDS request can take a little bit of time. Someday we > will make the client able to respond to these more quickly locally and > batch up MDS requests or something, but it'll be tricky. Faster file > creates will probably come first. (If we're lucky they can use some of > the same client-side machinery.) > -Greg > > > > > > > On Monday, April 4, 2016, Kenneth Waegeman > > wrote: > >> > >> Hi all, > >> > >> I want to remove a large directory containing +- 40M files /160TB of > data > >> in CephFS by running rm -rf on the directory via the ceph kernel client. > >> After 7h , the rm command is still running. I checked the rados df > output, > >> and saw that only about 2TB and 2M files are gone. > >> I know this output of rados df can be confusing because ceph should > delete > >> objects asyncroniously, but then I don't know why the rm command still > >> hangs. > >> Is there some way to speed this up? And is there a way to check how far > >> the marked for deletion has progressed ? > >> > >> Thank you very much! > >> > >> Kenneth > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] adding cache tier in productive hammer environment
Hi, i have some IO issues, and after Christian's great article/hint about caches i plan to add caches too. So now comes the troublesome question: How much dangerous is it to add cache tiers in an existing cluster with around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes. I mean will just everything explode and i just die, or how is the road map to introduce this, after you have an already running cluster ? Anything that needs to be considered ? Dangerous no-no's ? Also it will happen, that i have to add the cache tiers server by server, and not all at the same time. I am happy for any kind of advice. Thank you ! -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Maximizing OSD to PG quantity
You can mitigate how much it affects the IO but for the cost of how long it will take to complete. ceph tell osd.* injectargs '--osd-max-backfills #' Where # is the most pgs any osd can participate backfill data for at any given time. This is the same setting that is used when you add, remove, lose, or reweight osds in your cluster. The lower the number, the less impact to cluster IO but the longer it will take to finish the task. Max-backfills of 5 seems to work out well enough to get through things in a timely manner while not critically impacting IO. I do up that to 20 if I need speed more than IO. These numbers are very dependent on your individual hardware and configuration. From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Oliver Dzombic [i...@ip-interactive.de] Sent: Wednesday, April 06, 2016 11:45 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Maximizing OSD to PG quantity Hi, huge, deadly, IO :-) Imagine, everything has to multiplied 1 time. Thats nothing what will go smooth :-) -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 06.04.2016 um 16:41 schrieb d...@integrityhost.com: > Will changing the replication size from 2 to 3 cause huge I/O resources > to be used, or does this happen quietly in the background? > > > On 2016-04-06 00:40, Christian Balzer wrote: >> Hello, >> >> Brian already mentioned a number very pertinent things, I've got a few >> more: >> >> On Tue, 05 Apr 2016 10:48:49 -0400 d...@integrityhost.com wrote: >> >>> In a 12 OSD setup, the following config is there: >>> >>> (OSDs * 100) >>> Total PGs = -- >>> pool size >>> >> >> The PGcalc page at http://ceph.com/pgcalc/ is quite helpful and >> contains a >> lot of background info as well. >> >> As Brian said, you can never decrease PG count, but growing it is also a >> very I/O intensive operation and you want to avoid that as much as >> possible. >> >>> >>> So with 12 OSD's and a pool size of 2 replicas, this would equal Total >>> PGs of 600 as per this url: >> PGcalc with a target of 200 PGs per OSD (doubling of cluster size >> expected) gives us 1024, which is also what I would go for myself. >> >> However if this a production cluster and your OSDs are NOT RAID1 or very >> very reliable, fast and well monitored SSDs you're basically asking >> Murphy >> to come visit, destroying your data while eating babies and washing them >> down with bath water. >> >> The default replication size was changed to 3 for a very good reason, >> there are plenty of threads in this ML about failure scenarios and >> probabilities. >> >> Christian >> >>> >>> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#preselection >>> >>> >>> Yet in the same page, at the top it says: >>> >>> Between 10 and 50 OSDs set pg_num to 4096 >>> >>> Our use is for shared hosting so there are lots of small writes and >>> reads. Which of these would be correct? >>> >>> Also is it a simple process to update PGs on a live system without >>> affecting service? >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Day Sunnyvale Presentations
Hey cephers, I have all but one of the presentations from Ceph Day Sunnyvale, so rather than wait for a full hand I went ahead and posted the link to the slides on the event page: http://ceph.com/cephdays/ceph-day-sunnyvale/ The videos probably wont be processed until after next week, but I’ll add those once we get them. Thanks to all of the presenters and attendees that made this another great event. -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Maximizing OSD to PG quantity
Will changing the replication size from 2 to 3 cause huge I/O resources to be used, or does this happen quietly in the background? On 2016-04-06 00:40, Christian Balzer wrote: Hello, Brian already mentioned a number very pertinent things, I've got a few more: On Tue, 05 Apr 2016 10:48:49 -0400 d...@integrityhost.com wrote: In a 12 OSD setup, the following config is there: (OSDs * 100) Total PGs = -- pool size The PGcalc page at http://ceph.com/pgcalc/ is quite helpful and contains a lot of background info as well. As Brian said, you can never decrease PG count, but growing it is also a very I/O intensive operation and you want to avoid that as much as possible. So with 12 OSD's and a pool size of 2 replicas, this would equal Total PGs of 600 as per this url: PGcalc with a target of 200 PGs per OSD (doubling of cluster size expected) gives us 1024, which is also what I would go for myself. However if this a production cluster and your OSDs are NOT RAID1 or very very reliable, fast and well monitored SSDs you're basically asking Murphy to come visit, destroying your data while eating babies and washing them down with bath water. The default replication size was changed to 3 for a very good reason, there are plenty of threads in this ML about failure scenarios and probabilities. Christian http://docs.ceph.com/docs/master/rados/operations/placement-groups/#preselection Yet in the same page, at the top it says: Between 10 and 50 OSDs set pg_num to 4096 Our use is for shared hosting so there are lots of small writes and reads. Which of these would be correct? Also is it a simple process to update PGs on a live system without affecting service? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph rbd object write is atomic?
If you can guarantee that your write will be wholly contained within an object (and within a stripe), you should be able to consider the writes to be atomic between two clients since the OSD will process the two writes in sequence (all ops are execute in order for a given placement group). -- Jason Dillaman - Original Message - > From: "min fang"> To: "Jason Dillaman" > Cc: "ceph-users" > Sent: Wednesday, April 6, 2016 9:37:06 AM > Subject: Re: [ceph-users] ceph rbd object write is atomic? > Thanks Jason, yes, I also do not think they can guarantee atomic in extent > level. But for a stripe unit in a object, can the atomic write be > guaranteed? thanks. > 2016-04-06 19:53 GMT+08:00 Jason Dillaman < dilla...@redhat.com > : > > It's possible for a write to span one or more blocks -- it just depends on > > the write address/size and the RBD image layout (object size, "fancy" > > striping, etc). Regardless, however, RBD cannot provide any ordering > > guarantees when two clients are writing to the same image at the same > > extent. To safely use two or more clients concurrently on the same image > > you > > need a clustering filesystem on top of RBD (e.g. GFS2) or the application > > needs to provide its own coordination to avoid concurrent writes to the > > same > > image extents. > > > -- > > > Jason Dillaman > > > - Original Message - > > > > From: "min fang" < louisfang2...@gmail.com > > > > > To: "ceph-users" < ceph-users@lists.ceph.com > > > > > Sent: Tuesday, April 5, 2016 10:11:10 PM > > > > Subject: [ceph-users] ceph rbd object write is atomic? > > > > Hi, as my understanding, ceph rbd image will be divided into multiple > > > objects > > > > based on LBA address. > > > > My question here is: > > > > if two clients write to the same LBA address, such as client A write > > > "" > > > > to LBA 0x123456, client B write "" to the same LBA. > > > > LBA address and data will only be in an object, not cross two objects. > > > > Will ceph guarantee object data must be "" or ""? "aabb", "bbaa" > > > will > > > > not happen even in a stripe data layout model? > > > > thanks. > > > > ___ > > > > ceph-users mailing list > > > > ceph-users@lists.ceph.com > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph rbd object write is atomic?
Thanks Jason, yes, I also do not think they can guarantee atomic in extent level. But for a stripe unit in a object, can the atomic write be guaranteed? thanks. 2016-04-06 19:53 GMT+08:00 Jason Dillaman: > It's possible for a write to span one or more blocks -- it just depends on > the write address/size and the RBD image layout (object size, "fancy" > striping, etc). Regardless, however, RBD cannot provide any ordering > guarantees when two clients are writing to the same image at the same > extent. To safely use two or more clients concurrently on the same image > you need a clustering filesystem on top of RBD (e.g. GFS2) or the > application needs to provide its own coordination to avoid concurrent > writes to the same image extents. > > -- > > Jason Dillaman > > > - Original Message - > > > From: "min fang" > > To: "ceph-users" > > Sent: Tuesday, April 5, 2016 10:11:10 PM > > Subject: [ceph-users] ceph rbd object write is atomic? > > > Hi, as my understanding, ceph rbd image will be divided into multiple > objects > > based on LBA address. > > > My question here is: > > > if two clients write to the same LBA address, such as client A write > "" > > to LBA 0x123456, client B write "" to the same LBA. > > > LBA address and data will only be in an object, not cross two objects. > > > Will ceph guarantee object data must be "" or ""? "aabb", "bbaa" > will > > not happen even in a stripe data layout model? > > > thanks. > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com