Re: [ceph-users] SSD-primary crush rule doesn't work as intended
Oh, it's not working as intended though the ssd-primary rule is officially listed on ceph documentation. I should file a feature request or bugzilla for it? Regards, Horace Ng From: "Paul Emmerich"To: "horace" Cc: "ceph-users" Sent: Wednesday, May 23, 2018 8:37:07 PM Subject: Re: [ceph-users] SSD-primary crush rule doesn't work as intended You can't mix HDDs and SSDs in a server if you want to use such a rule. The new selection step after "emit" can't know what server was selected previously. Paul 2018-05-23 11:02 GMT+02:00 Horace < [ mailto:hor...@hkisl.net | hor...@hkisl.net ] > : Add to the info, I have a slightly modified rule to take advantage of the new storage class. rule ssd-hybrid { id 2 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 1 type host step emit step take default class hdd step chooseleaf firstn -1 type host step emit } Regards, Horace Ng - Original Message - From: "horace" < [ mailto:hor...@hkisl.net | hor...@hkisl.net ] > To: "ceph-users" < [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] > Sent: Wednesday, May 23, 2018 3:56:20 PM Subject: [ceph-users] SSD-primary crush rule doesn't work as intended I've set up the rule according to the doc, but some of the PGs are still being assigned to the same host. [ http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/ | http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/ ] rule ssd-primary { ruleset 5 type replicated min_size 5 max_size 10 step take ssd step chooseleaf firstn 1 type host step emit step take platter step chooseleaf firstn -1 type host step emit } Crush tree: [root@ceph0 ~]# ceph osd crush tree ID CLASS WEIGHT TYPE NAME -1 58.63989 root default -2 19.55095 host ceph0 0 hdd 2.73000 osd.0 1 hdd 2.73000 osd.1 2 hdd 2.73000 osd.2 3 hdd 2.73000 osd.3 12 hdd 4.54999 osd.12 15 hdd 3.71999 osd.15 18 ssd 0.2 osd.18 19 ssd 0.16100 osd.19 -3 19.55095 host ceph1 4 hdd 2.73000 osd.4 5 hdd 2.73000 osd.5 6 hdd 2.73000 osd.6 7 hdd 2.73000 osd.7 13 hdd 4.54999 osd.13 16 hdd 3.71999 osd.16 20 ssd 0.16100 osd.20 21 ssd 0.2 osd.21 -4 19.53799 host ceph2 8 hdd 2.73000 osd.8 9 hdd 2.73000 osd.9 10 hdd 2.73000 osd.10 11 hdd 2.73000 osd.11 14 hdd 3.71999 osd.14 17 hdd 4.54999 osd.17 22 ssd 0.18700 osd.22 23 ssd 0.16100 osd.23 #ceph pg ls-by-pool ssd-hybrid 27.8 1051 0 0 0 0 4399733760 1581 1581 active+clean 2018-05-23 06:20:56.088216 27957'185553 27959:368828 [23,1,11] 23 [23,1,11] 23 27953'182582 2018-05-23 06:20:56.088172 27843'162478 2018-05-20 18:28:20.118632 With osd.23 and osd.11 being assigned on the same host. Regards, Horace Ng ___ ceph-users mailing list [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] ___ ceph-users mailing list [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] -- -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at [ https://croit.io/ | https://croit.io ] croit GmbH Freseniusstr. 31h 81247 München [ http://www.croit.io/ | www.croit.io ] Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
On Thu, May 24, 2018 at 12:00 AM, Sean Sullivanwrote: > Thanks Yan! I did this for the bug ticket and missed these replies. I hope I > did it correctly. Here are the pastes of the dumps: > > https://pastebin.com/kw4bZVZT -- primary > https://pastebin.com/sYZQx0ER -- secondary > > > they are not that long here is the output of one: > > Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fe3b100a700 (LWP 120481)] > 0x5617aacc48c2 in Server::handle_client_getattr > (this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at > /build/ceph-12.2.5/src/mds/Server.cc:3065 > 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or directory. > (gdb) t > [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))] > (gdb) bt > #0 0x5617aacc48c2 in Server::handle_client_getattr > (this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at > /build/ceph-12.2.5/src/mds/Server.cc:3065 > #1 0x5617aacfc98b in Server::dispatch_client_request > (this=this@entry=0x5617b5acbcd0, mdr=...) at > /build/ceph-12.2.5/src/mds/Server.cc:1802 > #2 0x5617aacfce9b in Server::handle_client_request > (this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at > /build/ceph-12.2.5/src/mds/Server.cc:1716 > #3 0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0, > m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258 > #4 0x5617aac6afac in MDSRank::handle_deferrable_message > (this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at > /build/ceph-12.2.5/src/mds/MDSRank.cc:716 > #5 0x5617aac795cb in MDSRank::_dispatch > (this=this@entry=0x5617b5d22000, m=0x5617bdfa8700, > new_msg=new_msg@entry=false) at /build/ceph-12.2.5/src/mds/MDSRank.cc:551 > #6 0x5617aac7a472 in MDSRank::retry_dispatch (this=0x5617b5d22000, > m=) at /build/ceph-12.2.5/src/mds/MDSRank.cc:998 > #7 0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080) at > /build/ceph-12.2.5/src/include/Context.h:70 > #8 MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at > /build/ceph-12.2.5/src/mds/MDSContext.cc:30 > #9 0x5617aac78bf7 in MDSRank::_advance_queues (this=0x5617b5d22000) at > /build/ceph-12.2.5/src/mds/MDSRank.cc:776 > #10 0x5617aac7921a in MDSRank::ProgressThread::entry > (this=0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502 > #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at > pthread_create.c:333 > #12 0x7fe3ba37241d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 > > > > I > * set the debug level to mds=20 mon=1, > * attached gdb prior to trying to mount aufs from a separate client, > * typed continue, attempted the mount, > * then backtraced after it seg faulted. > > I hope this is more helpful. Is there something else I should try to get > more info? I was hoping for something closer to a python trace where it says > a variable is a different type or a missing delimiter. womp. I am definitely > out of my depth but now is a great time to learn! Can anyone shed some more > light as to what may be wrong? > I updated https://tracker.ceph.com/issues/23972. It's a kernel bug, which sends malformed request to mds. Regards Yan, Zheng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph replication factor of 2
Hi, About Bluestore, sure there are checksum, but are they fully used ? Rumors said that on a replicated pool, during recovery, they are not > My thoughts on the subject are that even though checksums do allow to find > which replica is corrupt without having to figure which 2 out of 3 copies are > the same, this is not the only reason min_size=2 was required. Even if you > are running all SSD which are more reliable than HDD and are keeping the disk > size small so you could backfill quickly in case of a single disk failure, > you would still occasionally have longer periods of degraded operation. To > name a couple - a full node going down; or operator deliberately wiping an > OSD to rebuild it. min_size=1 in this case would leave you running with no > redundancy at all. DR scenario with pool-to-pool mirroring probably means > that you can not just replace the lost or incomplete PGs in your main site > from your DR, cause DR is likely to have a different PG layout, so full > resync from DR would be required in case of one disk lost during such > unprotected times. I have to say, this is a common yet worthless argument If I have 3000 OSD, using 2 or 3 replica will not change much : the probability of losing 2 devices is still "high" On the other hand, if I have a small cluster, less than a hundred OSD, that same probability become "low" I do not buy the "if someone is making a maintenance and a device fails" either : this is a no-limit goal: what is X servers burns at the same time ? What if an admin make a mistake and drop 5 OSD ? What is some network tor or routers blow away ? Should we do one replica par OSD ? Thus, I would like to emphasis the technical sanity of using 2 replica, versus the organisational sanity of doing so Organisational stuff if specific to everybody, technical is shared by all clusters I would like people, especially the Ceph's devs and other people who knows how it works deeply (read the code!) to give us their advices Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph replication factor of 2
This week at the OpenStackSummit Vancouver I can hear people entertaining the idea of running Ceph with replication factor of 2. Karl Vietmeier of Intel suggested that we use 2x replication because Bluestore comes with checksums. https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21370/supporting-highly-transactional-and-low-latency-workloads-on-ceph Later, there was a question from the audience during the Ceph DR/mirroring talk on whether we could use 2x replication if we also mirror to DR. https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20749/how-to-survive-an-openstack-cloud-meltdown-with-ceph So the interest is definitely there: not losing 1/3 of your disk space and performance is promising. But on the other hand it comes with higher risks. I wonder if we as the community could come up to some consensus, now that the established practice of requiring size=3, min_size=2 is being challenged. My thoughts on the subject are that even though checksums do allow to find which replica is corrupt without having to figure which 2 out of 3 copies are the same, this is not the only reason min_size=2 was required. Even if you are running all SSD which are more reliable than HDD and are keeping the disk size small so you could backfill quickly in case of a single disk failure, you would still occasionally have longer periods of degraded operation. To name a couple - a full node going down; or operator deliberately wiping an OSD to rebuild it. min_size=1 in this case would leave you running with no redundancy at all. DR scenario with pool-to-pool mirroring probably means that you can not just replace the lost or incomplete PGs in your main site from your DR, cause DR is likely to have a different PG layout, so full resync from DR would be required in case of one disk lost during such unprotected times. What are your thoughts, would you run 2x replication factor in Production and in what scenarios? Regards, Anthony ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Flush very, very slow
Hi, the flush from the overlay cache for my ec-based cephfs is very very slow, as are all operations on the cephfs. The flush accelerates when the mds is stopped. I think this is due to a large number of files that were deleted all at once, but I'm not sure how to verify that. Are there any counters I can look up that show things like "pending deletions"? How else can I debug the problem? Any insight is very much appreciated. Philip (potentially helpful debug output follows) status: root@lxt-prod-ceph-mon02:~# ceph -s cluster: id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set 102 slow requests are blocked > 32 sec services: mon: 2 daemons, quorum lxt-prod-ceph-mon01,lxt-prod-ceph-mon02 mgr: lxt-prod-ceph-mon02(active), standbys: lxt-prod-ceph-mon01 mds: plexfs-1/1/1 up {0=lxt-prod-ceph-mds01=up:active} osd: 13 osds: 7 up, 7 in flags noscrub,nodeep-scrub data: pools: 3 pools, 536 pgs objects: 5431k objects, 21056 GB usage: 28442 GB used, 5319 GB / 33761 GB avail pgs: 536 active+clean io: client: 687 kB/s wr, 0 op/s rd, 9 op/s wr cache:345 kB/s flush (Throughput is currently in the kilobyte/ low megabyte range, but could go to 100MB/s under healthy conditions) health: root@lxt-prod-ceph-mon02:~# ceph health detail HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 105 slow requests are blocked > 32 sec OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set REQUEST_SLOW 105 slow requests are blocked > 32 sec 45 ops are blocked > 262.144 sec 29 ops are blocked > 131.072 sec 20 ops are blocked > 65.536 sec 11 ops are blocked > 32.768 sec osds 1,7 have blocked requests > 262.144 sec (all osds have a high system load, but not a lot of iowait. cephfs/flushing usually performs much better with the same conditions) pool configuration: root@lxt-prod-ceph-mon02:~# ceph osd pool ls detail pool 6 'cephfs-metadata' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 12515 lfor 0/12412 flags hashpspool stripe_width 0 application cephfs pool 9 'cephfs-data' erasure size 4 min_size 3 crush_rule 4 object_hash rjenkins pg_num 512 pgp_num 512 last_change 12482 lfor 12481/12481 flags hashpspool crash_replay_interval 45 tiers 17 read_tier 17 write_tier 17 stripe_width 4128 application cephfs pool 17 'cephfs-cache' replicated size 1 min_size 1 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 12553 lfor 12481/12481 flags hashpspool,incomplete_clones,noscrub,nodeep-scrub tier_of 9 cache_mode writeback target_bytes 2000 target_objects 15 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 180s x1 decay_rate 20 search_last_n 1 min_write_recency_for_promote 1 stripe_width 0 metadata and cache are both on the same ssd osd: root@lxt-prod-ceph-mon02:~# ceph osd crush tree ID CLASS WEIGHT TYPE NAME -50.24399 root ssd 7 ssd 0.24399 osd.7 -45.74399 host sinnlich 6 hdd 5.5 osd.6 7 ssd 0.24399 osd.7 -1 38.40399 root hdd -2 16.45799 host hn-lxt-ceph01 1 hdd 5.5 osd.1 9 hdd 5.5 osd.9 12 hdd 5.5 osd.12 -3 16.44600 host hn-lxt-ceph02 2 hdd 5.5 osd.2 3 hdd 5.5 osd.3 4 hdd 2.72299 osd.4 5 hdd 2.72299 osd.5 6 hdd 5.5 osd.6 cache tier settings: root@lxt-prod-ceph-mon02:~# ceph osd pool get cephfs-cache all size: 1 min_size: 1 crash_replay_interval: 0 pg_num: 8 pgp_num: 8 crush_rule: replicated_ruleset hashpspool: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: true nodeep-scrub: true hit_set_type: bloom hit_set_period: 180 hit_set_count: 1 hit_set_fpp: 0.05 use_gmt_hitset: 1 auid: 0 target_max_objects: 15 target_max_bytes: 2000 cache_target_dirty_ratio: 0.01 cache_target_dirty_high_ratio: 0.1 cache_target_full_ratio: 0.8 cache_min_flush_age: 60 cache_min_evict_age: 0 min_read_recency_for_promote: 0 min_write_recency_for_promote: 1 fast_read: 0 hit_set_grade_decay_rate: 20 hit_set_search_last_n: 1 (I'm not sure the values make much sense, I copied them from online examples and adapted them minimally if at all) the mds shows no ops in flight, but the ssd osd shows a lot of those operations that seem to be slow (all of them with the same events timeline stopping at reached_pg): root@sinnlich:~# ceph daemon osd.7 dump_ops_in_flight| head -30 { "ops": [ { "description": "osd_op(mds.0.3479:170284 17.1 17:98fc84de:::12a830d.:head [delete] snapc 1=[] ondisk+write+known_if_redirected+full_force e12553)", "initiated_at": "2018-05-23 21:27:00.140552", "age": 47.611064, "duration": 47.611077, "type_data": { "flag_point": "reached pg",
Re: [ceph-users] Too many objects per pg than average: deadlock situation
On Wed, 23 May 2018, Mike A wrote: > Hello > > > 21 мая 2018 г., в 2:05, Sage Weilнаписал(а): > > > > On Sun, 20 May 2018, Mike A wrote: > >> Hello! > >> > >> In our cluster, we see a deadlock situation. > >> This is a standard cluster for an OpenStack without a RadosGW, we have a > >> standard block access pools and one for metrics from a gnocchi. > >> The amount of data in the gnocchi pool is small, but objects are just a > >> lot. > >> > >> When planning a distribution of PG between pools, the PG are distributed > >> depending on the estimated data size of each pool. Correspondingly, as > >> suggested by pgcalc for the gnocchi pool, it is necessary to allocate a > >> little PG quantity. > >> > >> As a result, the cluster is constantly hanging with the error "1 pools > >> have many more objects per pg than average" and this is understandable: > >> the gnocchi produces a lot of small objects and in comparison with the > >> rest of pools it is tens times larger. > >> > >> And here we are at a deadlock: > >> 1. We can not increase the amount of PG on the gnocchi pool, since it is > >> very small in data size > >> 2. Even if we increase the number of PG - we can cross the recommended 200 > >> PGs limit for each OSD in cluster > >> 3. Constantly holding the cluster in the HEALTH_WARN mode is a bad idea > >> 4. We can set the parameter "mon pg warn max object skew", but we do not > >> know how the Ceph will work when there is one pool with a huge object / > >> pool ratio > >> > >> There is no obvious solution. > >> > >> How to solve this problem correctly? > > > > As a workaround, I'd just increase the skew option to make the warning go > > away. > > > > It seems to me like the underlying problem is that we're looking at object > > count vs pg count, but ignoring the object sizes. Unfortunately it's a > > bit awkward to fix because we don't have a way to quantify the size of > > omap objects via the stats (currently). So for now, just adjust the skew > > value enough to make the warning go away! > > > > sage > > This situation can somehow negatively affect the work of the cluster? Eh, you'll end up with a PG count that is possibly suboptimal. You'd have to work pretty hard to notice any difference, though. I wouldn't worry about it. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Too many objects per pg than average: deadlock situation
Hello > 21 мая 2018 г., в 2:05, Sage Weilнаписал(а): > > On Sun, 20 May 2018, Mike A wrote: >> Hello! >> >> In our cluster, we see a deadlock situation. >> This is a standard cluster for an OpenStack without a RadosGW, we have a >> standard block access pools and one for metrics from a gnocchi. >> The amount of data in the gnocchi pool is small, but objects are just a lot. >> >> When planning a distribution of PG between pools, the PG are distributed >> depending on the estimated data size of each pool. Correspondingly, as >> suggested by pgcalc for the gnocchi pool, it is necessary to allocate a >> little PG quantity. >> >> As a result, the cluster is constantly hanging with the error "1 pools have >> many more objects per pg than average" and this is understandable: the >> gnocchi produces a lot of small objects and in comparison with the rest of >> pools it is tens times larger. >> >> And here we are at a deadlock: >> 1. We can not increase the amount of PG on the gnocchi pool, since it is >> very small in data size >> 2. Even if we increase the number of PG - we can cross the recommended 200 >> PGs limit for each OSD in cluster >> 3. Constantly holding the cluster in the HEALTH_WARN mode is a bad idea >> 4. We can set the parameter "mon pg warn max object skew", but we do not >> know how the Ceph will work when there is one pool with a huge object / pool >> ratio >> >> There is no obvious solution. >> >> How to solve this problem correctly? > > As a workaround, I'd just increase the skew option to make the warning go > away. > > It seems to me like the underlying problem is that we're looking at object > count vs pg count, but ignoring the object sizes. Unfortunately it's a > bit awkward to fix because we don't have a way to quantify the size of > omap objects via the stats (currently). So for now, just adjust the skew > value enough to make the warning go away! > > sage This situation can somehow negatively affect the work of the cluster? — Mike, runs! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS_DAMAGE: 1 MDSs report damaged metadata
Dear Ceph Experts, I have recently deleted a very big directory on my cephfs and a few minutes after my dashboard start yelling : Overall status: HEALTH_ERR MDS_DAMAGE: 1 MDSs report damaged metadata So I immediately log in my ceph admin node than do a ceph -s: cluster: id: 472dfc88-84dc-4284-a1cf-0810ea45ae19 health: HEALTH_ERR 1 MDSs report damaged metadata services: mon: 3 daemons, quorum ceph-n1,ceph-n2,ceph-n3 mgr: ceph-admin(active), standbys: ceph-n1 mds: cephfs-2/2/2 up {0=ceph-admin=up:active,1=ceph-n1=up:active}, 1 up:standby osd: 17 osds: 17 up, 17 in rgw: 1 daemon active data: pools: 9 pools, 1584 pgs objects: 1093 objects, 418 MB usage: 2765 MB used, 6797 GB / 6799 GB avail pgs: 1584 active+clean io: client: 35757 B/s rd, 0 B/s wr, 34 op/s rd, 23 op/s wr and after a few research I tried : #ceph tell mds.0 damage ls : "damage_type": "backtrace", "id": 2744661796, "ino": 1099512314364, "path": "/M3/sogetel.net/t/te/testmda3/Maildir/dovecot.index.log.2" And so I tried to do what I saw at https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35682.html But it did not work so now I don't know how to fix it. Can you help me ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk is getting removed from master
On Wed, May 23, 2018 at 10:03 AM, Alfredo Dezawrote: > On Wed, May 23, 2018 at 12:12 PM, Vasu Kulkarni wrote: >> Alfredo, >> >> Do we have the migration docs link from ceph-disk deployment to >> ceph-volume? the current docs as i see lacks scenario migration, maybe >> there is another link ? >> http://docs.ceph.com/docs/master/ceph-volume/simple/#ceph-volume-simple >> >> If it doesn't exist can we document, how a) ceph-disk with filestore >> (with/without) journal can migrate to ceph-volume and b) >> ceph-disk/bluestore with wal/db on same/different partitions. > > There is no "scenario" because ceph-volume scans the existing OSD and > whatever that gives us we work with it: > > * filestore with collocated/separate journals > * bluestore in any kind of deployment (with wal, with db, with db and > wal, with main only) > > multiply that with *both* ceph-disk's way of encrypting. > > In short: we support them all. No special command or flag needed. Cool that sounds great. Thanks > > >> >> Regards >> Vasu >> >> >> On Wed, May 23, 2018 at 8:12 AM, Alfredo Deza wrote: >>> Now that Mimic is fully branched out from master, ceph-disk is going >>> to be removed from master so that it is no longer available for the N >>> release (pull request to follow) >>> >>> ceph-disk should be considered as "frozen" and deprecated for Mimic, >>> in favor of ceph-volume. >>> >>> This means that if you are relying on ceph-disk *at all*, you should >>> plan on migrating to ceph-volume for Mimic, and should expect breakage >>> if using/testing it in master. >>> >>> Please refer to the guide to migrate away from ceph-disk [0] >>> >>> Willem, we don't have a way of directly supporting FreeBSD, I've >>> suggested that a plugin would be a good way to consume ceph-volume >>> with whatever FreeBSD needs, alternatively forking ceph-disk could be >>> another option? >>> >>> >>> Thanks >>> >>> >>> [0] http://docs.ceph.com/docs/master/ceph-volume/#migrating >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk is getting removed from master
On Wed, May 23, 2018 at 11:47 AM, Willem Jan Withagenwrote: > On 23-5-2018 17:12, Alfredo Deza wrote: >> Now that Mimic is fully branched out from master, ceph-disk is going >> to be removed from master so that it is no longer available for the N >> release (pull request to follow) > >> Willem, we don't have a way of directly supporting FreeBSD, I've >> suggested that a plugin would be a good way to consume ceph-volume >> with whatever FreeBSD needs, alternatively forking ceph-disk could be >> another option? > > Yup, I'm aware of my "trouble"/commitment. > > Now that you have riped out most/all of the partitioning stuff there > should not much that one would need to do in ceph-volume other than > accept the filestore directories to format the MON/OSD stuff in. I worry about the way we poke at devices for setups (blkid, lsblk, /proc/mounts, etc...) The creation of the OSD (aside from devices) is straightforward > > IFF I could find the time to dive into ceph-volume. :( > ATM I'm having a hard time keeping up with the changes as it is. > > I'd appreciate if you could delay yanking ceph-disk until we are close > to the nautilus release. At which point feel free to use the axe. We can't delay this for ~8 months because it will obfuscate what breakage we will find on our end by ripping it up (teuthology suites, etc...) I've already started working on it, and we should be looking at 2 to 3 weeks from today. > > --WjW > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk is getting removed from master
On Wed, May 23, 2018 at 12:12 PM, Vasu Kulkarniwrote: > Alfredo, > > Do we have the migration docs link from ceph-disk deployment to > ceph-volume? the current docs as i see lacks scenario migration, maybe > there is another link ? > http://docs.ceph.com/docs/master/ceph-volume/simple/#ceph-volume-simple > > If it doesn't exist can we document, how a) ceph-disk with filestore > (with/without) journal can migrate to ceph-volume and b) > ceph-disk/bluestore with wal/db on same/different partitions. There is no "scenario" because ceph-volume scans the existing OSD and whatever that gives us we work with it: * filestore with collocated/separate journals * bluestore in any kind of deployment (with wal, with db, with db and wal, with main only) multiply that with *both* ceph-disk's way of encrypting. In short: we support them all. No special command or flag needed. > > Regards > Vasu > > > On Wed, May 23, 2018 at 8:12 AM, Alfredo Deza wrote: >> Now that Mimic is fully branched out from master, ceph-disk is going >> to be removed from master so that it is no longer available for the N >> release (pull request to follow) >> >> ceph-disk should be considered as "frozen" and deprecated for Mimic, >> in favor of ceph-volume. >> >> This means that if you are relying on ceph-disk *at all*, you should >> plan on migrating to ceph-volume for Mimic, and should expect breakage >> if using/testing it in master. >> >> Please refer to the guide to migrate away from ceph-disk [0] >> >> Willem, we don't have a way of directly supporting FreeBSD, I've >> suggested that a plugin would be a good way to consume ceph-volume >> with whatever FreeBSD needs, alternatively forking ceph-disk could be >> another option? >> >> >> Thanks >> >> >> [0] http://docs.ceph.com/docs/master/ceph-volume/#migrating >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk is getting removed from master
Alfredo, Do we have the migration docs link from ceph-disk deployment to ceph-volume? the current docs as i see lacks scenario migration, maybe there is another link ? http://docs.ceph.com/docs/master/ceph-volume/simple/#ceph-volume-simple If it doesn't exist can we document, how a) ceph-disk with filestore (with/without) journal can migrate to ceph-volume and b) ceph-disk/bluestore with wal/db on same/different partitions. Regards Vasu On Wed, May 23, 2018 at 8:12 AM, Alfredo Dezawrote: > Now that Mimic is fully branched out from master, ceph-disk is going > to be removed from master so that it is no longer available for the N > release (pull request to follow) > > ceph-disk should be considered as "frozen" and deprecated for Mimic, > in favor of ceph-volume. > > This means that if you are relying on ceph-disk *at all*, you should > plan on migrating to ceph-volume for Mimic, and should expect breakage > if using/testing it in master. > > Please refer to the guide to migrate away from ceph-disk [0] > > Willem, we don't have a way of directly supporting FreeBSD, I've > suggested that a plugin would be a good way to consume ceph-volume > with whatever FreeBSD needs, alternatively forking ceph-disk could be > another option? > > > Thanks > > > [0] http://docs.ceph.com/docs/master/ceph-volume/#migrating > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.
Thanks Yan! I did this for the bug ticket and missed these replies. I hope I did it correctly. Here are the pastes of the dumps: https://pastebin.com/kw4bZVZT -- primary https://pastebin.com/sYZQx0ER -- secondary they are not that long here is the output of one: 1. Thread 17 "mds_rank_progr" received signal SIGSEGV, Segmentation fault . 2. [Switching to Thread 0x7fe3b100a700 (LWP 120481)] 3. 0x5617aacc48c2 in Server::handle_client_getattr (this=this@entry= 0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at /build/ceph-12.2.5/src/mds/Server.cc:3065 4. 3065/build/ceph-12.2.5/src/mds/Server.cc: No such file or directory. 5. (gdb) t 6. [Current thread is 17 (Thread 0x7fe3b100a700 (LWP 120481))] 7. (gdb) bt 8. #0 0x5617aacc48c2 in Server::handle_client_getattr ( this=this@entry=0x5617b5acbcd0, mdr=..., is_lookup=is_lookup@entry=true) at /build/ceph-12.2.5/src/mds/Server.cc:3065 9. #1 0x5617aacfc98b in Server::dispatch_client_request ( this=this@entry=0x5617b5acbcd0, mdr=...) at /build/ceph-12.2.5/src/mds/Server.cc:1802 10. #2 0x5617aacfce9b in Server::handle_client_request ( this=this@entry=0x5617b5acbcd0, req=req@entry=0x5617bdfa8700)at /build/ceph-12.2.5/src/mds/Server.cc:1716 11. #3 0x5617aad017b6 in Server::dispatch (this=0x5617b5acbcd0, m=m@entry=0x5617bdfa8700) at /build/ceph-12.2.5/src/mds/Server.cc:258 12. #4 0x5617aac6afac in MDSRank::handle_deferrable_message ( this=this@entry=0x5617b5d22000, m=m@entry=0x5617bdfa8700)at /build/ceph-12.2.5/src/mds/MDSRank.cc:716 13. #5 0x5617aac795cb in MDSRank::_dispatch (this=this@entry= 0x5617b5d22000, m=0x5617bdfa8700, new_msg=new_msg@entry=false) at /build/ceph-12.2.5/src/mds/MDSRank.cc:551 14. #6 0x5617aac7a472 in MDSRank::retry_dispatch (this= 0x5617b5d22000, m=) at /build/ceph-12.2.5/src/mds/MDSRank.cc:998 15. #7 0x5617aaf0207b in Context::complete (r=0, this=0x5617bd568080 ) at /build/ceph-12.2.5/src/include/Context.h:70 16. #8 MDSInternalContextBase::complete (this=0x5617bd568080, r=0) at /build/ceph-12.2.5/src/mds/MDSContext.cc:30 17. #9 0x5617aac78bf7 in MDSRank::_advance_queues (this= 0x5617b5d22000) at /build/ceph-12.2.5/src/mds/MDSRank.cc:776 18. #10 0x5617aac7921a in MDSRank::ProgressThread::entry (this= 0x5617b5d22d40) at /build/ceph-12.2.5/src/mds/MDSRank.cc:502 19. #11 0x7fe3bb3066ba in start_thread (arg=0x7fe3b100a700) at pthread_create.c:333 20. #12 0x7fe3ba37241d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 I * set the debug level to mds=20 mon=1, * attached gdb prior to trying to mount aufs from a separate client, * typed continue, attempted the mount, * then backtraced after it seg faulted. I hope this is more helpful. Is there something else I should try to get more info? I was hoping for something closer to a python trace where it says a variable is a different type or a missing delimiter. womp. I am definitely out of my depth but now is a great time to learn! Can anyone shed some more light as to what may be wrong? On Fri, May 4, 2018 at 7:49 PM, Yan, Zhengwrote: > On Wed, May 2, 2018 at 7:19 AM, Sean Sullivan wrote: > > Forgot to reply to all: > > > > Sure thing! > > > > I couldn't install the ceph-mds-dbg packages without upgrading. I just > > finished upgrading the cluster to 12.2.5. The issue still persists in > 12.2.5 > > > > From here I'm not really sure how to do generate the backtrace so I hope > I > > did it right. For others on Ubuntu this is what I did: > > > > * firstly up the debug_mds to 20 and debug_ms to 1: > > ceph tell mds.* injectargs '--debug-mds 20 --debug-ms 1' > > > > * install the debug packages > > ceph-mds-dbg in my case > > > > * I also added these options to /etc/ceph/ceph.conf just in case they > > restart. > > > > * Now allow pids to dump (stolen partly from redhat docs and partly from > > ubuntu) > > echo -e 'DefaultLimitCORE=infinity\nPrivateTmp=true' | tee -a > > /etc/systemd/system.conf > > sysctl fs.suid_dumpable=2 > > sysctl kernel.core_pattern=/tmp/core > > systemctl daemon-reload > > systemctl restart ceph-mds@$(hostname -s) > > > > * A crash was created in /var/crash by apport but gdb cant read it. I > used > > apport-unpack and then ran GDB on what is inside: > > > > core dump should be in /tmp/core > > > apport-unpack /var/crash/$(ls /var/crash/*mds*) /root/crash_dump/ > > cd /root/crash_dump/ > > gdb $(cat ExecutablePath) CoreDump -ex 'thr a a bt' | tee > > /root/ceph_mds_$(hostname -s)_backtrace > > > > * This left me with the attached backtraces (which I think are wrong as I > > see a lot of ?? yet gdb says > > /usr/lib/debug/.build-id/1d/23dc5ef4fec1dacebba2c6445f05c8fe6b8a7c.debug > was > > loaded) > > > > kh10-8 mds backtrace -- https://pastebin.com/bwqZGcfD > > kh09-8 mds backtrace -- https://pastebin.com/vvGiXYVY > > >
Re: [ceph-users] ceph-disk is getting removed from master
On 23-5-2018 17:12, Alfredo Deza wrote: > Now that Mimic is fully branched out from master, ceph-disk is going > to be removed from master so that it is no longer available for the N > release (pull request to follow) > Willem, we don't have a way of directly supporting FreeBSD, I've > suggested that a plugin would be a good way to consume ceph-volume > with whatever FreeBSD needs, alternatively forking ceph-disk could be > another option? Yup, I'm aware of my "trouble"/commitment. Now that you have riped out most/all of the partitioning stuff there should not much that one would need to do in ceph-volume other than accept the filestore directories to format the MON/OSD stuff in. IFF I could find the time to dive into ceph-volume. :( ATM I'm having a hard time keeping up with the changes as it is. I'd appreciate if you could delay yanking ceph-disk until we are close to the nautilus release. At which point feel free to use the axe. --WjW ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-disk is getting removed from master
Now that Mimic is fully branched out from master, ceph-disk is going to be removed from master so that it is no longer available for the N release (pull request to follow) ceph-disk should be considered as "frozen" and deprecated for Mimic, in favor of ceph-volume. This means that if you are relying on ceph-disk *at all*, you should plan on migrating to ceph-volume for Mimic, and should expect breakage if using/testing it in master. Please refer to the guide to migrate away from ceph-disk [0] Willem, we don't have a way of directly supporting FreeBSD, I've suggested that a plugin would be a good way to consume ceph-volume with whatever FreeBSD needs, alternatively forking ceph-disk could be another option? Thanks [0] http://docs.ceph.com/docs/master/ceph-volume/#migrating ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] open vstorage
http://www.openvstorage.com https://www.openvstorage.org I came across this the other day and am curious if anybody has run it in front of their Ceph cluster. I'm looking at it for a clean-ish Ceph integration with VMWare. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD-primary crush rule doesn't work as intended
You can't mix HDDs and SSDs in a server if you want to use such a rule. The new selection step after "emit" can't know what server was selected previously. Paul 2018-05-23 11:02 GMT+02:00 Horace: > Add to the info, I have a slightly modified rule to take advantage of the > new storage class. > > rule ssd-hybrid { > id 2 > type replicated > min_size 1 > max_size 10 > step take default class ssd > step chooseleaf firstn 1 type host > step emit > step take default class hdd > step chooseleaf firstn -1 type host > step emit > } > > Regards, > Horace Ng > > - Original Message - > From: "horace" > To: "ceph-users" > Sent: Wednesday, May 23, 2018 3:56:20 PM > Subject: [ceph-users] SSD-primary crush rule doesn't work as intended > > I've set up the rule according to the doc, but some of the PGs are still > being assigned to the same host. > > http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/ > > rule ssd-primary { > ruleset 5 > type replicated > min_size 5 > max_size 10 > step take ssd > step chooseleaf firstn 1 type host > step emit > step take platter > step chooseleaf firstn -1 type host > step emit > } > > Crush tree: > > [root@ceph0 ~]#ceph osd crush tree > ID CLASS WEIGHT TYPE NAME > -1 58.63989 root default > -2 19.55095 host ceph0 > 0 hdd 2.73000 osd.0 > 1 hdd 2.73000 osd.1 > 2 hdd 2.73000 osd.2 > 3 hdd 2.73000 osd.3 > 12 hdd 4.54999 osd.12 > 15 hdd 3.71999 osd.15 > 18 ssd 0.2 osd.18 > 19 ssd 0.16100 osd.19 > -3 19.55095 host ceph1 > 4 hdd 2.73000 osd.4 > 5 hdd 2.73000 osd.5 > 6 hdd 2.73000 osd.6 > 7 hdd 2.73000 osd.7 > 13 hdd 4.54999 osd.13 > 16 hdd 3.71999 osd.16 > 20 ssd 0.16100 osd.20 > 21 ssd 0.2 osd.21 > -4 19.53799 host ceph2 > 8 hdd 2.73000 osd.8 > 9 hdd 2.73000 osd.9 > 10 hdd 2.73000 osd.10 > 11 hdd 2.73000 osd.11 > 14 hdd 3.71999 osd.14 > 17 hdd 4.54999 osd.17 > 22 ssd 0.18700 osd.22 > 23 ssd 0.16100 osd.23 > > #ceph pg ls-by-pool ssd-hybrid > > 27.8 1051 00 0 0 4399733760 > 1581 1581 active+clean 2018-05-23 06:20:56.088216 > 27957'185553 27959:368828 [23,1,11] 23 [23,1,11] 23 > 27953'182582 2018-05-23 06:20:56.08817227843'162478 2018-05-20 > 18:28:20.118632 > > With osd.23 and osd.11 being assigned on the same host. > > Regards, > Horace Ng > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HDFS with CEPH, only single RGW works with the hdfs
Hello Cephers, Our Team currently is trying to replace hdfs to CEPH object storage. However, there is a big problem which is "*hdfs dfs -put*" operation is very slow. I doubt session of RGW with hadoop system. Because, only one RGW node works with hadoop, even through we have 4 RGWs. There seems not have configurations about multi session of hdfs. Have you experienced similar issues and how could you overcome the issue. I would appreciate if anybody give me advice. Best Regards, John Haan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Several questions on the radosgw-openstack integration
For #2, I think I found myself the answer. The admin can simply generate the S3 keys for the user, e.g.: radosgw-admin key create --key-type=s3 --gen-access-key --gen-secret --uid="a22db12575694c9e9f8650dde73ef565\$a22db12575694c9e9f8650dde73ef565" --rgw-realm=cloudtest and then the user can access her data also using S3. besides swift Cheers, Massimo On Wed, May 23, 2018 at 12:49 PM, Massimo Sgaravatto < massimo.sgarava...@gmail.com> wrote: > For #1 I guess this is a known issue (http://tracker.ceph.com/issues/20570 > ) > > On Tue, May 22, 2018 at 1:03 PM, Massimo Sgaravatto < > massimo.sgarava...@gmail.com> wrote: > >> I have several questions on the radosgw - OpenStack integration. >> >> I was more or less able to set it (using a Luminous ceph cluster >> and an Ocata OpenStack cloud), but I don't know if it working as expected. >> >> >> So, the questions: >> >> >> 1. >> I miss the meaning of the attribute "rgw keystone implicit tenants" >> If I set "rgw keystone implicit tenants = false", accounts are created >> using id: >> >> and the display name is the name of the OpenStack >> project >> >> >> If I set "rgw keystone implicit tenants = true", accounts are created >> using id: >> >> $< >> >> and, again, the display name is the name of the OpenStack project >> >> >> So one account per openstack project in both cases >> I would have expected two radosgw accounts for 2 openstack users >> belonging to the same project, setting "rgw keystone implicit tenants = >> true" >> >> >> 2 >> Are OpenStack users supposed to access to their data only using swift, or >> also via S3 ? >> In the latter case, how can the user find her S3 credentials ? >> I am not able to find the S3 keys for such OpenStack users also using >> radosgw-admin >> >> # radosgw-admin user info --uid="a22db12575694c9e9f8650d >> de73ef565\$a22db12575694c9e9f8650dde73ef565" --rgw-realm=cloudtest >> ... >> ... >> "keys": [], >> ... >> ... >> >> >> 3 >> How is the admin supposed to set default quota for each project/user ? >> How can then the admin modify the quota for a user ? >> How can the user see the assigned quota ? >> >> I tried relying on the "rgw user default quota max size" attribute to >> set the default quota. It works for users created using "radosgw-admin >> user create" while >> I am not able to see it working for OpenStack users (see also the thread >> "rgw default user quota for OpenStack users") >> >> If I explicitly set the quota for a OpenStack user using: >> >> radosgw-admin quota set --quota-scope=user --max-size=2G >> --uid="a22db12575694c9e9f8650dde73ef565\$a22db12575694c9e9f8650dde73ef565" >> --rgw-realm=cloudtest >> radosgw-admin quota enable --quota-scope=user >> --uid="a22db12575694c9e9f8650dde73ef565\$a22db12575694c9e9f8650dde73ef565" >> --rgw-realm=cloudtest >> >> >> this works (i.e. quota is enforced) but such quota is not exposed to the >> user (at least it is not reported anywhere in the OpenStack dashboard nor >> in the "swift stat" output) >> >> >> 4 >> I tried creating (using the OpenStack dashboard) containers with public >> access. >> It looks like this works only if "rgw keystone implicit tenants" is set >> to false >> Is this expected ? >> >> >> Many thanks, Massimo >> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous: resilience - private interface down , no read/write
yes it is 68 disks , and will this mon_osd_reporter_subtree_level = host have any impact on mon_osd_ min_down_reporters ? And related to min_size , yes there was many suggestions for us to move to 2 , due to storage efficiency concerns we still retain with 1 and trying to convince customers to go with 2 for better data integrity. thanks, Muthu On Wed, May 23, 2018 at 3:31 PM, David Turnerwrote: > How many disks in each node? 68? If yes, then change it to 69. Also > running with ec 4+1 is bad for the same reason as running with size=2 > min_size=1 which has been mentioned and discussed multiple times on the ML. > > > On Wed, May 23, 2018, 3:39 AM nokia ceph wrote: > >> Hi David Turner, >> >> This is our ceph config under mon section , we have EC 4+1 and set the >> failure domain as host and osd_min_down_reporters to 4 ( osds from 4 >> different host ) . >> >> [mon] >> mon_compact_on_start = True >> mon_osd_down_out_interval = 86400 >> mon_osd_down_out_subtree_limit = host >> mon_osd_min_down_reporters = 4 >> mon_osd_reporter_subtree_level = host >> >> We have 68 disks , can we increase sd_min_down_reporters to 68 ? >> >> Thanks, >> Muthu >> >> On Tue, May 22, 2018 at 5:46 PM, David Turner >> wrote: >> >>> What happens when a storage node loses its cluster network but not it's >>> public network is that all other osss on the cluster see that it's down and >>> report that to the mons, but the node call still talk to the mons telling >>> the mons that it is up and in fact everything else is down. >>> >>> The setting osd _min_reporters (I think that's the name of it off the >>> top of my head) is designed to help with this scenario. It's default is 1 >>> which means any osd on either side of the network problem will be trusted >>> by the mons to mark osds down. What you want to do with this seeing is to >>> set it to at least 1 more than the number of osds in your failure domain. >>> If the failure domain is host and each node has 32 osds, then setting it to >>> 33 will prevent a full problematic node from being able to cause havoc. >>> >>> The osds will still try to mark themselves as up and this will still >>> cause problems for read until the osd process stops or the network comes >>> back up. There might be a seeing for how long an odd will try telling the >>> mons it's up, but this isn't really a situation I've come across after >>> initial testing and installation of nodes. >>> >>> On Tue, May 22, 2018, 1:47 AM nokia ceph >>> wrote: >>> Hi Ceph users, We have a cluster with 5 node (67 disks) and EC 4+1 configuration and min_size set as 4. Ceph version : 12.2.5 While executing one of our resilience usecase , making private interface down on one of the node, till kraken we saw less outage in rados (60s) . Now with luminous, we could able to see rados read/write outage for more than 200s . In the logs we could able to see that peer OSDs inform that one of the node OSDs are down however the OSDs defend like it is wrongly marked down and does not move to down state for long time. 2018-05-22 05:37:17.871049 7f6ac71e6700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.1 down, but it is still running 2018-05-22 05:37:17.871072 7f6ac71e6700 0 log_channel(cluster) log [DBG] : map e35690 wrongly marked me down at e35689 2018-05-22 05:37:17.878347 7f6ac71e6700 0 osd.1 35690 crush map has features 1009107927421960192, adjusting msgr requires for osds 2018-05-22 05:37:18.296643 7f6ac71e6700 0 osd.1 35691 crush map has features 1009107927421960192, adjusting msgr requires for osds Only when all 67 OSDs are move to down state , the read/write traffic is resumed. Could you please help us in resolving this issue and if it is bug , we will create corresponding ticket. Thanks, Muthu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph_vms performance
Hi, I'm testing out ceph_vms vs a cephfs mount with a cifs export. I currently have 3 active ceph mds servers to maximise throughput and when I have configured a cephfs mount with a cifs export, I'm getting a reasonable benchmark results. However, when I tried some benchmarking with the ceph_vms module, I only got a 3rd of the comparable write throughput. I'm just wondering if this is expected, or if there is an obvious configuration setup that I'm missing? Configuration: I've compiled git branch samba 4_8_test. I'm using ceph 12.2.5 Kind regards, Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Several questions on the radosgw-openstack integration
For #1 I guess this is a known issue (http://tracker.ceph.com/issues/20570) On Tue, May 22, 2018 at 1:03 PM, Massimo Sgaravatto < massimo.sgarava...@gmail.com> wrote: > I have several questions on the radosgw - OpenStack integration. > > I was more or less able to set it (using a Luminous ceph cluster > and an Ocata OpenStack cloud), but I don't know if it working as expected. > > > So, the questions: > > > 1. > I miss the meaning of the attribute "rgw keystone implicit tenants" > If I set "rgw keystone implicit tenants = false", accounts are created > using id: > > and the display name is the name of the OpenStack > project > > > If I set "rgw keystone implicit tenants = true", accounts are created > using id: > > $< > > and, again, the display name is the name of the OpenStack project > > > So one account per openstack project in both cases > I would have expected two radosgw accounts for 2 openstack users belonging > to the same project, setting "rgw keystone implicit tenants = true" > > > 2 > Are OpenStack users supposed to access to their data only using swift, or > also via S3 ? > In the latter case, how can the user find her S3 credentials ? > I am not able to find the S3 keys for such OpenStack users also using > radosgw-admin > > # radosgw-admin user info --uid="a22db12575694c9e9f8650dde73ef565\$ > a22db12575694c9e9f8650dde73ef565" --rgw-realm=cloudtest > ... > ... > "keys": [], > ... > ... > > > 3 > How is the admin supposed to set default quota for each project/user ? > How can then the admin modify the quota for a user ? > How can the user see the assigned quota ? > > I tried relying on the "rgw user default quota max size" attribute to > set the default quota. It works for users created using "radosgw-admin > user create" while > I am not able to see it working for OpenStack users (see also the thread > "rgw default user quota for OpenStack users") > > If I explicitly set the quota for a OpenStack user using: > > radosgw-admin quota set --quota-scope=user --max-size=2G --uid=" > a22db12575694c9e9f8650dde73ef565\$a22db12575694c9e9f8650dde73ef565" > --rgw-realm=cloudtest > radosgw-admin quota enable --quota-scope=user --uid=" > a22db12575694c9e9f8650dde73ef565\$a22db12575694c9e9f8650dde73ef565" > --rgw-realm=cloudtest > > > this works (i.e. quota is enforced) but such quota is not exposed to the > user (at least it is not reported anywhere in the OpenStack dashboard nor > in the "swift stat" output) > > > 4 > I tried creating (using the OpenStack dashboard) containers with public > access. > It looks like this works only if "rgw keystone implicit tenants" is set to > false > Is this expected ? > > > Many thanks, Massimo > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous: resilience - private interface down , no read/write
How many disks in each node? 68? If yes, then change it to 69. Also running with ec 4+1 is bad for the same reason as running with size=2 min_size=1 which has been mentioned and discussed multiple times on the ML. On Wed, May 23, 2018, 3:39 AM nokia cephwrote: > Hi David Turner, > > This is our ceph config under mon section , we have EC 4+1 and set the > failure domain as host and osd_min_down_reporters to 4 ( osds from 4 > different host ) . > > [mon] > mon_compact_on_start = True > mon_osd_down_out_interval = 86400 > mon_osd_down_out_subtree_limit = host > mon_osd_min_down_reporters = 4 > mon_osd_reporter_subtree_level = host > > We have 68 disks , can we increase sd_min_down_reporters to 68 ? > > Thanks, > Muthu > > On Tue, May 22, 2018 at 5:46 PM, David Turner > wrote: > >> What happens when a storage node loses its cluster network but not it's >> public network is that all other osss on the cluster see that it's down and >> report that to the mons, but the node call still talk to the mons telling >> the mons that it is up and in fact everything else is down. >> >> The setting osd _min_reporters (I think that's the name of it off the top >> of my head) is designed to help with this scenario. It's default is 1 which >> means any osd on either side of the network problem will be trusted by the >> mons to mark osds down. What you want to do with this seeing is to set it >> to at least 1 more than the number of osds in your failure domain. If the >> failure domain is host and each node has 32 osds, then setting it to 33 >> will prevent a full problematic node from being able to cause havoc. >> >> The osds will still try to mark themselves as up and this will still >> cause problems for read until the osd process stops or the network comes >> back up. There might be a seeing for how long an odd will try telling the >> mons it's up, but this isn't really a situation I've come across after >> initial testing and installation of nodes. >> >> On Tue, May 22, 2018, 1:47 AM nokia ceph >> wrote: >> >>> Hi Ceph users, >>> >>> We have a cluster with 5 node (67 disks) and EC 4+1 configuration and >>> min_size set as 4. >>> Ceph version : 12.2.5 >>> While executing one of our resilience usecase , making private interface >>> down on one of the node, till kraken we saw less outage in rados (60s) . >>> >>> Now with luminous, we could able to see rados read/write outage for more >>> than 200s . In the logs we could able to see that peer OSDs inform that one >>> of the node OSDs are down however the OSDs defend like it is wrongly >>> marked down and does not move to down state for long time. >>> >>> 2018-05-22 05:37:17.871049 7f6ac71e6700 0 log_channel(cluster) log >>> [WRN] : Monitor daemon marked osd.1 down, but it is still running >>> 2018-05-22 05:37:17.871072 7f6ac71e6700 0 log_channel(cluster) log >>> [DBG] : map e35690 wrongly marked me down at e35689 >>> 2018-05-22 05:37:17.878347 7f6ac71e6700 0 osd.1 35690 crush map has >>> features 1009107927421960192, adjusting msgr requires for osds >>> 2018-05-22 05:37:18.296643 7f6ac71e6700 0 osd.1 35691 crush map has >>> features 1009107927421960192, adjusting msgr requires for osds >>> >>> >>> Only when all 67 OSDs are move to down state , the read/write traffic is >>> resumed. >>> >>> Could you please help us in resolving this issue and if it is bug , we >>> will create corresponding ticket. >>> >>> Thanks, >>> Muthu >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] IO500 Call for Submissions for ISC 2018
IO500 Call for Submission Deadline: 23 June 2018 AoE The IO500 is now accepting and encouraging submissions for the upcoming IO500 list revealed at ISC 2018 in Frankfurt, Germany. The benchmark suite is designed to be easy to run and the community has multiple active support channels to help with any questions. Please submit and we look forward to seeing many of you at ISC 2018! Please note that submissions of all size are welcome; the site has customizable sorting so it is possible to submit on a small system and still get a very good per-client score for example. Additionally, the list is about much more than just the raw rank; all submissions help the community by collecting and publishing a wider corpus of data. More details below. Following the success of the Top500 in collecting and analyzing historical trends in supercomputer technology and evolution, the IO500 was created in 2017 and published its first list at SC17. The need for such an initiative has long been known within High Performance Computing; however, defining appropriate benchmarks had long been challenging. Despite this challenge, the community, after long and spirited discussion, finally reached consensus on a suite of benchmarks and a metric for resolving the scores into a single ranking. The multi-fold goals of the benchmark suite are as follows: * Maximizing simplicity in running the benchmark suite * Encouraging complexity in tuning for performance * Allowing submitters to highlight their “hero run” performance numbers * Forcing submitters to simultaneously report performance for challenging IO patterns. Specifically, the benchmark suite includes a hero-run of both IOR and mdtest configured however possible to maximize performance and establish an upper-bound for performance. It also includes an IOR and mdtest run with highly prescribed parameters in an attempt to determine a lower-bound. Finally, it includes a namespace search as this has been determined to be a highly sought-after feature in HPC storage systems that has historically not been well-measured. Submitters are encouraged to share their tuning insights for publication. The goals of the community are also multi-fold: * Gather historical data for the sake of analysis and to aid predictions of storage futures * Collect tuning information to share valuable performance optimizations across the community * Encourage vendors and designers to optimize for workloads beyond “hero runs” * Establish bounded expectations for users, procurers, and administrators Once again, we encourage you to submit (see http://io500.org/submission), to join our community, and to attend our BoF “The IO-500 and the Virtual Institute of I/O” at ISC 2018 where we will announce the second ever IO500 list. The current list includes results from BeeGPFS, DataWarp, IME, Lustre, and Spectrum Scale. We hope that the next list has even more! We look forward to answering any questions or concerns you might have. Thank you! IO500 Committee ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Luminous - OSD constantly crashing caused by corrupted placement group
Hi ! We have now deleted all snapshots of the pool in question. With "ceph pg dump" we can see that pg 5.9b has a SNAPTRIMQ_LEN of 27826. All other PGs have 0. It looks like this value does not decrease. LAST_SCRUB and LAST_DEEP_SCRUB are both from 2018-04-24. Almost 1 month ago. OSD still crashing a while after we start it. OSD Log : *** Caught signal (Aborted) ** and /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end()) Any Ideas howto fix this ? Is there a way to "force" the snaptrim of the pg in question ? Or anyother way to "clean" this pg ? We have searched a lot in the mail archives but couldnt find anything that could help us in that case. Br, Am 17.05.2018 um 00:12 schrieb Gregory Farnum: On Wed, May 16, 2018 at 6:49 AM Siegfried Höllrigl> wrote: Hi Greg ! Thank you for your fast reply. We have now deleted the PG on OSD.130 like you suggested and started it : ceph-s-06 # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-130/ --pgid 5.9b --op remove --force marking collection for removal setting '_remove' omap key finish_remove_pgs 5.9b_head removing 5.9b Remove successful ceph-s-06 # systemctl start ceph-osd@130.service The cluster recovered again until it came to the PG 5.9b. Then OSD.130 crashed again. -> No Change So we wanted to start the other way and export the PG from the primary (healthy) OSD. (OSD.19) but that fails: root@ceph-s-03:/tmp5.9b# ceph-objectstore-tool --op export --pgid 5.9b --data-path /var/lib/ceph/osd/ceph-19 --file /tmp5.9b/5.9b.export OSD has the store locked But we don't want to stop OSD.19 on this server because this Pool has size=3 and size_min=2. (this would make pg5.9b inaccessable) I'm a bit confused. Are you saying that 1) the ceph-objectstore-tool you pasted there successfully removed pg 5.9b from osd.130 (as it appears), AND 2) pg 5.9b was active with one of the other nodes as primary, so all data remained available, AND 3) when pg 5.9b got backfilled into osd.130, osd.130 crashed again? (But the other OSDs kept the PG fully available, without crashing?) That sequence of events is *deeply* confusing and I really don't understand how it might happen. Sadly I don't think you can grab a PG for export without stopping the OSD in question. When we query the pg, we can see a lot of "snap_trimq". Can this be cleaned somehow, even if the pg is undersized and degraded ? I *think* the PG will keep trimming snapshots even if undersized+degraded (though I don't remember for sure), but snapshot trimming is often heavily throttled and I'm not aware of any way to specifically push one PG to the front. If you're interested in speeding snaptrimming up you can search the archives or check the docs for the appropriate config options. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD-primary crush rule doesn't work as intended
Add to the info, I have a slightly modified rule to take advantage of the new storage class. rule ssd-hybrid { id 2 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 1 type host step emit step take default class hdd step chooseleaf firstn -1 type host step emit } Regards, Horace Ng - Original Message - From: "horace"To: "ceph-users" Sent: Wednesday, May 23, 2018 3:56:20 PM Subject: [ceph-users] SSD-primary crush rule doesn't work as intended I've set up the rule according to the doc, but some of the PGs are still being assigned to the same host. http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/ rule ssd-primary { ruleset 5 type replicated min_size 5 max_size 10 step take ssd step chooseleaf firstn 1 type host step emit step take platter step chooseleaf firstn -1 type host step emit } Crush tree: [root@ceph0 ~]#ceph osd crush tree ID CLASS WEIGHT TYPE NAME -1 58.63989 root default -2 19.55095 host ceph0 0 hdd 2.73000 osd.0 1 hdd 2.73000 osd.1 2 hdd 2.73000 osd.2 3 hdd 2.73000 osd.3 12 hdd 4.54999 osd.12 15 hdd 3.71999 osd.15 18 ssd 0.2 osd.18 19 ssd 0.16100 osd.19 -3 19.55095 host ceph1 4 hdd 2.73000 osd.4 5 hdd 2.73000 osd.5 6 hdd 2.73000 osd.6 7 hdd 2.73000 osd.7 13 hdd 4.54999 osd.13 16 hdd 3.71999 osd.16 20 ssd 0.16100 osd.20 21 ssd 0.2 osd.21 -4 19.53799 host ceph2 8 hdd 2.73000 osd.8 9 hdd 2.73000 osd.9 10 hdd 2.73000 osd.10 11 hdd 2.73000 osd.11 14 hdd 3.71999 osd.14 17 hdd 4.54999 osd.17 22 ssd 0.18700 osd.22 23 ssd 0.16100 osd.23 #ceph pg ls-by-pool ssd-hybrid 27.8 1051 00 0 0 4399733760 1581 1581 active+clean 2018-05-23 06:20:56.088216 27957'185553 27959:368828 [23,1,11] 23 [23,1,11] 23 27953'182582 2018-05-23 06:20:56.08817227843'162478 2018-05-20 18:28:20.118632 With osd.23 and osd.11 being assigned on the same host. Regards, Horace Ng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSD-primary crush rule doesn't work as intended
I've set up the rule according to the doc, but some of the PGs are still being assigned to the same host. http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/ rule ssd-primary { ruleset 5 type replicated min_size 5 max_size 10 step take ssd step chooseleaf firstn 1 type host step emit step take platter step chooseleaf firstn -1 type host step emit } Crush tree: [root@ceph0 ~]#ceph osd crush tree ID CLASS WEIGHT TYPE NAME -1 58.63989 root default -2 19.55095 host ceph0 0 hdd 2.73000 osd.0 1 hdd 2.73000 osd.1 2 hdd 2.73000 osd.2 3 hdd 2.73000 osd.3 12 hdd 4.54999 osd.12 15 hdd 3.71999 osd.15 18 ssd 0.2 osd.18 19 ssd 0.16100 osd.19 -3 19.55095 host ceph1 4 hdd 2.73000 osd.4 5 hdd 2.73000 osd.5 6 hdd 2.73000 osd.6 7 hdd 2.73000 osd.7 13 hdd 4.54999 osd.13 16 hdd 3.71999 osd.16 20 ssd 0.16100 osd.20 21 ssd 0.2 osd.21 -4 19.53799 host ceph2 8 hdd 2.73000 osd.8 9 hdd 2.73000 osd.9 10 hdd 2.73000 osd.10 11 hdd 2.73000 osd.11 14 hdd 3.71999 osd.14 17 hdd 4.54999 osd.17 22 ssd 0.18700 osd.22 23 ssd 0.16100 osd.23 #ceph pg ls-by-pool ssd-hybrid 27.8 1051 00 0 0 4399733760 1581 1581 active+clean 2018-05-23 06:20:56.088216 27957'185553 27959:368828 [23,1,11] 23 [23,1,11] 23 27953'182582 2018-05-23 06:20:56.08817227843'162478 2018-05-20 18:28:20.118632 With osd.23 and osd.11 being assigned on the same host. Regards, Horace Ng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous: resilience - private interface down , no read/write
Hi David Turner, This is our ceph config under mon section , we have EC 4+1 and set the failure domain as host and osd_min_down_reporters to 4 ( osds from 4 different host ) . [mon] mon_compact_on_start = True mon_osd_down_out_interval = 86400 mon_osd_down_out_subtree_limit = host mon_osd_min_down_reporters = 4 mon_osd_reporter_subtree_level = host We have 68 disks , can we increase sd_min_down_reporters to 68 ? Thanks, Muthu On Tue, May 22, 2018 at 5:46 PM, David Turnerwrote: > What happens when a storage node loses its cluster network but not it's > public network is that all other osss on the cluster see that it's down and > report that to the mons, but the node call still talk to the mons telling > the mons that it is up and in fact everything else is down. > > The setting osd _min_reporters (I think that's the name of it off the top > of my head) is designed to help with this scenario. It's default is 1 which > means any osd on either side of the network problem will be trusted by the > mons to mark osds down. What you want to do with this seeing is to set it > to at least 1 more than the number of osds in your failure domain. If the > failure domain is host and each node has 32 osds, then setting it to 33 > will prevent a full problematic node from being able to cause havoc. > > The osds will still try to mark themselves as up and this will still cause > problems for read until the osd process stops or the network comes back up. > There might be a seeing for how long an odd will try telling the mons it's > up, but this isn't really a situation I've come across after initial > testing and installation of nodes. > > On Tue, May 22, 2018, 1:47 AM nokia ceph wrote: > >> Hi Ceph users, >> >> We have a cluster with 5 node (67 disks) and EC 4+1 configuration and >> min_size set as 4. >> Ceph version : 12.2.5 >> While executing one of our resilience usecase , making private interface >> down on one of the node, till kraken we saw less outage in rados (60s) . >> >> Now with luminous, we could able to see rados read/write outage for more >> than 200s . In the logs we could able to see that peer OSDs inform that one >> of the node OSDs are down however the OSDs defend like it is wrongly >> marked down and does not move to down state for long time. >> >> 2018-05-22 05:37:17.871049 7f6ac71e6700 0 log_channel(cluster) log [WRN] >> : Monitor daemon marked osd.1 down, but it is still running >> 2018-05-22 05:37:17.871072 7f6ac71e6700 0 log_channel(cluster) log [DBG] >> : map e35690 wrongly marked me down at e35689 >> 2018-05-22 05:37:17.878347 7f6ac71e6700 0 osd.1 35690 crush map has >> features 1009107927421960192, adjusting msgr requires for osds >> 2018-05-22 05:37:18.296643 7f6ac71e6700 0 osd.1 35691 crush map has >> features 1009107927421960192, adjusting msgr requires for osds >> >> >> Only when all 67 OSDs are move to down state , the read/write traffic is >> resumed. >> >> Could you please help us in resolving this issue and if it is bug , we >> will create corresponding ticket. >> >> Thanks, >> Muthu >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [client.rgw.hostname] or [client.radosgw.hostname] ?
Ok, understood Thanks a lot Cheers, Massimo On Tue, May 22, 2018 at 1:57 PM, David Turnerwrote: > We use radosgw in our deployment. It doesn't really matter as you can > specify the key in the config file. You could call it > client.thatobjectthing.hostname and it would work fine. > > On Tue, May 22, 2018, 5:54 AM Massimo Sgaravatto < > massimo.sgarava...@gmail.com> wrote: > >> # ls /var/lib/ceph/radosgw/ >> ceph-rgw.ceph-test-rgw-01 >> >> >> So [client.rgw.ceph-test-rgw-01] >> >> Thanks, Massimo >> >> >> On Tue, May 22, 2018 at 6:28 AM, Marc Roos >> wrote: >> >>> >>> I can relate to your issue, I am always looking at >>> >>> /var/lib/ceph/ >>> >>> See what is used there >>> >>> >>> -Original Message- >>> From: Massimo Sgaravatto [mailto:massimo.sgarava...@gmail.com] >>> Sent: dinsdag 22 mei 2018 11:46 >>> To: Ceph Users >>> Subject: [ceph-users] [client.rgw.hostname] or [client.radosgw.hostname] >>> ? >>> >>> I am really confused about the use of [client.rgw.hostname] or >>> [client.radosgw.hostname] in the configuration file. I don't understand >>> if they have different purposes or if there is just a problem with >>> documentation. >>> >>> >>> E.g.: >>> >>> http://docs.ceph.com/docs/luminous/start/quick-rgw/ >>> >>> >>> says that [client.rgw.hostname] should be used >>> >>> while: >>> >>> http://docs.ceph.com/docs/luminous/radosgw/config-ref/ >>> >>> >>> talks about [client.radosgw.{instance-name}] >>> >>> >>> In my luminous-centos7 cluster it looks like only [client.rgw.hostname] >>> works >>> >>> >>> >>> Thanks, Massimo >>> >>> >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com