Re: [ceph-users] PG Balancer Upmap mode not working
Hi Anthony! Mon, 9 Dec 2019 17:11:12 -0800 Anthony D'Atri ==> ceph-users : > > How is that possible? I dont know how much more proof I need to present > > that there's a bug. > > FWIW, your pastes are hard to read with all the ? in them. Pasting > non-7-bit-ASCII? I don't see much "?" in his posts. Maybe a display issue? > > |I increased PGs and see no difference. > > From what pgp_num to what new value? Numbers that are not a power of 2 can > contribute to the sort of problem you describe. Do you have host CRUSH fault > domain? > Does the fault domain play a role with this situation? I can't see the reason. This would only be important if the OSDs weren't evenly distributed across the hosts. Philippe can you posts your 'ceph osd tree'? > > Raising PGs to 100 is an old statement anyway, anything 60+ should be fine. > > > > Fine in what regard? To be sure, Wido’s advice means a *ratio* of at least > 100. ratio = (pgp_num * replication) / #osds > > The target used to be 200, a commit around 12.2.1 retconned that to 100. > Best I can tell the rationale is memory usage at the expense of performance. > > Is your original except complete? Ie., do you only have 24 OSDs? Across how > many nodes? > > The old guidance for tiny clusters: > > • Less than 5 OSDs set pg_num to 128 > > • Between 5 and 10 OSDs set pg_num to 512 > > • Between 10 and 50 OSDs set pg_num to 1024 This is what I thought too. But in this posts https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/message/TR6CJQKSMOHNGOMQO4JBDMGEL2RMWE36/ [Why are the mailing lists ceph.io and ceph.com not merged? It's hard to find the link to messages this way.] Konstantin suggested to reduce to pg_num=512. The cluster had 35 OSDs. It is still merging very slowly the PGs. In the meantime I added 5 more OSDs and thinking about rising the pg_num back to 1024. I wonder how less PGs can balance better than 512. I'm in a similar situation like Philippe with my cluster. ceph osd df class hdd: […] MIN/MAX VAR: 0.73/1.21 STDDEV: 6.27 Attached is a picture of the dashboard with tiny bars of the data distribution. The nearly empty OSDs are SSDs used for its own pool. I think there might be a bug in the balancing algorithm. Thanks, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Impact of a small DB size with Bluestore
Hi, Tue, 26 Nov 2019 13:57:51 + Simon Ironside ==> ceph-users@lists.ceph.com : > Mattia Belluco said back in May: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/035086.html > > "when RocksDB needs to compact a layer it rewrites it > *before* deleting the old data; if you'd like to be sure you db does not > spill over to the spindle you should allocate twice the size of the > biggest layer to allow for compaction." > > I didn't spot anyone disagreeing so I used 64GiB DB/WAL partitions on > the SSDs in my most recent clusters to allow for this and to be certain > that I definitely had room for the WAL on top and wouldn't get caught > out by people saying GB (x1000^3 bytes) when they mean GiB (x1024^3 > bytes). I left the rest of the SSD empty to make the most of wear > leveling, garbage collection etc. > > Simon this is something I liked to get a comment from a developer too. So what about the doubled size for block_db? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NVMe disk - size
Hi Kristof, may I add another choice? I configured my SSDs this way. Every host for OSDs has two fast and durable SSDs. Both SSDs are in one RAID1 which then is split up into LVs. I took 58GB for DB & WAL (and space for a special action by the DB (was it compaction?)) for each OSD. Then there where some hundreds of GB left on this RAID1 which I took to form a faster SSD-OSD. This is put into its own class of OSDs. So I have (slower) pools put onto OSDs of class "hdd" and (faster) pools put onto OSDs of class "ssd". The faster pools are used for metadata of CephFS. Good luck, Lars Mon, 18 Nov 2019 07:46:23 +0100 Kristof Coucke ==> vita...@yourcmc.ru : > Hi all, > > Thanks for the feedback. > Though, just to be sure: > > 1. There is no 30GB limit if I understand correctly for the RocksDB size. > If metadata crosses that barrier, the L4 part will spillover to the primary > device? Or will it just move the RocksDB completely? Or will it just stop > and indicate it's full? > 2. Since the WAL will also be written to that device, I assume a few > additional GB's is still usefull... > > With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple > possible scenario's: > - Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would > result in only 455GB being used (13 x 35GB). This is a pity, since I have > 3.2TB of NVMe disk space... > > Options line-up: > > *Option a*: Not using the NVMe for block.db storage, but as RGW metadata > pool. > Advantages: > - Impact of 1 defect NVMe is limited. > - Fast storage for the metadata pool. > Disadvantage: > - RocksDB for each OSD is on the primary disk, resulting in slower > performance of each OSD. > > *Option b: *Hardware mirror of the NVMe drive > Advantages: > - Impact of 1 defect NVMe is limited > - Fast KV lookup for each OSD > Disadvantage: > - I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are > fast, I imagine that there still is an impact. > - 1 TB of NVMe is not used / host > > *Option c: *Split the NVMe's accross the OSD > Advantages: > - Fast RockDB access - up to L3 (assuming spillover does it job) > Disadvantage: > - 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons > per host) > - 2.7TB of NVMe space not used per host > > *Option d: *1 NVMe disk for OSDs, 1 for RGW metadata pool > Advantages: > - Fast RockDB access - up to L3 > - Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be > 16TB, divided by 3 due to replication) I assume this already gives some > possibilities > Disadvantages: > - 1 defect NVMe might impact a complete host (all OSDs might be using it > for the RockDB storage) > - 1 TB of NVMe is not used > > Though menu to choose from, each with it possibilities... The initial idea > was too assign 200GB per OSD of the NVMe space per OSD, but this would > result in a lot of unused space. I don't know if there is anything on the > roadmap to adapt the RocksDB sizing to make better use of the available > NVMe disk space. > With all the information, I would assume that the best option would be *option > A*. Since we will be using erasure coding for the RGW data pool (k=6, m=3), > the impact of a defect NVMe would be too significant. The other alternative > would be option b, but then again we would be dealing with HW raid which is > against all Ceph design rules. > > Any other options or (dis)advantages I missed? Or any other opinions to > choose another option? > > Regards, > > Kristof > > Op vr 15 nov. 2019 om 18:22 schreef : > > > Use 30 GB for all OSDs. Other values are pointless, because > > https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing > > > > You can use the rest of free NVMe space for bcache - it's much better > > than just allocating it for block.db. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-ansible / block-db block-wal
I don't use ansible anymore. But this was my config for the host onode1: ./host_vars/onode2.yml: lvm_volumes: - data: /dev/sdb db: '1' db_vg: host-2-db - data: /dev/sdc db: '2' db_vg: host-2-db - data: /dev/sde db: '3' db_vg: host-2-db - data: /dev/sdf db: '4' db_vg: host-2-db … one config file per host. The LVs were created by hand on a PV over RAID1 over two SSDs. The hosts had empty slots for hdds to be bought later. So I had to "partition" the PV by hand, because ansible uses the whole RAID1 only for the present HDDs. It is said that only certain sizes of DB & WAL partitions are sensible. I now use 58GiB LVs. The remaining space in the RAID1 is used for a faster OSD. Lars Wed, 30 Oct 2019 10:02:23 + CUZA Frédéric ==> "ceph-users@lists.ceph.com" : > Hi Everyone, > > Does anyone know how to indicate block-db and block-wal to device on ansible ? > In ceph-deploy it is quite easy : > ceph-deploy osd create osd_host08 --data /dev/sdl --block-db /dev/sdm12 > --block-wal /dev/sdn12 -bluestore > > On my data nodes I have 12 HDDs and 2 SSDs I use those SSDs for block-db and > block-wal. > How to indicate for each osd which partition to use ? > > And finally, how do you handle the deployment if you have multiple data nodes > setup ? > SSDs on sdm and sdn on one host and SSDs on sda and sdb on another ? > > Thank you for your help. > > Regards, -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS client hanging and cache issues
Hi. Sounds like you use kernel clients with kernels from canonical/ubuntu. Two kernels have a bug: 4.15.0-66 and 5.0.0-32 Updated kernels are said to have fixes. Older kernels also work: 4.15.0-65 and 5.0.0-31 Lars Wed, 30 Oct 2019 09:42:16 + Bob Farrell ==> ceph-users : > Hi. We are experiencing a CephFS client issue on one of our servers. > > ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus > (stable) > > Trying to access, `umount`, or `umount -f` a mounted CephFS volumes causes > my shell to hang indefinitely. > > After a reboot I can remount the volumes cleanly but they drop out after < > 1 hour of use. > > I see this log entry multiple times when I reboot the server: > ``` > cache_from_obj: Wrong slab cache. inode_cache but object is from > ceph_inode_info > ``` > The machine then reboots after approx. 30 minutes. > > All other Ceph/CephFS clients and servers seem perfectly happy. CephFS > cluster is HEALTH_OK. > > Any help appreciated. If I can provide any further details please let me > know. > > Thanks in advance, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster network down
Mon, 30 Sep 2019 15:21:18 +0200 Janne Johansson ==> Lars Täuber : > > > > I don't remember where I read it, but it was told that the cluster is > > migrating its complete traffic over to the public network when the cluster > > networks goes down. So this seems not to be the case? > > > > Be careful with generalizations like "when a network acts up, it will be > completely down and noticeably unreachable for all parts", since networks > can break in thousands of not-very-obvious ways which are not 0%-vs-100% > but somewhere in between. > Ok. I ask my question in a new way. What does ceph do, when I switch off all switches of the cluster network? Does ceph handle this silently without interruption? Does the heartbeat systems use the public network as a failover automatically? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster network down
Mon, 30 Sep 2019 14:49:48 +0200 Burkhard Linke ==> ceph-users@lists.ceph.com : > Hi, > > On 9/30/19 2:46 PM, Lars Täuber wrote: > > Hi! > > > > What happens when the cluster network goes down completely? > > Is the cluster silently using the public network without interruption, or > > does the admin has to act? > > The cluster network is used for OSD heartbeats and backfilling/recovery > traffic. If the heartbeats do not work anymore, the OSDs will start to > report the other OSDs as down, resulting in a completely confused cluster... > > > I would avoid an extra cluster network unless it is absolutely necessary. > > > Regards, > > Burkhard I don't remember where I read it, but it was told that the cluster is migrating its complete traffic over to the public network when the cluster networks goes down. So this seems not to be the case? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cluster network down
Hi! What happens when the cluster network goes down completely? Is the cluster silently using the public network without interruption, or does the admin has to act? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] BlueStore.cc: 11208: ceph_abort_msg("unexpected error")
Hi Paul, a result of fgrep is attached. Can you do something with it? I can't read it. Maybe this is the relevant part: " bluestore(/var/lib/ceph/osd/first-16) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)" Later I tried it again and the osd is working again. It feels like I hit a bug!? Huge thanks for your help. Cheers, Lars Fri, 23 Aug 2019 13:36:00 +0200 Paul Emmerich ==> Lars Täuber : > Filter the log for "7f266bdc9700" which is the id of the crashed > thread, it should contain more information on the transaction that > caused the crash. > > > Paul > 7f266bdc9700.log.gz Description: application/gzip ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] BlueStore.cc: 11208: ceph_abort_msg("unexpected error")
Hi there! In our testcluster is an osd that won't start anymore. Here is a short part of the log: -1> 2019-08-23 08:56:13.316 7f266bdc9700 -1 /tmp/release/Debian/WORKDIR/ceph-14.2.2/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f266bdc9700 time 2019-08-23 08:56:13.318938 /tmp/release/Debian/WORKDIR/ceph-14.2.2/src/os/bluestore/BlueStore.cc: 11208: ceph_abort_msg("unexpected error") ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string, std::allocator > const&)+0xdf) [0x564406ac153a] 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2830) [0x5644070e48d0] 3: (BlueStore::queue_transactions(boost::intrusive_ptr&, std::vector >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x42a) [0x5644070ec33a] 4: (ObjectStore::queue_transaction(boost::intrusive_ptr&, ObjectStore::Transaction&&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x7f) [0x564406cd620f] 5: (PG::_delete_some(ObjectStore::Transaction*)+0x945) [0x564406d32d85] 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0x564406d337d1] 7: (boost::statechart::simple_state, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x109) [0x564406d81ec9] 8: (boost::statechart::state_machine, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x564406d4e7cb] 9: (PG::do_peering_event(std::shared_ptr, PG::RecoveryCtx*)+0x2af) [0x564406d3f39f] 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr, ThreadPool::TPHandle&)+0x1b4) [0x564406c7e644] 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc4) [0x564406c7e8c4] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) [0x564406c72667] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x56440724f7d4] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5644072521d0] 15: (()+0x7fa3) [0x7f26862f6fa3] 16: (clone()+0x3f) [0x7f2685ea64cf] The log is so huge that I don't know which part may be of interest. The cite is the part I think is most useful. Is there anybody able to read and explain this? Thanks in advance, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space
Hi there! We also experience this behaviour of our cluster while it is moving pgs. # ceph health detail HEALTH_ERR 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs inactive; Degraded data redundancy (low space): 1 pg backfill_toofull MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 359 secs PG_AVAILABILITY Reduced data availability: 2 pgs inactive pg 21.231 is stuck inactive for 878.224182, current state remapped, last acting [20,2147483647,13,2147483647,15,10] pg 21.240 is stuck inactive for 878.123932, current state remapped, last acting [26,17,21,20,2147483647,2147483647] PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull pg 21.376 is active+remapped+backfill_wait+backfill_toofull, acting [6,11,29,2,10,15] # ceph pg map 21.376 osdmap e68016 pg 21.376 (21.376) -> up [6,5,23,21,10,11] acting [6,11,29,2,10,15] # ceph osd dump | fgrep ratio full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.85 This happens while the cluster is rebalancing the pgs after I manually mark a single osd out. see here: Subject: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036634.html Mostly the cluster heals itself at least into state HEALTH_WARN: # ceph health detail HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs inactive MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 1155 secs PG_AVAILABILITY Reduced data availability: 2 pgs inactive pg 21.231 is stuck inactive for 1677.312219, current state remapped, last acting [20,2147483647,13,2147483647,15,10] pg 21.240 is stuck inactive for 1677.211969, current state remapped, last acting [26,17,21,20,2147483647,2147483647] Cheers, Lars Wed, 21 Aug 2019 17:28:05 -0500 Reed Dier ==> Vladimir Brik : > Just chiming in to say that I too had some issues with backfill_toofull PGs, > despite no OSD's being in a backfill_full state, albeit, there were some > nearfull OSDs. > > I was able to get through it by reweighting down the OSD that was the target > reported by ceph pg dump | grep 'backfill_toofull'. > > This was on 14.2.2. > > Reed > > > On Aug 21, 2019, at 2:50 PM, Vladimir Brik > > wrote: > > > > Hello > > > > After increasing number of PGs in a pool, ceph status is reporting > > "Degraded data redundancy (low space): 1 pg backfill_toofull", but I don't > > understand why, because all OSDs seem to have enough space. > > > > ceph health detail says: > > pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85] > > > > $ ceph pg map 40.155 > > osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85] > > > > So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). > > According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization > > is 58.45%. The OSD with least free space in the cluster is 81.23% full, and > > it's not any of the ones above. > > > > OSD backfillfull_ratio is 90% (is there a better way to determine this?): > > $ ceph osd dump | grep ratio > > full_ratio 0.95 > > backfillfull_ratio 0.9 > > nearfull_ratio 0.7 > > > > Does anybody know why a PG could be in the backfill_toofull state if no OSD > > is in the backfillfull state? > > > > > > Vlad > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped
All osd are up. I manually mark one out of 30 "out" not "down". The primary osd of the stuck pgs are neither marked as out nor as down. Thanks Lars Thu, 22 Aug 2019 15:01:12 +0700 wahyu.muqs...@gmail.com ==> wahyu.muqs...@gmail.com, Lars Täuber : > I think you use too few osd. when you use erasure code, the probability of > primary pg being on the down osd will increase > On 22 Aug 2019 14.51 +0700, Lars Täuber , wrote: > > There are 30 osds. > > > > Thu, 22 Aug 2019 14:38:10 +0700 > > wahyu.muqs...@gmail.com ==> ceph-users@lists.ceph.com, Lars Täuber > > : > > > how many osd do you use ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped
There are 30 osds. Thu, 22 Aug 2019 14:38:10 +0700 wahyu.muqs...@gmail.com ==> ceph-users@lists.ceph.com, Lars Täuber : > how many osd do you use ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped
Hi all, we are using ceph in version 14.2.2 from https://mirror.croit.io/debian-nautilus/ on debian buster and experiencing problems with cephfs. The mounted file system produces hanging processes due to pg stuck inactive. This often happens after I marked single osds out manually. A typical result is this: HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs behind on trimming; Reduced data availability: 4 pgs inactive MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs mdsmds1(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked for 51206 secs MDS_TRIM 1 MDSs behind on trimming mdsmds1(mds.0): Behind on trimming (4298/128) max_segments: 128, num_segments: 4298 PG_AVAILABILITY Reduced data availability: 4 pgs inactive pg 21.1f9 is stuck inactive for 52858.655306, current state remapped, last acting [8,2147483647,2147483647,26,27,11] pg 21.22f is stuck inactive for 52858.636207, current state remapped, last acting [27,26,4,2147483647,15,2147483647] pg 21.2b5 is stuck inactive for 52865.857165, current state remapped, last acting [6,2147483647,21,27,11,2147483647] pg 21.3ed is stuck inactive for 52865.852710, current state remapped, last acting [26,18,14,20,2147483647,2147483647] The placement groups are from an erasure coded pool. # ceph osd erasure-code-profile get CLAYje4_2_5 crush-device-class= crush-failure-domain=host crush-root=default d=5 k=4 m=2 plugin=clay It helps restarting the primary osd from the stuck pgs to get them alive again. This problem keeps us from using this cluster as a productive system. I'm still a beginner with ceph and this cluster is still in testing phase. What I'm doing wrong? Is this problem a symptom of using the clay erasure code? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SOLVED - MDSs report damaged metadata
Hi all! I solved this situation with restarting the active mds. So the next mds took over and the error was gone. This is somehow a strange situation. Similar to the situation when restarting primary osds when having scrub errors of pgs. Maybe this should be researched a bit deeper. Thanks all for this great storage solution! Cheers, Lars Tue, 20 Aug 2019 07:30:11 +0200 Lars Täuber ==> ceph-users@lists.ceph.com : > Hi there! > > Does anyone else have an idea what I could do to get rid of this error? > > BTW: it is the third time that the pg 20.0 is gone inconsistent. > This is a pg from the metadata pool (cephfs). > May this be related anyhow? > > # ceph health detail > HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data > damage: 1 pg inconsistent > MDS_DAMAGE 1 MDSs report damaged metadata > mdsmds3(mds.0): Metadata damage detected > OSD_SCRUB_ERRORS 1 scrub errors > PG_DAMAGED Possible data damage: 1 pg inconsistent > pg 20.0 is active+clean+inconsistent, acting [9,27,15] > > > Best regards, > Lars > > > Mon, 19 Aug 2019 13:51:59 +0200 > Lars Täuber ==> Paul Emmerich : > > Hi Paul, > > > > thanks for the hint. > > > > I did a recursive scrub from "/". The log says there where some inodes with > > bad backtraces repaired. But the error remains. > > May this have something to do with a deleted file? Or a file within a > > snapshot? > > > > The path told by > > > > # ceph tell mds.mds3 damage ls > > 2019-08-19 13:43:04.608 7f563f7f6700 0 client.894552 ms_handle_reset on > > v2:192.168.16.23:6800/176704036 > > 2019-08-19 13:43:04.624 7f56407f8700 0 client.894558 ms_handle_reset on > > v2:192.168.16.23:6800/176704036 > > [ > > { > > "damage_type": "backtrace", > > "id": 3760765989, > > "ino": 1099518115802, > > "path": "~mds0/stray7/15161f7/dovecot.index.backup" > > } > > ] > > > > starts a bit strange to me. > > > > Are the snapshots also repaired with a recursive repair operation? > > > > Thanks > > Lars > > > > > > Mon, 19 Aug 2019 13:30:53 +0200 > > Paul Emmerich ==> Lars Täuber : > > > Hi, > > > > > > that error just says that the path is wrong. I unfortunately don't > > > know the correct way to instruct it to scrub a stray path off the top > > > of my head; you can always run a recursive scrub on / to go over > > > everything, though > > > > > > > > > Paul > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDSs report damaged metadata
Hi there! Does anyone else have an idea what I could do to get rid of this error? BTW: it is the third time that the pg 20.0 is gone inconsistent. This is a pg from the metadata pool (cephfs). May this be related anyhow? # ceph health detail HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data damage: 1 pg inconsistent MDS_DAMAGE 1 MDSs report damaged metadata mdsmds3(mds.0): Metadata damage detected OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 20.0 is active+clean+inconsistent, acting [9,27,15] Best regards, Lars Mon, 19 Aug 2019 13:51:59 +0200 Lars Täuber ==> Paul Emmerich : > Hi Paul, > > thanks for the hint. > > I did a recursive scrub from "/". The log says there where some inodes with > bad backtraces repaired. But the error remains. > May this have something to do with a deleted file? Or a file within a > snapshot? > > The path told by > > # ceph tell mds.mds3 damage ls > 2019-08-19 13:43:04.608 7f563f7f6700 0 client.894552 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > 2019-08-19 13:43:04.624 7f56407f8700 0 client.894558 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > [ > { > "damage_type": "backtrace", > "id": 3760765989, > "ino": 1099518115802, > "path": "~mds0/stray7/15161f7/dovecot.index.backup" > } > ] > > starts a bit strange to me. > > Are the snapshots also repaired with a recursive repair operation? > > Thanks > Lars > > > Mon, 19 Aug 2019 13:30:53 +0200 > Paul Emmerich ==> Lars Täuber : > > Hi, > > > > that error just says that the path is wrong. I unfortunately don't > > know the correct way to instruct it to scrub a stray path off the top > > of my head; you can always run a recursive scrub on / to go over > > everything, though > > > > > > Paul > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDSs report damaged metadata - "return_code": -116
Hi Paul, thanks for the hint. I did a recursive scrub from "/". The log says there where some inodes with bad backtraces repaired. But the error remains. May this have something to do with a deleted file? Or a file within a snapshot? The path told by # ceph tell mds.mds3 damage ls 2019-08-19 13:43:04.608 7f563f7f6700 0 client.894552 ms_handle_reset on v2:192.168.16.23:6800/176704036 2019-08-19 13:43:04.624 7f56407f8700 0 client.894558 ms_handle_reset on v2:192.168.16.23:6800/176704036 [ { "damage_type": "backtrace", "id": 3760765989, "ino": 1099518115802, "path": "~mds0/stray7/15161f7/dovecot.index.backup" } ] starts a bit strange to me. Are the snapshots also repaired with a recursive repair operation? Thanks Lars Mon, 19 Aug 2019 13:30:53 +0200 Paul Emmerich ==> Lars Täuber : > Hi, > > that error just says that the path is wrong. I unfortunately don't > know the correct way to instruct it to scrub a stray path off the top > of my head; you can always run a recursive scrub on / to go over > everything, though > > > Paul > -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDSs report damaged metadata - "return_code": -116
Hi all! Where can I look up what the error number means? Or did I something wrong in my command line? Thanks in advance, Lars Fri, 16 Aug 2019 13:31:38 +0200 Lars Täuber ==> Paul Emmerich : > Hi Paul, > > thank you for your help. But I get the following error: > > # ceph tell mds.mds3 scrub start > "~mds0/stray7/15161f7/dovecot.index.backup" repair > 2019-08-16 13:29:40.208 7f7e927fc700 0 client.881878 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > 2019-08-16 13:29:40.240 7f7e937fe700 0 client.867786 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > { > "return_code": -116 > } > > > > Lars > > > Fri, 16 Aug 2019 13:17:08 +0200 > Paul Emmerich ==> Lars Täuber : > > Hi, > > > > damage_type backtrace is rather harmless and can indeed be repaired > > with the repair command, but it's called scrub_path. > > Also you need to pass the name and not the rank of the MDS as id, it should > > be > > > > # (on the server where the MDS is actually running) > > ceph daemon mds.mds3 scrub_path ... > > > > But you should also be able to use ceph tell since nautilus which is a > > little bit easier because it can be run from any node: > > > > ceph tell mds.mds3 scrub start 'PATH' repair > > > > > > Paul > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDSs report damaged metadata
Hi Paul, thank you for your help. But I get the following error: # ceph tell mds.mds3 scrub start "~mds0/stray7/15161f7/dovecot.index.backup" repair 2019-08-16 13:29:40.208 7f7e927fc700 0 client.881878 ms_handle_reset on v2:192.168.16.23:6800/176704036 2019-08-16 13:29:40.240 7f7e937fe700 0 client.867786 ms_handle_reset on v2:192.168.16.23:6800/176704036 { "return_code": -116 } Lars Fri, 16 Aug 2019 13:17:08 +0200 Paul Emmerich ==> Lars Täuber : > Hi, > > damage_type backtrace is rather harmless and can indeed be repaired > with the repair command, but it's called scrub_path. > Also you need to pass the name and not the rank of the MDS as id, it should be > > # (on the server where the MDS is actually running) > ceph daemon mds.mds3 scrub_path ... > > But you should also be able to use ceph tell since nautilus which is a > little bit easier because it can be run from any node: > > ceph tell mds.mds3 scrub start 'PATH' repair > > > Paul > -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDSs report damaged metadata
Hi all! The mds of our ceph cluster produces a health_err state. It is a nautilus 14.2.2 on debian buster installed from the repo made by croit.io with osds on bluestore. The symptom: # ceph -s cluster: health: HEALTH_ERR 1 MDSs report damaged metadata services: mon: 3 daemons, quorum mon1,mon2,mon3 (age 2d) mgr: mon3(active, since 2d), standbys: mon2, mon1 mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby osd: 30 osds: 30 up (since 17h), 29 in (since 19h) data: pools: 3 pools, 1153 pgs objects: 435.21k objects, 806 GiB usage: 4.7 TiB used, 162 TiB / 167 TiB avail pgs: 1153 active+clean # ceph health detail HEALTH_ERR 1 MDSs report damaged metadata MDS_DAMAGE 1 MDSs report damaged metadata mdsmds3(mds.0): Metadata damage detected #ceph tell mds.0 damage ls 2019-08-16 07:20:09.415 7f1254ff9700 0 client.840758 ms_handle_reset on v2:192.168.16.23:6800/176704036 2019-08-16 07:20:09.431 7f1255ffb700 0 client.840764 ms_handle_reset on v2:192.168.16.23:6800/176704036 [ { "damage_type": "backtrace", "id": 3760765989, "ino": 1099518115802, "path": "~mds0/stray7/15161f7/dovecot.index.backup" } ] I tried this without much luck: # ceph daemon mds.0 "~mds0/stray7/15161f7/dovecot.index.backup" recursive repair admin_socket: exception getting command descriptions: [Errno 2] No such file or directory Is there a way out of this error? Thanks and best regards, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO
Thu, 11 Jul 2019 10:24:16 +0200 "Marc Roos" ==> ceph-users , lmb : > What about creating snaps on a 'lower level' in the directory structure > so you do not need to remove files from a snapshot as a work around? Thanks for the idea. This might be a solution for our use case. Regards, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO
Thu, 11 Jul 2019 10:21:16 +0200 Lars Marowsky-Bree ==> ceph-users@lists.ceph.com : > On 2019-07-10T09:59:08, Lars Täuber wrote: > > > Hi everbody! > > > > Is it possible to make snapshots in cephfs writable? > > We need to remove files because of this General Data Protection Regulation > > also from snapshots. > > Removing data from existing WORM storage is tricky, snapshots being a > specific form thereof. We liked it to be a non-WORM storage. It is not meant to be used as an archive. Thanks, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] writable snapshots in cephfs? GDPR/DSGVO
Hi everbody! Is it possible to make snapshots in cephfs writable? We need to remove files because of this General Data Protection Regulation also from snapshots. Thanks and best regards, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What does the differences in osd benchmarks mean?
Hi Nathan, yes the osd hosts are dual-socket machines. But does this make such difference? osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec 17 IOPS osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec 36 IOPS Doubling the IOPS? Thanks, Lars Thu, 27 Jun 2019 11:16:31 -0400 Nathan Fish ==> Ceph Users : > Are these dual-socket machines? Perhaps NUMA is involved? > > On Thu., Jun. 27, 2019, 4:56 a.m. Lars Täuber, wrote: > > > Hi! > > > > In our cluster I ran some benchmarks. > > The results are always similar but strange to me. > > I don't know what the results mean. > > The cluster consists of 7 (nearly) identical hosts for osds. Two of them > > have one an additional hdd. > > The hdds are from identical type. The ssds for the journal and wal are of > > identical type. The configuration is identical (ssd-db-lv-size) for each > > osd. > > The hosts are connected the same way to the same switches. > > This nautilus cluster was set up with ceph-ansible 4.0 on debian buster. > > > > This are the results of > > # ceph --format plain tell osd.* bench > > > > osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec > > 17 IOPS > > osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec > > 36 IOPS > > osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec > > 37 IOPS > > osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec > > 21 IOPS > > osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec > > 30 IOPS > > osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec > > 38 IOPS > > osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec > > 17 IOPS > > osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec > > 19 IOPS > > osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec > > 16 IOPS > > osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec > > 27 IOPS > > osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec > > 18 IOPS > > osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec > > 18 IOPS > > osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec > > 19 IOPS > > osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec > > 24 IOPS > > osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec > > 20 IOPS > > osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec > > 14 IOPS > > osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec > > 21 IOPS > > osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec > > 19 IOPS > > osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec > > 12 IOPS > > osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107 > > MiB/sec 26 IOPS > > osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec > > 16 IOPS > > osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec > > 14 IOPS > > osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec > > 15 IOPS > > osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec > > 16 IOPS > > osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100 > > MiB/sec 24 IOPS > > osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec > > 18 IOPS > > osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137 > > MiB/sec 34 IOPS > > osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec > > 22 IOPS > > osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec > > 15 IOPS > > osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101 > > MiB/sec 25 IOPS > > > > > > The different runs differ by ±1 IOPS. > > Why are the osds 1,2,4,5,9,19,26 faster than the others? > > > > Restarting an osd did change the result. > > > > Could someone give me hint where to look further to find the reason? > > > > Thanks > > Lars > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What does the differences in osd benchmarks mean?
Hi! In our cluster I ran some benchmarks. The results are always similar but strange to me. I don't know what the results mean. The cluster consists of 7 (nearly) identical hosts for osds. Two of them have one an additional hdd. The hdds are from identical type. The ssds for the journal and wal are of identical type. The configuration is identical (ssd-db-lv-size) for each osd. The hosts are connected the same way to the same switches. This nautilus cluster was set up with ceph-ansible 4.0 on debian buster. This are the results of # ceph --format plain tell osd.* bench osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec 17 IOPS osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec 36 IOPS osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec 37 IOPS osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec 21 IOPS osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec 30 IOPS osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec 38 IOPS osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec 17 IOPS osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec 19 IOPS osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec 16 IOPS osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec 27 IOPS osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec 18 IOPS osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec 18 IOPS osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec 19 IOPS osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec 24 IOPS osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec 20 IOPS osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec 14 IOPS osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec 21 IOPS osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec 19 IOPS osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec 12 IOPS osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107 MiB/sec 26 IOPS osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec 16 IOPS osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec 14 IOPS osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec 15 IOPS osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec 16 IOPS osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100 MiB/sec 24 IOPS osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec 18 IOPS osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137 MiB/sec 34 IOPS osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec 22 IOPS osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec 15 IOPS osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101 MiB/sec 25 IOPS The different runs differ by ±1 IOPS. Why are the osds 1,2,4,5,9,19,26 faster than the others? Restarting an osd did change the result. Could someone give me hint where to look further to find the reason? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reduced data availability: 2 pgs inactive
Hi Paul, thanks for the hint. Restarting the primary osds of the inactive pgs resolved the problem: Before restarting them they said: 2019-06-19 15:55:36.190 7fcd55c4e700 -1 osd.5 33858 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.220116.0:967410 21.2e4s0 21.d4e19ae4 (undecoded) ondisk+write+known_if_redirected e31569) and 2019-06-19 15:53:31.214 7f9b946d1700 -1 osd.13 33849 get_health_metrics reporting 14560 slow ops, oldest is osd_op(mds.0.44294:99584053 23.5 23.cad28605 (undecoded) ondisk+write+known_if_redirected+full_force e31562) Is this something to worry about? Regards, Lars Wed, 19 Jun 2019 15:04:06 +0200 Paul Emmerich ==> Lars Täuber : > That shouldn't trigger the PG limit (yet), but increasing "mon max pg per > osd" from the default of 200 is a good idea anyways since you are running > with more than 200 PGs per OSD. > > I'd try to restart all OSDs that are in the UP set for that PG: > > 13, > 21, > 23 > 7, > 29, > 9, > 28, > 11, > 8 > > > Maybe that solves it (technically it shouldn't), if that doesn't work > you'll have to dig in deeper into the log files to see where exactly and > why it is stuck activating. > > Paul > -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reduced data availability: 2 pgs inactive
Hi Paul, thanks for your reply. Wed, 19 Jun 2019 13:19:55 +0200 Paul Emmerich ==> Lars Täuber : > Wild guess: you hit the PG hard limit, how many PGs per OSD do you have? > If this is the case: increase "osd max pg per osd hard ratio" > > Check "ceph pg query" to see why it isn't activating. > > Can you share the output of "ceph osd df tree" and "ceph pg query" > of the affected PGs? The pg queries are attached. I can't read them - to much information. Here is the osd df tree: # osd df tree ID CLASS WEIGHTREWEIGHT SIZERAW USE DATAOMAPMETA AVAIL %USE VAR PGS STATUS TYPE NAME -1 167.15057- 167 TiB 4.7 TiB 1.2 TiB 952 MiB 57 GiB 162 TiB 2.79 1.00 -root PRZ -1772.43192- 72 TiB 2.0 TiB 535 GiB 393 MiB 25 GiB 70 TiB 2.78 1.00 -rack 1-eins -922.28674- 22 TiB 640 GiB 170 GiB 82 MiB 9.0 GiB 22 TiB 2.80 1.01 -host onode1 2 hdd 5.57169 1.0 5.6 TiB 162 GiB 45 GiB 11 MiB 2.3 GiB 5.4 TiB 2.84 1.02 224 up osd.2 9 hdd 5.57169 1.0 5.6 TiB 156 GiB 39 GiB 19 MiB 2.1 GiB 5.4 TiB 2.74 0.98 201 up osd.9 14 hdd 5.57169 1.0 5.6 TiB 162 GiB 44 GiB 24 MiB 2.1 GiB 5.4 TiB 2.84 1.02 230 up osd.14 21 hdd 5.57169 1.0 5.6 TiB 160 GiB 42 GiB 27 MiB 2.5 GiB 5.4 TiB 2.80 1.00 219 up osd.21 -1322.28674- 22 TiB 640 GiB 170 GiB 123 MiB 8.9 GiB 22 TiB 2.80 1.00 -host onode4 4 hdd 5.57169 1.0 5.6 TiB 156 GiB 39 GiB 38 MiB 2.2 GiB 5.4 TiB 2.73 0.98 205 up osd.4 11 hdd 5.57169 1.0 5.6 TiB 164 GiB 47 GiB 24 MiB 2.0 GiB 5.4 TiB 2.87 1.03 241 up osd.11 18 hdd 5.57169 1.0 5.6 TiB 159 GiB 42 GiB 31 MiB 2.5 GiB 5.4 TiB 2.79 1.00 221 up osd.18 22 hdd 5.57169 1.0 5.6 TiB 160 GiB 43 GiB 29 MiB 2.1 GiB 5.4 TiB 2.81 1.01 225 up osd.22 -527.85843- 28 TiB 782 GiB 195 GiB 188 MiB 6.9 GiB 27 TiB 2.74 0.98 -host onode7 5 hdd 5.57169 1.0 5.6 TiB 158 GiB 41 GiB 26 MiB 1.2 GiB 5.4 TiB 2.77 0.99 213 up osd.5 12 hdd 5.57169 1.0 5.6 TiB 159 GiB 42 GiB 31 MiB 993 MiB 5.4 TiB 2.79 1.00 222 up osd.12 20 hdd 5.57169 1.0 5.6 TiB 157 GiB 40 GiB 47 MiB 1.2 GiB 5.4 TiB 2.76 0.99 212 up osd.20 27 hdd 5.57169 1.0 5.6 TiB 151 GiB 33 GiB 28 MiB 1.9 GiB 5.4 TiB 2.64 0.95 179 up osd.27 29 hdd 5.57169 1.0 5.6 TiB 156 GiB 39 GiB 56 MiB 1.7 GiB 5.4 TiB 2.74 0.98 203 up osd.29 -1844.57349- 45 TiB 1.3 TiB 341 GiB 248 MiB 14 GiB 43 TiB 2.81 1.01 -rack 2-zwei -722.28674- 22 TiB 641 GiB 171 GiB 132 MiB 6.7 GiB 22 TiB 2.81 1.01 -host onode2 1 hdd 5.57169 1.0 5.6 TiB 155 GiB 38 GiB 35 MiB 1.2 GiB 5.4 TiB 2.72 0.97 203 up osd.1 8 hdd 5.57169 1.0 5.6 TiB 163 GiB 46 GiB 36 MiB 2.4 GiB 5.4 TiB 2.86 1.02 243 up osd.8 16 hdd 5.57169 1.0 5.6 TiB 161 GiB 43 GiB 24 MiB 1000 MiB 5.4 TiB 2.82 1.01 221 up osd.16 23 hdd 5.57169 1.0 5.6 TiB 162 GiB 45 GiB 37 MiB 2.1 GiB 5.4 TiB 2.84 1.02 228 up osd.23 -322.28674- 22 TiB 640 GiB 170 GiB 116 MiB 7.6 GiB 22 TiB 2.80 1.00 -host onode5 3 hdd 5.57169 1.0 5.6 TiB 154 GiB 36 GiB 14 MiB 1010 MiB 5.4 TiB 2.70 0.97 186 up osd.3 7 hdd 5.57169 1.0 5.6 TiB 161 GiB 44 GiB 22 MiB 2.2 GiB 5.4 TiB 2.82 1.01 221 up osd.7 15 hdd 5.57169 1.0 5.6 TiB 165 GiB 48 GiB 26 MiB 2.3 GiB 5.4 TiB 2.89 1.04 249 up osd.15 24 hdd 5.57169 1.0 5.6 TiB 160 GiB 42 GiB 54 MiB 2.1 GiB 5.4 TiB 2.80 1.00 223 up osd.24 -1950.14517- 50 TiB 1.4 TiB 376 GiB 311 MiB 18 GiB 49 TiB 2.79 1.00 -rack 3-drei -1522.28674- 22 TiB 649 GiB 179 GiB 112 MiB 8.2 GiB 22 TiB 2.84 1.02 -host onode3 0 hdd 5.57169 1.0 5.6 TiB 162 GiB 45 GiB 28 MiB 996 MiB 5.4 TiB 2.84 1.02 229 up osd.0 10 hdd 5.57169 1.0 5.6 TiB 159 GiB 42 GiB 21 MiB 2.2 GiB 5.4 TiB 2.79 1.00 213 up osd.10 17 hdd 5.57169 1.0 5.6 TiB 165 GiB 47 GiB 19 MiB 2.5 GiB 5.4 TiB 2.88 1.03 238 up osd.17 25 hdd 5.57169 1.0 5.6 TiB 163 GiB 46 GiB 44 MiB 2.5 GiB 5.4 TiB 2.86 1.03 242 up osd.25 -1127.85843- 28 TiB 784 GiB 197 GiB 199 MiB 9.4
[ceph-users] Reduced data availability: 2 pgs inactive
Hi there! Recently I made our cluster rack aware by adding racks to the crush map. The failure domain was and still is "host". rule cephfs2_data { id 7 type erasure min_size 3 max_size 6 step set_chooseleaf_tries 5 step set_choose_tries 100 step take PRZ step chooseleaf indep 0 type host step emit Then I sorted the hosts into the new rack buckets of the crush map as they are in reality, by: # osd crush move onodeX rack=XYZ for all hosts. The cluster started to reorder the data. In the end the cluster has now: HEALTH_WARN 1 filesystem is degraded; Reduced data availability: 2 pgs inactive; Degraded data redundancy: 678/2371785 objects degraded (0.029%), 2 pgs degraded, 2 pgs undersized FS_DEGRADED 1 filesystem is degraded fs cephfs_1 is degraded PG_AVAILABILITY Reduced data availability: 2 pgs inactive pg 21.2e4 is stuck inactive for 142792.952697, current state activating+undersized+degraded+remapped+forced_backfill, last acting [5,2147483647,25,28,11,2] pg 23.5 is stuck inactive for 142791.437243, current state activating+undersized+degraded+remapped+forced_backfill, last acting [13,21] PG_DEGRADED Degraded data redundancy: 678/2371785 objects degraded (0.029%), 2 pgs degraded, 2 pgs undersized pg 21.2e4 is stuck undersized for 142779.321192, current state activating+undersized+degraded+remapped+forced_backfill, last acting [5,2147483647,25,28,11,2] pg 23.5 is stuck undersized for 142789.747915, current state activating+undersized+degraded+remapped+forced_backfill, last acting [13,21] The cluster hosts a cephfs which is not mountable anymore. I tried a few things (as you can see: forced_backfill), but failed. The cephfs_data pool is EC 4+2. Both inactive pgs seem to have enough copies to recalculate the contents for all osds. Is there a chance to get both pgs clean again? How can I force the pgs to recalculate all necessary copies? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent number of pools
Yes, thanks. This helped. Regards, Lars Tue, 28 May 2019 11:50:01 -0700 Gregory Farnum ==> Lars Täuber : > You’re the second report I’ve seen if this, and while it’s confusing, you > should be Abel to resolve it by restarting your active manager daemon. > > On Sun, May 26, 2019 at 11:52 PM Lars Täuber wrote: > > > Fri, 24 May 2019 21:41:33 +0200 > > Michel Raabe ==> Lars Täuber , > > ceph-users@lists.ceph.com : > > > > > > You can also try > > > > > > $ rados lspools > > > $ ceph osd pool ls > > > > > > and verify that with the pgs > > > > > > $ ceph pg ls --format=json-pretty | jq -r '.pg_stats[].pgid' | cut -d. > > > -f1 | uniq > > > > > > > Yes, now I know but I still get this: > > $ sudo ceph -s > > […] > > data: > > pools: 5 pools, 1153 pgs > > […] > > > > > > and with all other means I get: > > $ sudo ceph osd lspools | wc -l > > 3 > > > > Which is what I expect, because all other pools are removed. > > But since this has no bad side effects I can live with it. > > > > Cheers, > > Lars > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent number of pools
Fri, 24 May 2019 21:41:33 +0200 Michel Raabe ==> Lars Täuber , ceph-users@lists.ceph.com : > > You can also try > > $ rados lspools > $ ceph osd pool ls > > and verify that with the pgs > > $ ceph pg ls --format=json-pretty | jq -r '.pg_stats[].pgid' | cut -d. > -f1 | uniq > Yes, now I know but I still get this: $ sudo ceph -s […] data: pools: 5 pools, 1153 pgs […] and with all other means I get: $ sudo ceph osd lspools | wc -l 3 Which is what I expect, because all other pools are removed. But since this has no bad side effects I can live with it. Cheers, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent number of pools
Mon, 20 May 2019 10:52:14 + Eugen Block ==> ceph-users@lists.ceph.com : > Hi, have you tried 'ceph health detail'? > No I hadn't. Thanks for the hint. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] inconsistent number of pools
Hi everybody, with the status report I get a HEALTH_WARN I don't know how to get rid of. It my be connected to recently removed pools. # ceph -s cluster: id: 6cba13d1-b814-489c-9aac-9c04aaf78720 health: HEALTH_WARN 1 pools have many more objects per pg than average services: mon: 3 daemons, quorum mon1,mon2,mon3 (age 4h) mgr: mon1(active, since 4h), standbys: cephsible, mon2, mon3 mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby osd: 30 osds: 30 up (since 2h), 30 in (since 7w) data: pools: 5 pools, 1029 pgs objects: 315.51k objects, 728 GiB usage: 4.6 TiB used, 163 TiB / 167 TiB avail pgs: 1029 active+clean !!! but: # ceph osd lspools | wc -l 3 The status says there are 5 pools but the listing says there are only 3. How to I get to know which pool is the reason for the health warning? Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pool migration for cephfs?
Hi, is there a way to migrate a cephfs to a new data pool like it is for rbd on nautilus? https://ceph.com/geen-categorie/ceph-pool-migration/ Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Multi Mds Trim Log Slow
I restarted the mds process which was in "up:stopping" state. Since then there are no trimmings behind any more. All (sub)directories are accessible as normal again. It seems there are stability issues with snapshots in a multi-mds cephfs on nautilus. This has already been suspected here: http://docs.ceph.com/docs/nautilus/cephfs/experimental-features/#snapshots Regards, Lars Fri, 3 May 2019 11:45:41 +0200 Lars Täuber ==> ceph-users@lists.ceph.com : > Hi, > > I'm still new to ceph. Here are similar problems with CephFS. > > ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus > (stable) > on Debian GNU/Linux buster/sid > > # ceph health detail > HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming > MDS_SLOW_REQUEST 1 MDSs report slow requests > mdsmds3(mds.0): 13 slow requests are blocked > 30 secs > MDS_TRIM 1 MDSs behind on trimming > mdsmds3(mds.0): Behind on trimming (33924/125) max_segments: 125, > num_segments: 33924 > > > > The workload is "doveadm backup" of more than 500 mail folders from a local > ext4 to a cephfs. > * There are ~180'000 files with a strange file size distribution: > > # NumSamples = 181056; MIN_SEEN = 377; MAX_SEEN = 584835624 > # Mean = 4477785.646005; Variance = 31526763457775.421875; SD = 5614869.852256 > 377 - 262502 [ 56652]: ∎ > 31.29% > 262502 - 524627 [ 4891]: 2.70% > 524627 - 786752 [ 3498]: ∎∎∎ 1.93% > 786752 -1048878 [ 2770]: ∎∎∎ 1.53% > 1048878 -1311003 [ 2460]: ∎∎ 1.36% > 1311003 -1573128 [ 2197]: ∎∎ 1.21% > 1573128 -1835253 [ 2014]: ∎∎ 1.11% > 1835253 -2097378 [ 1961]: ∎∎ 1.08% > 2097378 -2359503 [ 2244]: ∎∎ 1.24% > 2359503 -2621628 [ 1890]: ∎∎ 1.04% > 2621628 -2883754 [ 1897]: ∎∎ 1.05% > 2883754 -3145879 [ 2188]: ∎∎ 1.21% > 3145879 -3408004 [ 2579]: ∎∎ 1.42% > 3408004 -3670129 [ 3396]: ∎∎∎ 1.88% > 3670129 -3932254 [ 5173]: 2.86% > 3932254 -4194379 [ 24847]: ∎∎ 13.72% > 4194379 -4456505 [ 1512]: ∎∎ 0.84% > 4456505 -4718630 [ 1394]: ∎∎ 0.77% > 4718630 -4980755 [ 1412]: ∎∎ 0.78% > 4980755 - 584835624 [ 56081]: ∎ > 30.97% > > * There are two snapshots of the main directory the mails are backed up to. > * There are three sub directories where a simple ls doesn't return from. > * The cephfs is mounted using the kernel driver of Ubuntu 18.04.2 LTS kernel > 4.15.0-48-generic. > * Same behaviour with ceph-fuse 'FUSE library version: 2.9.7' with the > difference that I can't interrupt the ls. > > The reduction of the number of mds working for our cephfs to 1 made no > difference. > The number of segments is still rising. > # ceph -w > cluster: > id: 6cba13d1-b814-489c-9aac-9c04aaf78720 > health: HEALTH_WARN > 1 MDSs report slow requests > 1 MDSs behind on trimming > > services: > mon: 3 daemons, quorum mon1,mon2,mon3 (age 3d) > mgr: cephsible(active, since 27h), standbys: mon3, mon1 > mds: cephfs_1:2 {0=mds3=up:active,1=mds2=up:stopping} 1 up:standby > osd: 30 osds: 30 up (since 4w), 30 in (since 5w) > > data: > pools: 5 pools, 393 pgs > objects: 607.74k objects, 1.5 TiB > usage: 6.9 TiB used, 160 TiB / 167 TiB avail > pgs: 393 active+clean > > > 2019-05-03 11:40:17.916193 mds.mds3 [WRN] 15 slow requests, 0 included below; > oldest blocked for > 342610.193367 secs > > It seems the stopping of one out of two mds doesn't come to an end. > > How to debug this? > > Thanks in advance. > Lars > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Multi Mds Trim Log Slow
Hi, I'm still new to ceph. Here are similar problems with CephFS. ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable) on Debian GNU/Linux buster/sid # ceph health detail HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming MDS_SLOW_REQUEST 1 MDSs report slow requests mdsmds3(mds.0): 13 slow requests are blocked > 30 secs MDS_TRIM 1 MDSs behind on trimming mdsmds3(mds.0): Behind on trimming (33924/125) max_segments: 125, num_segments: 33924 The workload is "doveadm backup" of more than 500 mail folders from a local ext4 to a cephfs. * There are ~180'000 files with a strange file size distribution: # NumSamples = 181056; MIN_SEEN = 377; MAX_SEEN = 584835624 # Mean = 4477785.646005; Variance = 31526763457775.421875; SD = 5614869.852256 377 - 262502 [ 56652]: ∎ 31.29% 262502 - 524627 [ 4891]: 2.70% 524627 - 786752 [ 3498]: ∎∎∎ 1.93% 786752 -1048878 [ 2770]: ∎∎∎ 1.53% 1048878 -1311003 [ 2460]: ∎∎ 1.36% 1311003 -1573128 [ 2197]: ∎∎ 1.21% 1573128 -1835253 [ 2014]: ∎∎ 1.11% 1835253 -2097378 [ 1961]: ∎∎ 1.08% 2097378 -2359503 [ 2244]: ∎∎ 1.24% 2359503 -2621628 [ 1890]: ∎∎ 1.04% 2621628 -2883754 [ 1897]: ∎∎ 1.05% 2883754 -3145879 [ 2188]: ∎∎ 1.21% 3145879 -3408004 [ 2579]: ∎∎ 1.42% 3408004 -3670129 [ 3396]: ∎∎∎ 1.88% 3670129 -3932254 [ 5173]: 2.86% 3932254 -4194379 [ 24847]: ∎∎ 13.72% 4194379 -4456505 [ 1512]: ∎∎ 0.84% 4456505 -4718630 [ 1394]: ∎∎ 0.77% 4718630 -4980755 [ 1412]: ∎∎ 0.78% 4980755 - 584835624 [ 56081]: ∎ 30.97% * There are two snapshots of the main directory the mails are backed up to. * There are three sub directories where a simple ls doesn't return from. * The cephfs is mounted using the kernel driver of Ubuntu 18.04.2 LTS kernel 4.15.0-48-generic. * Same behaviour with ceph-fuse 'FUSE library version: 2.9.7' with the difference that I can't interrupt the ls. The reduction of the number of mds working for our cephfs to 1 made no difference. The number of segments is still rising. # ceph -w cluster: id: 6cba13d1-b814-489c-9aac-9c04aaf78720 health: HEALTH_WARN 1 MDSs report slow requests 1 MDSs behind on trimming services: mon: 3 daemons, quorum mon1,mon2,mon3 (age 3d) mgr: cephsible(active, since 27h), standbys: mon3, mon1 mds: cephfs_1:2 {0=mds3=up:active,1=mds2=up:stopping} 1 up:standby osd: 30 osds: 30 up (since 4w), 30 in (since 5w) data: pools: 5 pools, 393 pgs objects: 607.74k objects, 1.5 TiB usage: 6.9 TiB used, 160 TiB / 167 TiB avail pgs: 393 active+clean 2019-05-03 11:40:17.916193 mds.mds3 [WRN] 15 slow requests, 0 included below; oldest blocked for > 342610.193367 secs It seems the stopping of one out of two mds doesn't come to an end. How to debug this? Thanks in advance. Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Wed, 17 Apr 2019 20:01:28 +0900 Christian Balzer ==> Ceph Users : > On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote: > > > Wed, 17 Apr 2019 10:47:32 +0200 > > Paul Emmerich ==> Lars Täuber : > > > The standard argument that it helps preventing recovery traffic from > > > clogging the network and impacting client traffic is missleading: > > > > What do you mean by "it"? I don't know the standard argument. > > Do you mean separating the networks or do you mean having both together in > > one switched network? > > > He means separated networks, obviously. > > > > > > > * write client traffic relies on the backend network for replication > > > operations: your client (write) traffic is impacted anyways if the > > > backend network is full > > > > This I understand as an argument for separating the networks and the > > backend network being faster than the frontend network. > > So in case of reconstruction there should be some bandwidth left in the > > backend for the traffic that is used for the client IO. > > > You need to run the numbers and look at the big picture. > As mentioned already, this is all moot in your case. > > 6 HDDs at realistically 150MB/s each, if they were all doing sequential > I/O. which they aren't. > But the for the sake of argument lest say that one of your nodes can read > (or write, not both at the same time) 900MB/s. > That's still less than half of a single 25Gb/s link. Is this really true also with the WAL device (combined with the DB device) which is a (fast) SSD in our setup? reading:2150MB/s writing:2120MB/s IOPS 4K reading/wrtiing 440k/320k If so, the next version of OSD host will be adjusted in HW requirements. > And that very hypothetical data rate (it's not sequential, you will > concurrent operations and thus seeks) is all your node can handle, if it > all going into recovery/rebalancing your clients are starved because of > that, not bandwidth exhaustion. If it like this also with our SSD WAL, the next version of OSD host will be adjusted in HW requirements. Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Wed, 17 Apr 2019 10:47:32 +0200 Paul Emmerich ==> Lars Täuber : > The standard argument that it helps preventing recovery traffic from > clogging the network and impacting client traffic is missleading: What do you mean by "it"? I don't know the standard argument. Do you mean separating the networks or do you mean having both together in one switched network? > > * write client traffic relies on the backend network for replication > operations: your client (write) traffic is impacted anyways if the > backend network is full This I understand as an argument for separating the networks and the backend network being faster than the frontend network. So in case of reconstruction there should be some bandwidth left in the backend for the traffic that is used for the client IO. > * you are usually not limited by network speed for recovery (except > for 1 gbit networks), and if you are you probably want to reduce > recovery speed anyways if you would run into that limit > > Paul > Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Wed, 17 Apr 2019 09:52:29 +0200 Stefan Kooman ==> Lars Täuber : > Quoting Lars Täuber (taeu...@bbaw.de): > > > I'd probably only use the 25G network for both networks instead of > > > using both. Splitting the network usually doesn't help. > > > > This is something i was told to do, because a reconstruction of failed > > OSDs/disks would have a heavy impact on the backend network. > > Opinions vary on running "public" only versus "public" / "backend". > Having a separate "backend" network might lead to difficult to debug > issues when the "public" network is working fine, but the "backend" is > having issues and OSDs can't peer with each other, while the clients can > talk to all OSDs. You will get slow requests and OSDs marking each other > down while they are still running etc. This I was not aware of. > In your case with only 6 spinners max per server there is no way you > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s > (for large spinners) should be just enough to fill a 10 Gb/s link. A > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for > both OSD replication traffic and client IO. The reason for the choice for the 25GBit network was because a remark of someone, that the latency in this ethernet is way below that of 10GBit. I never double checked this. > > My 2 cents, > > Gr. Stefan > Cheers, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Thanks Paul for the judgement. Tue, 16 Apr 2019 10:13:03 +0200 Paul Emmerich ==> Lars Täuber : > Seems in line with what I'd expect for the hardware. > > Your hardware seems to be way overspecced, you'd be fine with half the > RAM, half the CPU and way cheaper disks. Do you mean all the components of the cluster or only the OSD-nodes? Before making the requirements i only read about mirroring clusters. I was afraid of the CPUs being to slow to calculate the erasure codes we planned to use. > In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk. This is a really good hint, because we just started to plan the extension. > > I'd probably only use the 25G network for both networks instead of > using both. Splitting the network usually doesn't help. This is something i was told to do, because a reconstruction of failed OSDs/disks would have a heavy impact on the backend network. > > Paul > Thanks again. Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to judge the results? - rados bench comparison
Hi there, i'm new to ceph and just got my first cluster running. Now i'd like to know if the performance we get is expectable. Is there a website with benchmark results somewhere where i could have a look to compare with our HW and our results? This are the results: rados bench single threaded: # rados bench 10 write --rbd-cache=false -t 1 Object size:4194304 Bandwidth (MB/sec): 53.7186 Stddev Bandwidth: 3.86437 Max bandwidth (MB/sec): 60 Min bandwidth (MB/sec): 48 Average IOPS: 13 Stddev IOPS:0.966092 Average Latency(s): 0.0744599 Stddev Latency(s): 0.00911778 nearly maxing out one (idle) client with 28 threads # rados bench 10 write --rbd-cache=false -t 28 Bandwidth (MB/sec): 850.451 Stddev Bandwidth: 40.6699 Max bandwidth (MB/sec): 904 Min bandwidth (MB/sec): 748 Average IOPS: 212 Stddev IOPS:10.1675 Average Latency(s): 0.131309 Stddev Latency(s): 0.0318489 four concurrent benchmarks on four clients each with 24 threads: Bandwidth (MB/sec): 396 376 381 389 Stddev Bandwidth: 30 25 22 22 Max bandwidth (MB/sec): 440 420 416 428 Min bandwidth (MB/sec): 352 348 344 364 Average IOPS: 99 94 95 97 Stddev IOPS:7.5 6.3 5.6 5.6 Average Latency(s): 0.240.250.250.24 Stddev Latency(s): 0.120.150.150.14 summing up: write mode ~1500 MB/sec Bandwidth ~385 IOPS ~0.25s Latency rand mode: ~3500 MB/sec ~920 IOPS ~0.154s Latency Maybe someone could judge our numbers. I am actually very satisfied with the values. The (mostly idle) cluster is build from these components: * 10GB frontend network, bonding two connections to mon-, mds- and osd-nodes ** no bonding to clients * 25GB backend network, bonding two connections to osd-nodes cluster: * 3x mon, 2x Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 64GB RAM * 3x mds, 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz, 128MB RAM * 7x OSD-nodes, 2x Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz, 96GB RAM ** 4x 6TB SAS HDD HGST HUS726T6TAL5204 (5x on two nodes, max. 6x per chassis for later growth) ** 2x 800GB SAS SSD WDC WUSTM3280ASS200 => SW-RAID1 => LVM ~116 GiB per OSD for DB and WAL erasure encoded pool: (made for CephFS) * plugin=clay k=5 m=2 d=6 crush-failure-domain=host Thanks and best regards Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] typo in news for PG auto-scaler
Hi everybody! There is a small mistake in the news about the PG autoscaler https://ceph.com/rados/new-in-nautilus-pg-merging-and-autotuning/ The command $ ceph osd pool set foo target_ratio .8 should actually be $ ceph osd pool set foo target_size_ratio .8 Thanks for this great improvement! Cheers, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Support for buster with nautilus?
Hi there! I just started to install a ceph cluster. I'd like to take the nautilus release. Because of hardware restrictions (network driver modules) I had to take the buster release of Debian. Will there be buster packages of nautilus available after the release? Thanks for this great storage! Cheers, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] luminous/bluetsore osd memory requirements
Hi there, can someone share her/his experiences regarding this question? Maybe differentiated according to the different available algorithms? Sat, 12 Aug 2017 14:40:05 +0200 Stijn De Weirdt==> Gregory Farnum , Mark Nelson , "ceph-users@lists.ceph.com" : > also any indication how much more cpu EC uses (10%, > 100%, ...)? I would be interested in the hardware recommendations for the newly introduced ceph-mgr daemon also. The big search engines don't tell me anything about this yet. Thanks in advance Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com