Re: [ceph-users] Need advice with setup planning
Hi, In the case of three ceph hosts, you could also consider this setup: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server This only requires that you have two 10G nics on each machine. Plus an extra 1G for 'regular' non-ceph traffic. That way at least your ceph comms would be 10G, since 1G is surely going to be a bottleneck. We are running the above setup. No problems. Only issue is: adding a fourth node will be relatively intrusive. MJ On 9/20/19 8:23 PM, Salsa wrote: Replying inline. -- Salsa Sent with ProtonMail <https://protonmail.com> Secure Email. ‐‐‐ Original Message ‐‐‐ On Friday, September 20, 2019 1:34 PM, Martin Verges wrote: Hello Salsa, I have tested Ceph using VMs but never got to put it to use and had a lot of trouble to get it to install. if you want to get rid of all the troubles from installing to day2day operations, you could consider using https://croit.io/croit-virtual-demo Amazing! Where were you 3 months ago? Only problem is that I think we have no moer budget for this so I can't get approval for software license. - Use 2 HDDs for SO using RAID 1 (I've left 3.5TB unallocated in case I can use it later for storage) - Install CentOS 7.7 Is ok, but won't be necessary if you choose croit as we boot from the network and don't install a operating system. No budget for software license - Use 2 vLANs, one for ceph internal usage and another for external access. Since they've 4 network adapters, I'll try to bond them in pairs to speed up network (1Gb). If there is no internal policy that forces you to do seperate networks, you can use a simple 1 vlan setup and bond 4*1GbE. Otherwise it's ok. The service is critical and we are afraid that the network might be congested and QoS for the end user degrades. - I'll try to use ceph-ansible for installation. I failed to use it on lab, but it seems more recommended. - Install Ceph Nautilus Ultra easy with croit, maybe look at our videos on youtube - https://www.youtube.com/playlist?list=PL1g9zo59diHDSJgkZcMRUq6xROzt_YKox Thanks! I'll be watching them. - Each server will host OSD, MON, MGR and MDS. ok, but you should use ssd for metadata. No budget and no option to get those now. - One VM for ceph-admin: This wil be used to run ceph-ansible and maybe to host some ceph services later perfect for croit ;) - I'll have to serve samba, iscsi and probably NFS too. Not sure how or on which servers. Just put it on the servers as well, with croit it is just a click away and everything is included in our interface. If not using croit, you can still install it on the same systems and configure it by hand/script. Great! Thanks for the help and congratulations on that demo. It is the best I've used and the easiest ceph setup I've found. As feedback, the last part of the demo tutorial is not 100% compatible with the master branch from github. The RBD pool creation has a different interface than the one presented in your tutorial (Or I made some mistake along the way). Also, my cluster is showing error in my placement groups after RB pool creation, but I'll try to find out what happened. Thanks again! -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io> Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Fr., 20. Sept. 2019 um 18:14 Uhr schrieb Salsa mailto:sa...@protonmail.com>>: I have tested Ceph using VMs but never got to put it to use and had a lot of trouble to get it to install. Now I've been asked to do a production setup using 3 servers (Dell R740) with 12 4TB each. My plan is this: - Use 2 HDDs for SO using RAID 1 (I've left 3.5TB unallocated in case I can use it later for storage) - Install CentOS 7.7 - Use 2 vLANs, one for ceph internal usage and another for external access. Since they've 4 network adapters, I'll try to bond them in pairs to speed up network (1Gb). - I'll try to use ceph-ansible for installation. I failed to use it on lab, but it seems more recommended. - Install Ceph Nautilus - Each server will host OSD, MON, MGR and MDS. - One VM for ceph-admin: This wil be used to run ceph-ansible and maybe to host some ceph services later - I'll have to serve samba, iscsi and probably NFS too. Not sure how or on which servers. Am I missing anything? Am I doing anything "wrong"? I searched for some actual guidance on setup but I couldn't find anything complete, like a good tutorial or reference based on possible use-cases. So, is there any suggestions you could share or links and references I should take a look?
Re: [ceph-users] clock skew
An update. We noticed contradicting output from chrony. "chronyc sources" showed that chrony was synced. However, we also noted this output: root@ceph2:/etc/chrony# chronyc activity 200 OK 0 sources online 4 sources offline 0 sources doing burst (return to online) 0 sources doing burst (return to offline) 0 sources with unknown address so "chrony activity" shows OFFLINE sources. After we changed sources to nl.pool.ntp.org, "chronyc activity" started showing the sources as ONLINE, and now, after a day running, our skew as reported by "ceph time-sync-status" is 0.00 on all hosts. Seems that replying on "chronyc sources" is not always enough to make sure that everything is synced indeed. Thanks for the help! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clock skew
Hi all, Thanks for all replies! @Huang: ceph time-sync-status is exactly what I was looking for, thanks! @Janne: i will checkout/implement the peer config per your suggestion. However what confuses us is that chrony thinks the clocks match, and only ceph feels it doesn't. So we are not sure if the peer config will actually help in this situation. But time will tell. @John: Thanks for the maxsources suggestion @Bill: thanks for the interesting article, will check it out! MJ On 4/25/19 5:47 PM, Bill Sharer wrote: If you are just synching to the outside pool, the three hosts may end up latching on to different outside servers as their definitive sources. You might want to make one of the three a higher priority source to the other two and possibly just have it use the outside sources as sync. Also for hardware newer than about five years old, you might want to look at enabling the NIC clocks using LinuxPTP to keep clock jitter down inside your LAN. I wrote this article on the Gentoo wiki on enabling PTP in chrony. https://wiki.gentoo.org/wiki/Chrony_with_hardware_timestamping Bill Sharer On 4/25/19 6:33 AM, mj wrote: Hi all, On our three-node cluster, we have setup chrony for time sync, and even though chrony reports that it is synced to ntp time, at the same time ceph occasionally reports time skews that can last several hours. See for example: root@ceph2:~# ceph -v ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable) root@ceph2:~# ceph health detail HEALTH_WARN clock skew detected on mon.1 MON_CLOCK_SKEW clock skew detected on mon.1 mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s (latency 0.000591877s) root@ceph2:~# chronyc tracking Reference ID : 7F7F0101 () Stratum : 10 Ref time (UTC) : Wed Apr 24 19:05:28 2019 System time : 0.00133 seconds slow of NTP time Last offset : -0.00524 seconds RMS offset : 0.00524 seconds Frequency : 12.641 ppm slow Residual freq : +0.000 ppm Skew : 0.000 ppm Root delay : 0.00 seconds Root dispersion : 0.00 seconds Update interval : 1.4 seconds Leap status : Normal root@ceph2:~# For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced similarly with NTP on the two other nodes. We don't understand this... I have now injected mon_clock_drift_allowed 0.7, so at least we have HEALTH_OK again. (to stop upsetting my monitoring system) But two questions: - can anyone explain why this is happening, is it looks as if ceph and NTP/chrony disagree on just how time-synced the servers are..? - how to determine the current clock skew from cephs perspective? Because "ceph health detail" in case of HEALTH_OK does not show it. (I want to start monitoring it continuously, to see if I can find some sort of pattern) Thanks! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] clock skew
Hi all, On our three-node cluster, we have setup chrony for time sync, and even though chrony reports that it is synced to ntp time, at the same time ceph occasionally reports time skews that can last several hours. See for example: root@ceph2:~# ceph -v ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable) root@ceph2:~# ceph health detail HEALTH_WARN clock skew detected on mon.1 MON_CLOCK_SKEW clock skew detected on mon.1 mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s (latency 0.000591877s) root@ceph2:~# chronyc tracking Reference ID: 7F7F0101 () Stratum : 10 Ref time (UTC) : Wed Apr 24 19:05:28 2019 System time : 0.00133 seconds slow of NTP time Last offset : -0.00524 seconds RMS offset : 0.00524 seconds Frequency : 12.641 ppm slow Residual freq : +0.000 ppm Skew: 0.000 ppm Root delay : 0.00 seconds Root dispersion : 0.00 seconds Update interval : 1.4 seconds Leap status : Normal root@ceph2:~# For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced similarly with NTP on the two other nodes. We don't understand this... I have now injected mon_clock_drift_allowed 0.7, so at least we have HEALTH_OK again. (to stop upsetting my monitoring system) But two questions: - can anyone explain why this is happening, is it looks as if ceph and NTP/chrony disagree on just how time-synced the servers are..? - how to determine the current clock skew from cephs perspective? Because "ceph health detail" in case of HEALTH_OK does not show it. (I want to start monitoring it continuously, to see if I can find some sort of pattern) Thanks! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG inconsistent, "pg repair" not working
Hi, I was able to solve a similar issue on our cluster using this blog: https://ceph.com/geen-categorie/ceph-manually-repair-object/ It does help if you are running a 3/2 config. Perhaps it helps you as well. MJ On 09/25/2018 02:37 AM, Sergey Malinin wrote: Hello, During normal operation our cluster suddenly thrown an error and since then we have had 1 inconsistent PG, and one of clients sharing cephfs mount has started to occasionally log "ceph: Failed to find inode X". "ceph pg repair" deep scrubs the PG and fails with the same error in log. Can anyone advise how to fix this? log entry: 2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 1.92 soid 1:496296a8:::1000f44d0f4.0018:head: failed to pick suitable object info 2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : scrub 1.92 1:496296a8:::1000f44d0f4.0018:head on disk size (3751936) does not match object info size (0) adjusted for ondisk to (0) 2018-09-20 06:50:36.925 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 1.92 scrub 3 errors # ceph -v ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable) # ceph health detail HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 3 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 1.92 is active+clean+inconsistent, acting [4,9] # rados list-inconsistent-obj 1.92 {"epoch":519,"inconsistents":[]} # ceph pg 1.92 query { "state": "active+clean+inconsistent", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 520, "up": [ 4, 9 ], "acting": [ 4, 9 ], "acting_recovery_backfill": [ "4", "9" ], "info": { "pgid": "1.92", "last_update": "520'2456340", "last_complete": "520'2456340", "log_tail": "520'2453330", "last_user_version": 7914566, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 63, "epoch_pool_created": 63, "last_epoch_started": 520, "last_interval_started": 519, "last_epoch_clean": 520, "last_interval_clean": 519, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 519, "same_interval_since": 519, "same_primary_since": 514, "last_scrub": "520'2456105", "last_scrub_stamp": "2018-09-25 02:17:35.631365", "last_deep_scrub": "520'2456105", "last_deep_scrub_stamp": "2018-09-25 02:17:35.631365", "last_clean_scrub_stamp": "2018-09-19 02:27:22.656268" }, "stats": { "version": "520'2456340", "reported_seq": "6115579", "reported_epoch": "520", "state": "active+clean+inconsistent", "last_fresh": "2018-09-25 03:02:34.338256", "last_change": "2018-09-25 02:17:35.631476", "last_active": "2018-09-25 03:02:34.338256", "last_peered": "2018-09-25 03:02:34.338256", "last_clean": "2018-09-25 03:02:34.338256", "last_became_active": "2018-09-24 15:25:30.238044", "last_became_peered": "2018-09-24 15:25:30.238044", "last_unstale": "2018-09-25 03:02:34.338256", "last_undegraded": "2018-09-25 03:02:34.338256", "last_fullsized": "2018-09-25 03:02:34.338256", "mapping_epoch": 519, "log_start": "520'2453330", "ondisk_log_start": "520'2453330", "created": 63, "last_epoch_clean": 520, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "520'2456105", "last_scrub_stamp": "2018-09-25 02:17:35.631365", "last_deep_scrub": "520'2456105", "last_deep_scrub_stamp": "2018-09-25 02:17:35.631365", "last_clean_scrub_stamp": "2018-09-19 02:27:22.656268", "log_size": 3010, "ondisk_log_size": 3010, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "manifest_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 23138366490, "num_objects": 479532, "num_object_clones": 0, "num_object_copies": 959064, "num_objects_missing_on_primary": 0, "num_object
Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?
On 09/24/2018 08:53 AM, Nicolas Huillard wrote: Thanks for your anecdote ;-) Could it be that I stack too many things (XFS in LVM in md-RAID in SSD 's FTL)? No, we regularly use the same compound of layers, just without the SSD. mj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?
On 09/24/2018 08:46 AM, Nicolas Huillard wrote: Too bad, since this FS have a lot of very promising features. I view it as the single-host-ceph-like FS, and do not see any equivalent (apart from ZFS which will also never included in the kernel). Agreed. It's also so much more flexible than zfs, like in adding disks to raids to expand space for example. mj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?
Hi, Just a very quick and simple reply: XFS has *always* treated us nicely, and we have been using it for a VERY long time, ever since the pre-2000 suse 5.2 days on pretty much all our machines. We have seen only very few corruptions on xfs, and the few times we tried btrfs, (almost) always 'something' happened. (same for the few times we tried reiserfs, btw) So, while my story may be very anecdotical (and you will probably find many others here claiming the opposite) our own conclusion is very clear: we love xfs, and do not like btrfs very much. MJ On 09/22/2018 10:58 AM, Nicolas Huillard wrote: Hi all, I don't have a good track record with XFS since I got rid of ReiserFS a long time ago. I decided XFS was a good idea on servers, while I tested BTRFS on various less important devices. So far, XFS betrayed me far more often (a few times) than BTRFS (never). Last time was yesterday, on a root filesystem with "Block out of range: block 0x17b9814b0, EOFS 0x12a000" "I/O Error Detected. Shutting down filesystem" (shutting down the root filesystem is pretty hard). Some threads on this ML discuss a similar problem, related to partitioning and logical sectors located just after the end of the partition. The problem here does not seem to be the same, as the requested block is very far out of bound (2 orders of magnitude too far), and I use a recent Debian stock kernel with every security patch. My question is : should I trust XFS for small root filesystems (/, /tmp, /var on LVM sitting within md-RAID1 smallish partition), or is BTRFS finally trusty enough for a general purpose cluster (still root et al. filesystems), or do you guys just use the distro-recommended setup (typically Ext4 on plain disks) ? Debian stretch with 4.9.110-3+deb9u4 kernel. Ceph 12.2.8 on bluestore (not related to the question). Partial output of lsblk /dev/sdc /dev/nvme0n1: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdc 8:32 0 447,1G 0 disk ├─sdc1 8:33 0 55,9G 0 part │ └─md0 9:00 55,9G 0 raid1 │ ├─oxygene_system-root 253:40 9,3G 0 lvm / │ ├─oxygene_system-tmp 253:50 9,3G 0 lvm /tmp │ └─oxygene_system-var 253:60 4,7G 0 lvm /var └─sdc2 8:34 0 29,8G 0 part [SWAP] nvme0n1 259:00 477G 0 disk ├─nvme0n1p1 259:10 55,9G 0 part │ └─md0 9:00 55,9G 0 raid1 │ ├─oxygene_system-root 253:40 9,3G 0 lvm / │ ├─oxygene_system-tmp 253:50 9,3G 0 lvm /tmp │ └─oxygene_system-var 253:60 4,7G 0 lvm /var ├─nvme0n1p2 259:20 29,8G 0 part [SWAP] TIA ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proxmox/ceph upgrade and addition of a new node/OSDs
Hi Hervé! Thanks for the detailed summary, much appreciated! Best, MJ On 09/21/2018 09:03 AM, Hervé Ballans wrote: Hi MJ (and all), So we upgraded our Proxmox/Ceph cluster, and if we have to summarize the operation in a few words : overall, everything went well :) The most critical operation of all is the 'osd crush tunables optimal', I talk about it in more detail after... The Proxmox documentation is really well written and accurate and, normally, following the documentation step by step is almost sufficient ! * first step : upgrade Ceph Jewel to Luminous : https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous (Note here : OSDs remain in FileStore backend, no BlueStore migration) * second step : upgrade Proxmox version 4 to 5 : https://pve.proxmox.com/wiki/Upgrade_from_4.x_to_5.0 Just some numbers, observations and tips (based on our feedback, I'm not an expert !) : * Before migration, make sure you are in the lastest version of Proxmox 4 (4.4-24) and Ceph Jewel (10.2.11) * We don't use the pve repository for ceph packages but the official one (download.ceph.com). Thus, during the upgrade of Promox PVE, we don't replace ceph.com repository with promox.com Ceph repository... * When you upgrade Ceph to Luminous (without tunables optimal), there is no impact on Proxmox 4. VMs are still running normally. The side effect (non blocking for the functionning of VMs) is located in the GUI, on the Ceph menu : it can't report the status of the ceph cluster as it has a JSON formatting error (indeed the output of the command 'ceph -s' is completely different, really more readable on Luminous) * It misses a little step in section 8 "Create Manager instances" of the upgrade ceph documentation. As the Ceph manager daemon is new since Luminous, the package doesn't exist on Jewel. So you have to install the ceph-mgr package on each node first before doing 'pveceph createmgr'||| | * The 'osd crush tunables optimal' operation is time consuming ! in our case : 5 nodes (PE R730xd), 58 OSDs, replicated (3/2) rbd pool with 2048 pgs and 2 millions objects, 22 TB used. The tunables operation took a little more than 24 hours ! * Really take the right time to make the 'tunables optimal' ! We encountered some pgs stuck and blocked requests during this operation. In our case, the involved OSDs were those with a high numbers of pgs (as they are high capacity disks). The consequences can be critical since it can freeze some VMs (I guess those that replicas are stored on the stuck pgs ?). The stuck state were corrected by rebooting the involved OSDs. If you can move the disks of your critical VMs on another storage, so these VMs should not be impacted by the recovery (we moved some disks on another Ceph cluster and keep the conf in the Proxmox cluster being updated and there was no impact) Otherwise : - verify that all your VMs are recently backuped on an external storage (in case of Disaster recovery Plan !) - if you can, stop all your non-critical VMs (in order to limit client io operations) - if any, wait for the end of current backups then disable datacenter backup (in order to limit client io operations). !! do not forget to re-enable it when all is over !! - if any and if no longer needed, delete your snapshots, it removes many useless objects ! - start the tunables operation outside of major activity periods (night, week-end, ??) and take into account that it can be very slow... There are probably some options to configure in ceph to avoid 'pgs stuck' states, but on our side, as we previously moved our critical VM's disks, we didn't care about that ! * Anyway, the upgrade step of Proxmox PVE is done easily and quickly (just follow the documentation). Note that you can upgrade Proxmox PVE before doing the 'tunables optimal' operation. Hoping that you will find this information useful, good luck with your very next migration ! Hervé Le 13/09/2018 à 22:04, mj a écrit : Hi Hervé, No answer from me, but just to say that I have exactly the same upgrade path ahead of me. :-) Please report here any tips, trics, or things you encountered doing the upgrades. It could potentially save us a lot of time. :-) Thanks! MJ On 09/13/2018 05:23 PM, Hervé Ballans wrote: Dear list, I am currently in the process of upgrading Proxmox 4/Jewel to Proxmox5/Luminous. I also have a new node to add to my Proxmox cluster. What I plan to do is the following (from https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous): * upgrade Jewel to Luminous * let the "ceph osd crush tunables optimal " command run * upgrade my proxmox to v5 * add the new node (already up to date in v5) * add the new OSDs * let ceph rebalance the lot A couple of questions I have : * would it be a good idea to add the new node+OSDs and run the "tunables optimal" command immediately after, which would maybe gain a little time and avoid two successive pg rebalan
Re: [ceph-users] Proxmox/ceph upgrade and addition of a new node/OSDs
Hi Hervé, No answer from me, but just to say that I have exactly the same upgrade path ahead of me. :-) Please report here any tips, trics, or things you encountered doing the upgrades. It could potentially save us a lot of time. :-) Thanks! MJ On 09/13/2018 05:23 PM, Hervé Ballans wrote: Dear list, I am currently in the process of upgrading Proxmox 4/Jewel to Proxmox5/Luminous. I also have a new node to add to my Proxmox cluster. What I plan to do is the following (from https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous): * upgrade Jewel to Luminous * let the "ceph osd crush tunables optimal " command run * upgrade my proxmox to v5 * add the new node (already up to date in v5) * add the new OSDs * let ceph rebalance the lot A couple of questions I have : * would it be a good idea to add the new node+OSDs and run the "tunables optimal" command immediately after, which would maybe gain a little time and avoid two successive pg rebalancing ? * did I miss anything in this plan? Regards, Hervé ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_ERR vs HEALTH_WARN
Hi Mark, others, I took my info from following page: https://ceph.com/geen-categorie/ceph-manually-repair-object/ where is written: "Of course the above works well when you have 3 replicas when it is easier for Ceph to compare two versions against another one." Based on that info, I assumed that a simple "ceph pg repair 2.1a9" was enough to solve this without introducing corruption into our 3/2 cluster. MJ On 08/23/2018 12:28 PM, Mark Schouten wrote: Gregory's answer worries us. We thought that with a 3/2 pool, and one PG corrupted, the assumption would be: the two similar ones are correct, and the third one needs to be adjusted. Can we determine from this output, if I created corruption in our cluster..? I second this assumption.. Can someone clarify? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_ERR vs HEALTH_WARN
Hi, Thanks John and Gregory for your answers. Gregory's answer worries us. We thought that with a 3/2 pool, and one PG corrupted, the assumption would be: the two similar ones are correct, and the third one needs to be adjusted. Can we determine from this output, if I created corruption in our cluster..? root@pm1:~# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 2.1a9 is active+clean+inconsistent, acting [15,23,6] 1 scrub errors root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log* /var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15 10.10.89.1:6812/3810 2122 : cluster [INF] 2.1a9 deep-scrub starts /var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15 10.10.89.1:6812/3810 2123 : cluster [INF] 2.1a9 deep-scrub ok /var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15 10.10.89.1:6800/3352 18074 : cluster [INF] 2.1a9 deep-scrub starts /var/log/ceph/ceph.log.1.gz:2018-08-22 04:40:02.579204 osd.15 10.10.89.1:6800/3352 18075 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read error /var/log/ceph/ceph.log.1.gz:2018-08-22 04:41:02.720716 osd.15 10.10.89.1:6800/3352 18076 : cluster [ERR] 2.1a9 deep-scrub 0 missing, 1 inconsistent objects /var/log/ceph/ceph.log:2018-08-22 08:23:09.682792 osd.15 10.10.89.1:6800/3352 18088 : cluster [INF] 2.1a9 repair starts /var/log/ceph/ceph.log:2018-08-22 08:29:28.440526 osd.15 10.10.89.1:6800/3352 18089 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read error /var/log/ceph/ceph.log:2018-08-22 08:30:18.790176 osd.15 10.10.89.1:6800/3352 18090 : cluster [ERR] 2.1a9 repair 0 missing, 1 inconsistent objects /var/log/ceph/ceph.log:2018-08-22 08:30:18.791718 osd.15 10.10.89.1:6800/3352 18091 : cluster [ERR] 2.1a9 repair 1 errors, 1 fixed And also: jewel (which we're running) is considered "the old past" with the old non-checksum behaviour? In case this occurs again... what would be the steps to determine WHICH pg is the corrupt one, and how to proceed it it happens to be the primary pg for an object? Upgrading to luminous would prevent this from happening again i guess. We're a bit scared to upgrade, because there seem to be so many issues with luminous and upgrading to it. Having said all this: we are surprised to see this is on our cluster, as it should be and has been running stable and reliably for over two years. Perhaps just a one-time glitch. Thanks for your replies! MJ On 08/23/2018 01:06 AM, Gregory Farnum wrote: On Wed, Aug 22, 2018 at 2:46 AM John Spray <mailto:jsp...@redhat.com>> wrote: On Wed, Aug 22, 2018 at 7:57 AM mj mailto:li...@merit.unu.edu>> wrote: > > Hi, > > This morning I woke up, seeing my ceph jewel 10.2.10 cluster in > HEALTH_ERR state. That helps you getting out of bed. :-) > > Anyway, much to my surprise, all VMs running on the cluster were still > working like nothing was going on. :-) > > Checking a bit more reveiled: > > > root@pm1:~# ceph -s > > cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1 > > health HEALTH_ERR > > 1 pgs inconsistent > > 1 scrub errors > > monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0 <http://10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0>} > > election epoch 296, quorum 0,1,2 0,1,2 > > osdmap e12662: 24 osds: 24 up, 24 in > > flags sortbitwise,require_jewel_osds > > pgmap v64045618: 1088 pgs, 2 pools, 14023 GB data, 3680 kobjects > > 44027 GB used, 45353 GB / 89380 GB avail > > 1087 active+clean > > 1 active+clean+inconsistent > > client io 26462 kB/s rd, 14048 kB/s wr, 6 op/s rd, 383 op/s wr > > root@pm1:~# ceph health detail > > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > > pg 2.1a9 is active+clean+inconsistent, acting [15,23,6] > > 1 scrub errors > > root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log* > > /var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15 10.10.89.1:6812/3810 <http://10.10.89.1:6812/3810> 2122 : cluster [INF] 2.1a9 deep-scrub starts > > /var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15 10.10.89.1:6812/3810 <http://10.10.89.1:6812/3810> 2123 : cluster [INF] 2.1a9 deep-scrub ok > > /var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15 10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18074 : cluster [INF] 2.1a9 deep-scrub starts > > /var/log/ceph/ceph.log.1.gz:2018-08-2
[ceph-users] HEALTH_ERR vs HEALTH_WARN
Hi, This morning I woke up, seeing my ceph jewel 10.2.10 cluster in HEALTH_ERR state. That helps you getting out of bed. :-) Anyway, much to my surprise, all VMs running on the cluster were still working like nothing was going on. :-) Checking a bit more reveiled: root@pm1:~# ceph -s cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1 health HEALTH_ERR 1 pgs inconsistent 1 scrub errors monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0} election epoch 296, quorum 0,1,2 0,1,2 osdmap e12662: 24 osds: 24 up, 24 in flags sortbitwise,require_jewel_osds pgmap v64045618: 1088 pgs, 2 pools, 14023 GB data, 3680 kobjects 44027 GB used, 45353 GB / 89380 GB avail 1087 active+clean 1 active+clean+inconsistent client io 26462 kB/s rd, 14048 kB/s wr, 6 op/s rd, 383 op/s wr root@pm1:~# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 2.1a9 is active+clean+inconsistent, acting [15,23,6] 1 scrub errors root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log* /var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15 10.10.89.1:6812/3810 2122 : cluster [INF] 2.1a9 deep-scrub starts /var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15 10.10.89.1:6812/3810 2123 : cluster [INF] 2.1a9 deep-scrub ok /var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15 10.10.89.1:6800/3352 18074 : cluster [INF] 2.1a9 deep-scrub starts /var/log/ceph/ceph.log.1.gz:2018-08-22 04:40:02.579204 osd.15 10.10.89.1:6800/3352 18075 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read error /var/log/ceph/ceph.log.1.gz:2018-08-22 04:41:02.720716 osd.15 10.10.89.1:6800/3352 18076 : cluster [ERR] 2.1a9 deep-scrub 0 missing, 1 inconsistent objects ok, according to the docs I should do "ceph pg repair 2.1a9". Did that, and some minutes later the cluster came back to "HEALTH_OK" Checking the logs: /var/log/ceph/ceph.log:2018-08-22 08:23:09.682792 osd.15 10.10.89.1:6800/3352 18088 : cluster [INF] 2.1a9 repair starts /var/log/ceph/ceph.log:2018-08-22 08:29:28.440526 osd.15 10.10.89.1:6800/3352 18089 : cluster [ERR] 2.1a9 shard 23: soid 2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read error /var/log/ceph/ceph.log:2018-08-22 08:30:18.790176 osd.15 10.10.89.1:6800/3352 18090 : cluster [ERR] 2.1a9 repair 0 missing, 1 inconsistent objects /var/log/ceph/ceph.log:2018-08-22 08:30:18.791718 osd.15 10.10.89.1:6800/3352 18091 : cluster [ERR] 2.1a9 repair 1 errors, 1 fixed So, we are fine again, it seems. But now my question: can anyone what happened? Is one of my disks dying? In the proxmox gui, all osd disks are SMART status "OK". Besides that, as the cluster was still running and the fix was relatively simple, would a HEALTH_WARN not have been more appropriate? And, since this is a size 3, min 2 pool... shouldn't this have been taken care of automatically..? ('self-healing' and all that..?) So, I'm having my morning coffee finally, wondering what happened... :-) Best regards to all, have a nice day! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] lacp bonding | working as expected..?
Hi Jacob, Thanks for your reply. But I'm not sure I completely understand it. :-) On 06/21/2018 09:09 PM, Jacob DeGlopper wrote: In your example, where you see one link being used, I see an even source IP paired with an odd destination port number for both transfers, or is that a search and replace issue? Well, I left portnumbers as they were, I edited the IPs. Actually the machines are not a.b.c.9 and a.b.c.10, but a.b.c.204 and a.b.c.205, for the rest, everyting is unedited. So a single line example: Client connecting to a.b.c.205, TCP port 5001 TCP window size: 85.0 KByte (default) [ 3] local a.b.c.204 port 60600 connected with a.b.c.205 port 5001 Client connecting to a.b.c.205, TCP port 5000 TCP window size: 85.0 KByte (default) [ 3] local a.b.c.204 port 53788 connected with a.b.c.205 port 5000 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 746 MBytes 625 Mbits/sec [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 383 MBytes 321 Mbits/sec And a lucky example: Client connecting to a.b.c.205, TCP port 5001 TCP window size: 85.0 KByte (default) [ 3] local a.b.c.204 port 37984 connected with a.b.c.205 port 5001 Client connecting to a.b.c.205, TCP port 5000 TCP window size: 85.0 KByte (default) [ 3] local a.b.c.204 port 48850 connected with a.b.c.205 port 5000 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.09 GBytes 936 Mbits/sec [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 885 MBytes 742 Mbits/sec (reason for the a.b.c.204 is that the IPs are public, and I'd rather not put them here) I don't see the odd/even port numbers thing you noticed..? (I could very well miss something though) I see no way to specify what outgoing port iperf should use, otherwise I could try again using the same ports, to check the pattern. Thanks again! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding additional disks to the production cluster without performance impacts on the existing
Hi Pardhiv, On 06/08/2018 05:07 AM, Pardhiv Karri wrote: We recently added a lot of nodes to our ceph clusters. To mitigate lot of problems (we are using tree algorithm) we added an empty node first to the crushmap and then added OSDs with zero weight, made sure the ceph health is OK and then started ramping up each OSD. I created a script to do it dynamically, which will check CPU of the new host with OSDs that Would you mind sharing this script..? MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding cluster network to running cluster
On 06/07/2018 01:45 PM, Wido den Hollander wrote: Removing cluster network is enough. After the restart the OSDs will not publish a cluster network in the OSDMap anymore. You can keep the public network in ceph.conf and can even remove that after you removed the 10.10.x.x addresses from the system(s). Wido Thanks for the info, Wido. :-) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tunable question
Hi, For the record, we changed tunables from "hammer" to "optimal", yesterday at 14:00, and it finished this morning at 9:00, so rebalancing took 19 hours. This was on a small ceph cluster, 24 4TB OSDs spread over three hosts, connected over 10G ethernet. Total amount of data: 32730 GB used, 56650 GB / 89380 GB avail We set noscrub and no-deepscrub during the rebalance, and our VMs experienced basically no impact. MJ On 10/03/2017 05:37 PM, lists wrote: Thanks Jake, for your extensive reply. :-) MJ On 3-10-2017 15:21, Jake Young wrote: On Tue, Oct 3, 2017 at 8:38 AM lists <li...@merit.unu.edu <mailto:li...@merit.unu.edu>> wrote: Hi, What would make the decision easier: if we knew that we could easily revert the > "ceph osd crush tunables optimal" once it has begun rebalancing data? Meaning: if we notice that impact is too high, or it will take too long, that we could simply again say > "ceph osd crush tunables hammer" and the cluster would calm down again? Yes you can revert the tunables back; but it will then move all the data back where it was, so be prepared for that. Verify you have the following values in ceph.conf. Note that these are the defaults in Jewel, so if they aren’t defined, you’re probably good: osd_max_backfills=1 osd_recovery_threads=1 You can try to set these (using ceph —inject) if you notice a large impact to your client performance: osd_recovery_op_priority=1 osd_recovery_max_active=1 osd_recovery_threads=1 I recall this tunables change when we went from hammer to jewel last year. It took over 24 hours to rebalance 122TB on our 110 osd cluster. Jake MJ On 2-10-2017 9:41, Manuel Lausch wrote: > Hi, > > We have similar issues. > After upgradeing from hammer to jewel the tunable "choose leave stabel" > was introduces. If we activate it nearly all data will be moved. The > cluster has 2400 OSD on 40 nodes over two datacenters and is filled with > 2,5 PB Data. > > We tried to enable it but the backfillingtraffic is to high to be > handled without impacting other services on the Network. > > Do someone know if it is neccessary to enable this tunable? And could > it be a problem in the future if we want to upgrade to newer versions > wihout it enabled? > > Regards, > Manuel Lausch > ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tunable question
Hi Dan, list, Our cluster is small: three nodes, totally 24 4Tb platter OSDs, SSD journals. Using rbd for VMs. That's it. Runs nicely though :-) The fact that "tunable optimal" for jewel would result in "significantly fewer mappings change when an OSD is marked out of the cluster" is what attracts us. Reasoning behind it: upgrading to "optimal" NOW, should result in faster rebuild-time when disaster strikes, and we're all stressed out. :-) After the jewel upgrade, we also upgraded the tunables from "(require bobtail, min is firefly)" to "hammer". This resulted in approx 24 hours rebuild, but actually without significant inpact on the hosted VMs. Is it safe to assume that setting it to "optimal" would have a similar impact, or are the implications bigger? MJ On 09/28/2017 10:29 AM, Dan van der Ster wrote: Hi, How big is your cluster and what is your use case? For us, we'll likely never enable the recent tunables that need to remap *all* PGs -- it would simply be too disruptive for marginal benefit. Cheers, Dan On Thu, Sep 28, 2017 at 9:21 AM, mj <li...@merit.unu.edu> wrote: Hi, We have completed the upgrade to jewel, and we set tunables to hammer. Cluster again HEALTH_OK. :-) But now, we would like to proceed in the direction of luminous and bluestore OSDs, and we would like to ask for some feedback first. From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an existing cluster will result in a very large amount of data movement as almost every PG mapping is likely to change." Given the above, and the fact that we would like to proceed to luminous/bluestore in the not too far away future: What is cleverer: 1 - keep the cluster at tunable hammer now, upgrade to luminous in a little while, change OSDs to bluestore, and then set tunables to optimal or 2 - set tunable to optimal now, take the impact of "almost all PG remapping", and when that is finished, upgrade to luminous, bluestore etc. Which route is the preferred one? Or is there a third (or fourth?) option..? :-) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] tunable question
Hi, We have completed the upgrade to jewel, and we set tunables to hammer. Cluster again HEALTH_OK. :-) But now, we would like to proceed in the direction of luminous and bluestore OSDs, and we would like to ask for some feedback first. From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an existing cluster will result in a very large amount of data movement as almost every PG mapping is likely to change." Given the above, and the fact that we would like to proceed to luminous/bluestore in the not too far away future: What is cleverer: 1 - keep the cluster at tunable hammer now, upgrade to luminous in a little while, change OSDs to bluestore, and then set tunables to optimal or 2 - set tunable to optimal now, take the impact of "almost all PG remapping", and when that is finished, upgrade to luminous, bluestore etc. Which route is the preferred one? Or is there a third (or fourth?) option..? :-) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot
Hi, I forwarded your announcement to the dovecot mailinglist. The following reply to it was posted by there by Timo Sirainen. I'm forwarding it here, as you might not be reading the dovecot mailinglist. Wido: First, the Github link: https://github.com/ceph-dovecot/dovecot-ceph-plugin I am not going to repeat everything which is on Github, put a short summary: - CephFS is used for storing Mailbox Indexes - E-Mails are stored directly as RADOS objects - It's a Dovecot plugin We would like everybody to test librmb and report back issues on Github so that further development can be done. It's not finalized yet, but all the help is welcome to make librmb the best solution for storing your e-mails on Ceph with Dovecot. Timo: It would be have been nicer if RADOS support was implemented as lib-fs driver, and the fs-API had been used all over the place elsewhere. So 1) LibRadosMailBox wouldn't have been relying so much on RADOS specifically and 2) fs-rados could have been used for other purposes. There are already fs-dict and dict-fs drivers, so the RADOS dict driver may not have been necessary to implement if fs-rados was implemented instead (although I didn't check it closely enough to verify). (We've had fs-rados on our TODO list for a while also.) BTW. We've also been planning on open sourcing some of the obox pieces, mainly fs-drivers (e.g. fs-s3). The obox format maybe too, but without the "metacache" piece. The current obox code is a bit too much married into the metacache though to make open sourcing it easy. (The metacache is about storing the Dovecot index files in object storage and efficiently caching them on local filesystem, which isn't planned to be open sourced in near future. That's pretty much the only difficult piece of the obox plugin, with Cassandra integration coming as a good second. I wish there had been a better/easier geo-distributed key-value database to use - tombstones are annoyingly troublesome.) And using rmb-mailbox format, my main worries would be: * doesn't store index files (= message flags) - not necessarily a problem, as long as you don't want geo-replication * index corruption means rebuilding them, which means rescanning list of mail files, which means rescanning the whole RADOS namespace, which practically means rescanning the RADOS pool. That most likely is a very very slow operation, which you want to avoid unless it's absolutely necessary. Need to be very careful to avoid that happening, and in general to avoid losing mails in case of crashes or other bugs. * I think copying/moving mails physically copies the full data on disk * Each IMAP/POP3/LMTP/etc process connects to RADOS separately from each others - some connection pooling would likely help here ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restart ceph cluster
Hi, On 05/12/2017 03:35 PM, Vladimir Prokofev wrote: My best guess is that using systemd you can write some basic script to restart whatever OSDs you want. Another option is to use the same mechanics that ceph-deploy uses, but the principle is all the same - write some automation script. I would love to hear from people who has 100-1000+ OSDs in production though. That's surely not me, but inserting config changes/updates are easily done to all OSDs with ceph tell osd.* injectargs The only thing to keep into consederation to _also_ make the changes to ceph.conf, so they persist after a reboot. But perhaps I completely misunderstand your question... ;-) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Hi Gregory, Reading your reply with great interest, thanks. Can you confirm my understanding now: - live snapshots are more expensive for the cluster as a whole, than taking the snapshot when the VM is switched off? - using fstrim in VMs is (much?) more expensive when the VM has existing snapshots? - it might be worthwhile to postpone upgrading from hammer to jewel, until after your big accouncement? - we are on xfs (both for the ceph OSDs and the VMs) and that is the best combination to avoid these slow requests and CoW overhead with snapshots (or at least to minimise their impact) Any other tips, do's or don'ts, or things to keep in mind related to snapshots, VM/OSD filesystems, or using fstrim..? (our cluster is also small, hammer, three servers with 8 OSDs each, and journals on ssd, plenty of cpu/ram) Again, thanks for your interesting post. MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
On 04/18/2017 11:24 AM, Jogi Hofmüller wrote: This might have been true for hammer and older versions of ceph. From what I can tell now, every snapshot taken reduces performance of the entire cluster :( Really? Can others confirm this? Is this a 'wellknown fact'? (unknown only to us, perhaps...) We are still on hammer, but if the result of upgrading to jewel is actually a massive performance decrease, I might postpone as long as possible... Most of our VMs have a snapshot or two... MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
ah right: _during_ the actual removal, you mean. :-) clear now. mj On 04/13/2017 05:50 PM, Lionel Bouton wrote: Le 13/04/2017 à 17:47, mj a écrit : Hi, On 04/13/2017 04:53 PM, Lionel Bouton wrote: We use rbd snapshots on Firefly (and Hammer now) and I didn't see any measurable impact on performance... until we tried to remove them. What exactly do you mean with that? Just what I said : having snapshots doesn't impact performance, only removing them (obviously until Ceph is finished cleaning up). Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests and short OSD failures in small cluster
Hi, On 04/13/2017 04:53 PM, Lionel Bouton wrote: We use rbd snapshots on Firefly (and Hammer now) and I didn't see any measurable impact on performance... until we tried to remove them. What exactly do you mean with that? MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clock skew
On 04/01/2017 04:02 PM, John Petrini wrote: Hello, I'm also curious about the impact of clock drift. We see the same on both of our clusters despite trying various NTP servers including our own local servers. Ultimately we just ended up adjusting our monitoring to be less sensitive to it since the clock drift always resolves on its own. Is this a dangerous practice? Are you running ntp, or this chrony? (which I did not know of) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clock skew
Hi, On 04/01/2017 02:10 PM, Wido den Hollander wrote: You could try the chrony NTP daemon instead of ntpd and make sure all MONs are peers from each other. I understand now what that means. I have set it up according to your suggestion. Curious to see how this works out, thanks! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clock skew
Hi Wido, On 04/01/2017 02:10 PM, Wido den Hollander wrote: That warning is there for a reason. I suggest you double-check your NTP and clocks on the machines. This should never happen in production. I know... Don't understand why this happens..! Tried both ntpd and systemd-timesyncd. I did not yet know chrony, will try it. I imagined that a 0.2 sec time skew would not be too disasterous.. As a side note: I cannot find explained anywhere WHAT could happen if the skew becomes too big. Only that we should prevent it. (data loss?) Are you running the MONs inside Virtual Machines? They are more likely to have drifting clocks. Nope. All bare metal on new supermicro servers. You could try the chrony NTP daemon instead of ntpd and make sure all MONs are peers from each other. Will try that. I had set all MONs to sync with chime1.surfnet.nl - chime4. We usually have very good experiences with those ntp servers. So, you're telling me that the MONs should be peers from each other... But if all MONs listen/sync to/with each other, where do I configure the external stratum1 source.? MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clock skew
Hi! On 04/01/2017 12:49 PM, Wei Jin wrote: mon_clock_drift_allowed should be used in monitor process, what's the output of `ceph daemon mon.foo config show | grep clock`? how did you change the value? command line or config file? I guess I changed it wrong then... Did it in ceph.conf, like: [global] mon clock drift allowed = 0.1 and for immediate effect, also: > ceph tell osd.* injectargs --mon_clock_drift_allowed "0.2" So I guess that's wrong..? Should it be under the [mon] sections of ceph.conf? If listed under [global] like I have it now, then what have does it actually change..? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] clock skew
Hi, Despite ntp, we keep getting clock skews that auto disappear again after a few minutes. To prevent the unneccerasy HEALTH_WARNs, I have increased the mon_clock_drift_allowed to 0.2, as can be seen below: root@ceph1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep clock "mon_clock_drift_allowed": "0.2", "mon_clock_drift_warn_backoff": "5", "clock_offset": "0", root@ceph1:~# Despite this setting, I keep receiving HEALTH_WARNs like below: ceph cluster node ceph1 health status became HEALTH_WARN clock skew detected on mon.1; Monitor clock skew detected mon.1 addr 10.10.89.2:6789/0 clock skew 0.113709s > max 0.1s (latency 0.000523111s) Can anyone explain why the running config shows "mon_clock_drift_allowed": "0.2" and the HEALTH_WARN says "max 0.1s (latency 0.000523111s)"? How come there's a difference between the two? MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] default pools gone. problem?
On 03/24/2017 10:13 PM, Bob R wrote: You can operate without the default pools without issue. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] default pools gone. problem?
Hi, On the docs on ppols http://docs.ceph.com/docs/cuttlefish/rados/operations/pools/ it says: The default pools are: *data *metadata *rbd My ceph install has only ONE pool called "ceph-storage", the others are gone. (probably deleted?) Is not having those default pools a problem? Do I need to recreate them, or can they safely be deleted? I'm on hammer, but intending to upgrade to jewel, and trying to identify potential issues, therefore this question. MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph 'tech' question
On 03/24/2017 10:33 AM, ulem...@polarzone.de wrote: And why? better distibution of read-access. Udo Ah yes. On the other hand... In the case of specific often-requested data in your pool, the primary PG will have to handle all those requests, and in that case using a local copy would have benefits. Anyway, thanks for your reply. :-) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph 'tech' question
Hi all, Something that I am curious about: Suppose I have a three-server cluster, all with identical OSDs configuration, and also a replication factor of three. That would mean (I guess) that all 3 servers have a copy of everything in the ceph pool. My question: given that every machine has all the data, does that also imply that reads will be LOCAL on each machine? I'm asking because I understand that each PG has one primary copy optionally with extra secondary copies. (depending on the replication factor) I have the feeling that local reads will usually be faster than reads over the network. And if this is not the case, then why not? :-) Thanks for any insights or pointers! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] add multiple OSDs to cluster
Hi Jonathan, Anthony and Steve, Thanks very much for your valuable advise and suggestions! MJ On 03/21/2017 08:53 PM, Jonathan Proulx wrote: If it took 7hr for one drive you probably already done this (or defaults are for low impact recovery) but before doing anything you want to besure you OSD settings max backfills, max recovery active, recovery sleep (perhaps others?) are set such that revovery and backfilling doesn't overwhelm produciton use. look through the recovery section of http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/ This is important because if you do have a failure and thus unplanned recovery you want to have this tuned to your prefered balance of quick performance or quick return to full redundancy. That said my theory is to add things in as balanced a way as possible to minimize moves. What that means depends on your crush map. For me I have 3 "racks" and all (most) of my pools are 3x replication so each object should have one copy in each rack. I've only expanded once, but what I did was to add three servers. One to each 'rack'. I set them all 'in' at the same time which should have minimized movement between racks and moved obbjects from other servers' osds in the same rack onto the osds in the new server. This seemed to work well for me. In your case this would mean adding drives to all servers at once in a balanced way. That would prevent copy across servers since the balance amoung servers wouldn't change. You could do one disk on each server or load them all up and trust recovery settings to keep the thundering herd in check. As I said I've only gone through one expantion round and while this theory seemed to work out for me hopefully someone with deeper knowlege can confirm or deny it's general applicability. -Jon On Tue, Mar 21, 2017 at 07:56:57PM +0100, mj wrote: :Hi, : :Just a quick question about adding OSDs, since most of the docs I can find :talk about adding ONE OSD, and I'd like to add four per server on my :three-node cluster. : :This morning I tried the careful approach, and added one OSD to server1. It :all went fine, everything rebuilt and I have a HEALTH_OK again now. It took :around 7 hours. : :But now I started thinking... (and that's when things go wrong, therefore :hoping for feedback here) : :The question: was I being stupid to add only ONE osd to the server1? Is it :not smarter to add all four OSDs at the same time? : :I mean: things will rebuild anyway...and I have the feeling that rebuilding :from 4 -> 8 OSDs is not going to be much heavier than rebuilding from 4 -> 5 :OSDs. Right? : :So better add all new OSDs together on a specific server? : :Or not? :-) : :MJ :___ :ceph-users mailing list :ceph-users@lists.ceph.com :http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] add multiple OSDs to cluster
Hi, Just a quick question about adding OSDs, since most of the docs I can find talk about adding ONE OSD, and I'd like to add four per server on my three-node cluster. This morning I tried the careful approach, and added one OSD to server1. It all went fine, everything rebuilt and I have a HEALTH_OK again now. It took around 7 hours. But now I started thinking... (and that's when things go wrong, therefore hoping for feedback here) The question: was I being stupid to add only ONE osd to the server1? Is it not smarter to add all four OSDs at the same time? I mean: things will rebuild anyway...and I have the feeling that rebuilding from 4 -> 8 OSDs is not going to be much heavier than rebuilding from 4 -> 5 OSDs. Right? So better add all new OSDs together on a specific server? Or not? :-) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] suddenly high memory usage for ceph-mon process
Hi Igor and David, Thanks for your replies. There are no ceph-mds processes running in our cluster. I'm guesing David's reply applies to us, and we just need to setup additional monitoring for memory usage, so we get notified in case it happens again. Anyway: we learned that this can happen, so next time we know where to look first. Thanks both, for you replies, MJ On 11/04/2016 03:26 PM, igor.podo...@ts.fujitsu.com wrote: Maybe you hit this https://github.com/ceph/ceph/pull/10238 still waits for merge. This will occur only if you have ceph-mds process in your cluster, but it's not configured (you not need to use MDS, this process could be running only on some node). Check your monitor logs for something like: "up but filesystem disabled" and how many similar lines you have. Regards, Igor. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of mj Sent: Friday, November 4, 2016 2:06 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] suddenly high memory usage for ceph-mon process Hi, Running ceph 0.94.9 on jessie (proxmox), three hosts, 4 OSDs per host, ssd journal, 10G cluster network. Hosts have 65G ram. The cluster is generally not very buzy. Suddenly we were getting HEALTH_WRN today, with two osd's (both on the same server) being slow. Looking into this, we noticed very high memory usage on that host: 75% memory for ceph-mon! (normally here ceph-mon uses around 1% - 2%) I restarted ceph-mon on that host, and that seems to have brought things back to normal immediately. I don't see anything out of the ordinary in /var/log/syslog on that server, and also generally the cluster is HEALTH_OK. No changes to configs lately (last many weeks) and last time I applied updates and rebooted is 30 days ago. No idea what could have caused this. Any ideas what to check, where to look? What would typically cause such high memory usage for the ceph-mon process? MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] suddenly high memory usage for ceph-mon process
Hi, Running ceph 0.94.9 on jessie (proxmox), three hosts, 4 OSDs per host, ssd journal, 10G cluster network. Hosts have 65G ram. The cluster is generally not very buzy. Suddenly we were getting HEALTH_WRN today, with two osd's (both on the same server) being slow. Looking into this, we noticed very high memory usage on that host: 75% memory for ceph-mon! (normally here ceph-mon uses around 1% - 2%) I restarted ceph-mon on that host, and that seems to have brought things back to normal immediately. I don't see anything out of the ordinary in /var/log/syslog on that server, and also generally the cluster is HEALTH_OK. No changes to configs lately (last many weeks) and last time I applied updates and rebooted is 30 days ago. No idea what could have caused this. Any ideas what to check, where to look? What would typically cause such high memory usage for the ceph-mon process? MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10Gbit switch advice for small ceph cluster upgrade
Hi Jelle, On 10/27/2016 03:04 PM, Jelle de Jong wrote: Hello everybody, I want to upgrade my small ceph cluster to 10Gbit networking and would like some recommendation from your experience. What is your recommend budget 10Gbit switch suitable for Ceph? We are running a 3-node cluster, with _direct_ 10G cable connections (quasi crosslink) between the three hosts. This is very low-budget, as it gives you 10G speed, without a (relatively) expensive 10G switch. Working fine here, with each host having a double 10G intel nic, plus a regular 1G interface. MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] running xfs_fsr on ceph OSDs
Hi Christian, Thanks for the reply / suggestion! MJ On 10/24/2016 10:02 AM, Christian Balzer wrote: Hello, On Mon, 24 Oct 2016 09:41:37 +0200 mj wrote: Hi, We have been running xfs on our servers for many years, and we are used to run a scheduled xfs_fsr during the weekend. Lately we have started using proxmox / ceph, and I'm wondering if we would benefit (like 'the old days') from scheduled xfs_fsr runs? Our OSDs are xfs, plus the VMs are also mostly running xfs. Both of which (in theory anyway) could be defragmented. Google doesn't tell me a lot, therefore I'm posing the question here: What is consensus here? Is it worth running xfs_fsr on VMs and OSDs? (or perhaps just one of both?) Only using XFS on some test OSDs, but the experience there and the anecdotes here suggest that it's quite prone to fragmentation over time. Of course instead of just running xfs_fsr willy-nilly, you might want to verify that fact for yourself and pick a schedule/time based on those results. And OSDs only, not within the VMs. Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] running xfs_fsr on ceph OSDs
Hi, We have been running xfs on our servers for many years, and we are used to run a scheduled xfs_fsr during the weekend. Lately we have started using proxmox / ceph, and I'm wondering if we would benefit (like 'the old days') from scheduled xfs_fsr runs? Our OSDs are xfs, plus the VMs are also mostly running xfs. Both of which (in theory anyway) could be defragmented. Google doesn't tell me a lot, therefore I'm posing the question here: What is consensus here? Is it worth running xfs_fsr on VMs and OSDs? (or perhaps just one of both?) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Surviving a ceph cluster outage: the hard way
Hi, Interesting reading! Any chance you could state some of your lessons (if any) you learned..? I can, for example, imagine your situation would have been much better with a replication factor of three instead of two..? MJ On 10/20/2016 12:09 AM, Kostis Fardelas wrote: Hello cephers, this is the blog post on our Ceph cluster's outage we experienced some weeks ago and about how we managed to revive the cluster and our clients's data. I hope it will prove useful for anyone who will find himself/herself in a similar position. Thanks for everyone on the ceph-users and ceph-devel lists who contributed to our inquiries during troubleshooting. https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/ Regards, Kostis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd pool:replica size choose: 2 vs 3
Hi, On 09/23/2016 09:41 AM, Dan van der Ster wrote: If you care about your data you run with size = 3 and min_size = 2. Wido We're currently running with min_size 1. Can we simply change this, online, with: ceph osd pool set vm-storage min_size 2 and expect everything to continue running? (our cluster is HEALTH_OK, enough disk space, etc, etc) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados bench output question
Hi Christian, Thanks a lot for all your information! (specially the bit that ceph never reads from the journal, but writes to osd from memory was new for me) MJ On 09/07/2016 03:20 AM, Christian Balzer wrote: hello, On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote: Hi Christian, Thanks for your reply. What SSD model (be precise)? Samsung 480GB PM863 SSD So that's not your culprit then (they are supposed to handle sync writes at full speed). Only one SSD? Yes. With a 5GB partition based journal for each osd. A bit small, but in normal scenarios that shouldn't be a problem. Read: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28003.html During the 0 MB/sec, there is NO increased cpu usage: it is usually around 15 - 20% for the four ceph-osd processes. Watch your node(s) with atop or iostat. Ok, I will do. Best results will be had with 3 large terminals (one per node) running atop, interval set to at least 5, down from default 10 seconds. Same diff with iostat, parameters "-x 2". Do we have an issue..? And if yes: Anyone with a suggestions where to look at? You will find that either your journal SSD is overwhelmed and a single SSD peaking around 500MB/s wouldn't be that surprising. Or that your HDDs can't scribble away at more than the speed above, the more likely reason. Even a combination of both. Ceph needs to flush data to the OSDs eventually (and that is usually more or less immediately with default parameters), so for a sustained, sequential write test you're looking at the speed of your HDDs. And that will be spiky of sorts, due to FS journals, seeks for other writes (replicas), etc. But would we expect the MB/sec to drop to ZERO, during journal-to-osd flushes? A common misconception when people start up with Ceph and probably something that should be better explained in the docs. Or not, given that Blustore is on the shimmering horizon. Ceph never reads from the journals, unless there has been a crash. (Now would be a good time to read that link above if you haven't yet) What happens is that (depending on the various filestore and journal parameters) Ceph starts flushing the still in memory data to the OSD (disk, FS) after the journal has been written, as I mentioned above. The logic here is to not create an I/O storm after letting things pile up for a long time. People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune these parameters. So now think about what happens during that rados bench run: A 4MB object gets written (created, then filled), so the client talks to the OSD that holds the primary PG for that object. That OSD writes the data to the journal and sends it to the other OSDs (replicas). Once all journals have been written, the primary OSD acks the write to the client. And this happens with 16 threads by default, making things nicely busy. Now keeping in mind the above description and the fact that you have a small cluster, a single OSD that gets too busy will block the whole cluster basically. So things dropping to zero means that at least one OSD was so busy (not CPU in your case, IOwait) that it couldn't take in more data. The fact that your drops happen in a rather predictable, roughly 9 seconds interval, suggests also the possibility that the actual journal got full, but that's not conclusive. Christian Thanks for the quick feedback, and I'll dive into atop and iostat next. Regards, MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com