from:"mj"

Re: [ceph-users] Need advice with setup planning

2019-09-21 Thread mj


Hi,

In the case of three ceph hosts, you could also consider this setup:

https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server

This only requires that you have two 10G nics on each machine. Plus an 
extra 1G for 'regular' non-ceph traffic.


That way at least your ceph comms would be 10G, since 1G is surely going 
to be a bottleneck.


We are running the above setup. No problems. Only issue is: adding a 
fourth node will be relatively intrusive.


MJ

On 9/20/19 8:23 PM, Salsa wrote:

Replying inline.

--
Salsa

Sent with ProtonMail <https://protonmail.com> Secure Email.

‐‐‐ Original Message ‐‐‐
On Friday, September 20, 2019 1:34 PM, Martin Verges 
 wrote:



Hello Salsa,

I have tested Ceph using VMs but never got to put it to use and
had a lot of trouble to get it to install.

if you want to get rid of all the troubles from installing to day2day 
operations, you could consider using https://croit.io/croit-virtual-demo


Amazing! Where were you 3 months ago? Only problem is that I think we 
have no moer budget for this so I can't get approval for software license.




- Use 2 HDDs for SO using RAID 1 (I've left 3.5TB unallocated in
case I can use it later for storage)
- Install CentOS 7.7

Is ok, but won't be necessary if you choose croit as we boot from the 
network and don't install a operating system.


No budget for software license


- Use 2 vLANs, one for ceph internal usage and another for
external access. Since they've 4 network adapters, I'll try to
bond them in pairs to speed up network (1Gb).

If there is no internal policy that forces you to do seperate 
networks, you can use a simple 1 vlan setup and bond 4*1GbE. Otherwise 
it's ok.


The service is critical and we are afraid that the network might be 
congested and QoS for the end user degrades.



- I'll try to use ceph-ansible for installation. I failed to use
it on lab, but it seems more recommended.
- Install Ceph Nautilus

Ultra easy with croit, maybe look at our videos on youtube - 
https://www.youtube.com/playlist?list=PL1g9zo59diHDSJgkZcMRUq6xROzt_YKox



Thanks! I'll be watching them.



- Each server will host OSD, MON, MGR and MDS.

ok, but you should use ssd for metadata.

No budget and no option to get those now.


- One VM for ceph-admin: This wil be used to run ceph-ansible and
maybe to host some ceph services later

perfect for croit ;)

- I'll have to serve samba, iscsi and probably NFS too. Not sure
how or on which servers.

Just put it on the servers as well, with croit it is just a click away 
and everything is included in our interface.
If not using croit, you can still install it on the same systems and 
configure it by hand/script.


Great! Thanks for the help and congratulations on that demo. It is the 
best I've used and the easiest ceph setup I've found. As feedback, the 
last part of the demo tutorial is not 100% compatible with the master 
branch from github. The RBD pool creation has a different interface than 
the one presented in your tutorial (Or I made some mistake along the 
way). Also, my cluster is showing error in my placement groups after RB 
pool creation, but I'll try to find out what happened.


Thanks again!


--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io>
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 20. Sept. 2019 um 18:14 Uhr schrieb Salsa 
mailto:sa...@protonmail.com>>:


I have tested Ceph using VMs but never got to put it to use and
had a lot of trouble to get it to install.

Now I've been asked to do a production setup using 3 servers (Dell
R740) with 12 4TB each.

My plan is this:
- Use 2 HDDs for SO using RAID 1 (I've left 3.5TB unallocated in
case I can use it later for storage)
- Install CentOS 7.7
- Use 2 vLANs, one for ceph internal usage and another for
external access. Since they've 4 network adapters, I'll try to
bond them in pairs to speed up network (1Gb).
- I'll try to use ceph-ansible for installation. I failed to use
it on lab, but it seems more recommended.
- Install Ceph Nautilus
- Each server will host OSD, MON, MGR and MDS.
- One VM for ceph-admin: This wil be used to run ceph-ansible and
maybe to host some ceph services later
- I'll have to serve samba, iscsi and probably NFS too. Not sure
how or on which servers.

Am I missing anything? Am I doing anything "wrong"?

I searched for some actual guidance on setup but I couldn't find
anything complete, like a good tutorial or reference based on
possible use-cases.

So, is there any suggestions you could share or links and
references I should take a look?

Re: [ceph-users] clock skew

2019-04-28 Thread mj


An update.

We noticed contradicting output from chrony. "chronyc sources" showed 
that chrony was synced. However, we also noted this output:



root@ceph2:/etc/chrony# chronyc activity
200 OK
0 sources online
4 sources offline
0 sources doing burst (return to online)
0 sources doing burst (return to offline)
0 sources with unknown address


so "chrony activity" shows OFFLINE sources.

After we changed sources to nl.pool.ntp.org, "chronyc activity" started 
showing the sources as ONLINE, and now, after a day running, our skew as 
reported by "ceph time-sync-status" is 0.00 on all hosts.


Seems that replying on "chronyc sources" is not always enough to make 
sure that everything is synced indeed.


Thanks for the help!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] clock skew

2019-04-26 Thread mj


Hi all,

Thanks for all replies!

@Huang: ceph time-sync-status is exactly what I was looking for, thanks!

@Janne: i will checkout/implement the peer config per your suggestion. 
However what confuses us is that chrony thinks the clocks match, and 
only ceph feels it doesn't. So we are not sure if the peer config will 
actually help in this situation. But time will tell.


@John: Thanks for the maxsources suggestion

@Bill: thanks for the interesting article, will check it out!

MJ

On 4/25/19 5:47 PM, Bill Sharer wrote:
If you are just synching to the outside pool, the three hosts may end up 
latching on to different outside servers as their definitive sources. 
You might want to make one of the three a higher priority source to the 
other two and possibly just have it use the outside sources as sync. 
Also for hardware newer than about five years old, you might want to 
look at enabling the NIC clocks using LinuxPTP to keep clock jitter down 
inside your LAN.  I wrote this article on the Gentoo wiki on enabling 
PTP in chrony.


https://wiki.gentoo.org/wiki/Chrony_with_hardware_timestamping

Bill Sharer


On 4/25/19 6:33 AM, mj wrote:

Hi all,

On our three-node cluster, we have setup chrony for time sync, and 
even though chrony reports that it is synced to ntp time, at the same 
time ceph occasionally reports time skews that can last several hours.


See for example:


root@ceph2:~# ceph -v
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) 
luminous (stable)

root@ceph2:~# ceph health detail
HEALTH_WARN clock skew detected on mon.1
MON_CLOCK_SKEW clock skew detected on mon.1
    mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s 
(latency 0.000591877s)

root@ceph2:~# chronyc tracking
Reference ID    : 7F7F0101 ()
Stratum : 10
Ref time (UTC)  : Wed Apr 24 19:05:28 2019
System time : 0.00133 seconds slow of NTP time
Last offset : -0.00524 seconds
RMS offset  : 0.00524 seconds
Frequency   : 12.641 ppm slow
Residual freq   : +0.000 ppm
Skew    : 0.000 ppm
Root delay  : 0.00 seconds
Root dispersion : 0.00 seconds
Update interval : 1.4 seconds
Leap status : Normal
root@ceph2:~# 


For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced 
similarly with NTP on the two other nodes.


We don't understand this...

I have now injected mon_clock_drift_allowed 0.7, so at least we have 
HEALTH_OK again. (to stop upsetting my monitoring system)


But two questions:

- can anyone explain why this is happening, is it looks as if ceph and 
NTP/chrony disagree on just how time-synced the servers are..?


- how to determine the current clock skew from cephs perspective? 
Because "ceph health detail" in case of HEALTH_OK does not show it.
(I want to start monitoring it continuously, to see if I can find some 
sort of pattern)


Thanks!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] clock skew

2019-04-25 Thread mj


Hi all,

On our three-node cluster, we have setup chrony for time sync, and even 
though chrony reports that it is synced to ntp time, at the same time 
ceph occasionally reports time skews that can last several hours.


See for example:


root@ceph2:~# ceph -v
ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous 
(stable)
root@ceph2:~# ceph health detail
HEALTH_WARN clock skew detected on mon.1
MON_CLOCK_SKEW clock skew detected on mon.1
mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s (latency 
0.000591877s)
root@ceph2:~# chronyc tracking
Reference ID: 7F7F0101 ()
Stratum : 10
Ref time (UTC)  : Wed Apr 24 19:05:28 2019
System time : 0.00133 seconds slow of NTP time
Last offset : -0.00524 seconds
RMS offset  : 0.00524 seconds
Frequency   : 12.641 ppm slow
Residual freq   : +0.000 ppm
Skew: 0.000 ppm
Root delay  : 0.00 seconds
Root dispersion : 0.00 seconds
Update interval : 1.4 seconds
Leap status : Normal
root@ceph2:~# 


For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced similarly 
with NTP on the two other nodes.


We don't understand this...

I have now injected mon_clock_drift_allowed 0.7, so at least we have 
HEALTH_OK again. (to stop upsetting my monitoring system)


But two questions:

- can anyone explain why this is happening, is it looks as if ceph and 
NTP/chrony disagree on just how time-synced the servers are..?


- how to determine the current clock skew from cephs perspective? 
Because "ceph health detail" in case of HEALTH_OK does not show it.
(I want to start monitoring it continuously, to see if I can find some 
sort of pattern)


Thanks!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG inconsistent, "pg repair" not working

2018-09-25 Thread mj


Hi,

I was able to solve a similar issue on our cluster using this blog:

https://ceph.com/geen-categorie/ceph-manually-repair-object/

It does help if you are running a 3/2 config.

Perhaps it helps you as well.

MJ

On 09/25/2018 02:37 AM, Sergey Malinin wrote:

Hello,
During normal operation our cluster suddenly thrown an error and since 
then we have had 1 inconsistent PG, and one of clients sharing cephfs 
mount has started to occasionally log "ceph: Failed to find inode X".

"ceph pg repair" deep scrubs the PG and fails with the same error in log.
Can anyone advise how to fix this?


log entry:
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 
1.92 soid 1:496296a8:::1000f44d0f4.0018:head: failed to pick 
suitable object info
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 
scrub 1.92 1:496296a8:::1000f44d0f4.0018:head on disk size (3751936) 
does not match object info size (0) adjusted for ondisk to (0)
2018-09-20 06:50:36.925 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 
1.92 scrub 3 errors


# ceph -v
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
(stable)



# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 1.92 is active+clean+inconsistent, acting [4,9]


# rados list-inconsistent-obj 1.92
{"epoch":519,"inconsistents":[]}


# ceph pg 1.92 query
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 520,
"up": [
4,
9
],
"acting": [
4,
9
],
"acting_recovery_backfill": [
"4",
"9"
],
"info": {
"pgid": "1.92",
"last_update": "520'2456340",
"last_complete": "520'2456340",
"log_tail": "520'2453330",
"last_user_version": 7914566,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 63,
"epoch_pool_created": 63,
"last_epoch_started": 520,
"last_interval_started": 519,
"last_epoch_clean": 520,
"last_interval_clean": 519,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 519,
"same_interval_since": 519,
"same_primary_since": 514,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268"
},
"stats": {
"version": "520'2456340",
"reported_seq": "6115579",
"reported_epoch": "520",
"state": "active+clean+inconsistent",
"last_fresh": "2018-09-25 03:02:34.338256",
"last_change": "2018-09-25 02:17:35.631476",
"last_active": "2018-09-25 03:02:34.338256",
"last_peered": "2018-09-25 03:02:34.338256",
"last_clean": "2018-09-25 03:02:34.338256",
"last_became_active": "2018-09-24 15:25:30.238044",
"last_became_peered": "2018-09-24 15:25:30.238044",
"last_unstale": "2018-09-25 03:02:34.338256",
"last_undegraded": "2018-09-25 03:02:34.338256",
"last_fullsized": "2018-09-25 03:02:34.338256",
"mapping_epoch": 519,
"log_start": "520'2453330",
"ondisk_log_start": "520'2453330",
"created": 63,
"last_epoch_clean": 520,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268",
"log_size": 3010,
"ondisk_log_size": 3010,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 23138366490,
"num_objects": 479532,
"num_object_clones": 0,
"num_object_copies": 959064,
"num_objects_missing_on_primary": 0,
"num_object

Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?

2018-09-24 Thread mj





On 09/24/2018 08:53 AM, Nicolas Huillard wrote:

Thanks for your anecdote ;-)
Could it be that I stack too many things (XFS in LVM in md-RAID in SSD
's FTL)?


No, we regularly use the same compound of layers, just without the SSD.

mj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?

2018-09-24 Thread mj





On 09/24/2018 08:46 AM, Nicolas Huillard wrote:

Too bad, since this FS have a lot of very promising features. I view it
as the single-host-ceph-like FS, and do not see any equivalent (apart
from ZFS which will also never included in the kernel).


Agreed. It's also so much more flexible than zfs, like in adding disks 
to raids to expand space for example.


mj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [slightly OT] XFS vs. BTRFS vs. others as root/usr/var/tmp filesystems ?

2018-09-23 Thread mj


Hi,

Just a very quick and simple reply:

XFS has *always* treated us nicely, and we have been using it for a VERY 
long time, ever since the pre-2000 suse 5.2 days on pretty much all our 
machines.


We have seen only very few corruptions on xfs, and the few times we 
tried btrfs, (almost) always 'something' happened. (same for the few 
times we tried reiserfs, btw)


So, while my story may be very anecdotical (and you will probably find 
many others here claiming the opposite) our own conclusion is very 
clear: we love xfs, and do not like btrfs very much.


MJ

On 09/22/2018 10:58 AM, Nicolas Huillard wrote:

Hi all,

I don't have a good track record with XFS since I got rid of ReiserFS a
long time ago. I decided XFS was a good idea on servers, while I tested
BTRFS on various less important devices.
So far, XFS betrayed me far more often (a few times) than BTRFS
(never).
Last time was yesterday, on a root filesystem with "Block out of range:
block 0x17b9814b0, EOFS 0x12a000" "I/O Error Detected. Shutting down
filesystem" (shutting down the root filesystem is pretty hard).

Some threads on this ML discuss a similar problem, related to
partitioning and logical sectors located just after the end of the
partition. The problem here does not seem to be the same, as the
requested block is very far out of bound (2 orders of magnitude too
far), and I use a recent Debian stock kernel with every security patch.

My question is : should I trust XFS for small root filesystems (/,
/tmp, /var on LVM sitting within md-RAID1 smallish partition), or is
BTRFS finally trusty enough for a general purpose cluster (still root
et al. filesystems), or do you guys just use the distro-recommended
setup (typically Ext4 on plain disks) ?

Debian stretch with 4.9.110-3+deb9u4 kernel.
Ceph 12.2.8 on bluestore (not related to the question).

Partial output of lsblk /dev/sdc /dev/nvme0n1:
NAME  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdc 8:32   0 447,1G  0 disk
├─sdc1  8:33   0  55,9G  0 part
│ └─md0 9:00  55,9G  0 raid1
│   ├─oxygene_system-root 253:40   9,3G  0 lvm   /
│   ├─oxygene_system-tmp  253:50   9,3G  0 lvm   /tmp
│   └─oxygene_system-var  253:60   4,7G  0 lvm   /var
└─sdc2  8:34   0  29,8G  0 part  [SWAP]
nvme0n1   259:00   477G  0 disk
├─nvme0n1p1   259:10  55,9G  0 part
│ └─md0 9:00  55,9G  0 raid1
│   ├─oxygene_system-root 253:40   9,3G  0 lvm   /
│   ├─oxygene_system-tmp  253:50   9,3G  0 lvm   /tmp
│   └─oxygene_system-var  253:60   4,7G  0 lvm   /var
├─nvme0n1p2   259:20  29,8G  0 part  [SWAP]

TIA !


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proxmox/ceph upgrade and addition of a new node/OSDs

2018-09-21 Thread mj


Hi Hervé!

Thanks for the detailed summary, much appreciated!

Best,
MJ


On 09/21/2018 09:03 AM, Hervé Ballans wrote:

Hi MJ (and all),

So we upgraded our Proxmox/Ceph cluster, and if we have to summarize the 
operation in a few words : overall, everything went well :)
The most critical operation of all is the 'osd crush tunables optimal', 
I talk about it in more detail after...


The Proxmox documentation is really well written and accurate and, 
normally, following the documentation step by step is almost sufficient !


* first step : upgrade Ceph Jewel to Luminous : 
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous

(Note here : OSDs remain in FileStore backend, no BlueStore migration)

* second step : upgrade Proxmox version 4 to 5 : 
https://pve.proxmox.com/wiki/Upgrade_from_4.x_to_5.0


Just some numbers, observations and tips (based on our feedback, I'm not 
an expert !) :


* Before migration, make sure you are in the lastest version of Proxmox 
4 (4.4-24) and Ceph Jewel (10.2.11)


* We don't use the pve repository for ceph packages but the official one 
(download.ceph.com). Thus, during the upgrade of Promox PVE, we don't 
replace ceph.com repository with promox.com Ceph repository...


* When you upgrade Ceph to Luminous (without tunables optimal), there is 
no impact on Proxmox 4. VMs are still running normally.
The side effect (non blocking for the functionning of VMs) is located in 
the GUI, on the Ceph menu : it can't report the status of the ceph 
cluster as it has a JSON formatting error (indeed the output of the 
command 'ceph -s' is completely different, really more readable on Luminous)


* It misses a little step in section 8 "Create Manager instances" of the 
upgrade ceph documentation. As the Ceph manager daemon is new since 
Luminous, the package doesn't exist on Jewel. So you have to install the 
ceph-mgr package on each node first before doing 'pveceph createmgr'|||

|

* The 'osd crush tunables optimal' operation is time consuming ! in our 
case : 5 nodes (PE R730xd), 58 OSDs, replicated (3/2) rbd pool with 2048 
pgs and 2 millions objects, 22 TB used. The tunables operation took a 
little more than 24 hours !


* Really take the right time to make the 'tunables optimal' !

We encountered some pgs stuck and blocked requests during this 
operation. In our case, the involved OSDs were those with a high numbers 
of pgs (as they are high capacity disks).
The consequences can be critical since it can freeze some VMs (I guess 
those that replicas are stored on the stuck pgs ?).

The stuck state were corrected by rebooting the involved OSDs.
If you can move the disks of your critical VMs on another storage, so 
these VMs should not be impacted by the recovery (we moved some disks on 
another Ceph cluster and keep the conf in the Proxmox cluster being 
updated and there was no impact)


Otherwise :
- verify that all your VMs are recently backuped on an external storage 
(in case of Disaster recovery Plan !)
- if you can, stop all your non-critical VMs (in order to limit client 
io operations)
- if any, wait for the end of current backups then disable datacenter 
backup (in order to limit client io operations). !! do not forget to 
re-enable it when all is over !!
- if any and if no longer needed, delete your snapshots, it removes many 
useless objects !
- start the tunables operation outside of major activity periods (night, 
week-end, ??) and take into account that it can be very slow...


There are probably some options to configure in ceph to avoid 'pgs 
stuck' states, but on our side, as we previously moved our critical VM's 
disks, we didn't care about that !


* Anyway, the upgrade step of Proxmox PVE is done easily and quickly 
(just follow the documentation). Note that you can upgrade Proxmox PVE 
before doing the 'tunables optimal' operation.


Hoping that you will find this information useful, good luck with your 
very next migration !


Hervé

Le 13/09/2018 à 22:04, mj a écrit :

Hi Hervé,

No answer from me, but just to say that I have exactly the same 
upgrade path ahead of me. :-)


Please report here any tips, trics, or things you encountered doing 
the upgrades. It could potentially save us a lot of time. :-)


Thanks!

MJ

On 09/13/2018 05:23 PM, Hervé Ballans wrote:

Dear list,

I am currently in the process of upgrading Proxmox 4/Jewel to 
Proxmox5/Luminous.


I also have a new node to add to my Proxmox cluster.

What I plan to do is the following (from 
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous):


* upgrade Jewel to Luminous

* let the "ceph osd crush tunables optimal " command run

* upgrade my proxmox to v5

* add the new node (already up to date in v5)

* add the new OSDs

* let ceph rebalance the lot


A couple of questions I have :

* would it be a good idea to add the new node+OSDs and run the 
"tunables optimal" command immediately after, which would maybe gain 
a little time and avoid two successive pg rebalan

Re: [ceph-users] Proxmox/ceph upgrade and addition of a new node/OSDs

2018-09-13 Thread mj


Hi Hervé,

No answer from me, but just to say that I have exactly the same upgrade 
path ahead of me. :-)


Please report here any tips, trics, or things you encountered doing the 
upgrades. It could potentially save us a lot of time. :-)


Thanks!

MJ

On 09/13/2018 05:23 PM, Hervé Ballans wrote:

Dear list,

I am currently in the process of upgrading Proxmox 4/Jewel to 
Proxmox5/Luminous.


I also have a new node to add to my Proxmox cluster.

What I plan to do is the following (from 
https://pve.proxmox.com/wiki/Ceph_Jewel_to_Luminous):


* upgrade Jewel to Luminous

* let the "ceph osd crush tunables optimal " command run

* upgrade my proxmox to v5

* add the new node (already up to date in v5)

* add the new OSDs

* let ceph rebalance the lot


A couple of questions I have :

* would it be a good idea to add the new node+OSDs and run the "tunables 
optimal" command immediately after, which would maybe gain a little time 
and avoid two successive pg rebalancing ?


* did I miss anything in this plan?


Regards,
Hervé



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HEALTH_ERR vs HEALTH_WARN

2018-08-23 Thread mj


Hi Mark, others,

I took my info from following page:
https://ceph.com/geen-categorie/ceph-manually-repair-object/

where is written: "Of course the above works well when you have 3 
replicas when it is easier for Ceph to compare two versions against 
another one."


Based on that info, I assumed that a simple "ceph pg repair 2.1a9" was 
enough to solve this without introducing corruption into our 3/2 cluster.


MJ

On 08/23/2018 12:28 PM, Mark Schouten wrote:

Gregory's answer worries us. We thought that with a 3/2 pool, and one
PG
corrupted, the assumption would be: the two similar ones are
correct,
and the third one needs to be adjusted.

Can we determine from this output, if I created corruption in our
cluster..?


I second this assumption.. Can someone clarify?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HEALTH_ERR vs HEALTH_WARN

2018-08-23 Thread mj


Hi,

Thanks John and Gregory for your answers.

Gregory's answer worries us. We thought that with a 3/2 pool, and one PG 
corrupted, the assumption would be: the two similar ones are correct, 
and the third one needs to be adjusted.


Can we determine from this output, if I created corruption in our cluster..?


root@pm1:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.1a9 is active+clean+inconsistent, acting [15,23,6]
1 scrub errors
root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log*
/var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15 
10.10.89.1:6812/3810 2122 : cluster [INF] 2.1a9 deep-scrub starts
/var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15 
10.10.89.1:6812/3810 2123 : cluster [INF] 2.1a9 deep-scrub ok
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15 
10.10.89.1:6800/3352 18074 : cluster [INF] 2.1a9 deep-scrub starts
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:40:02.579204 osd.15 
10.10.89.1:6800/3352 18075 : cluster [ERR] 2.1a9 shard 23: soid 
2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read 
error
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:41:02.720716 osd.15 
10.10.89.1:6800/3352 18076 : cluster [ERR] 2.1a9 deep-scrub 0 missing, 1 
inconsistent objects



/var/log/ceph/ceph.log:2018-08-22 08:23:09.682792 osd.15 10.10.89.1:6800/3352 
18088 : cluster [INF] 2.1a9 repair starts
/var/log/ceph/ceph.log:2018-08-22 08:29:28.440526 osd.15 10.10.89.1:6800/3352 
18089 : cluster [ERR] 2.1a9 shard 23: soid 
2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read 
error
/var/log/ceph/ceph.log:2018-08-22 08:30:18.790176 osd.15 10.10.89.1:6800/3352 
18090 : cluster [ERR] 2.1a9 repair 0 missing, 1 inconsistent objects
/var/log/ceph/ceph.log:2018-08-22 08:30:18.791718 osd.15 10.10.89.1:6800/3352 
18091 : cluster [ERR] 2.1a9 repair 1 errors, 1 fixed


And also: jewel (which we're running) is considered "the old past" with 
the old non-checksum behaviour?


In case this occurs again... what would be the steps to determine WHICH 
pg is the corrupt one, and how to proceed it it happens to be the 
primary pg for an object?


Upgrading to luminous would prevent this from happening again i guess. 
We're a bit scared to upgrade, because there seem to be so many issues 
with luminous and upgrading to it.


Having said all this: we are surprised to see this is on our cluster, as 
it should be and has been running stable and reliably for over two 
years. Perhaps just a one-time glitch.


Thanks for your replies!

MJ

On 08/23/2018 01:06 AM, Gregory Farnum wrote:
On Wed, Aug 22, 2018 at 2:46 AM John Spray <mailto:jsp...@redhat.com>> wrote:


On Wed, Aug 22, 2018 at 7:57 AM mj mailto:li...@merit.unu.edu>> wrote:
 >
 > Hi,
 >
 > This morning I woke up, seeing my ceph jewel 10.2.10 cluster in
 > HEALTH_ERR state. That helps you getting out of bed. :-)
 >
 > Anyway, much to my surprise, all VMs  running on the cluster were
still
 > working like nothing was going on. :-)
 >
 > Checking a bit more reveiled:
 >
 > > root@pm1:~# ceph -s
 > >     cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
 > >      health HEALTH_ERR
 > >             1 pgs inconsistent
 > >             1 scrub errors
 > >      monmap e3: 3 mons at
{0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0
<http://10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0>}
 > >             election epoch 296, quorum 0,1,2 0,1,2
 > >      osdmap e12662: 24 osds: 24 up, 24 in
 > >             flags sortbitwise,require_jewel_osds
 > >       pgmap v64045618: 1088 pgs, 2 pools, 14023 GB data, 3680
kobjects
 > >             44027 GB used, 45353 GB / 89380 GB avail
 > >                 1087 active+clean
 > >                    1 active+clean+inconsistent
 > >   client io 26462 kB/s rd, 14048 kB/s wr, 6 op/s rd, 383 op/s wr
 > > root@pm1:~# ceph health detail
 > > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
 > > pg 2.1a9 is active+clean+inconsistent, acting [15,23,6]
 > > 1 scrub errors
 > > root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log*
 > > /var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15
10.10.89.1:6812/3810 <http://10.10.89.1:6812/3810> 2122 : cluster
[INF] 2.1a9 deep-scrub starts
 > > /var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15
10.10.89.1:6812/3810 <http://10.10.89.1:6812/3810> 2123 : cluster
[INF] 2.1a9 deep-scrub ok
 > > /var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15
10.10.89.1:6800/3352 <http://10.10.89.1:6800/3352> 18074 : cluster
[INF] 2.1a9 deep-scrub starts
 > > /var/log/ceph/ceph.log.1.gz:2018-08-2

[ceph-users] HEALTH_ERR vs HEALTH_WARN

2018-08-22 Thread mj


Hi,

This morning I woke up, seeing my ceph jewel 10.2.10 cluster in 
HEALTH_ERR state. That helps you getting out of bed. :-)


Anyway, much to my surprise, all VMs  running on the cluster were still 
working like nothing was going on. :-)


Checking a bit more reveiled:


root@pm1:~# ceph -s
cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
 health HEALTH_ERR
1 pgs inconsistent
1 scrub errors
 monmap e3: 3 mons at 
{0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
election epoch 296, quorum 0,1,2 0,1,2
 osdmap e12662: 24 osds: 24 up, 24 in
flags sortbitwise,require_jewel_osds
  pgmap v64045618: 1088 pgs, 2 pools, 14023 GB data, 3680 kobjects
44027 GB used, 45353 GB / 89380 GB avail
1087 active+clean
   1 active+clean+inconsistent
  client io 26462 kB/s rd, 14048 kB/s wr, 6 op/s rd, 383 op/s wr
root@pm1:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.1a9 is active+clean+inconsistent, acting [15,23,6]
1 scrub errors
root@pm1:~# zgrep 2.1a9 /var/log/ceph/ceph.log*
/var/log/ceph/ceph.log.14.gz:2017-09-11 21:02:24.755778 osd.15 
10.10.89.1:6812/3810 2122 : cluster [INF] 2.1a9 deep-scrub starts
/var/log/ceph/ceph.log.14.gz:2017-09-11 21:08:10.537249 osd.15 
10.10.89.1:6812/3810 2123 : cluster [INF] 2.1a9 deep-scrub ok
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:33:21.156004 osd.15 
10.10.89.1:6800/3352 18074 : cluster [INF] 2.1a9 deep-scrub starts
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:40:02.579204 osd.15 
10.10.89.1:6800/3352 18075 : cluster [ERR] 2.1a9 shard 23: soid 
2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read 
error
/var/log/ceph/ceph.log.1.gz:2018-08-22 04:41:02.720716 osd.15 
10.10.89.1:6800/3352 18076 : cluster [ERR] 2.1a9 deep-scrub 0 missing, 1 
inconsistent objects


ok, according to the docs I should do "ceph pg repair 2.1a9". Did that, 
and some minutes later the cluster came back to "HEALTH_OK"


Checking the logs:

/var/log/ceph/ceph.log:2018-08-22 08:23:09.682792 osd.15 10.10.89.1:6800/3352 
18088 : cluster [INF] 2.1a9 repair starts
/var/log/ceph/ceph.log:2018-08-22 08:29:28.440526 osd.15 10.10.89.1:6800/3352 
18089 : cluster [ERR] 2.1a9 shard 23: soid 
2:95b8d975:::rbd_data.2c191e238e1f29.000c7c9d:head candidate had a read 
error
/var/log/ceph/ceph.log:2018-08-22 08:30:18.790176 osd.15 10.10.89.1:6800/3352 
18090 : cluster [ERR] 2.1a9 repair 0 missing, 1 inconsistent objects
/var/log/ceph/ceph.log:2018-08-22 08:30:18.791718 osd.15 10.10.89.1:6800/3352 
18091 : cluster [ERR] 2.1a9 repair 1 errors, 1 fixed


So, we are fine again, it seems.

But now my question: can anyone what happened? Is one of my disks dying? 
In the proxmox gui, all osd disks are SMART status "OK".


Besides that, as the cluster was still running and the fix was 
relatively simple, would a HEALTH_WARN not have been more appropriate?


And, since this is a size 3, min 2 pool... shouldn't this have been 
taken care of automatically..? ('self-healing' and all that..?)


So, I'm having my morning coffee finally, wondering what happened... :-)

Best regards to all, have a nice day!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] lacp bonding | working as expected..?

2018-06-21 Thread mj


Hi Jacob,

Thanks for your reply. But I'm not sure I completely understand it. :-)

On 06/21/2018 09:09 PM, Jacob DeGlopper wrote:
In your example, where you see one link being used, I see an even source 
IP paired with an odd destination port number for both transfers, or is 
that a search and replace issue?


Well, I left portnumbers as they were, I edited the IPs. Actually the 
machines are not a.b.c.9 and a.b.c.10, but a.b.c.204 and a.b.c.205, for 
the rest, everyting is unedited.


So a single line example:

Client connecting to a.b.c.205, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local a.b.c.204 port 60600 connected with a.b.c.205 port 5001

Client connecting to a.b.c.205, TCP port 5000
TCP window size: 85.0 KByte (default)

[  3] local a.b.c.204 port 53788 connected with a.b.c.205 port 5000
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec   746 MBytes   625 Mbits/sec
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec   383 MBytes   321 Mbits/sec


And a lucky example:

Client connecting to a.b.c.205, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local a.b.c.204 port 37984 connected with a.b.c.205 port 5001

Client connecting to a.b.c.205, TCP port 5000
TCP window size: 85.0 KByte (default)

[  3] local a.b.c.204 port 48850 connected with a.b.c.205 port 5000
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   936 Mbits/sec
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec   885 MBytes   742 Mbits/sec

(reason for the a.b.c.204 is that the IPs are public, and I'd rather not 
put them here)


I don't see the odd/even port numbers thing you noticed..? (I could very 
well miss something though)


I see no way to specify what outgoing port iperf should use, otherwise I 
could try again using the same ports, to check the pattern.


Thanks again!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding additional disks to the production cluster without performance impacts on the existing

2018-06-08 Thread mj


Hi Pardhiv,

On 06/08/2018 05:07 AM, Pardhiv Karri wrote:
We recently added a lot of nodes to our ceph clusters. To mitigate lot 
of problems (we are using tree algorithm) we added an empty node first 
to the crushmap and then added OSDs with zero weight, made sure the ceph 
health is OK and then started ramping up each OSD. I created a script to 
do it dynamically, which will check CPU of the new host with OSDs that 


Would you mind sharing this script..?

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding cluster network to running cluster

2018-06-07 Thread mj





On 06/07/2018 01:45 PM, Wido den Hollander wrote:


Removing cluster network is enough. After the restart the OSDs will not
publish a cluster network in the OSDMap anymore.

You can keep the public network in ceph.conf and can even remove that
after you removed the 10.10.x.x addresses from the system(s).

Wido


Thanks for the info, Wido. :-)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tunable question

2017-10-05 Thread mj

Hi,

For the record, we changed tunables from "hammer" to "optimal", 
yesterday at 14:00, and it finished this morning at 9:00, so rebalancing 
 took 19 hours.

This was on a small ceph cluster, 24 4TB OSDs spread over three hosts, 
connected over 10G ethernet. Total amount of data: 32730 GB used, 56650 
GB / 89380 GB avail

We set noscrub and no-deepscrub during the rebalance, and our VMs 
experienced basically no impact.

MJ

On 10/03/2017 05:37 PM, lists wrote:

Thanks Jake, for your extensive reply. :-)

MJ

On 3-10-2017 15:21, Jake Young wrote:

On Tue, Oct 3, 2017 at 8:38 AM lists <li...@merit.unu.edu 
<mailto:li...@merit.unu.edu>> wrote:

    Hi,

    What would make the decision easier: if we knew that we could easily
    revert the
  > "ceph osd crush tunables optimal"
    once it has begun rebalancing data?

    Meaning: if we notice that impact is too high, or it will take too 
long,

    that we could simply again say
  > "ceph osd crush tunables hammer"
    and the cluster would calm down again?

Yes you can revert the tunables back; but it will then move all the 
data back where it was, so be prepared for that.

Verify you have the following values in ceph.conf. Note that these are 
the defaults in Jewel, so if they aren’t defined, you’re probably good:

osd_max_backfills=1
osd_recovery_threads=1

You can try to set these (using ceph —inject) if you notice a large 
impact to your client performance:

osd_recovery_op_priority=1
osd_recovery_max_active=1
osd_recovery_threads=1

I recall this tunables change when we went from hammer to jewel last 
year. It took over 24 hours to rebalance 122TB on our 110 osd  cluster.

Jake

    MJ

    On 2-10-2017 9:41, Manuel Lausch wrote:
 > Hi,
 >
 > We have similar issues.
 > After upgradeing from hammer to jewel the tunable "choose leave
    stabel"
 > was introduces. If we activate it nearly all data will be 
moved. The

 > cluster has 2400 OSD on 40 nodes over two datacenters and is
    filled with
 > 2,5 PB Data.
 >
 > We tried to enable it but the backfillingtraffic is to high to be
 > handled without impacting other services on the Network.
 >
 > Do someone know if it is neccessary to enable this tunable? And 
could
 > it be a problem in the future if we want to upgrade to newer 
versions

 > wihout it enabled?
 >
 > Regards,
 > Manuel Lausch
 >
    ___
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tunable question

2017-09-28 Thread mj


Hi Dan, list,

Our cluster is small: three nodes, totally 24 4Tb platter OSDs, SSD 
journals. Using rbd for VMs. That's it. Runs nicely though :-)


The fact that "tunable optimal" for jewel would result in "significantly 
fewer mappings change when an OSD is marked out of the cluster" is what 
attracts us.


Reasoning behind it: upgrading to "optimal" NOW, should result in faster 
rebuild-time when disaster strikes, and we're all stressed out. :-)


After the jewel upgrade, we also upgraded the tunables from "(require 
bobtail, min is firefly)" to "hammer". This resulted in approx 24 hours 
rebuild, but actually without significant inpact on the hosted VMs.


Is it safe to assume that setting it to "optimal" would have a similar 
impact, or are the implications bigger?


MJ


On 09/28/2017 10:29 AM, Dan van der Ster wrote:

Hi,

How big is your cluster and what is your use case?

For us, we'll likely never enable the recent tunables that need to
remap *all* PGs -- it would simply be too disruptive for marginal
benefit.

Cheers, Dan


On Thu, Sep 28, 2017 at 9:21 AM, mj <li...@merit.unu.edu> wrote:

Hi,

We have completed the upgrade to jewel, and we set tunables to hammer.
Cluster again HEALTH_OK. :-)

But now, we would like to proceed in the direction of luminous and bluestore
OSDs, and we would like to ask for some feedback first.

 From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an
existing cluster will result in a very large amount of data movement as
almost every PG mapping is likely to change."

Given the above, and the fact that we would like to proceed to
luminous/bluestore in the not too far away future: What is cleverer:

1 - keep the cluster at tunable hammer now, upgrade to luminous in a little
while, change OSDs to bluestore, and then set tunables to optimal

or

2 - set tunable to optimal now, take the impact of "almost all PG
remapping", and when that is finished, upgrade to luminous, bluestore etc.

Which route is the preferred one?

Or is there a third (or fourth?) option..? :-)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] tunable question

2017-09-28 Thread mj


Hi,

We have completed the upgrade to jewel, and we set tunables to hammer. 
Cluster again HEALTH_OK. :-)


But now, we would like to proceed in the direction of luminous and 
bluestore OSDs, and we would like to ask for some feedback first.


From the jewel ceph docs on tubables: "Changing tunable to "optimal" on 
an existing cluster will result in a very large amount of data movement 
as almost every PG mapping is likely to change."


Given the above, and the fact that we would like to proceed to 
luminous/bluestore in the not too far away future: What is cleverer:


1 - keep the cluster at tunable hammer now, upgrade to luminous in a 
little while, change OSDs to bluestore, and then set tunables to optimal


or

2 - set tunable to optimal now, take the impact of "almost all PG 
remapping", and when that is finished, upgrade to luminous, bluestore etc.


Which route is the preferred one?

Or is there a third (or fourth?) option..? :-)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] librmb: Mail storage on RADOS with Dovecot

2017-09-24 Thread mj


Hi,

I forwarded your announcement to the dovecot  mailinglist. The following 
reply to it was posted by there by Timo Sirainen. I'm forwarding it 
here, as you might not be reading the dovecot mailinglist.


Wido:

First, the Github link:
https://github.com/ceph-dovecot/dovecot-ceph-plugin

I am not going to repeat everything which is on Github, put a short summary:

- CephFS is used for storing Mailbox Indexes
- E-Mails are stored directly as RADOS objects
- It's a Dovecot plugin

We would like everybody to test librmb and report back issues on Github so that 
further development can be done.

It's not finalized yet, but all the help is welcome to make librmb the best 
solution for storing your e-mails on Ceph with Dovecot.


Timo:
It would be have been nicer if RADOS support was implemented as lib-fs 
driver, and the fs-API had been used all over the place elsewhere. So 1) 
LibRadosMailBox wouldn't have been relying so much on RADOS specifically 
and 2) fs-rados could have been used for other purposes. There are 
already fs-dict and dict-fs drivers, so the RADOS dict driver may not 
have been necessary to implement if fs-rados was implemented instead 
(although I didn't check it closely enough to verify). (We've had 
fs-rados on our TODO list for a while also.)


BTW. We've also been planning on open sourcing some of the obox pieces, 
mainly fs-drivers (e.g. fs-s3). The obox format maybe too, but without 
the "metacache" piece. The current obox code is a bit too much married 
into the metacache though to make open sourcing it easy. (The metacache 
is about storing the Dovecot index files in object storage and 
efficiently caching them on local filesystem, which isn't planned to be 
open sourced in near future. That's pretty much the only difficult piece 
of the obox plugin, with Cassandra integration coming as a good second. 
I wish there had been a better/easier geo-distributed key-value database 
to use - tombstones are annoyingly troublesome.)


And using rmb-mailbox format, my main worries would be:
 * doesn't store index files (= message flags) - not necessarily a 
problem, as long as you don't want geo-replication
 * index corruption means rebuilding them, which means rescanning list 
of mail files, which means rescanning the whole RADOS namespace, which 
practically means  rescanning the RADOS pool. That most likely is a very 
very slow operation, which you want to avoid unless it's absolutely 
necessary. Need to be very careful to avoid that happening, and in 
general to avoid losing mails in case of crashes or other bugs.

 * I think copying/moving mails physically copies the full data on disk
 * Each IMAP/POP3/LMTP/etc process connects to RADOS separately from 
each others - some connection pooling would likely help here


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Restart ceph cluster

2017-05-12 Thread mj


Hi,

On 05/12/2017 03:35 PM, Vladimir Prokofev wrote:

My best guess is that using systemd you can write some basic script to
restart whatever OSDs you want. Another option is to use the same
mechanics that ceph-deploy uses, but the principle is all the same -
write some automation script.
I would love to hear from people who has 100-1000+ OSDs in production
though.


That's surely not me, but inserting config changes/updates are easily 
done to all OSDs with


ceph tell osd.* injectargs

The only thing to keep into consederation to _also_ make the changes to 
ceph.conf, so they persist after a reboot.


But perhaps I completely misunderstand your question... ;-)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread mj


Hi Gregory,

Reading your reply with great interest, thanks.

Can you confirm my understanding now:

- live snapshots are more expensive for the cluster as a whole, than 
taking the snapshot when the VM is switched off?


- using fstrim in VMs is (much?) more expensive when the VM has existing 
snapshots?


- it might be worthwhile to postpone upgrading from hammer to jewel, 
until after your big accouncement?


- we are on xfs (both for the ceph OSDs and the VMs) and that is the 
best combination to avoid these slow requests and CoW overhead with 
snapshots (or at least to minimise their impact)


Any other tips, do's or don'ts, or things to keep in mind related to 
snapshots, VM/OSD filesystems, or using fstrim..?


(our cluster is also small, hammer, three servers with 8 OSDs each, and 
journals on ssd, plenty of cpu/ram)


Again, thanks for your interesting post.

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread mj




On 04/18/2017 11:24 AM, Jogi Hofmüller wrote:

This might have been true for hammer and older versions of ceph. From
what I can tell now, every snapshot taken reduces performance of the
entire cluster :(


Really? Can others confirm this? Is this a 'wellknown fact'?
(unknown only to us, perhaps...)

We are still on hammer, but if the result of upgrading to jewel is 
actually a massive performance decrease, I might postpone as long as 
possible...


Most of our VMs have a snapshot or two...

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-14 Thread mj


ah right: _during_ the actual removal, you mean. :-)

clear now.

mj

On 04/13/2017 05:50 PM, Lionel Bouton wrote:

Le 13/04/2017 à 17:47, mj a écrit :

Hi,

On 04/13/2017 04:53 PM, Lionel Bouton wrote:

We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
measurable impact on performance... until we tried to remove them.


What exactly do you mean with that?


Just what I said : having snapshots doesn't impact performance, only
removing them (obviously until Ceph is finished cleaning up).

Lionel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread mj


Hi,

On 04/13/2017 04:53 PM, Lionel Bouton wrote:

We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
measurable impact on performance... until we tried to remove them.


What exactly do you mean with that?

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] clock skew

2017-04-01 Thread mj




On 04/01/2017 04:02 PM, John Petrini wrote:

Hello,

I'm also curious about the impact of clock drift. We see the same on
both of our clusters despite trying various NTP servers including our
own local servers. Ultimately we just ended up adjusting our monitoring
to be less sensitive to it since the clock drift always resolves on its
own. Is this a dangerous practice?


Are you running ntp, or this chrony? (which I did not know of)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] clock skew

2017-04-01 Thread mj


Hi,

On 04/01/2017 02:10 PM, Wido den Hollander wrote:

You could try the chrony NTP daemon instead of ntpd and make sure all
MONs are peers from each other.
I understand now what that means. I have set it up according to your 
suggestion.


Curious to see how this works out, thanks!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] clock skew

2017-04-01 Thread mj


Hi Wido,

On 04/01/2017 02:10 PM, Wido den Hollander wrote:

That warning is there for a reason. I suggest you double-check your NTP and 
clocks on the machines. This should never happen in production.
I know... Don't understand why this happens..! Tried both ntpd and 
systemd-timesyncd. I did not yet know chrony, will try it.


I imagined that a 0.2 sec time skew would not be too disasterous.. As a 
side note: I cannot find explained anywhere WHAT could happen if the 
skew becomes too big. Only that we should prevent it. (data loss?)



Are you running the MONs inside Virtual Machines? They are more likely to have 
drifting clocks.

Nope. All bare metal on new supermicro servers.


You could try the chrony NTP daemon instead of ntpd and make sure all MONs are 
peers from each other.

Will try that.

I had set all MONs to sync with chime1.surfnet.nl - chime4. We usually 
have very good experiences with those ntp servers.


So, you're telling me that the MONs should be peers from each other... 
But if all MONs listen/sync to/with each other, where do I configure the 
external stratum1 source.?


MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] clock skew

2017-04-01 Thread mj


Hi!

On 04/01/2017 12:49 PM, Wei Jin wrote:

mon_clock_drift_allowed should be used in monitor process, what's the
output of `ceph daemon mon.foo config show | grep clock`?

how did you change the value? command line or config file?


I guess I changed it wrong then... Did it in ceph.conf, like:

[global]
mon clock drift allowed = 0.1

and for immediate effect, also:
> ceph tell osd.* injectargs --mon_clock_drift_allowed "0.2"

So I guess that's wrong..?

Should it be under the [mon] sections of ceph.conf?

If listed under [global] like I have it now, then what have does it 
actually change..?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] clock skew

2017-04-01 Thread mj


Hi,

Despite ntp, we keep getting clock skews that auto disappear again after 
a few minutes.


To prevent the unneccerasy HEALTH_WARNs, I have increased the 
mon_clock_drift_allowed to 0.2, as can be seen below:



root@ceph1:~# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | 
grep clock
"mon_clock_drift_allowed": "0.2",
"mon_clock_drift_warn_backoff": "5",
"clock_offset": "0",
root@ceph1:~#


Despite this setting, I keep receiving HEALTH_WARNs like below:


ceph cluster node ceph1 health status became HEALTH_WARN clock skew detected on 
mon.1; Monitor clock skew detected mon.1 addr 10.10.89.2:6789/0 clock skew 
0.113709s > max 0.1s (latency 0.000523111s)


Can anyone explain why the running config shows 
"mon_clock_drift_allowed": "0.2" and the HEALTH_WARN says "max 0.1s 
(latency 0.000523111s)"?


How come there's a difference between the two?

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] default pools gone. problem?

2017-03-24 Thread mj




On 03/24/2017 10:13 PM, Bob R wrote:

You can operate without the default pools without issue.


Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] default pools gone. problem?

2017-03-24 Thread mj


Hi,

On the docs on ppols 
http://docs.ceph.com/docs/cuttlefish/rados/operations/pools/ it says:


The default pools are:

*data
*metadata
*rbd

My ceph install has only ONE pool called "ceph-storage", the others are 
gone. (probably deleted?)


Is not having those default pools a problem? Do I need to recreate them, 
or can they safely be deleted?


I'm on hammer, but intending to upgrade to jewel, and trying to identify 
potential issues, therefore this question.


MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph 'tech' question

2017-03-24 Thread mj



On 03/24/2017 10:33 AM, ulem...@polarzone.de wrote:

And why? better distibution of read-access.

Udo


Ah yes.

On the other hand... In the case of specific often-requested data in 
your pool, the primary PG will have to handle all those requests, and in 
that case using a local copy would have benefits.


Anyway, thanks for your reply. :-)


MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph 'tech' question

2017-03-24 Thread mj


Hi all,

Something that I am curious about:

Suppose I have a three-server cluster, all with identical OSDs 
configuration, and also a replication factor of three.


That would mean (I guess) that all 3 servers have a copy of everything 
in the ceph pool.


My question: given that every machine has all the data, does that also 
imply that reads will be LOCAL on each machine?


I'm asking because I understand that each PG has one primary copy 
optionally with extra secondary copies. (depending on the replication 
factor)


I have the feeling that local reads will usually be faster than reads 
over the network.


And if this is not the case, then why not? :-)

Thanks for any insights or pointers!

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] add multiple OSDs to cluster

2017-03-22 Thread mj


Hi Jonathan, Anthony and Steve,

Thanks very much for your valuable advise and suggestions!

MJ

On 03/21/2017 08:53 PM, Jonathan Proulx wrote:




If it took 7hr for one drive you probably already done this (or
defaults are for low impact recovery) but before doing anything you
want to besure you OSD settings max backfills, max recovery active,
recovery sleep (perhaps others?) are set such that revovery and
backfilling doesn't overwhelm produciton use.

look through the recovery section of
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/

This is important because if you do have a failure and thus unplanned
recovery you want to have this tuned to your prefered balance of
quick performance or quick return to full redundancy.

That said my theory is to add things in as balanced a way as possible to
minimize moves.

What that means depends on your crush map.

For me I have 3 "racks" and all (most) of my pools are 3x replication
so each object should have one copy in each rack.

I've only expanded once, but what I did was to add three servers.  One
to each 'rack'.  I set them all 'in' at the same time which should
have minimized movement between racks and moved obbjects from other
servers' osds in the same rack onto the osds in the new server.  This
seemed to work well for me.

In your case this would mean adding drives to all servers at once in a
balanced way.  That would prevent copy across servers since the
balance amoung servers wouldn't change.

You could do one disk on each server or load them all up and trust
recovery settings to keep the thundering herd in check.

As I said I've only gone through one expantion round and while this
theory seemed to work out for me hopefully someone with deeper
knowlege can confirm or deny it's general applicability.

-Jon

On Tue, Mar 21, 2017 at 07:56:57PM +0100, mj wrote:
:Hi,
:
:Just a quick question about adding OSDs, since most of the docs I can find
:talk about adding ONE OSD, and I'd like to add four per server on my
:three-node cluster.
:
:This morning I tried the careful approach, and added one OSD to server1. It
:all went fine, everything rebuilt and I have a HEALTH_OK again now. It took
:around 7 hours.
:
:But now I started thinking... (and that's when things go wrong, therefore
:hoping for feedback here)
:
:The question: was I being stupid to add only ONE osd to the server1? Is it
:not smarter to add all four OSDs at the same time?
:
:I mean: things will rebuild anyway...and I have the feeling that rebuilding
:from 4 -> 8 OSDs is not going to be much heavier than rebuilding from 4 -> 5
:OSDs. Right?
:
:So better add all new OSDs together on a specific server?
:
:Or not? :-)
:
:MJ
:___
:ceph-users mailing list
:ceph-users@lists.ceph.com
:http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] add multiple OSDs to cluster

2017-03-21 Thread mj


Hi,

Just a quick question about adding OSDs, since most of the docs I can 
find talk about adding ONE OSD, and I'd like to add four per server on 
my three-node cluster.


This morning I tried the careful approach, and added one OSD to server1. 
It all went fine, everything rebuilt and I have a HEALTH_OK again now. 
It took around 7 hours.


But now I started thinking... (and that's when things go wrong, 
therefore hoping for feedback here)


The question: was I being stupid to add only ONE osd to the server1? Is 
it not smarter to add all four OSDs at the same time?


I mean: things will rebuild anyway...and I have the feeling that 
rebuilding from 4 -> 8 OSDs is not going to be much heavier than 
rebuilding from 4 -> 5 OSDs. Right?


So better add all new OSDs together on a specific server?

Or not? :-)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] suddenly high memory usage for ceph-mon process

2016-11-05 Thread mj


Hi Igor and David,

Thanks for your replies. There are no ceph-mds processes running in our 
cluster.


I'm guesing David's reply applies to us, and we just need to setup 
additional monitoring for memory usage, so we get notified in case it 
happens again.


Anyway: we learned that this can happen, so next time we know where to 
look first.


Thanks both, for you replies,

MJ

On 11/04/2016 03:26 PM, igor.podo...@ts.fujitsu.com wrote:

Maybe you hit this https://github.com/ceph/ceph/pull/10238 still waits for 
merge.

This will occur only if you have ceph-mds process in your cluster, but it's not 
configured (you not need to use MDS, this process could be running only on some 
node).

Check your monitor logs for something like: "up but filesystem disabled" and 
how many similar lines you have.

Regards,
Igor.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
mj
Sent: Friday, November 4, 2016 2:06 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] suddenly high memory usage for ceph-mon process

Hi,

Running ceph 0.94.9 on jessie (proxmox), three hosts, 4 OSDs per host, ssd
journal, 10G cluster network. Hosts have 65G ram. The cluster is generally not
very buzy.

Suddenly we were getting HEALTH_WRN today, with two osd's (both on the
same server) being slow. Looking into this, we noticed very high memory
usage on that host: 75% memory for ceph-mon!

(normally here ceph-mon uses around 1% - 2%)

I restarted ceph-mon on that host, and that seems to have brought things
back to normal immediately.

I don't see anything out of the ordinary in /var/log/syslog on that server, and
also generally the cluster is HEALTH_OK. No changes to configs lately (last
many weeks) and last time I applied updates and rebooted is 30 days ago.

No idea what could have caused this. Any ideas what to check, where to
look? What would typically cause such high memory usage for the ceph-mon
process?

MJ

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] suddenly high memory usage for ceph-mon process

2016-11-04 Thread mj


Hi,

Running ceph 0.94.9 on jessie (proxmox), three hosts, 4 OSDs per host, 
ssd journal, 10G cluster network. Hosts have 65G ram. The cluster is 
generally not very buzy.


Suddenly we were getting HEALTH_WRN today, with two osd's (both on the 
same server) being slow. Looking into this, we noticed very high memory 
usage on that host: 75% memory for ceph-mon!


(normally here ceph-mon uses around 1% - 2%)

I restarted ceph-mon on that host, and that seems to have brought things 
back to normal immediately.


I don't see anything out of the ordinary in /var/log/syslog on that 
server, and also generally the cluster is HEALTH_OK. No changes to 
configs lately (last many weeks) and last time I applied updates and 
rebooted is 30 days ago.


No idea what could have caused this. Any ideas what to check, where to 
look? What would typically cause such high memory usage for the ceph-mon 
process?


MJ

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 10Gbit switch advice for small ceph cluster upgrade

2016-10-27 Thread mj


Hi Jelle,

On 10/27/2016 03:04 PM, Jelle de Jong wrote:

Hello everybody,

I want to upgrade my small ceph cluster to 10Gbit networking and would
like some recommendation from your experience.

What is your recommend budget 10Gbit switch suitable for Ceph?


We are running a 3-node cluster, with _direct_ 10G cable connections 
(quasi crosslink) between the three hosts. This is very low-budget, as 
it gives you 10G speed, without a (relatively) expensive 10G switch.


Working fine here, with each host having a double 10G intel nic, plus a 
regular 1G interface.


MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running xfs_fsr on ceph OSDs

2016-10-26 Thread mj


Hi Christian,

Thanks for the reply / suggestion!

MJ

On 10/24/2016 10:02 AM, Christian Balzer wrote:


Hello,

On Mon, 24 Oct 2016 09:41:37 +0200 mj wrote:


Hi,

We have been running xfs on our servers for many years, and we are used
to run a scheduled xfs_fsr during the weekend.

Lately we have started using proxmox / ceph, and I'm wondering if we
would benefit (like 'the old days') from scheduled xfs_fsr runs?

Our OSDs are xfs, plus the VMs are also mostly running xfs. Both of
which (in theory anyway) could be defragmented.

Google doesn't tell me a lot, therefore I'm posing the question here:

What is consensus here? Is it worth running xfs_fsr on VMs and OSDs? (or
perhaps just one of both?)


Only using XFS on some test OSDs, but the experience there and the
anecdotes here suggest that it's quite prone to fragmentation over time.

Of course instead of just running xfs_fsr willy-nilly, you might want to
verify that fact for yourself and pick a schedule/time based on those
results.

And OSDs only, not within the VMs.

Christian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] running xfs_fsr on ceph OSDs

2016-10-24 Thread mj


Hi,

We have been running xfs on our servers for many years, and we are used 
to run a scheduled xfs_fsr during the weekend.


Lately we have started using proxmox / ceph, and I'm wondering if we 
would benefit (like 'the old days') from scheduled xfs_fsr runs?


Our OSDs are xfs, plus the VMs are also mostly running xfs. Both of 
which (in theory anyway) could be defragmented.


Google doesn't tell me a lot, therefore I'm posing the question here:

What is consensus here? Is it worth running xfs_fsr on VMs and OSDs? (or 
perhaps just one of both?)


MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-20 Thread mj


Hi,

Interesting reading!

Any chance you could state some of your lessons (if any) you learned..?

I can, for example, imagine your situation would have been much better 
with a replication factor of three instead of two..?


MJ

On 10/20/2016 12:09 AM, Kostis Fardelas wrote:

Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.

I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users and
ceph-devel lists who contributed to our inquiries during
troubleshooting.

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd pool:replica size choose: 2 vs 3

2016-09-23 Thread mj


Hi,

On 09/23/2016 09:41 AM, Dan van der Ster wrote:

If you care about your data you run with size = 3 and min_size = 2.

Wido


We're currently running with min_size 1. Can we simply change this, 
online, with:


ceph osd pool set vm-storage min_size 2

and expect everything to continue running?

(our cluster is HEALTH_OK, enough disk space, etc, etc)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rados bench output question

2016-09-08 Thread mj


Hi Christian,

Thanks a lot for all your information!

(specially the bit that ceph never reads from the journal, but writes to 
osd from memory was new for me)


MJ

On 09/07/2016 03:20 AM, Christian Balzer wrote:


hello,

On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote:


Hi Christian,

Thanks for your reply.


What SSD model (be precise)?

Samsung 480GB PM863 SSD


So that's not your culprit then (they are supposed to handle sync writes
at full speed).


Only one SSD?

Yes. With a 5GB partition based journal for each osd.


A bit small, but in normal scenarios that shouldn't be a problem.
Read:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28003.html


During the 0 MB/sec, there is NO increased cpu usage: it is usually
around 15 - 20% for the four ceph-osd processes.


Watch your node(s) with atop or iostat.

Ok, I will do.


Best results will be had with 3 large terminals (one per node) running
atop, interval set to at least 5, down from default 10 seconds.
Same diff with iostat, parameters "-x 2".


Do we have an issue..? And if yes: Anyone with a suggestions where to
look at?


You will find that either your journal SSD is overwhelmed and a single
SSD peaking around 500MB/s wouldn't be that surprising.
Or that your HDDs can't scribble away at more than the speed above, the
more likely reason.
Even a combination of both.

Ceph needs to flush data to the OSDs eventually (and that is usually more
or less immediately with default parameters), so for a sustained,
sequential write test you're looking at the speed of your HDDs.
And that will be spiky of sorts, due to FS journals, seeks for other
writes (replicas), etc.

But would we expect the MB/sec to drop to ZERO, during journal-to-osd
flushes?


A common misconception when people start up with Ceph and probably
something that should be better explained in the docs. Or not, given that
Blustore is on the shimmering horizon.

Ceph never reads from the journals, unless there has been a crash.
(Now would be a good time to read that link above if you haven't yet)

What happens is that (depending on the various filestore and journal
parameters) Ceph starts flushing the still in memory data to the OSD
(disk, FS) after the journal has been written, as I mentioned above.

The logic here is to not create an I/O storm after letting things pile up
for a long time.
People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune
these parameters.

So now think about what happens during that rados bench run:
A 4MB object gets written (created, then filled), so the client talks to
the OSD that holds the primary PG for that object.
That OSD writes the data to the journal and sends it to the other OSDs
(replicas).
Once all journals have been written, the primary OSD acks the write to
the client.

And this happens with 16 threads by default, making things nicely busy.
Now keeping in mind the above description and the fact that you have a
small cluster, a single OSD that gets too busy will block the whole
cluster basically.

So things dropping to zero means that at least one OSD was so busy (not
CPU in your case, IOwait) that it couldn't take in more data.
The fact that your drops happen in a rather predictable, roughly 9
seconds interval, suggests also the possibility that the actual journal
got full, but that's not conclusive.

Christian


Thanks for the quick feedback, and I'll dive into atop and iostat next.

Regards,
MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

44 matches

Mail list logo