from:"Nathan Fish"

Re: [ceph-users] dashboard returns 401 on successful auth

2019-06-06 Thread Nathan Fish

I have filed this bug:
https://tracker.ceph.com/issues/40051

On Thu, Jun 6, 2019 at 12:52 PM Drew Weaver  wrote:
>
> Hello,
>
>
>
> I was able to get Nautilus running on my cluster.
>
>
>
> When I try to login to dashboard with the user I created if I enter the 
> correct credentials in the log I see:
>
>
>
> 2019-06-06 12:51:43.738 7f373ec9b700  1 mgr[dashboard] 
> [:::192.168.105.1:56110] [GET] [401] [0.002s] [271B] 
> /api/settings/alertmanager-api-host
>
> 2019-06-06 12:51:43.741 7f373ec9b700  1 mgr[dashboard] 
> [:::192.168.105.1:56110] [GET] [401] [0.002s] [271B] /api/health/minimal
>
> 2019-06-06 12:51:43.745 7f373ec9b700  1 mgr[dashboard] 
> [:::192.168.105.1:56110] [GET] [401] [0.002s] [271B] /api/health/minimal
>
> 2019-06-06 12:51:43.755 7f373dc99700  1 mgr[dashboard] 
> [:::192.168.105.1:56111] [GET] [401] [0.001s] [271B] /api/feature_toggles
>
>
>
> And in the browser nothing happens.
>
>
>
> Any ideas?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS getattr op stuck in snapshot

2019-06-12 Thread Nathan Fish

I have run into a similar hang on 'ls .snap' recently:
https://tracker.ceph.com/issues/40101#note-2

On Wed, Jun 12, 2019 at 9:33 AM Yan, Zheng  wrote:
>
> On Wed, Jun 12, 2019 at 3:26 PM Hector Martin  wrote:
> >
> > Hi list,
> >
> > I have a setup where two clients mount the same filesystem and
> > read/write from mostly non-overlapping subsets of files (Dovecot mail
> > storage/indices). There is a third client that takes backups by
> > snapshotting the top-level directory, then rsyncing the snapshot over to
> > another location.
> >
> > Ever since I switched the backup process to using snapshots, the rsync
> > process has stalled at a certain point during the backup with a stuck
> > MDS op:
> >
> > root@mon02:~# ceph daemon mds.mon02 dump_ops_in_flight
> > {
> > "ops": [
> > {
> > "description": "client_request(client.146682828:199050
> > getattr pAsLsXsFs #0x107//bak-20190612094501/ > path>/dovecot.index.log 2019-06-12 12:20:56.992049 caller_uid=5000,
> > caller_gid=5000{})",
> > "initiated_at": "2019-06-12 12:20:57.001534",
> > "age": 9563.847754,
> > "duration": 9563.847780,
> > "type_data": {
> > "flag_point": "failed to rdlock, waiting",
> > "reqid": "client.146682828:199050",
> > "op_type": "client_request",
> > "client_info": {
> > "client": "client.146682828",
> > "tid": 199050
> > },
> > "events": [
> > {
> > "time": "2019-06-12 12:20:57.001534",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001534",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001538",
> > "event": "throttled"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001550",
> > "event": "all_read"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001713",
> > "event": "dispatched"
> > },
> > {
> > "time": "2019-06-12 12:20:57.001997",
> > "event": "failed to rdlock, waiting"
> > }
> > ]
> > }
> > }
> > ],
> > "num_ops": 1
> > }
> >
> > AIUI, when a snapshot is taken, all clients with dirty data are supposed
> > to get a message to flush it to the cluster in order to produce a
> > consistent snapshot. My guess is this isn't happening properly, so reads
> > of that file in the snapshot are blocked. Doing a 'echo 3 >
> > /proc/sys/vm/drop_caches' on both of the writing clients seems to clear
> > the stuck op, but doing it once isn't enough; usually I get the stuck up
> > and have to clear caches twice after making any given snapshot.
> >
> > Everything is on Ubuntu. The cluster is running 13.2.4 (mimic), and the
> > clients are using the kernel client version 4.18.0-20-generic (writers)
> > and 4.18.0-21-generic (backup host).
> >
> > I managed to reproduce it like this:
> >
> > host1$ mkdir _test
> > host1$ cd _test/.snap
> >
> > host2$ cd _test
> > host2$ for i in $(seq 1 1); do (sleep 0.1; echo $i; sleep 1) > b_$i
> > & sleep 0.05; done
> >
> > (while that is running)
> >
> > host1$ mkdir s11
> > host1$ cd s11
> >
> > (wait a few seconds)
> >
> > host2$ ^C
> >
> > host1$ ls -al
> > (hangs)
> >
> > This yielded this stuck request:
> >
> > {
> > "ops": [
> > {
> > "description": "client_request(client.146687505:13785
> > getattr pAsLsXsFs #0x117f41c//s11/b_42 2019-06-12 16:15:59.095025
> > caller_uid=0, caller_gid=0{})",
> > "initiated_at": "2019-06-12 16:15:59.095559",
> > "age": 30.846294,
> > "duration": 30.846318,
> > "type_data": {
> > "flag_point": "failed to rdlock, waiting",
> > "reqid": "client.146687505:13785",
> > "op_type": "client_request",
> > "client_info": {
> > "client": "client.146687505",
> > "tid": 13785
> > },
> > "events": [
> > {
> > "time": "2019-06-12 16:15:59.095559",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-06-12 16:15:59.095559",
> > "event": "header_read"
> > },
> > {
> > "time": "2019-06-12 16:15:59.095562",
> > "event":

[ceph-users] Degraded pgs during async randwrites

2019-05-06 Thread Nathan Fish

Hello all, I'm testing out a new cluster that we hope to put into
production soon. Performance has overall been great, but there's one
benchmark that not only stresses the cluster, but causes it to degrade
- async randwrites.

The benchmark:
# The file was previously laid out with dd'd random data to prevent sparseness
root@mc-3015-201:~# fio --rw=randwrite --bs=4k --size=100G
--numjobs=$JOBS --group_reporting --directory=/mnt/ceph/
--name=largerandwrite --iodepth=16 --end_fsync=1
largerandwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W)
4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=16
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
largerandwrite: (groupid=0, jobs=1): err= 0: pid=17230: Mon May  6 11:30:11 2019
  write: IOPS=14.7k, BW=57.4MiB/s (60.2MB/s)(100GiB/1782445msec)
clat (nsec): min=1617, max=120033k, avg=12644.96, stdev=379152.20
 lat (nsec): min=1656, max=120033k, avg=12687.21, stdev=379152.31
clat percentiles (usec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[3],
 | 30.00th=[3], 40.00th=[3], 50.00th=[4], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[9], 99.50th=[   11], 99.90th=[   24], 99.95th=[10290],
 | 99.99th=[19530]
   bw (  KiB/s): min=19424, max=1395544, per=100.00%, avg=306914.86,
stdev=392390.46, samples=683
   iops: min= 4856, max=348886, avg=76728.66, stdev=98097.60,
samples=683
  lat (usec)   : 2=0.01%, 4=78.02%, 10=21.37%, 20=0.43%, 50=0.10%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 10=0.01%, 20=0.06%, 50=0.01%, 250=0.01%
  cpu  : usr=0.78%, sys=5.03%, ctx=30215, majf=0, minf=1657
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwt: total=0,26214400,0, short=0,0,0, dropped=0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=57.4MiB/s (60.2MB/s), 57.4MiB/s-57.4MiB/s
(60.2MB/s-60.2MB/s), io=100GiB (107GB), run=1782445-1782445msec

Setup:
Ubuntu 18.04.2 + Nautilus repo (deb
https://download.ceph.com/debian-nautilus bionic main)
3 hosts, 100Gbit/s NIC, and each has:
Dual-socket Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (20 cores each)
2 Optane NVMe cards, in one LVM VG - vg_optane.
18 12TB 7200rpm SAS drives, each running a bluestore OSD
each HDD OSD has a 32GB LV on vg_optane for wal+db
3 OSDs on 32GB lv's on vg_optane
1 mon, mgr, and mds

cephfs_data on hdd osds, 512 pgs
cephfs_metadata on ssd osds, 16 pgs
both replicated, size = 3, min_size = 2

root@m3-3101-422:~# ceph df
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW
USED
hdd   591 TiB 582 TiB 8.7 TiB  8.8 TiB
1.49
ssd   288 GiB 257 GiB  14 GiB   31 GiB
10.75
TOTAL 591 TiB 582 TiB 8.8 TiB  8.8 TiB
1.49

POOLS:
POOLID STORED  OBJECTS USED
%USED MAX AVAIL
cephfs_metadata 23 5.3 GiB   5.71k 5.6 GiB
2.3079 GiB
cephfs_data 24 1.6 TiB  11.43M 6.8 TiB
1.23   183 TiB


Observations:
When starting the benchmark, it's over 10 seconds before iostat shows
any activity on the OSD drives. fio gets to 100% very quickly, then
the end_fsync takes a long time.

root@m3-3101-422:~# ceph health detail | head
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; Degraded data
redundancy: 207/34295076 objects degraded (0.001%), 28 pgs degraded, 5
pgs undersized; 18 pgs not deep-scrubbed in time; 528 pgs not scrubbed
in time; 1 pools have too many placement groups; too few PGs per OSD
(25 < min 30)
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
PG_DEGRADED Degraded data redundancy: 207/34295076 objects degraded
(0.001%), 28 pgs degraded, 5 pgs undersized
pg 24.8 is active+recovery_wait+degraded, acting [10,40,36]
pg 24.11 is active+recovery_wait+degraded, acting [40,0,36]
pg 24.15 is stuck undersized for 137.001663, current state
active+recovery_wait+undersized+degraded+remapped, last acting [16,33]
pg 24.25 is active+recovery_wait+degraded, acting [7,20,40]
pg 24.2e is active+recovering+degraded, acting [45,40,32]
pg 24.49 is active+recovery_wait+degraded, acting [1,40,9]
pg 24.5d is active+recovery_wait+degraded, acting [16,40,27]

It seems that the OSDs cannot keep up and thus become degraded. I
understand that 4k randwrites is a very harsh benchmark for a
distributed cluster of spinning disks, and the actual performance is
acceptable. What I want is for Ceph to not enter HEALTH_WARN when
doing it.

Things I have tried:
Increasing OSD journal size to 10GB (bluestore_block_wal_size) - no effect
Setting write_congestion_kb to to

Re: [ceph-users] What does the differences in osd benchmarks mean?

2019-06-27 Thread Nathan Fish

Are these dual-socket machines? Perhaps NUMA is involved?

On Thu., Jun. 27, 2019, 4:56 a.m. Lars Täuber,  wrote:

> Hi!
>
> In our cluster I ran some benchmarks.
> The results are always similar but strange to me.
> I don't know what the results mean.
> The cluster consists of 7 (nearly) identical hosts for osds. Two of them
> have one an additional hdd.
> The hdds are from identical type. The ssds for the journal and wal are of
> identical type. The configuration is identical (ssd-db-lv-size) for each
> osd.
> The hosts are connected the same way to the same switches.
> This nautilus cluster was set up with ceph-ansible 4.0 on debian buster.
>
> This are the results of
> # ceph --format plain tell osd.* bench
>
> osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec
> 17 IOPS
> osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec
> 36 IOPS
> osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec
> 37 IOPS
> osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec
> 21 IOPS
> osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec
> 30 IOPS
> osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec
> 38 IOPS
> osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec
> 17 IOPS
> osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec
> 19 IOPS
> osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec
> 16 IOPS
> osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec
> 27 IOPS
> osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec
> 18 IOPS
> osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec
> 18 IOPS
> osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec
> 19 IOPS
> osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec
> 24 IOPS
> osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec
> 20 IOPS
> osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec
> 14 IOPS
> osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec
> 21 IOPS
> osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec
> 19 IOPS
> osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec
> 12 IOPS
> osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107
> MiB/sec 26 IOPS
> osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec
> 16 IOPS
> osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec
> 14 IOPS
> osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec
> 15 IOPS
> osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec
> 16 IOPS
> osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100
> MiB/sec 24 IOPS
> osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec
> 18 IOPS
> osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137
> MiB/sec 34 IOPS
> osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec
> 22 IOPS
> osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec
> 15 IOPS
> osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101
> MiB/sec 25 IOPS
>
>
> The different runs differ by ±1 IOPS.
> Why are the osds 1,2,4,5,9,19,26 faster than the others?
>
> Restarting an osd did change the result.
>
> Could someone give me hint where to look further to find the reason?
>
> Thanks
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] shutdown down all monitors

2019-07-11 Thread Nathan Fish

The monitors determine quorum, so stopping all monitors will
immediately stop IO to prevent split-brain. I would not recommend
shutting down all mons at once in production, though it *should* come
back up fine. If you really need to, shut them down in a certain
order, and bring them back up in the opposite order.

On Thu, Jul 11, 2019 at 5:42 AM Marc Roos  wrote:
>
>
>
> Can I temporary shutdown all my monitors? This only affects new
> connections not? Existing will still keep running?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Pool stats issue with upgrades to nautilus

2019-07-12 Thread Nathan Fish

Excellent! I have been checking the tracker
(https://tracker.ceph.com/versions/574) every day, and there hadn't
been any movement for weeks.

On Fri, Jul 12, 2019 at 11:29 AM Sage Weil  wrote:
>
> On Fri, 12 Jul 2019, Nathan Fish wrote:
> > Thanks. Speaking of 14.2.2, is there a timeline for it? We really want
> > some of the fixes in it as soon as possible.
>
> I think it's basically ready now... probably Monday?
>
> sage
>
> >
> > On Fri, Jul 12, 2019 at 11:22 AM Sage Weil  wrote:
> > >
> > > Hi everyone,
> > >
> > > All current Nautilus releases have an issue where deploying a single new
> > > (Nautilus) BlueStore OSD on an upgraded cluster (i.e. one that was
> > > originally deployed pre-Nautilus) breaks the pool utilization stats
> > > reported by ``ceph df``.  Until all OSDs have been reprovisioned or
> > > updated (via ``ceph-bluestore-tool repair``), the pool stats will show
> > > values that are lower than the true value.  A fix is in the works but will
> > > not appear until 14.2.3.  Users who have upgraded to Nautilus (or are
> > > considering upgrading) may want to delay provisioning new OSDs until the
> > > fix is available in the next release.
> > >
> > > This issue will only affect you if:
> > >
> > > - You started with a pre-nautilus cluster and upgraded
> > > - You then provision one or more new BlueStore OSDs, or run
> > >   'ceph-bluestore-tool repair' on an upgraded OSD.
> > >
> > > The symptom is that the pool stats from 'ceph df' are too small.  For
> > > example, the pre-upgrade stats on our test cluster were
> > >
> > > ...
> > > POOLS:
> > > POOL   ID  STORED  OBJECTS USED   
> > >  %USED MAX AVAIL
> > > data 0  63 TiB  44.59M  63 
> > > TiB 30.2148 TiB
> > > ...
> > >
> > > but when one OSD was updated it changed to
> > >
> > > POOLS:
> > > POOL   ID  STORED  OBJECTS USED   
> > >  %USED MAX AVAIL
> > > data 0 558 GiB  43.50M 1.7 
> > > TiB  1.2245 TiB
> > >
> > > The root cause is that, starting with Nautilus, BlueStore maintains
> > > per-pool usage stats, but it requires a slight on-disk format change;
> > > upgraded OSDs won't have the new stats until you run a ceph-bluestore-tool
> > > repair.  The problem is that the mon starts using the new stats as soon as
> > > *any* OSDs are reporting per-pool stats (instead of waiting until *all*
> > > OSDs are doing so).
> > >
> > > To avoid the issue, either
> > >
> > >  - do not provision new BlueStore OSDs after the upgrade, or
> > >  - update all OSDs to keep new per-pool stats.  An existing BlueStore
> > >OSD can be converted with
> > >
> > >  systemctl stop ceph-osd@$N
> > >  ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$N
> > >  systemctl start ceph-osd@$N
> > >
> > >Note that FileStore does not support the new per-pool stats at all, so
> > >if there are filestore OSDs in your cluster there is no workaround
> > >that doesn't involve replacing the filestore OSDs with bluestore.
> > >
> > > A fix[1] is working it's way through QA and will appear in 14.2.3; it
> > > won't quite make the 14.2.2 release.
> > >
> > > sage
> > >
> > >
> > > [1] https://github.com/ceph/ceph/pull/28978
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Pool stats issue with upgrades to nautilus

2019-07-12 Thread Nathan Fish

Thanks. Speaking of 14.2.2, is there a timeline for it? We really want
some of the fixes in it as soon as possible.

On Fri, Jul 12, 2019 at 11:22 AM Sage Weil  wrote:
>
> Hi everyone,
>
> All current Nautilus releases have an issue where deploying a single new
> (Nautilus) BlueStore OSD on an upgraded cluster (i.e. one that was
> originally deployed pre-Nautilus) breaks the pool utilization stats
> reported by ``ceph df``.  Until all OSDs have been reprovisioned or
> updated (via ``ceph-bluestore-tool repair``), the pool stats will show
> values that are lower than the true value.  A fix is in the works but will
> not appear until 14.2.3.  Users who have upgraded to Nautilus (or are
> considering upgrading) may want to delay provisioning new OSDs until the
> fix is available in the next release.
>
> This issue will only affect you if:
>
> - You started with a pre-nautilus cluster and upgraded
> - You then provision one or more new BlueStore OSDs, or run
>   'ceph-bluestore-tool repair' on an upgraded OSD.
>
> The symptom is that the pool stats from 'ceph df' are too small.  For
> example, the pre-upgrade stats on our test cluster were
>
> ...
> POOLS:
> POOL   ID  STORED  OBJECTS USED   
>  %USED MAX AVAIL
> data 0  63 TiB  44.59M  63 TiB
>  30.2148 TiB
> ...
>
> but when one OSD was updated it changed to
>
> POOLS:
> POOL   ID  STORED  OBJECTS USED   
>  %USED MAX AVAIL
> data 0 558 GiB  43.50M 1.7 TiB
>   1.2245 TiB
>
> The root cause is that, starting with Nautilus, BlueStore maintains
> per-pool usage stats, but it requires a slight on-disk format change;
> upgraded OSDs won't have the new stats until you run a ceph-bluestore-tool
> repair.  The problem is that the mon starts using the new stats as soon as
> *any* OSDs are reporting per-pool stats (instead of waiting until *all*
> OSDs are doing so).
>
> To avoid the issue, either
>
>  - do not provision new BlueStore OSDs after the upgrade, or
>  - update all OSDs to keep new per-pool stats.  An existing BlueStore
>OSD can be converted with
>
>  systemctl stop ceph-osd@$N
>  ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-$N
>  systemctl start ceph-osd@$N
>
>Note that FileStore does not support the new per-pool stats at all, so
>if there are filestore OSDs in your cluster there is no workaround
>that doesn't involve replacing the filestore OSDs with bluestore.
>
> A fix[1] is working it's way through QA and will appear in 14.2.3; it
> won't quite make the 14.2.2 release.
>
> sage
>
>
> [1] https://github.com/ceph/ceph/pull/28978
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] increase pg_num error

2019-07-01 Thread Nathan Fish

I ran into this recently. Try running "ceph osd require-osd-release
nautilus". This drops backwards compat with pre-nautilus and allows
changing settings.

On Mon, Jul 1, 2019 at 4:24 AM Sylvain PORTIER  wrote:
>
> Hi all,
>
> I am using ceph 14.2.1 (Nautilus)
>
> I am unable to increase the pg_num of a pool.
>
> I have a pool named Backup, the current pg_num is 64 : ceph osd pool get
> Backup pg_num => result pg_num: 64
>
> And when I try to increase it using the command
>
> ceph osd pool set Backup pg_num 512 => result "set pool 6 pg_num to 512"
>
> And then I check with the command : ceph osd pool get Backup pg_num =>
> result pg_num: 64
>
> I don't how to increase the pg_num of a pool, I also tried the autoscale
> module, but it doesn't work (unable to activate the autoscale, always
> warn mode).
>
> Thank you for your help,
>
>
> Cabeur.
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Nathan Fish

This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?

On Mon, Jul 8, 2019 at 8:56 AM Lei Liu  wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder  于2019年7月8日周一 下午4:36写道：
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>> a problem and a plugin optimized for specific values of k and m might help 
>> here. Under usual circumstances I see very low load on all OSD hosts, even 
>> under rebalancing. However, I remember that once I needed to rebuild 
>> something on all OSDs (I don't remember what it was, sorry). In this 
>> situation, CPU load went up to 30-50% (meaning up to half the cores were at 
>> 100%), which is really high considering that each server has only 16 disks 
>> at the moment and is sized to handle up to 100. CPU power could become a 
>> bottle for us neck in the future.
>>
>> These are some general observations and do not replace benchmarks for 
>> specific use cases. I was hunting for a specific performance pattern, which 
>> might not be what you want to optimize for. I would recommend to run 
>> extensive benchmarks if you have to live with a configuration for a long 
>> time - EC profiles cannot be changed.
>>
>> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also 
>> use bluestore compression. All meta data pools are on SSD, only very little 
>> SSD space is required. This choice works well for the majority of our use 
>> cases. We can still build small expensive pools to accommodate special 
>> performance requests.
>>
>> Best regards,
>>
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: ceph-users  on behalf of David 
>> 
>> Sent: 07 July 2019 20:01:18
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users]  What's the best practice for Erasure Coding
>>
>> Hi Ceph-Users,
>>
>> I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
>> Recently, I'm trying to use the Erasure Code pool.
>> My question is "what's the best practice for using EC pools ?".
>> More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
>> adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), 
>> (k=6,m=3) ).
>>
>> Does anyone share some experience?
>>
>> Thanks for any help.
>>
>> Regards,
>> David
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
>

Re: [ceph-users] bluestore write iops calculation

2019-08-02 Thread Nathan Fish

Any EC pool with m=1 is fragile. By default, min_size = k+1, so you'd
immediately stop IO the moment you lose a single OSD. min_size can be
lowered to k, but that can cause data loss and corruption. You should
set m=2 at a minimum. 4+2 doesn't take much more space than 4+1, and
it's far safer.

On Fri, Aug 2, 2019 at 11:21 PM  wrote:
>
> > where small means 32kb or smaller going to BlueStore, so <= 128kb
> > writes
> > from the client.
> >
> > Also: please don't do 4+1 erasure coding, see older discussions for
> > details.
>
> Can you point me to the discussion abort the problems of 4+1? It's not
> easy to google :)
>
> --
> Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs quota setfattr permission denied

2019-07-31 Thread Nathan Fish

The client key which is used to mount the FS needs the 'p' permission
to set xattrs. eg:

ceph fs authorize cephfs client.foo / rwsp

That might be your problem.

On Wed, Jul 31, 2019 at 5:43 AM Mattia Belluco  wrote:
>
> Dear ceph users,
>
> We have been recently trying to use the two quota attributes:
>
> - ceph.quota.max_files
> - ceph.quota.max_bytes
>
> to prepare for quota enforcing.
>
> While the idea is quite straightforward we found out we cannot set any
> additional file attribute (we tried with the directory pinning, too): we
> get a straight "permission denied" each time.
>
> We are running the latest version of Mimic on a cluster used exclusively
> for cephfs.
>
> This is the output of a "ceph fs dump"
>
> Filesystem 'spindlefs' (4)
> fs_name spindlefs
> epoch   18062
> flags   12
> created 2019-02-21 17:53:48.240659
> modified2019-07-30 16:52:13.141688
> tableserver 0
> root0
> session_timeout 60
> session_autoclose   300
> max_file_size   1099511627776
> min_compat_client   -1 (unspecified)
> last_failure0
> last_failure_osd_epoch  31203
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in  0
> up  {0=10744488}
> failed
> damaged
> stopped
> data_pools  [11]
> metadata_pool   10
> inline_data disabled
> balancer
> standby_count_wanted1
> 10744488:   10.129.48.46:6800/549232138 'mds-l15-34' mds.0.18058 up:active
> seq 49
>
> and the command we are trying to use:
>
> setfattr -n ceph.quota.max_bytes -v 1 /user/dir
>
> It should be all pretty much default and we could find no reference
> online regarding settings to allow the setfattr to succeed.
>
> Any hint is welcome.
>
> Kind regards
> mattia
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Nathan Fish

Good to know. I tried reset-failed and restart several times, it
didn't work on any of them. I also rebooted one of the hosts, didn't
help. Thankfully it seems they failed far enough apart that our
nearly-empty cluster rebuilt in time. But it's rather worrying.

On Fri, Jul 19, 2019 at 10:09 PM Nigel Williams
 wrote:
>
>
> On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:
>>
>> On further investigation, it seems to be this bug:
>> http://tracker.ceph.com/issues/38724
>
>
> We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
> recovered with:
>
> systemctl reset-failed ceph-osd@160
> systemctl start ceph-osd@160
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Nathan Fish

I came in this morning and started to upgrade to 14.2.2, only to
notice that 3 OSDs had crashed overnight - exactly 1 on each of 3
hosts. Apparently there was no data loss, which implies they crashed
at different times, far enough part to rebuild? Still digging through
logs to find exactly when they first crashed.

Log from restarting ceph-osd@53:
https://termbin.com/3e0x

If someone can read this log and get anything out of it I would
appreciate it. All I can tell is that it wasn't a RocksDB ENOSPC,
which I have seen before.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-19 Thread Nathan Fish

On further investigation, it seems to be this bug:
http://tracker.ceph.com/issues/38724

On Fri, Jul 19, 2019 at 1:38 PM Nathan Fish  wrote:
>
> I came in this morning and started to upgrade to 14.2.2, only to
> notice that 3 OSDs had crashed overnight - exactly 1 on each of 3
> hosts. Apparently there was no data loss, which implies they crashed
> at different times, far enough part to rebuild? Still digging through
> logs to find exactly when they first crashed.
>
> Log from restarting ceph-osd@53:
> https://termbin.com/3e0x
>
> If someone can read this log and get anything out of it I would
> appreciate it. All I can tell is that it wasn't a RocksDB ENOSPC,
> which I have seen before.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-23 Thread Nathan Fish

Multiple active MDS's is a somewhat new feature, and it might obscure
debugging information.
I'm not sure what the best way to restore stability temporarily is,
but if you can manage it,
I would go down to one MDS, crank up the debugging, and try to
reproduce the problem.

How are your OSDs configured? Are they HDDs? Do you have WAL and/or DB
devices on SSDs?
Is the metadata pool on SSDs?

On Tue, Jul 23, 2019 at 4:06 PM Janek Bevendorff
 wrote:
>
> Thanks for your reply.
>
> On 23/07/2019 21:03, Nathan Fish wrote:
> > What Ceph version? Do the clients match? What CPUs do the MDS servers
> > have, and how is their CPU usage when this occurs?
>
> Sorry, I totally forgot to mention that while transcribing my post. The
> cluster runs Nautilus (I upgraded recently). The client still had Mimic
> when I started, but an upgrade to Nautilus did not solve any of the
> problems.
>
> The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual
> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild.
> While MDSs are trying to rejoin, they tend to saturate a single thread
> shortly, but nothing spectacular. During normal operation, none of the
> cores is particularly under load.
>
> > While migrating to a Nautilus cluster recently, we had up to 14
> > million inodes open, and we increased the cache limit to 16GiB. Other
> > than warnings about oversized cache, this caused no issues.
>
> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting
> rid of the cache size warnings (and sometimes allowing an MDS to rejoin
> without being kicked again after a few seconds), it did not change much
> in terms of the actual problem. Right now I can change it to whatever I
> want, it doesn't do anything, because rank 0 keeps being trashed anyway
> (the other ranks are fine, but the CephFS is down anyway). Is there
> anything useful I can give you to debug this? Otherwise I would try
> killing the MDS daemons so I can at least restore the CephFS to a
> semi-operational state.
>
>
> >
> > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
> >> Hi,
> >>
> >> Disclaimer: I posted this before to the cheph.io mailing list, but from
> >> the answers I didn't get and a look at the archives, I concluded that
> >> that list is very dead. So apologies if anyone has read this before.
> >>
> >> I am trying to copy our storage server to a CephFS. We have 5 MONs in
> >> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am
> >> trying to copy is about 23GB, so it's a lot of files. I am copying them
> >> in batches of 25k using 16 parallel rsync processes over a 10G link.
> >>
> >> I started out with 5 MDSs / 2 active, but had repeated issues with
> >> immense and growing cache sizes far beyond the theoretical maximum of
> >> 400k inodes which the 16 rsync processes could keep open at the same
> >> time. The usual inode count was between 1 and 4 million and the cache
> >> size between 20 and 80GB on average.
> >>
> >> After a while, the MDSs started failing under this load by either
> >> crashing or being kicked from the quorum. I tried increasing the max
> >> cache size, max log segments, and beacon grace period, but to no avail.
> >> A crashed MDS often needs minutes to rejoin.
> >>
> >> The MDSs fail with the following message:
> >>
> >>-21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map is_healthy
> >> 'MDSRank' had timed out after 15
> >>-20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
> >> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
> >> heartbeat is not healthy!
> >>
> >> I found the following thread, which seems to be about the same general
> >> issue:
> >>
> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
> >>
> >> Unfortunately, it does not really contain a solution except things I
> >> have tried already. Though it does give some explanation as to why the
> >> MDSs pile up so many open inodes. It appears like Ceph can't handle many
> >> (write-only) operations on different files, since the clients keep their
> >> capabilities open and the MDS can't evict them from its cache. This is
> >> very baffling to me, since how am I supposed to use a CephFS if I cannot
> >> fill it with files before?
> >>
> >> The next thing I tried was increasing the number of active MDSs. Three
> >> seemed to make it worse, but four worked surprisingly well.
> >>

Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-23 Thread Nathan Fish

What Ceph version? Do the clients match? What CPUs do the MDS servers
have, and how is their CPU usage when this occurs?
While migrating to a Nautilus cluster recently, we had up to 14
million inodes open, and we increased the cache limit to 16GiB. Other
than warnings about oversized cache, this caused no issues.

On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff
 wrote:
>
> Hi,
>
> Disclaimer: I posted this before to the cheph.io mailing list, but from
> the answers I didn't get and a look at the archives, I concluded that
> that list is very dead. So apologies if anyone has read this before.
>
> I am trying to copy our storage server to a CephFS. We have 5 MONs in
> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am
> trying to copy is about 23GB, so it's a lot of files. I am copying them
> in batches of 25k using 16 parallel rsync processes over a 10G link.
>
> I started out with 5 MDSs / 2 active, but had repeated issues with
> immense and growing cache sizes far beyond the theoretical maximum of
> 400k inodes which the 16 rsync processes could keep open at the same
> time. The usual inode count was between 1 and 4 million and the cache
> size between 20 and 80GB on average.
>
> After a while, the MDSs started failing under this load by either
> crashing or being kicked from the quorum. I tried increasing the max
> cache size, max log segments, and beacon grace period, but to no avail.
> A crashed MDS often needs minutes to rejoin.
>
> The MDSs fail with the following message:
>
>-21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
>-20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
> heartbeat is not healthy!
>
> I found the following thread, which seems to be about the same general
> issue:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
>
> Unfortunately, it does not really contain a solution except things I
> have tried already. Though it does give some explanation as to why the
> MDSs pile up so many open inodes. It appears like Ceph can't handle many
> (write-only) operations on different files, since the clients keep their
> capabilities open and the MDS can't evict them from its cache. This is
> very baffling to me, since how am I supposed to use a CephFS if I cannot
> fill it with files before?
>
> The next thing I tried was increasing the number of active MDSs. Three
> seemed to make it worse, but four worked surprisingly well.
> Unfortunately, the crash came eventually and the rank-0 MDS got kicked.
> Since then the standbys have been (not very successfully) playing
> round-robin to replace it, only to be kicked repeatedly. This is the
> status quo right now and it has been going for hours with no end in
> sight. The only option might be to kill all MDSs and let them restart
> from empty caches.
>
> While trying to rejoin, the MDSs keep logging the above-mentioned error
> message followed by
>
> 2019-07-23 17:53:37.386 7f3b135a5700  0 mds.0.cache.ino(0x100019693f8)
> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 0x100019693f8
> /XXX_12_doc_ids_part7/ [2,head] auth{1=2,2=2} v=0 cv=0/0
> state=1140850688 f() n() hs=17033+0,ss=0+0 | child=1 replicated=1
> 0x5642a2ff7700]
>
> and then finally
>
> 2019-07-23 17:53:48.786 7fb02bc08700  1 mds.XXX Map has assigned me to
> become a standby
>
> The other thing I noticed over the last few days is that after a
> sufficient number of failures, the client locks up completely and the
> mount becomes unresponsive, even after the MDSs are back. Sometimes this
> lock-up is so catastrophic that I cannot even unmount the share with
> umount -lf anymore and a reboot of the machine lets the kernel panic.
> This looks like a bug to me.
>
> I hope somebody can provide me with tips to stabilize our setup. I can
> move data through our RadosGWs over 7x10Gbps from 130 nodes in parallel,
> no problem. But I cannot even rsync a few TB of files from a single node
> to the CephFS without knocking out the MDS daemons.
>
> Any help is greatly appreciated!
>
> Janek
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-23 Thread Nathan Fish

I have not had any more OSDs crash, but the 3 that crashed still crash
on startup. I may purge and recreate them, but there's no hurry. I
have 18 OSDs per host and plenty of free space currently.

On Tue, Jul 23, 2019 at 2:19 AM Ashley Merrick  wrote:
>
> Have they been stable since, or still had some crash?
>
> ,Thanks
>
>  On Sat, 20 Jul 2019 10:09:08 +0800 Nigel Williams 
>  wrote 
>
>
> On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:
>
> On further investigation, it seems to be this bug:
> http://tracker.ceph.com/issues/38724
>
>
> We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
> recovered with:
>
> systemctl reset-failed ceph-osd@160
> systemctl start ceph-osd@160
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph durability during outages

2019-07-24 Thread Nathan Fish

It is inherently dangerous to accept client IO - particularly writes -
when at k. Just like it's dangerous to accept IO with 1 replica in
replicated mode. It is not inherently dangerous to do recovery when at
k, but apparently it was originally written to use min_size rather
than k.
Looking at the PR, the actual code change is fairly small, ~30 lines,
but it's a fairly critical change and has several pages of testing
code associated with it. It also requires setting
"osd_allow_recovery_below_min_size" just in case. So clearly it is
being treated with caution.


On Wed, Jul 24, 2019 at 2:28 PM Jean-Philippe Méthot
 wrote:
>
> Thank you, that does make sense. I was completely unaware that min size was 
> k+1 and not k. Had I known that, I would have designed this pool differently.
>
> Regarding that feature for Octopus, I’m guessing it shouldn't be dangerous 
> for data integrity to recover at less than min_size?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> Le 24 juill. 2019 à 13:49, Nathan Fish  a écrit :
>
> 2/3 monitors is enough to maintain quorum, as is any majority.
>
> However, EC pools have a default min_size of  k+1 chunks.
> This can be adjusted to k, but that has it's own dangers.
> I assume you are using failure domain = "host"?
> As you had k=6,m=2, and lost 2 failure domains, you had k chunks left,
> resulting in all IO stopping.
>
> Currently, EC pools that have k chunks but less than min_size do not rebuild.
> This is being worked on for Octopus: https://github.com/ceph/ceph/pull/17619
>
> k=6,m=2 is therefore somewhat slim for a 10-host cluster.
> I do not currently use EC, as I have only 3 failure domains, so others
> here may know better than me,
> but I might have done k=6,m=3. This would allow rebuilding to OK from
> 1 host failure, and remaining available in WARN state with 2 hosts
> down.
> k=4,m=4 would be very safe, but potentially too expensive.
>
>
> On Wed, Jul 24, 2019 at 1:31 PM Jean-Philippe Méthot
>  wrote:
>
>
> Hi,
>
> I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This 
> cluster is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 
> erasure code profile with a 3 copy metadata pool in front. The cluster runs 
> well, but we recently had a short outage which triggered unexpected behaviour 
> in the cluster.
>
> I’ve always been under the impression that Ceph would continue working 
> properly even if nodes would go down. I tested it several months ago with 
> this configuration and it worked fine as long as only 2 nodes went down. 
> However, this time, the first monitor as well as two osd nodes went down. As 
> a result, Openstack VMs were able to mount their rbd volume but unable to 
> read from it, even after the cluster had recovered with the following message 
> : Reduced data availability: 599 pgs inactive, 599 pgs incomplete .
>
> I believe the cluster should have continued to work properly despite the 
> outage, so what could have prevented it from functioning? Is it because there 
> was only two monitors remaining? Or is it that reduced data availability 
> message? In that case, is my erasure coding configuration fine for that 
> number of nodes?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph durability during outages

2019-07-24 Thread Nathan Fish

2/3 monitors is enough to maintain quorum, as is any majority.

However, EC pools have a default min_size of  k+1 chunks.
This can be adjusted to k, but that has it's own dangers.
I assume you are using failure domain = "host"?
As you had k=6,m=2, and lost 2 failure domains, you had k chunks left,
resulting in all IO stopping.

Currently, EC pools that have k chunks but less than min_size do not rebuild.
This is being worked on for Octopus: https://github.com/ceph/ceph/pull/17619

k=6,m=2 is therefore somewhat slim for a 10-host cluster.
I do not currently use EC, as I have only 3 failure domains, so others
here may know better than me,
but I might have done k=6,m=3. This would allow rebuilding to OK from
1 host failure, and remaining available in WARN state with 2 hosts
down.
k=4,m=4 would be very safe, but potentially too expensive.

On Wed, Jul 24, 2019 at 1:31 PM Jean-Philippe Méthot
 wrote:
>
> Hi,
>
> I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This 
> cluster is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 
> erasure code profile with a 3 copy metadata pool in front. The cluster runs 
> well, but we recently had a short outage which triggered unexpected behaviour 
> in the cluster.
>
> I’ve always been under the impression that Ceph would continue working 
> properly even if nodes would go down. I tested it several months ago with 
> this configuration and it worked fine as long as only 2 nodes went down. 
> However, this time, the first monitor as well as two osd nodes went down. As 
> a result, Openstack VMs were able to mount their rbd volume but unable to 
> read from it, even after the cluster had recovered with the following message 
> : Reduced data availability: 599 pgs inactive, 599 pgs incomplete .
>
> I believe the cluster should have continued to work properly despite the 
> outage, so what could have prevented it from functioning? Is it because there 
> was only two monitors remaining? Or is it that reduced data availability 
> message? In that case, is my erasure coding configuration fine for that 
> number of nodes?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-07-25 Thread Nathan Fish

I have seen significant increases (1GB -> 8GB) proportional to number
of inodes open, just like the MDS cache grows. These went away once
the stat-heavy workloads (multiple parallel rsyncs) stopped. I
disabled autoscale warnings on the metadata pools due to this
fluctuation.

On Thu, Jul 25, 2019 at 1:31 PM Dietmar Rieder
 wrote:
>
> On 7/25/19 11:55 AM, Konstantin Shalygin wrote:
> >> we just recently upgraded our cluster from luminous 12.2.10 to nautilus
> >> 14.2.1 and I noticed a massive increase of the space used on the cephfs
> >> metadata pool although the used space in the 2 data pools  basically did
> >> not change. See the attached graph (NOTE: log10 scale on y-axis)
> >>
> >> Is there any reason that explains this?
> >
> > Dietmar, how your metadata usage now? Is stop growing?
>
> it is stable now and only changes as the number of files in the FS changes.
>
> Dietmar
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Phone: +43 512 9003 71402
> Fax: +43 512 9003 73100
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Nathan Fish

Ok, great. Some numbers for you:
I have a filesystem of 50 million files, 5.4 TB.
The data pool is on HDD OSDs with Optane DB/WAL, size=3.
The metadata pool (Optane OSDs) has 17GiB "stored", 20GiB "used", at
size=3. 5.18M objects.
When doing parallel rsyncs, with ~14M inodes open, the MDS cache goes
to about 40GiB but it remains stable. MDS CPU usage goes to about 400%
(4 cores worth, spread across 6-8 processes). Hope you find this
useful.

On Fri, Jul 26, 2019 at 11:05 AM Stefan Kooman  wrote:
>
> Quoting Nathan Fish (lordci...@gmail.com):
> > MDS CPU load is proportional to metadata ops/second. MDS RAM cache is
> > proportional to # of files (including directories) in the working set.
> > Metadata pool size is proportional to total # of files, plus
> > everything in the RAM cache. I have seen that the metadata pool can
> > balloon 8x between being idle, and having every inode open by a
> > client.
> > The main thing I'd recommend is getting SSD OSDs to dedicate to the
> > metadata pools, and SSDs for the HDD OSD's DB/WAL. NVMe if you can. If
> > you put that much metadata on only HDDs, it's going to be slow.
>
> Only SSD for OSD data pool and NVMe for metadata pool, so that should be
> fine. Besides the initial loading of that many files / directories this
> workload shouldn't be any problem.
>
> Thanks for your feedback.
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Nathan Fish

Yes, definitely enable standby-replay. I saw sub-second failovers with
standby-replay, but when I restarted the new rank 0 (previously 0-s)
while the standby was syncing up to become 0-s, the failover took
several minutes. This was with ~30GiB of cache.

On Fri, Jul 26, 2019 at 12:41 PM Burkhard Linke
 wrote:
>
> Hi,
>
>
> one particular interesting point in setups with a large number of active
> files/caps is the failover.
>
>
> If your MDS fails (assuming single MDS, multiple MDS with multiple
> active ranks behave in the same way for _each_ rank), the monitors will
> detect the failure and update the mds map. CephFS clients will be
> notified about the update, connect to the new MDS the rank has failed
> over to (hopefully within the connect timeout...). They will also
> re-request all their currently active caps from the MDS to allow it to
> recreate the state of the point in time before the failure.
>
>
> And this is were things can get "interesting". Assuming a cold standby
> MDS, the MDS will receive the information about all active files and
> capabilities assigned to the various client. It also has to _stat_ all
> these files during the rejoin phase. And if million of files have to
> stat'ed, this may take time, put a lot of pressure on the metadata and
> data pools, and might even lead to timeouts and subsequent failure or
> failover to another MDS.
>
>
> We had some problems with this in the past, but it became better and
> less failure prone with every ceph release (great work, ceph
> developers!). Our current setup has up to 15 million cached inodes and
> several million caps in the worst case (during nightly backup). The caps
> per client limit in luminous/nautilus? helps a lot with reducing the
> number of active files and caps.
>
> Prior to nautilus we configured a secondary MDS as standby-replay, which
> allows it to cache the same inodes that were active on the primary.
> During rejoin the stat call can be served from cache, which makes the
> failover a lot faster and less demanding for the ceph cluster itself. In
> nautilus the setup for standby-replay has moved from a daemon feature to
> a filesystem feature (one spare MDS becomes designated standby-replay
> for a rank). But there are also other caveats like not selecting one of
> these as failover for another rank.
>
>
> So if you want to test cephfs for your use case, I would highly
> recommend to test failover, too. Both a controlled failover and an
> unexpected one. You may also want to use multiple active MDS, but my
> experience with these setups is limited.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Nathan Fish

MDS CPU load is proportional to metadata ops/second. MDS RAM cache is
proportional to # of files (including directories) in the working set.
Metadata pool size is proportional to total # of files, plus
everything in the RAM cache. I have seen that the metadata pool can
balloon 8x between being idle, and having every inode open by a
client.
The main thing I'd recommend is getting SSD OSDs to dedicate to the
metadata pools, and SSDs for the HDD OSD's DB/WAL. NVMe if you can. If
you put that much metadata on only HDDs, it's going to be slow.



On Fri, Jul 26, 2019 at 5:11 AM Stefan Kooman  wrote:
>
> Hi List,
>
> We are planning to move a filesystem workload (currently nfs) to CephFS.
> It's around 29 TB. The unusual thing here is the amount of directories
> in use to host the files. In order to combat a "too many files in one
> directory" scenario a "let's make use of recursive directories" approach.
> Not ideal either. This workload is supposed to be moved to (Ceph) S3
> sometime in the future, but until then, it has to go to a shared
> filesystem ...
>
> So what is unusual about this? The directory layout looks like this
>
> /data/files/00/00/[0-8][0-9]/[0-9]/ from this point on there will be 7
> directories created to store 1 file.
>
> Total amount of directories in a file path is 14. There are around 150 M
> files in 400 M directories.
>
> The working set won't be big. Most files will just sit around and will
> not be touched. The active amount of files wil be a few thousand.
>
> We are wondering if this kind of directory structure is suitable for
> CephFS. Might the MDS get difficulties with keeping up that many inodes
> / dentries or doesn't it care at all?
>
> The amount of metadata overhead might be horrible, but we will test that
> out.
>
> Thanks,
>
> Stefan
>
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore runs out of space and dies

2019-10-31 Thread Nathan Fish

You should never run into this problem on a 4TB drive in the first
place. Bluestore explodes if it can't allocate a few GB; but on a 4TB
drive, the default full_ratio of 0.95 will forbid placing any new
objects onto an OSD with less than 200GB.

On Thu, Oct 31, 2019 at 9:31 AM George Shuklin  wrote:
>
> Thank you everyone, I got it. There is no way to fix out-of-space
> bluestore without expanding it.
>
> Therefore, in production we would stick with 99%FREE size for LV, as it
> gives operators 'last chance' to repair the cluster in case of
> emergency. It's a bit unfortunate that we need to give up the whole per
> cent (1 % is too much for 4Tb drives).
>
> On 31/10/2019 15:04, Nathan Fish wrote:
> > The best way to prevent this on a testing cluster with tiny virtual
> > drives is probably to lower the various full_ratio's significantly.
> >
> > On Thu, Oct 31, 2019 at 7:17 AM Paul Emmerich  
> > wrote:
> >> BlueStore doesn't handle running out of space gracefully because that
> >> doesn't happen on a real disk because full_ratio (95%) and the
> >> failsafe_full_ratio (97%? some obscure config option) kick in before
> >> that happens.
> >>
> >> Yeah, I've also lost some test cluster with tiny disks to this. Usual
> >> reason is keeping it in a degraded state for weeks at a time...
> >>
> >> Paul
> >>
> >> --
> >> Paul Emmerich
> >>
> >> Looking for help with your Ceph cluster? Contact us at https://croit.io
> >>
> >> croit GmbH
> >> Freseniusstr. 31h
> >> 81247 München
> >> www.croit.io
> >> Tel: +49 89 1896585 90
> >>
> >> On Thu, Oct 31, 2019 at 10:50 AM George Shuklin
> >>  wrote:
> >>> Hello.
> >>>
> >>> In my lab a nautilus cluster with a bluestore suddenly went dark. As I
> >>> found it had used 98% of the space and most of OSDs (small, 10G each)
> >>> went offline. Any attempt to restart them failed with this message:
> >>>
> >>> # /usr/bin/ceph-osd -f --cluster  ceph --id 18 --setuser ceph --setgroup
> >>> ceph
> >>>
> >>> 2019-10-31 09:44:37.591 7f73d54b3f80 -1 osd.18 271 log_to_monitors
> >>> {default=true}
> >>> 2019-10-31 09:44:37.615 7f73bff99700 -1
> >>> bluestore(/var/lib/ceph/osd/ceph-18) _do_alloc_write failed to allocate
> >>> 0x1 allocated 0x ffe4 min_alloc_size 0x1 available 0x > >>> 0
> >>> 2019-10-31 09:44:37.615 7f73bff99700 -1
> >>> bluestore(/var/lib/ceph/osd/ceph-18) _do_write _do_alloc_write failed
> >>> with (28) No space left on device
> >>> 2019-10-31 09:44:37.615 7f73bff99700 -1
> >>> bluestore(/var/lib/ceph/osd/ceph-18) _txc_add_transaction error (28) No
> >>> space left on device not handled on operation 10 (op 30, counting from 0)
> >>> 2019-10-31 09:44:37.615 7f73bff99700 -1
> >>> bluestore(/var/lib/ceph/osd/ceph-18) ENOSPC from bluestore,
> >>> misconfigured cluster
> >>> /build/ceph-14.2.4/src/os/bluestore/BlueStore.cc: In function 'void
> >>> BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> >>> ObjectStore::Transaction*)' thread 7f73bff99700 time 2019-10-31
> >>> 09:44:37.620694
> >>> /build/ceph-14.2.4/src/os/bluestore/BlueStore.cc: 11455:
> >>> ceph_abort_msg("unexpected error")
> >>>
> >>> I was able to recover cluster by adding some more space into VGs for
> >>> some of OSDs and using this command:
> >>>
> >>> ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-xx
> >>> --command bluefs-bdev-expand
> >>>
> >>> It worked but only because I added some space into OSD.
> >>>
> >>> I'm curious, is there a way to recover such OSD without growing it? On
> >>> the old filestore I can just remove some objects to gain space, is this
> >>> possible for bluestore? My main concern is that OSD daemon simply
> >>> crashes at start, so I can't just add 'more OSD' to cluster - all data
> >>> become unavailable, because OSDs are completely dead.
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Full FLash NVME Cluster recommendation

2019-11-15 Thread Nathan Fish

Bluestore will use about 4 cores, but in my experience, the maximum
utilization I've seen has been something like: 100%, 100%, 50%, 50%

So those first 2 cores are the bottleneck for pure OSD IOPS. This sort
of pattern isn't uncommon in multithreaded programs. This was on HDD
OSDs with DB/WAL on NVMe, as well as some small metadata OSDs on pure
NVMe. SSD OSDs default to 2 threads per shard, and HDD to 1, but we
had to set HDD to 2 as well when we enabled NVMe WAL/DB. Otherwise the
OSDs ran out of CPU and failed to heartbeat when under load. I believe
that if we had 50% faster cores, we might not have needed to do this.

On SSDs/NVMe you can compensate for slower cores with more OSDs, but
of course only for parallel operations. Anything that is
serial+synchronous, not so much. I would expect something like 4 OSDs
per NVMe, 4 cores per OSD. That's already 16 cores per node just for
OSDs.

Our bottleneck in practice is the Ceph MDS, which seems to use exactly
2 cores and has no setting to change this. As far as I can tell, if we
had 50% faster cores just for the MDS, I would expect roughly +50%
performance in terms of metadata ops/second. Each filesystem has it's
own rank-0 MDS, so this load will be split across daemons. The MDS can
also use a ton of RAM (32GB) if the clients have a working set of 1
million+ files. Multi-mds exists to further split the load, but is
quite new and I would not trust it. CephFS in general is likely where
you will have the most issues, as it both new and complex compared to
a simple object store. Having an MDS in standby-replay mode keeps it's
RAM cache synced with the active, so you get far faster failover (
O(seconds) rather than O(minutes) with a few million file caps) but
you use the same RAM again.

So, IMHO, you will want at least:
CPU:
16 cores per 1-card NVMe OSD node. 2 cores per filesystem (maybe 1 if
you don't expect a lot of simultaneous load?)

RAM:
The Bluestore default is 4GB per OSD, so 16GB per node.
~32GB of RAM per active and standby-replay MDS if you expect file
counts in the millions, so 64GB per filesystem.

128GB of RAM per node ought to do, if you have less than 14 filesystems?

YMMV.

On Fri, Nov 15, 2019 at 11:17 AM Anthony D'Atri  wrote:
>
> I’ve been trying unsuccessfully to convince some folks of the need for fast 
> cores, there’s the idea that the effect would be slight.  Do you have any 
> numbers?  I’ve also read a claim that each BlueStore will use 3-4 cores
>,
> They’re listening to me though about splitting the card into multiple OSDs.
>
> > On Nov 15, 2019, at 7:38 AM, Nathan Fish  wrote:
> >
> > In order to get optimal performance out of NVMe, you will want very
> > fast cores, and you will probably have to split each NVMe card into
> > 2-4 OSD partitions in order to throw enough cores at it.
> >
> > On Fri, Nov 15, 2019 at 10:24 AM Yoann Moulin  wrote:
> >>
> >> Hello,
> >>
> >> I'm going to deploy a new cluster soon based on 6.4TB NVME PCI-E Cards, I 
> >> will have only 1 NVME card per node and 38 nodes.
> >>
> >> The use case is to offer cephfs volumes for a k8s platform, I plan to use 
> >> an EC-POOL 8+3 for the cephfs_data pool.
> >>
> >> Do you have recommendations for the setup or mistakes to avoid? I use 
> >> ceph-ansible to deploy all myclusters.
> >>
> >> Best regards,
> >>
> >> --
> >> Yoann Moulin
> >> EPFL IC-IT
> >> ___
> >> ceph-users mailing list -- ceph-us...@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-us...@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Scaling out

2019-11-21 Thread Nathan Fish

The default crush rule uses "host" as the failure domain, so in order
to deploy on one host you will need to make a crush rule that
specifies "osd". Then simply adding more hosts with osds will result
in automatic rebalancing. Once you have enough hosts to satisfy the
crush rule ( 3 for replicated size = 3) you can change the pool(s)
back to the default rule.

On Thu, Nov 21, 2019 at 7:46 AM Alfredo De Luca
 wrote:
>
> Hi all.
> We are doing some tests on how to scale out nodes on Ceph Nautilus.
> Basically we want to try to install Ceph on one node and scale up to 2+ 
> nodes. How to do so?
>
> Every nodes has 6 disks and maybe  we can use the crushmap to achieve this?
>
> Any thoughts/ideas/recommendations?
>
>
> Cheers
>
>
> --
> Alfredo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replace bad db for bluestore

2019-11-21 Thread Nathan Fish

You should design your cluster and crush rules such that a failure of
a single OSD is not a problem. Preferably such that losing any 1 host
isn't a problem either.

On Thu, Nov 21, 2019 at 6:32 AM zhanrzh...@teamsun.com.cn
 wrote:
>
> Hi,all
>  Suppose the db of bluestore can't read/write,are there  some methods to 
> replace bad db with a new one,in luminous.
> If not, i dare not  deploy ceph with bluestore in my production.
>
> zhanrzh...@teamsun.com.cn
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replace bad db for bluestore

2019-11-21 Thread Nathan Fish

A power outage shouldn't corrupt your db unless you are doing
dangerous async writes. And sharing an SSD for several OSDs on the
same host is normal, but not an issue given that you have planned for
the failure of hosts.

On Thu, Nov 21, 2019 at 9:57 AM 展荣臻（信泰）  wrote:
>
>
>
> In general db is located ssd and 4~5 or more osd share the same ssd.
> Considering such a situation that the db is broken due to the power outage of 
> data center,
> it will be a disaster.
>
> > You should design your cluster and crush rules such that a failure of
> > a single OSD is not a problem. Preferably such that losing any 1 host
> > isn't a problem either.
> >
> > On Thu, Nov 21, 2019 at 6:32 AM zhanrzh...@teamsun.com.cn
> >  wrote:
> > >
> > > Hi,all
> > >  Suppose the db of bluestore can't read/write,are there  some methods to 
> > > replace bad db with a new one,in luminous.
> > > If not, i dare not  deploy ceph with bluestore in my production.
> > >
> > > zhanrzh...@teamsun.com.cn
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Separate disk sets for high IO?

2019-12-16 Thread Nathan Fish

https://ceph.io/community/new-luminous-crush-device-classes/
https://docs.ceph.com/docs/nautilus/rados/operations/crush-map/#device-classes

On Mon, Dec 16, 2019 at 5:42 PM Philip Brown  wrote:
>
> Sounds very useful.
>
> Any online example documentation for this?
> havent found any so far?
>
>
> - Original Message -----
> From: "Nathan Fish" 
> To: "Marc Roos" 
> Cc: "ceph-users" , "Philip Brown" 
> 
> Sent: Monday, December 16, 2019 2:07:44 PM
> Subject: Re: [ceph-users] Separate disk sets for high IO?
>
> Indeed, you can set device class to pretty much arbitrary strings and
> specify them. By default, 'hdd', 'ssd', and I think 'nvme' are
> autodetected - though my Optanes showed up as 'ssd'.
>
> On Mon, Dec 16, 2019 at 4:58 PM Marc Roos  wrote:
> >
> >
> >
> > You can classify osd's, eg as ssd. And you can assign this class to a
> > pool you create. This way you have have rbd's running on only ssd's. I
> > think you have also a class for nvme and you can create custom classes.
> >
> >
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HEALTH_WARN 1 MDSs report oversized cache

2019-12-05 Thread Nathan Fish

MDS cache size scales with the number of files recently opened by
clients. if you have RAM to spare, increase "mds cache memory limit".
I have raised mine from the default of 1GiB to 32GiB. My rough
estimate is 2.5kiB per inode in recent use.


On Thu, Dec 5, 2019 at 10:39 AM Ranjan Ghosh  wrote:
>
> Okay, now, after I settled the issue with the oneshot service thanks to
> the amazing help of Paul and Richard (thanks again!), I still wonder:
>
> What could I do about that MDS warning:
>
> ===
>
> health: HEALTH_WARN
>
> 1 MDSs report oversized cache
>
> ===
>
> If anybody has any ideas? I tried googling it, of course, but came up
> with no really relevant info on how to actually solve this.
>
>
> BR
>
> Ranjan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Separate disk sets for high IO?

2019-12-16 Thread Nathan Fish

Indeed, you can set device class to pretty much arbitrary strings and
specify them. By default, 'hdd', 'ssd', and I think 'nvme' are
autodetected - though my Optanes showed up as 'ssd'.

On Mon, Dec 16, 2019 at 4:58 PM Marc Roos  wrote:
>
>
>
> You can classify osd's, eg as ssd. And you can assign this class to a
> pool you create. This way you have have rbd's running on only ssd's. I
> think you have also a class for nvme and you can create custom classes.
>
>
>
>
> -Original Message-
> From: Philip Brown [mailto:pbr...@medata.com]
> Sent: 16 December 2019 22:55
> To: ceph-users
> Subject: [ceph-users] Separate disk sets for high IO?
>
> Still relatively new to ceph, but have been tinkering for a few weeks
> now.
>
> If I'm reading the various docs correctly, then any RBD in a particular
> ceph cluster, will be distributed across ALL OSDs, ALL the time.
> There is no way to designate a particular set of disks, AKA OSDs, to be
> a high performance group, and allocate certain RBDs to only use that set
> of disks.
> Pools, only control things like the replication count, and number of
> placement groups.
>
> I'd have to set up a whole new ceph cluster for the type of behavior I
> want.
>
> Am I correct?
>
>
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Large OMAP Object

2019-11-20 Thread Nathan Fish

It's a warning, not an error, and if you consider it to not be a
problem, I believe you can change
osd_deep_scrub_large_omap_object_value_sum_threshold back to 2M.

On Wed, Nov 20, 2019 at 11:37 AM  wrote:
>
> All;
>
> Since I haven't heard otherwise, I have to assume that the only way to get 
> this to go away is to dump the contents of the RGW bucket(s), and  recreate 
> it (them)?
>
> How did this get past release approval?  A change which makes a valid cluster 
> state in-valid, with no mitigation other than downtime, in a minor release.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> dhils...@performair.com
> Sent: Friday, November 15, 2019 9:13 AM
> To: ceph-users@lists.ceph.com
> Cc: Stephen Self
> Subject: Re: [ceph-users] Large OMAP Object
>
> Wido;
>
> Ok, yes, I have tracked it down to the index for one of our buckets.  I 
> missed the ID in the ceph df output previously.  Next time I'll wait to read 
> replies until I've finished my morning coffee.
>
> How would I go about correcting this?
>
> The content for this bucket is basically just junk, as we're still doing 
> production qualification, and workflow planning.  Moving from Windows file 
> shares to self-hosted cloud storage is a significant undertaking.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
> den Hollander
> Sent: Friday, November 15, 2019 8:40 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Large OMAP Object
>
>
>
> On 11/15/19 4:35 PM, dhils...@performair.com wrote:
> > All;
> >
> > Thank you for your help so far.  I have found the log entries from when the 
> > object was found, but don't see a reference to the pool.
> >
> > Here the logs:
> > 2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
> > starts
> > 2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap 
> > object found. Object: 
> > 56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key 
> > count: 380425 Size (bytes): 82896978
> >
>
> In this case it's in pool 56, check 'ceph df' to see which pool that is.
>
> To me this seems like a RGW bucket which index grew too big.
>
> Use:
>
> $ radosgw-admin bucket list
> $ radosgw-admin metadata get bucket:
>
> And match that UUID back to the bucket.
>
> Wido
>
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: Wido den Hollander [mailto:w...@42on.com]
> > Sent: Friday, November 15, 2019 1:56 AM
> > To: Dominic Hilsbos; ceph-users@lists.ceph.com
> > Cc: Stephen Self
> > Subject: Re: [ceph-users] Large OMAP Object
> >
> > Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
> > pool and Object the large Object is in?
> >
> > Wido
> >
> > On 11/15/19 12:23 AM, dhils...@performair.com wrote:
> >> All;
> >>
> >> We had a warning about a large OMAP object pop up in one of our clusters 
> >> overnight.  The cluster is configured for CephFS, but nothing mounts a 
> >> CephFS, at this time.
> >>
> >> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, 
> >> and the MGR log on one of the mons, with no useful references to the pool 
> >> / pg where the large OMAP objects resides.
> >>
> >> Is my only option to find this large OMAP object to go through the OSD 
> >> logs for the individual OSDs in the cluster?
> >>
> >> Thank you,
> >>
> >> Dominic L. Hilsbos, MBA
> >> Director - Information Technology
> >> Perform Air International Inc.
> >> dhils...@performair.com
> >> www.PerformAir.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] HA and data recovery of CEPH

2019-11-28 Thread Nathan Fish

If correctly configured, your cluster should have zero downtime from a
single OSD or node failure. What is your crush map? Are you using
replica or EC? If your 'min_size' is not smaller than 'size', then you
will lose availability.

On Thu, Nov 28, 2019 at 10:50 PM Peng Bo  wrote:
>
> Hi all,
>
> We are working on use CEPH to build our HA system, the purpose is the system 
> should always provide service even a node of CEPH is down or OSD is lost.
>
> Currently, as we practiced once a node/OSD is down, the CEPH cluster needs to 
> take about 40 seconds to sync data, our system can't provide service during 
> that.
>
> My questions:
>
> Does there have any way that we can reduce the data sync time?
> How can we let the CEPH keeps available once a node/OSD is down?
>
>
> BR
>
> --
> The modern Unified Communications provider
>
> https://www.portsip.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore runs out of space and dies

2019-10-31 Thread Nathan Fish

The best way to prevent this on a testing cluster with tiny virtual
drives is probably to lower the various full_ratio's significantly.

On Thu, Oct 31, 2019 at 7:17 AM Paul Emmerich  wrote:
>
> BlueStore doesn't handle running out of space gracefully because that
> doesn't happen on a real disk because full_ratio (95%) and the
> failsafe_full_ratio (97%? some obscure config option) kick in before
> that happens.
>
> Yeah, I've also lost some test cluster with tiny disks to this. Usual
> reason is keeping it in a degraded state for weeks at a time...
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Oct 31, 2019 at 10:50 AM George Shuklin
>  wrote:
> >
> > Hello.
> >
> > In my lab a nautilus cluster with a bluestore suddenly went dark. As I
> > found it had used 98% of the space and most of OSDs (small, 10G each)
> > went offline. Any attempt to restart them failed with this message:
> >
> > # /usr/bin/ceph-osd -f --cluster  ceph --id 18 --setuser ceph --setgroup
> > ceph
> >
> > 2019-10-31 09:44:37.591 7f73d54b3f80 -1 osd.18 271 log_to_monitors
> > {default=true}
> > 2019-10-31 09:44:37.615 7f73bff99700 -1
> > bluestore(/var/lib/ceph/osd/ceph-18) _do_alloc_write failed to allocate
> > 0x1 allocated 0x ffe4 min_alloc_size 0x1 available 0x 0
> > 2019-10-31 09:44:37.615 7f73bff99700 -1
> > bluestore(/var/lib/ceph/osd/ceph-18) _do_write _do_alloc_write failed
> > with (28) No space left on device
> > 2019-10-31 09:44:37.615 7f73bff99700 -1
> > bluestore(/var/lib/ceph/osd/ceph-18) _txc_add_transaction error (28) No
> > space left on device not handled on operation 10 (op 30, counting from 0)
> > 2019-10-31 09:44:37.615 7f73bff99700 -1
> > bluestore(/var/lib/ceph/osd/ceph-18) ENOSPC from bluestore,
> > misconfigured cluster
> > /build/ceph-14.2.4/src/os/bluestore/BlueStore.cc: In function 'void
> > BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > ObjectStore::Transaction*)' thread 7f73bff99700 time 2019-10-31
> > 09:44:37.620694
> > /build/ceph-14.2.4/src/os/bluestore/BlueStore.cc: 11455:
> > ceph_abort_msg("unexpected error")
> >
> > I was able to recover cluster by adding some more space into VGs for
> > some of OSDs and using this command:
> >
> > ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-xx
> > --command bluefs-bdev-expand
> >
> > It worked but only because I added some space into OSD.
> >
> > I'm curious, is there a way to recover such OSD without growing it? On
> > the old filestore I can just remove some objects to gain space, is this
> > possible for bluestore? My main concern is that OSD daemon simply
> > crashes at start, so I can't just add 'more OSD' to cluster - all data
> > become unavailable, because OSDs are completely dead.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS

2019-10-03 Thread Nathan Fish

We have tried running nfs-ganesha (2.7 - 2.8.1) with FSAL_CEPH backed by
a Nautilus CephFS. Performance when doing metadata operations (ie
anything with small files) is very slow.

On Thu, Oct 3, 2019 at 10:34 AM Marc Roos  wrote:
>
>
> How should a multi tenant RGW config look like, I am not able get this
> working:
>
> EXPORT {
>Export_ID=301;
>Path = "test:test3";
>#Path = "/";
>Pseudo = "/rgwtester";
>
>Protocols = 4;
>FSAL {
>Name = RGW;
>User_Id = "test$tester1";
>Access_Key_Id = "TESTER";
>Secret_Access_Key = "xxx";
>}
>Disable_ACL = TRUE;
>CLIENT { Clients = 192.168.10.0/24; access_type = "RO"; }
> }
>
>
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> create_export :FSAL :CRIT :RGW module: librgw init failed (-5)
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> mdcache_fsal_create_export :FSAL :MAJ :Failed to call create_export on
> underlying FSAL RGW
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> fsal_put :FSAL :INFO :FSAL RGW now unused
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> fsal_cfg_commit :CONFIG :CRIT :Could not create export for (/rgwtester)
> to (test:test3)
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> fsal_cfg_commit :FSAL :F_DBG :FSAL RGW refcount 0
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> config_errs_to_log :CONFIG :CRIT :Config File
> (/etc/ganesha/ganesha.conf:216): 1 validation errors in block FSAL
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> config_errs_to_log :CONFIG :CRIT :Config File
> (/etc/ganesha/ganesha.conf:216): Errors processing block (FSAL)
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> config_errs_to_log :CONFIG :CRIT :Config File
> (/etc/ganesha/ganesha.conf:209): 1 validation errors in block EXPORT
> 03/10/2019 16:15:37 : epoch 5d8d274c : c01 : ganesha.nfsd-4722[sigmgr]
> config_errs_to_log :CONFIG :CRIT :Config File
> (/etc/ganesha/ganesha.conf:209): Errors processing block (EXPORT)
>
> -Original Message-
> Subject: Re: [ceph-users] NFS
>
> RGW NFS can support any NFS style of authentication, but users will have
> the RGW access of their nfs-ganesha export.  You can create exports with
> disjoint privileges, and since recent L, N, RGW tenants.
>
> Matt
>
> On Tue, Oct 1, 2019 at 8:31 AM Marc Roos 
> wrote:
> >
> >  I think you can run into problems
> > with a multi user environment of RGW and nfs-ganesha.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] sharing single SSD across multiple HD based OSDs

2019-12-09 Thread Nathan Fish

You can loop over the creation of LVs on the SSD of a fixed size, then
loop over creating OSDs assigned to each of them. That is what we did,
it wasn't bad.

On Mon, Dec 9, 2019 at 9:32 PM Philip Brown  wrote:
>
> I have a bunch of hard drives I want to use as OSDs, with ceph nautilus.
>
> ceph-volume lvm create makes straight raw dev usage relatively easy, since 
> you can just do
>
> ceph-volume lvm create --data /dev/sdc
>
> or whatever.
> Its nice that it takes care of all the LVM jiggerypokery automatically.
>
> but.. what if you have a single SSD. lets call it /dev/sdx. and I want to use 
> it for the WAL, for
> /dev/sdc, sdd, sde, sdf, and so on.
>
> Do you have to associate each OSD with a unique WAL dev, or can they "share"?
>
> Do I really have to MANUALLY go carve up /dev/sdx into slices, LVM or 
> otherwise, and then go hand manage the slicing?
>
> ceph-volume lvm create --data /dev/sdc --block.wal /dev/sdx1
> ceph-volume lvm create --data /dev/sdd --block.wal /dev/sdx2
> ceph-volume lvm create --data /dev/sde --block.wal /dev/sdx3
> ?
>
> can I not get away with some other more simplified usage?
>
>
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problem : "1 pools have many more objects per pg than average"

2020-01-22 Thread Nathan Fish

Injectargs causes an immediate runtime change; rebooting the mon would
negate the change.

On Wed., Jan. 22, 2020, 4:41 p.m. St-Germain, Sylvain (SSC/SPC), <
sylvain.st-germ...@canada.ca> wrote:

> / Problem ///
>
>  I've got a Warning on my cluster that I cannot remove :
>
> "1 pools have many more objects per pg than average"
>
> Does somebody has some insight ? I think it's normal to have this warning
> because I have just one pool in use, but how can I remove this warning ?
>
> Thx !
>
> / INFORMATION ///
>
> *** Here's some information about the cluster
>
> # ceph health detail
> HEALTH_WARN 1 pools have many more objects per pg than average
> MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
> pool default.rgw.buckets.data objects per pg (50567) is more than
> 14.1249 times cluster average (3580)
>
> # sudo ceph-conf -D | grep mon_pg_warn_max_object_skew
> mon_pg_warn_max_object_skew = 10.00
>
> # ceph daemon mon.dao-wkr-04 config show | grep mon_pg_warn_max_object_skew
> "mon_pg_warn_max_object_skew": "10.00"
>
> # ceph -v
> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus
> (stable)
>
> # ceph df
> RAW STORAGE:
> CLASS   SIZEAVAIL   USED   RAW USED %RAW
> USED
> hdd 873 TiB 823 TiB 50 TiB   51 TiB
> 5.80
> TOTAL   873 TiB 823 TiB 50 TiB   51 TiB
> 5.80
>
> POOLS:
> POOLID STORED
> OBJECTS  USED%USED MAX AVAIL
> .rgw.root   11 3.5 KiB  8
>  1.5 MiB 0   249 TiB
> default.rgw.control 12 0 B  8
>  0 B 0   249 TiB
> default.rgw.meta13  52 KiB  186
>  34 MiB  0   249 TiB
> default.rgw.log 14 0 B
> 207 0 B 0   249 TiB
> default.rgw.buckets.index   15 1.2 GiB  131
>  1.2 GiB 0   249 TiB
> cephfs_data 29 915 MiB  202
>  1.5 GiB 0   467 TiB
> cephfs_metadata 30 145 KiB
> 23  2.1 MiB 0   249 TiB
> default.rgw.buckets.data31  30 TiB
> 12.95M  50 TiB  6.32  467 TiB
>
>
> # ceph osd dump | grep default.rgw.buckets.data pool 31
> 'default.rgw.buckets.data' erasure size 8 min_size 6 crush_rule 2
> object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change
> 9502 lfor 0/2191/7577 flags hashpspool stripe_width 20480 target_size_ratio
> 0.4 application rgw
>
> /// SOLUTION TRIED ///
>
> 1- I try to increase the value of mon_pg_warn_max_object_skew parameter
>
> #sudo ceph tell mon.* injectargs '--mon_pg_warn_max_object_skew 20'
> #sudo ceph tell osd.* injectargs '--mon_pg_warn_max_object_skew 20'
>
> + And reboot the monitor
>
> The parameter didn't change.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs kernel client io performance decreases extremely

2019-12-26 Thread Nathan Fish

I would start by viewing "ceph status", drive IO with: "iostat -x 1
/dev/sd{a..z}" and the CPU/RAM usage of the active MDS. If "ceph status"
warns that the MDS cache is oversized, that may be an easy fix.

On Thu, Dec 26, 2019 at 7:33 AM renjianxinlover 
wrote:

> hello,
>recently, after deleting some fs data in a small-scale ceph
> cluster, some clients IO performance became bad, specially latency. for
> example, opening a tiny text file by vim maybe consumed nearly twenty
>  seconds, i am not clear about how to diagnose the cause, could anyone give
> some guidence?
>
> Brs
> renjianxinlover
> renjianxinlo...@163.com
>
> 
> 签名由 网易邮箱大师  定制
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

38 matches

Mail list logo