Re: [ceph-users] OSD Marked down unable to restart continuously failing

2020-01-11 Thread Eugen Block

Hi,

you say the daemons are locally up and running but restarting fails?  
Which one is it?
Do you see any messages suggesting flapping OSDs? After 5 retries  
within 10 minutes the OSDs would be marked out. What is the result of  
your checks for iostat etc.? Anything pointing to a high load on the  
OSD node?


Regards,
Eugen


Zitat von Radhakrishnan2 S :


Can someone please help to respond to the below query ?

Regards
Radha Krishnan S
TCS Enterprise Cloud Practice
Tata Consultancy Services
Cell:- +1 848 466 4870
Mailto: radhakrishnan...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting



-Radhakrishnan2 S/CHN/TCS wrote: -
To: "Ceph Users" 
From: Radhakrishnan2 S/CHN/TCS
Date: 01/09/2020 08:34AM
Subject: OSD Marked down unable to restart continuously failing

Hello Everyone,

One of the OSD node out of 16 has 12 OSD's with a bcache as NVMe,  
locally those osd daemons seem to be up and running, while the ceph  
osd tree shows them as down. Logs show that OSD's have struck IO for  
over 4096 sec.


I tried checking for iostat, netstat, ceph -w  along with the logs.  
Is there a way to identify why this happening ? In addition, when I  
restart the OSD daemons on the respective OSD node, restart is  
failing. Any quick help pls.


Regards
Radha Krishnan S
TCS Enterprise Cloud Practice
Tata Consultancy Services
Cell:- +1 848 466 4870
Mailto: radhakrishnan...@tcs.com
Website: http://www.tcs.com

Experience certainty.   IT Services
Business Solutions
Consulting



-"ceph-users"  wrote: -
To: d.aber...@profihost.ag, "Janne Johansson" 
From: "Wido den Hollander"
Sent by: "ceph-users"
Date: 01/09/2020 08:19AM
Cc: "Ceph Users" , a.bra...@profihost.ag,  
"p.kra...@profihost.ag" , j.kr...@profihost.ag

Subject: Re: [ceph-users] Looking for experience

"External email. Open with Caution"


On 1/9/20 2:07 PM, Daniel Aberger - Profihost AG wrote:


Am 09.01.20 um 13:39 schrieb Janne Johansson:


I'm currently trying to workout a concept for a ceph cluster which can
be used as a target for backups which satisfies the following
requirements:

- approx. write speed of 40.000 IOP/s and 2500 Mbyte/s


You might need to have a large (at least non-1) number of writers to get
to that sum of operations, as opposed to trying to reach it with one
single stream written from one single client.



We are aiming for about 100 writers.


So if I read it correctly the writes will be 64k each.

That should be doable, but you probably want something like NVMe for DB+WAL.

You might want to tune that larger writes also go into the WAL to speed
up the ingress writes. But you mainly want more spindles then less.

Wido



Cheers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph (jewel) unable to recover after node failure

2020-01-10 Thread Eugen Block

Hi,


A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.


if all OSDs come back (stable) the recovery should eventually finish.


B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?


Yes, this is a reasonable assumption. Just some weeks ago we saw this  
in a customer cluster with EC pools. The OSDs were fully saturated,  
causing failing heartbeats from the peers, coming back up and so on  
(flapping OSDs). At the beginning the MON notices that the OSD  
processes are up although the peers report them as down but after 5 of  
these "down" reports by peers (config option osd_max_markdown_count)  
within 10 minutes (config osd_max_markdown_period) the OSD is marked  
as out, causing more rebalancing causing a higher load.


If there are no other hints for a different root cause you could set  
'ceph osd set nodown' to prevent that flapping. This should help the  
cluster to recover, it helped in the customer environment, although  
there also was another issue.


Regards,
Eugen


Zitat von Hanspeter Kunz :


Hi,

after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.

what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:

too many PGs per OSD (314 > max 300)

2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)

3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.

4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:

output of ceph -s:

 health HEALTH_ERR
16 pgs are stuck inactive for more than 300 seconds
255 pgs backfill_wait
16 pgs backfilling
205 pgs degraded
14 pgs down
2 pgs incomplete
14 pgs peering
48 pgs recovery_wait
205 pgs stuck degraded
16 pgs stuck inactive
335 pgs stuck unclean
156 pgs stuck undersized
156 pgs undersized
25 requests are blocked > 32 sec
recovery 1788571/71151951 objects degraded (2.514%)
recovery 2342374/71151951 objects misplaced (3.292%)
too many PGs per OSD (314 > max 300)

I have a few questions now:

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?

D. What would you advise me to do/try to recover to a healthy state?

In what follows I try to give some more background information
(configuration, log messages).

ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]

cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD.

ceph.conf:
-
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
--

Log Messages (examples):

we see a lot of:

Jan  7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377  
7f0ebd93b700 -1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48  
since back 2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff  
2020-01-07 18:52:02.4113

30)

however, all the networks were up (the machines could ping each other).

I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691  
7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt ***
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701  
7fbe5ee73700 -1 osd.25 15017 shutdown
Jan  7 16:47:43 bruce ceph-osd[5689]: 

Re: [ceph-users] pgs backfill_toofull after removing OSD from CRUSH map

2019-12-19 Thread Eugen Block

Hi Kristof,

setting the OSD "out" doesn't change the crush weight of that OSD, but  
removing it from the tree does, that's why the cluster started to  
rebalance.


Regards,
Eugen


Zitat von Kristof Coucke :


Hi all,

We are facing a strange symptom here.
We're testing our recovery procedures. Short description of our environment:
1. 10 OSD host nodes, each 13 disks + 2 NVMe's
2. 3 monitor nodes
3. 1 management node
4. 2 RGW's
5. 1 Client

Ceph version: Nautilus version 14.2.4

=> We are testing to "nicely" eliminate 1 OSD host.
As a first step, we've removed the OSD's by running "ceph osd out
osd.".
System went in error with a few messages that backfill was too full, but
this was more or less expected.

However, after leaving the system recovering, everything went back to
normal. Health did not indicate any warnings nor errors.
Running the Ceph OSD safe to destroy command indicated disks could be
safely removed.

So far so good, no problem...
Then we decided to properly removed the disks from the crush map, and now
the whole story starts again. Backfill_toofull errors and recovery is
running again.

Why?
The disks were already marked out and no PG's have been on them.

Is this caused by the fact that the CRUSH map is modified and recalculation
is happening causing the PG's automatically to be linked to different
OSD's? It does seem a strange behaviour to be honest.

Any feedback is greatly appreciated!

Regards,

Kristof Coucke




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster in ERR status when rebalancing

2019-12-09 Thread Eugen Block

Hi,

since we upgraded our cluster to Nautilus we also see those messages  
sometimes when it's rebalancing. There are several reports about this  
[1] [2], we didn't see it in Luminous. But eventually the rebalancing  
finished and the error message cleared, so I'd say there's (probably)  
nothing to worry about if there aren't any other issues.


Regards,
Eugen


[1] https://tracker.ceph.com/issues/39555
[2] https://tracker.ceph.com/issues/41255


Zitat von Simone Lazzaris :


Hi all;
Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One  
of the disk is showing
some read error, so I''ve added an OSD in the faulty node (OSD.26)  
and set the (re)weight of

the faulty OSD (OSD.12) to zero.

The cluster is now rebalancing, which is fine, but I have now 2 PG  
in "backfill_toofull" state, so

the cluster health is "ERR":

  cluster:
id: 9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
health: HEALTH_ERR
Degraded data redundancy (low space): 2 pgs backfill_toofull

  services:
mon: 3 daemons, quorum s1,s2,s3 (age 7d)
mgr: s1(active, since 7d), standbys: s2, s3
osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs
rgw: 3 daemons active (s1, s2, s3)

  data:
pools:   10 pools, 1200 pgs
objects: 11.72M objects, 37 TiB
usage:   57 TiB used, 42 TiB / 98 TiB avail
pgs: 2618510/35167194 objects misplaced (7.446%)
 938 active+clean
 216 active+remapped+backfill_wait
 44  active+remapped+backfilling
 2   active+remapped+backfill_wait+backfill_toofull

  io:
recovery: 163 MiB/s, 50 objects/s

  progress:
Rebalancing after osd.12 marked out
  [=.]

As you can see, there is plenty of space and none of my OSD  is in  
full or near full state:


++--+---+---++-++-+---+
| id | host |  used | avail | wr ops | wr data | rd ops | rd data |   
 state   |

++--+---+---++-++-+---+
| 0  |  s1  | 2415G | 1310G |0   | 0   |0   | 0   |  
exists,up |
| 1  |  s2  | 2009G | 1716G |0   | 0   |0   | 0   |  
exists,up |
| 2  |  s3  | 2183G | 1542G |0   | 0   |0   | 0   |  
exists,up |
| 3  |  s1  | 2680G | 1045G |0   | 0   |0   | 0   |  
exists,up |
| 4  |  s2  | 2063G | 1662G |0   | 0   |0   | 0   |  
exists,up |
| 5  |  s3  | 2269G | 1456G |0   | 0   |0   | 0   |  
exists,up |
| 6  |  s1  | 2523G | 1202G |0   | 0   |0   | 0   |  
exists,up |
| 7  |  s2  | 1973G | 1752G |0   | 0   |0   | 0   |  
exists,up |
| 8  |  s3  | 2007G | 1718G |0   | 0   |1   | 0   |  
exists,up |
| 9  |  s1  | 2485G | 1240G |0   | 0   |0   | 0   |  
exists,up |
| 10 |  s2  | 2385G | 1340G |0   | 0   |0   | 0   |  
exists,up |
| 11 |  s3  | 2079G | 1646G |0   | 0   |0   | 0   |  
exists,up |
| 12 |  s1  | 2272G | 1453G |0   | 0   |0   | 0   |  
exists,up |
| 13 |  s2  | 2381G | 1344G |0   | 0   |0   | 0   |  
exists,up |
| 14 |  s3  | 1923G | 1802G |0   | 0   |0   | 0   |  
exists,up |
| 15 |  s1  | 2617G | 1108G |0   | 0   |0   | 0   |  
exists,up |
| 16 |  s2  | 2099G | 1626G |0   | 0   |0   | 0   |  
exists,up |
| 17 |  s3  | 2336G | 1389G |0   | 0   |0   | 0   |  
exists,up |
| 18 |  s1  | 2435G | 1290G |0   | 0   |0   | 0   |  
exists,up |
| 19 |  s2  | 2198G | 1527G |0   | 0   |0   | 0   |  
exists,up |
| 20 |  s3  | 2159G | 1566G |0   | 0   |0   | 0   |  
exists,up |
| 21 |  s1  | 2128G | 1597G |0   | 0   |0   | 0   |  
exists,up |
| 22 |  s3  | 2064G | 1661G |0   | 0   |0   | 0   |  
exists,up |
| 23 |  s2  | 1943G | 1782G |0   | 0   |0   | 0   |  
exists,up |
| 24 |  s3  | 2168G | 1557G |0   | 0   |0   | 0   |  
exists,up |
| 25 |  s2  | 2113G | 1612G |0   | 0   |0   | 0   |  
exists,up |
| 26 |  s1  | 68.9G | 3657G |0   | 0   |0   | 0   |  
exists,up |

++--+---+---++-++-+---+



root@s1:~# ceph pg dump|egrep 'toofull|PG_STAT'
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
OMAP_BYTES* OMAP_KEYS* LOG  DISK_LOG STATE
   STATE_STAMP
VERSION   REPORTED   UP UP_PRIMARY ACTING  
ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP   
 SNAPTRIMQ_LEN
6.212 0  00 0   0  
38145321727   0  0 3023 3023
active+remapped+backfill_wait+backfill_toofull 2019-12-09  
11:11:39.093042  13598'212053
13713:1179718  [6,19,24]  6  

Re: [ceph-users] HEALTH_WARN 1 MDSs report oversized cache

2019-12-05 Thread Eugen Block

Hi,

can you provide more details?

ceph daemon mds. cache status
ceph config show mds. | grep mds_cache_memory_limit


Regards,
Eugen


Zitat von Ranjan Ghosh :


Okay, now, after I settled the issue with the oneshot service thanks to
the amazing help of Paul and Richard (thanks again!), I still wonder:

What could I do about that MDS warning:

===

health: HEALTH_WARN

1 MDSs report oversized cache

===

If anybody has any ideas? I tried googling it, of course, but came up
with no really relevant info on how to actually solve this.


BR

Ranjan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Command ceph osd df hangs

2019-11-21 Thread Eugen Block

Hi,

check if the active MGR is hanging.
I had this when testing pg_autoscaler, after some time every command  
would hang. Restarting the MGR helped for a short period of time, then  
I disabled pg_autoscaler. This is an upgraded cluster, currently on  
Nautilus.


Regards,
Eugen


Zitat von Thomas Schneider <74cmo...@gmail.com>:


Hi,
command ceph osd df does not return any output.
Based on the strace output there's a timeout.
[...]
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f53006b9000
brk(0x55c2579b6000) = 0x55c2579b6000
brk(0x55c2579d7000) = 0x55c2579d7000
brk(0x55c2579f9000) = 0x55c2579f9000
brk(0x55c257a1a000) = 0x55c257a1a000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f5300679000
brk(0x55c257a3b000) = 0x55c257a3b000
brk(0x55c257a5c000) = 0x55c257a5c000
brk(0x55c257a7d000) = 0x55c257a7d000
clone(child_stack=0x7f53095c1fb0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f53095c29d0, tls=0x7f53095c2700,
child_tidptr=0x7f53095c29d0) = 3261669
futex(0x55c257489940, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c2576246e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
NULL, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}^Cstrace: Process
3261645 detached
 
Interrupted
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1263, in 
    retval = main()
  File "/usr/bin/ceph", line 1194, in main

    verbose)
  File "/usr/bin/ceph", line 619, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 593, in do_command
    return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment


How can I fix this?
Do you need the full strace output to analyse this issue?

This Ceph health status is reported since hours and I cannot identify
any progress. Not sure if this is related to the issue with ceph osd df,
though.

2019-11-21 15:00:00.000262 mon.ld5505 [ERR] overall HEALTH_ERR 1
filesystem is degraded; 1 filesystem has a failed mds daemon; 1
filesystem is offline; insufficient standby MDS daemons available;
nodown,noout,noscrub,nodeep-scrub flag(s) set; 81 osds down; Reduced
data availability: 1366 pgs inactive, 241 pgs peering; Degraded data
redundancy: 6437/190964568 objects degraded (0.003%), 7 pgs degraded, 7
pgs undersized; 1 subtrees have overcommitted pool target_size_bytes; 1
subtrees have overcommitted pool target_size_ratio

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is deepscrub Part of PG increase?

2019-11-03 Thread Eugen Block

Hi,

deep-scrubs can also be configured per pool, so even if you have  
adjusted the general deep-scrub time the deep-scrubs will still  
happen. To disable per pool deep-scrubs you need to set


ceph osd pool set  nodeep-scrub true

Regards,
Eugen


Zitat von c...@elchaka.de:


Hello,

I have a Nautlius Cluster (14.2.4) where i set the Time for scrubs  
to run only from 23 till 6 o'clock.


But, Last Time i increase my PGs from 512 PGs (with 15 bluestore  
osds - 3 nodes) to 1024 PGs (with 35 OSDs - 7 nodes) and observed  
running deepscrubs when he finished some rebalances...


When i set no(deep-)scrub there is no scrub running, but when i  
unset no(deep-)scrub it starte again. Observed this when i First  
increase to 532 pgs as a test.


So, is scrubbing a nessesary Part when we increase the PGs which  
have to be run and should not or can not wait till the defined  
timespan for scrubbing from the admin?


Hope you can enlighten me :)
- Mehmet




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] clust recovery stuck

2019-10-23 Thread Eugen Block

Hi,

if the OSDs are not too full it's probably the crush weight of those  
hosts and OSDs. CRUSH tries to distribute the data evenly to all three  
hosts because they have the same weight (9.31400). But since two OSDs  
are missing the distribution doesn't finish. If you can't replace the  
failed OSDs you could try to adjust the crush weights and see if the  
recovery finishes.


Regards,
Eugen


Zitat von Andras Pataki :


Hi Philipp,

Given 256 PG's triple replicated onto 4 OSD's you might be  
encountering the "PG overdose protection" of OSDs.  Take a look at  
'ceph osd df' and see the number of PG's that are mapped to each OSD  
(last column or near the last).  The default limit is 200, so if any  
OSD exceeds that, it would explain the freeze, since the OSD will  
simply ignore the excess.  In that case, try increasing  
mon_max_pg_per_osd to, say, 400 and see if that helps.  This would  
allow the recovery to proceed - but you should consider adding OSDs  
(or at least increase the memory allocated to OSDs above the  
defaults).


Andras

On 10/22/19 3:02 PM, Philipp Schwaha wrote:

hi,

On 2019-10-22 08:05, Eugen Block wrote:

Hi,

can you share `ceph osd tree`? What crush rules are in use in your
cluster? I assume that the two failed OSDs prevent the remapping because
the rules can't be applied.


ceph osd tree gives:

ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 27.94199 root default
-2  9.31400 host alpha.local
 0  4.65700 osd.0   down0  1.0
 3  4.65700 osd.3 up  1.0  1.0
-3  9.31400 host beta.local
 1  4.65700 osd.1 up  1.0  1.0
 6  4.65700 osd.6   down0  1.0
-4  9.31400 host gamma.local
 2  4.65700 osd.2 up  1.0  1.0
 4  4.65700 osd.4 up  1.0  1.0


the crush rules should be fairly simple, nothing particularly customized
as far as I can tell:
'ceph osd crush tree' gives:
[
{
"id": -1,
"name": "default",
"type": "root",
"type_id": 10,
"items": [
{
"id": -2,
"name": "alpha.local",
"type": "host",
"type_id": 1,
"items": [
{
"id": 0,
"name": "osd.0",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
},
{
"id": 3,
"name": "osd.3",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
}
]
},
{
"id": -3,
"name": "beta.local",
"type": "host",
"type_id": 1,
"items": [
{
"id": 1,
"name": "osd.1",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
},
{
"id": 6,
"name": "osd.6",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
}
]
},
{
"id": -4,
"name": "gamma.local",
"type": "host",
"type_id": 1,
"items": [
{
"id": 2,
"name": "osd.2",
"type": "osd",
"type_id": 0,
"crush_weight": 4.656998,
"depth": 2
},
{
"id": 4,
"na

Re: [ceph-users] clust recovery stuck

2019-10-22 Thread Eugen Block

Hi,

can you share `ceph osd tree`? What crush rules are in use in your  
cluster? I assume that the two failed OSDs prevent the remapping  
because the rules can't be applied.



Regards,
Eugen


Zitat von Philipp Schwaha :


hi,

I have a problem with a cluster being stuck in recovery after osd
failure. at first recovery was doing quite well, but now it just sits
there without any progress. I currently looks like this:

 health HEALTH_ERR
36 pgs are stuck inactive for more than 300 seconds
50 pgs backfill_wait
52 pgs degraded
36 pgs down
36 pgs peering
1 pgs recovering
1 pgs recovery_wait
36 pgs stuck inactive
52 pgs stuck unclean
52 pgs undersized
recovery 261632/2235446 objects degraded (11.704%)
recovery 259813/2235446 objects misplaced (11.622%)
recovery 2/1117723 unfound (0.000%)
 monmap e3: 3 mons at
{0=192.168.19.13:6789/0,1=192.168.19.17:6789/0,2=192.168.19.23:6789/0}
election epoch 78, quorum 0,1,2 0,1,2
 osdmap e7430: 6 osds: 4 up, 4 in; 88 remapped pgs
flags sortbitwise
  pgmap v20023893: 256 pgs, 1 pools, 4366 GB data, 1091 kobjects
8421 GB used, 10183 GB / 18629 GB avail
261632/2235446 objects degraded (11.704%)
259813/2235446 objects misplaced (11.622%)
2/1117723 unfound (0.000%)
 168 active+clean
  50 active+undersized+degraded+remapped+wait_backfill
  36 down+remapped+peering
   1 active+recovering+undersized+degraded+remapped
   1 active+recovery_wait+undersized+degraded+remapped

Is there any way to motivate it to resume recovery?

Thanks
Philipp




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph stats on the logs

2019-10-08 Thread Eugen Block

Hi,

there is also /var/log/ceph/ceph.log on the MONs, it has the stats  
you're asking for. Does this answer your question?


Regards,
Eugen


Zitat von nokia ceph :


Hi Team,

With default log settings , the ceph  stats will be logged like
cluster [INF] pgmap v30410386: 8192 pgs: 8192 active+clean; 445 TB data,
1339 TB used, 852 TB / 2191 TB avail; 188 kB/s rd, 217 MB/s wr, 1618 op/s
 Jewel : on mon logs
 Nautilus : on mgr logs
Luminous : not able to view similar logs on either mon/mgr , what is the
log level to be set to have this stats on the logs.

Thanks,
Muthu




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Howto add DB (aka RockDB) device to existing OSD on HDD

2019-08-29 Thread Eugen Block
Sorry, I misread, your option is correct, of course since there was no  
external db device.

This worked for me:

ceph-2:~ # CEPH_ARGS="--bluestore-block-db-size 1048576"  
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-1 bluefs-bdev-new-db  
--dev-target /dev/sdb

inferring bluefs devices from bluestore path
DB device added /dev/sdb

ceph-2:~ # ll /var/lib/ceph/osd/ceph-1/block*
lrwxrwxrwx 1 ceph ceph 93 31. Jul 15:04 /var/lib/ceph/osd/ceph-1/block  
->  
/dev/ceph-d1f349d6-70ba-40d3-a510-3e5afb585782/osd-block-7523a676-a9de-4ed9-890c-197c6cd2d6d1
lrwxrwxrwx 1 root root  8 29. Aug 12:14  
/var/lib/ceph/osd/ceph-1/block.db -> /dev/sdb


Regards,
Eugen


Zitat von 74cmo...@gmail.com:


Hi,

I have created OSD on HDD w/o putting DB on faster drive.

In order to improve performance I have now a single SSD drive with 3.8TB.

I modified /etc/ceph/ceph.conf by adding this in [global]:
bluestore_block_db_size = 53687091200
This should create RockDB with size 50GB.

Then I tried to move DB to a new device (SSD) that is not formatted:
root@ld5505:~# ceph-bluestore-tool bluefs-bdev-new-db –-path  
/var/lib/ceph/osd/ceph-76 --dev-target /dev/sdbk

too many positional options have been specified on the command line

Checking the content of /var/lib/ceph/osd/ceph-76 it appears that  
there's no link to block.db:

root@ld5505:~# ls -l /var/lib/ceph/osd/ceph-76/
insgesamt 52
-rw-r--r-- 1 ceph ceph 418 Aug 27 11:08 activate.monmap
lrwxrwxrwx 1 ceph ceph 93 Aug 27 11:08 block ->  
/dev/ceph-8cd045dc-9eb2-47ad-9668-116cf425a66a/osd-block-9c51bde1-3c75-4767-8808-f7e7b58b8f97

-rw-r--r-- 1 ceph ceph 2 Aug 27 11:08 bluefs
-rw-r--r-- 1 ceph ceph 37 Aug 27 11:08 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Aug 27 11:08 fsid
-rw--- 1 ceph ceph 56 Aug 27 11:08 keyring
-rw-r--r-- 1 ceph ceph 8 Aug 27 11:08 kv_backend
-rw-r--r-- 1 ceph ceph 21 Aug 27 11:08 magic
-rw-r--r-- 1 ceph ceph 4 Aug 27 11:08 mkfs_done
-rw-r--r-- 1 ceph ceph 41 Aug 27 11:08 osd_key
-rw-r--r-- 1 ceph ceph 6 Aug 27 11:08 ready
-rw-r--r-- 1 ceph ceph 3 Aug 27 11:08 require_osd_release
-rw-r--r-- 1 ceph ceph 10 Aug 27 11:08 type
-rw-r--r-- 1 ceph ceph 3 Aug 27 11:08 whoami

root@ld5505:~# more /var/lib/ceph/osd/ceph-76/bluefs
1

Questions:
How can I add DB device for every single existing OSD to this new SSD drive?
How can I increase the DB size later in case it's insufficient?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Howto add DB (aka RockDB) device to existing OSD on HDD

2019-08-29 Thread Eugen Block

Hi,


Then I tried to move DB to a new device (SSD) that is not formatted:
root@ld5505:~# ceph-bluestore-tool bluefs-bdev-new-db –-path  
/var/lib/ceph/osd/ceph-76 --dev-target /dev/sdbk

too many positional options have been specified on the command line


I think you're trying the wrong option. 'man bluefs-bdev-new-db' says:

   bluefs-bdev-new-db --path osd path --dev-target new-device
  Adds DB device to BlueFS, fails if DB device already exists.

If you want to move an existing DB you should use bluefs-bdev-migrate  
instead. I haven't tried it yet, though.



How can I increase the DB size later in case it's insufficient?


There's also a bluefs-bdev-expand command to resize the db if the  
underlying device has more space available. It depends on your ceph  
version, of course. This was not possible in Luminous, I'm not sure  
about Mimic but it works in Nautilus.


Regards,
Eugen


Zitat von 74cmo...@gmail.com:


Hi,

I have created OSD on HDD w/o putting DB on faster drive.

In order to improve performance I have now a single SSD drive with 3.8TB.

I modified /etc/ceph/ceph.conf by adding this in [global]:
bluestore_block_db_size = 53687091200
This should create RockDB with size 50GB.

Then I tried to move DB to a new device (SSD) that is not formatted:
root@ld5505:~# ceph-bluestore-tool bluefs-bdev-new-db –-path  
/var/lib/ceph/osd/ceph-76 --dev-target /dev/sdbk

too many positional options have been specified on the command line

Checking the content of /var/lib/ceph/osd/ceph-76 it appears that  
there's no link to block.db:

root@ld5505:~# ls -l /var/lib/ceph/osd/ceph-76/
insgesamt 52
-rw-r--r-- 1 ceph ceph 418 Aug 27 11:08 activate.monmap
lrwxrwxrwx 1 ceph ceph 93 Aug 27 11:08 block ->  
/dev/ceph-8cd045dc-9eb2-47ad-9668-116cf425a66a/osd-block-9c51bde1-3c75-4767-8808-f7e7b58b8f97

-rw-r--r-- 1 ceph ceph 2 Aug 27 11:08 bluefs
-rw-r--r-- 1 ceph ceph 37 Aug 27 11:08 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 Aug 27 11:08 fsid
-rw--- 1 ceph ceph 56 Aug 27 11:08 keyring
-rw-r--r-- 1 ceph ceph 8 Aug 27 11:08 kv_backend
-rw-r--r-- 1 ceph ceph 21 Aug 27 11:08 magic
-rw-r--r-- 1 ceph ceph 4 Aug 27 11:08 mkfs_done
-rw-r--r-- 1 ceph ceph 41 Aug 27 11:08 osd_key
-rw-r--r-- 1 ceph ceph 6 Aug 27 11:08 ready
-rw-r--r-- 1 ceph ceph 3 Aug 27 11:08 require_osd_release
-rw-r--r-- 1 ceph ceph 10 Aug 27 11:08 type
-rw-r--r-- 1 ceph ceph 3 Aug 27 11:08 whoami

root@ld5505:~# more /var/lib/ceph/osd/ceph-76/bluefs
1

Questions:
How can I add DB device for every single existing OSD to this new SSD drive?
How can I increase the DB size later in case it's insufficient?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph device list empty

2019-08-15 Thread Eugen Block

Hi,

are the OSD nodes on Nautilus already? We upgraded from Luminous to  
Nautilus recently and the commands return valid output, except for  
those OSDs that haven't been upgraded yet.



Zitat von Gary Molenkamp :


I've had no luck in tracing this down.  I've tried setting debugging and
log channels to try and find what is failing with no success.

With debug_mgr at 20/20, the logs will show:
        log_channel(audit) log [DBG] : from='client.10424012 -'
entity='client.admin' cmd=[{"prefix": "device ls", "target": ["mgr",
""]}]: dispatch
but I don't see anything further.

Interestingly, when using "ceph device ls-by-daemon" I see this in the logs:
0 log_channel(audit) log [DBG] : from='client.10345413 -'
entity='client.admin' cmd=[{"prefix": "device ls-by-daemon", "who":
"osd.0", "target": ["mgr", ""]}]: dispatch
-1 mgr.server reply reply (22) Invalid argument No handler found for
'device ls-by-daemon'


Gary.




On 2019-08-07 11:20 a.m., Gary Molenkamp wrote:

I'm testing an upgrade to Nautilus on a development cluster and the
command "ceph device ls" is returning an empty list.

# ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
#

I have walked through the luminous upgrade documentation under
https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous
but I don't see anything pertaining to "activating" device support under
Nautilus.

The devices are visible to ceph-volume on the OSS nodes.  ie:

osdev-stor1 ~]# ceph-volume lvm list
== osd.0 ===
    [block]
/dev/ceph-f5eb16ec-7074-477b-8f83-ce87c5f74fa3/osd-block-c1de464f-d838-4558-ba75-1c268e538d6b

    block device
/dev/ceph-f5eb16ec-7074-477b-8f83-ce87c5f74fa3/osd-block-c1de464f-d838-4558-ba75-1c268e538d6b
    block uuid dlbIm6-H5za-001b-C3mQ-EGks-yoed-zoQpoo

    devices   /dev/sdb

== osd.2 ===
    [block]
/dev/ceph-37145a74-6b2b-4519-b72e-2defe11732aa/osd-block-e06c513b-5af3-4bf6-927f-1f0142c59e8a
    block device
/dev/ceph-37145a74-6b2b-4519-b72e-2defe11732aa/osd-block-e06c513b-5af3-4bf6-927f-1f0142c59e8a
    block uuid egdvpm-3bXx-xmNO-ACzp-nxax-Wka2-81rfNT

    devices   /dev/sdc

Is there a step I missed?
Thanks.

Gary.





--
Gary Molenkamp  Computer Science/Science Technology Services
Systems Administrator   University of Western Ontario
molen...@uwo.ca http://www.csd.uwo.ca
(519) 661-2111 x86882   (519) 661-3566

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus dashboard: crushmap viewer shows only first root

2019-07-24 Thread Eugen Block

Thank you very much!


Zitat von EDH - Manuel Rios Fernandez :


Hi Eugen,

Yes its solved, we reported in 14.2.1 and team fixed in 14.2.2

Regards,
Manuel

-Mensaje original-
De: ceph-users  En nombre de Eugen Block
Enviado el: miércoles, 24 de julio de 2019 15:10
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Nautilus dashboard: crushmap viewer shows only first
root

Hi all,

we just upgraded our cluster to:

ceph version 14.2.0-300-gacd2f2b9e1
(acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable)

When clicking through the dashboard to see what's new we noticed that the
crushmap viewer only shows the first root of our crushmap (we have two
roots). I couldn't find anything in the tracker, and I can't update further
to the latest release 14.2.2 to see if that has been resolved. Is this known
or already fixed?

Regards
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nautilus dashboard: crushmap viewer shows only first root

2019-07-24 Thread Eugen Block

Hi all,

we just upgraded our cluster to:

ceph version 14.2.0-300-gacd2f2b9e1  
(acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable)


When clicking through the dashboard to see what's new we noticed that  
the crushmap viewer only shows the first root of our crushmap (we have  
two roots). I couldn't find anything in the tracker, and I can't  
update further to the latest release 14.2.2 to see if that has been  
resolved. Is this known or already fixed?


Regards
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD replacement causes slow requests

2019-07-24 Thread Eugen Block

Hi Wido,

thanks for your response.


Have you tried to dump the historic slow ops on the OSDs involved to see
what is going on?
$ ceph daemon osd.X dump_historic_slow_ops


Good question, I don't recall doing that. Maybe my colleague did but  
he's on vacation right now. ;-)



But to be clear, are all the OSDs on Nautilus or is there a mix of L and
N OSDs?


I'll try to clarify: it was (and still is) a mixture of L and N OSDs,  
but all L-OSDs were empty at the time. The cluster already had  
rebalanced all PGs to the new OSDs. So the L-OSDs were not involved in  
this recovery process. We're currently upgrading the remaining servers  
to Nautilus, there's one left with L-OSDs, but those OSDs don't store  
any objects at the moment (different root in crushmap).


The recovery eventually finished successfully, but my colleague had to  
do it after business hours, maybe that's why he needs his vacation. ;-)


Regards,
Eugen


Zitat von Wido den Hollander :


On 7/18/19 12:21 PM, Eugen Block wrote:

Hi list,

we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).

We added new servers with Nautilus to the existing Luminous cluster, so
we could first replace the MONs step by step. Then we moved the old
servers to a new root in the crush map and then added the new OSDs to
the default root so we would need to rebalance the data only once. This
almost worked as planned, except for many slow and stuck requests. We
did this after business hours so the impact was negligable and we didn't
really investigate, the goal was to finish the rebalancing.

But only after two days one of the new OSDs (osd.30) already reported
errors, so we need to replace that disk.
The replacement disk (osd.0) has been added with an initial crush weight
of 0 (also reweight 0) to control the backfill with small steps.
This seems to be harder than it should (also than we experienced so
far), no matter how small the steps are, the cluster immediately reports
slow requests. We can't disrupt the production environment so we
cancelled the backfill/recovery for now. But this procedure has been
successful in the past with Luminous, that's why we're so surprised.

The recovery and backfill parameters are pretty low:

    "osd_max_backfills": "1",
    "osd_recovery_max_active": "3",

This usually allowed us a slow backfill to be able to continue
productive work, now it doesn't.

Our ceph version is (only the active MDS still runs Luminous, the
designated server is currently being upgraded):

14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f)
nautilus (stable)

Is there anything we missed that we should be aware of in Nautilus
regarding recovery and replacement scenarios?
We couldn't reduce the weight of that osd lower than 0.16, everything
else results in slow requests.
During the weight reduction several PGs keep stuck in
activating+remapped state when, only recoverable (sometimes) by
restarting that affected osd several times. Reducing crush weight leads
to the same effect.

Please note: the old servers in root-ec are going to be ec-only OSDs,
that's why they're still in the cluster.

Any pointers to what goes wrong here would be highly appreciated! If you
need any other information I'd be happy to provide it.



Have you tried to dump the historic slow ops on the OSDs involved to see
what is going on?

$ ceph daemon osd.X dump_historic_slow_ops

But to be clear, are all the OSDs on Nautilus or is there a mix of L and
N OSDs?

Wido


Best regards,
Eugen


This is our osd tree:

ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
-19   11.09143 root root-ec
 -2    5.54572 host ceph01
  1   hdd  0.92429 osd.1   down    0 1.0
  4   hdd  0.92429 osd.4 up    0 1.0
  6   hdd  0.92429 osd.6 up    0 1.0
 13   hdd  0.92429 osd.13    up    0 1.0
 16   hdd  0.92429 osd.16    up    0 1.0
 18   hdd  0.92429 osd.18    up    0 1.0
 -3    5.54572 host ceph02
  2   hdd  0.92429 osd.2 up    0 1.0
  5   hdd  0.92429 osd.5 up    0 1.0
  7   hdd  0.92429 osd.7 up    0 1.0
 12   hdd  0.92429 osd.12    up    0 1.0
 17   hdd  0.92429 osd.17    up    0 1.0
 19   hdd  0.92429 osd.19    up    0 1.0
 -5  0 host ceph03
 -1   38.32857 root default
-31   10.79997 host ceph04
 25   hdd  3.5 osd.25    up  1.0 1.0
 26   hdd  3.5 osd.26    up  1.0 1.0
 27   hdd  3.5 osd.27    up  1.0 1.0
-34   14.39995 host ceph05
  0   hdd  3.59998 osd.0 up    0 1.0
 28   hdd  3.5 osd.28    up  1.

[ceph-users] OSD replacement causes slow requests

2019-07-18 Thread Eugen Block

Hi list,

we're facing an unexpected recovery behavior of an upgraded cluster  
(Luminous -> Nautilus).


We added new servers with Nautilus to the existing Luminous cluster,  
so we could first replace the MONs step by step. Then we moved the old  
servers to a new root in the crush map and then added the new OSDs to  
the default root so we would need to rebalance the data only once.  
This almost worked as planned, except for many slow and stuck  
requests. We did this after business hours so the impact was  
negligable and we didn't really investigate, the goal was to finish  
the rebalancing.


But only after two days one of the new OSDs (osd.30) already reported  
errors, so we need to replace that disk.
The replacement disk (osd.0) has been added with an initial crush  
weight of 0 (also reweight 0) to control the backfill with small steps.
This seems to be harder than it should (also than we experienced so  
far), no matter how small the steps are, the cluster immediately  
reports slow requests. We can't disrupt the production environment so  
we cancelled the backfill/recovery for now. But this procedure has  
been successful in the past with Luminous, that's why we're so  
surprised.


The recovery and backfill parameters are pretty low:

"osd_max_backfills": "1",
"osd_recovery_max_active": "3",

This usually allowed us a slow backfill to be able to continue  
productive work, now it doesn't.


Our ceph version is (only the active MDS still runs Luminous, the  
designated server is currently being upgraded):


14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f)  
nautilus (stable)


Is there anything we missed that we should be aware of in Nautilus  
regarding recovery and replacement scenarios?
We couldn't reduce the weight of that osd lower than 0.16, everything  
else results in slow requests.
During the weight reduction several PGs keep stuck in  
activating+remapped state when, only recoverable (sometimes) by  
restarting that affected osd several times. Reducing crush weight  
leads to the same effect.


Please note: the old servers in root-ec are going to be ec-only OSDs,  
that's why they're still in the cluster.


Any pointers to what goes wrong here would be highly appreciated! If  
you need any other information I'd be happy to provide it.


Best regards,
Eugen


This is our osd tree:

ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
-19   11.09143 root root-ec
 -25.54572 host ceph01
  1   hdd  0.92429 osd.1   down0 1.0
  4   hdd  0.92429 osd.4 up0 1.0
  6   hdd  0.92429 osd.6 up0 1.0
 13   hdd  0.92429 osd.13up0 1.0
 16   hdd  0.92429 osd.16up0 1.0
 18   hdd  0.92429 osd.18up0 1.0
 -35.54572 host ceph02
  2   hdd  0.92429 osd.2 up0 1.0
  5   hdd  0.92429 osd.5 up0 1.0
  7   hdd  0.92429 osd.7 up0 1.0
 12   hdd  0.92429 osd.12up0 1.0
 17   hdd  0.92429 osd.17up0 1.0
 19   hdd  0.92429 osd.19up0 1.0
 -5  0 host ceph03
 -1   38.32857 root default
-31   10.79997 host ceph04
 25   hdd  3.5 osd.25up  1.0 1.0
 26   hdd  3.5 osd.26up  1.0 1.0
 27   hdd  3.5 osd.27up  1.0 1.0
-34   14.39995 host ceph05
  0   hdd  3.59998 osd.0 up0 1.0
 28   hdd  3.5 osd.28up  1.0 1.0
 29   hdd  3.5 osd.29up  1.0 1.0
 30   hdd  3.5 osd.30up  0.15999   0
-37   10.79997 host ceph06
 31   hdd  3.5 osd.31up  1.0 1.0
 32   hdd  3.5 osd.32up  1.0 1.0
 33   hdd  3.5 osd.33up  1.0 1.0


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume failed after replacing disk

2019-07-05 Thread Eugen Block

Hi,

did you also remove that OSD from crush and also from auth before  
recreating it?


ceph osd crush remove osd.71
ceph auth del osd.71

Regards,
Eugen


Zitat von "ST Wong (ITSC)" :


Hi all,

We replaced a faulty disk out of N OSD and tried to follow steps  
according to "Replacing and OSD" in  
http://docs.ceph.com/docs/nautilus/rados/operations/add-or-rm-osds/,  
but got error:


# ceph osd destroy 71--yes-i-really-mean-it
# ceph-volume lvm create --bluestore --data /dev/data/lv01 --osd-id  
71 --block.db /dev/db/lv01

Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name  
client.bootstrap-osd --keyring  
/var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json

-->  RuntimeError: The osd ID 71 is already in use or does not exist.

ceph -s still shows  N OSDS.   I then remove with "ceph osd rm 71".   
 Now "ceph -s" shows N-1 OSDS and id 71 doesn't appear in "ceph osd  
ls".


However, repeating the ceph-volume command still gets same error.
We're running CEPH 14.2.1.   I must have some steps missed.Would  
anyone please help? Thanks a lot.


Rgds,
/stwong




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade

2019-07-02 Thread Eugen Block

Hi,

did you try to use rbd and rados commands with the cinder keyring, not  
the admin keyring? Did you check if the caps for that client are still  
valid (do the caps differ between the two cinder pools)?


Are the ceph versions on your hypervisors also nautilus?

Regards,
Eugen


Zitat von Adrien Georget :


Hi all,

I'm facing a very strange issue after migrating my Luminous cluster  
to Nautilus.
I have 2 pools configured for Openstack cinder volumes with multiple  
backend setup, One "service" Ceph pool with cache tiering and one  
"R" Ceph pool.
After the upgrade, the R pool became inaccessible for Cinder and  
the cinder-volume service using this pool can't start anymore.
What is strange is that Openstack and Ceph report no error, Ceph  
cluster is healthy, all OSDs are UP & running and the "service" pool  
is still running well with the other cinder service on the same  
openstack host.
I followed exactly the upgrade procedure  
(https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous), no problem during the upgrade but I can't understand why Cinder still fails with this  
pool.
I can access, list, create volume on this pool with rbd or rados  
command from the monitors, but on the Openstack hypervisor the rbd  
or rados ls command stay stuck and rados ls give this message  
(|134.158.208.37 is an OSD node,10.158.246.214 an Openstack  
hypervisor) |:


|2019-07-02 11:26:15.999869 7f63484b4700  0 --  
10.158.246.214:0/1404677569 >> 134.158.208.37:6884/2457222  
pipe(0x555c2bf96240 sd=7 :0 s=1 pgs=0 cs=0 l=1  
c=0x555c2bf97500).fault|



ceph version 14.2.1
Openstack Newton

I spent 2 days checking everything on Ceph side but I couldn't find  
anything problematic...

If you have any hints which can help me, I would appreciate :)

Adrien




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs allocated to osd with weights 0

2019-07-02 Thread Eugen Block

Hi,

I can’t get data flushed out of osd with weights set to 0. Is there  
any way of checking the tasks queued for PG remapping ? Thank You.


can you give some more details about your cluster (replicated or EC  
pools, applied rules etc.)? My first guess would be that the other  
OSDs are either (near) full and the PGs can't be recovered on the  
remaining servers. Or your crush rules don't allow the redistribution  
of those PGs since your osd tree has changed.


The output of

ceph osd df tree

would help.

Regards,
Eugen


Zitat von Yanko Davila :


Hello

I can’t get data flushed out of osd with weights set to 0. Is there  
any way of checking the tasks queued for PG remapping ? Thank You.


Yanko.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MGR Logs after Failure Testing

2019-06-28 Thread Eugen Block
You may want to configure your standby-mds's to be "standby-replay" so  
the mds that's taking over from the failed one takes less time to take  
over. To manage this you add to your ceph.conf something like this:


---snip---
[mds.server1]
mds_standby_replay = true
mds_standby_for_rank = 0

[mds.server2]
mds_standby_replay = true
mds_standby_for_rank = 0

[mds.server3]
mds_standby_replay = true
mds_standby_for_rank = 0
---snip---

For your setup this would mean you have one active mds, one as  
standby-replay (that takes over immediately, depending on the load a  
very short interruption could happen) and one as standby ("cold  
standby" if you will). Currently both your standby mds servers are  
"cold".



Zitat von dhils...@performair.com:


Eugen;

All services are running, yes, though they didn't all start when I  
brought the host up (configured not to start because the last thing  
I had done is physically relocate the entire cluster).


All services are running, and happy.

# ceph status
  cluster:
id: 1a8a1693-fa54-4cb3-89d2-7951d4cee6a3
health: HEALTH_OK

  services:
mon: 3 daemons, quorum S700028,S700029,S700030 (age 20h)
mgr: S700028(active, since 17h), standbys: S700029, S700030
mds: cifs:1 {0=S700029=up:active} 2 up:standby
osd: 6 osds: 6 up (since 21h), 6 in (since 21h)

  data:
pools:   16 pools, 192 pgs
objects: 449 objects, 761 MiB
usage:   724 GiB used, 65 TiB / 66 TiB avail
pgs: 192 active+clean

# ceph osd tree
ID CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
-1   66.17697 root default
-5   22.05899 host S700029
 2   hdd 11.02950 osd.2up  1.0 1.0
 3   hdd 11.02950 osd.3up  1.0 1.0
-7   22.05899 host S700030
 4   hdd 11.02950 osd.4up  1.0 1.0
 5   hdd 11.02950 osd.5up  1.0 1.0
-3   22.05899 host s700028
 0   hdd 11.02950 osd.0up  1.0 1.0
 1   hdd 11.02950 osd.1up  1.0 1.0

The question about configuring the MDS as failover struck me as a  
potential, since I don't remember doing that, however it look like  
S700029 (10.0.200.111) took over from S700028 (10.0.200.110) as the  
active MDS.


Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On  
Behalf Of Eugen Block

Sent: Thursday, June 27, 2019 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] MGR Logs after Failure Testing

Hi,

some more information about the cluster status would be helpful, such as

ceph -s
ceph osd tree

service status of all MONs, MDSs, MGRs.
Are all services up? Did you configure the spare MDS as standby for
rank 0 so that a failover can happen?

Regards,
Eugen


Zitat von dhils...@performair.com:


All;

I built a demonstration and testing cluster, just 3 hosts
(10.0.200.110, 111, 112).  Each host runs mon, mgr, osd, mds.

During the demonstration yesterday, I pulled the power on one of the hosts.

After bringing the host back up, I'm getting several error messages
every second or so:
2019-06-26 16:01:56.424 7fcbe0af9700  0 ms_deliver_dispatch:
unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed
6) v7 from mds.? v2:10.0.200.112:6808/980053124
2019-06-26 16:01:56.425 7fcbf4cd1700  1 mgr finish mon failed to
return metadata for mds.S700030: (2) No such file or directory
2019-06-26 16:01:56.429 7fcbe0af9700  0 ms_deliver_dispatch:
unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed
1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738
2019-06-26 16:01:56.430 7fcbf4cd1700  1 mgr finish mon failed to
return metadata for mds.S700029: (2) No such file or directory

Thoughts?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MGR Logs after Failure Testing

2019-06-27 Thread Eugen Block

Hi,

some more information about the cluster status would be helpful, such as

ceph -s
ceph osd tree

service status of all MONs, MDSs, MGRs.
Are all services up? Did you configure the spare MDS as standby for  
rank 0 so that a failover can happen?


Regards,
Eugen


Zitat von dhils...@performair.com:


All;

I built a demonstration and testing cluster, just 3 hosts  
(10.0.200.110, 111, 112).  Each host runs mon, mgr, osd, mds.


During the demonstration yesterday, I pulled the power on one of the hosts.

After bringing the host back up, I'm getting several error messages  
every second or so:
2019-06-26 16:01:56.424 7fcbe0af9700  0 ms_deliver_dispatch:  
unhandled message 0x55e80a728f00 mgrreport(mds.S700030 +0-0 packed  
6) v7 from mds.? v2:10.0.200.112:6808/980053124
2019-06-26 16:01:56.425 7fcbf4cd1700  1 mgr finish mon failed to  
return metadata for mds.S700030: (2) No such file or directory
2019-06-26 16:01:56.429 7fcbe0af9700  0 ms_deliver_dispatch:  
unhandled message 0x55e809f8e600 mgrreport(mds.S700029 +110-0 packed  
1366) v7 from mds.0 v2:10.0.200.111:6808/2726495738
2019-06-26 16:01:56.430 7fcbf4cd1700  1 mgr finish mon failed to  
return metadata for mds.S700029: (2) No such file or directory


Thoughts?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd daemon cluster_fsid not reflecting actual cluster_fsid

2019-06-20 Thread Eugen Block

Hi,

I don't have an answer for you, but could you elaborate on what  
exactly you're are trying to do and what has worked so far? Which ceph  
version are you running? I understand that you want to clone your  
whole cluster, how exactly are you trying to do that? Is this the  
first OSD you're trying to start or are there already running cloned  
OSDs? If there are, what were the steps required to do that and what  
did you do differently with that failing OSD?
Have you tried to set debug level of that OSD to 10 or 20 and then  
search the logs for hints?


Regards,
Eugen


Zitat von Vincent Pharabot :


I think i found where the wrong fsid is located on OSD osdmap but no way to
change fsid...
I tried with ceph-objectstore-tool --op set-osdmap from osdmap from monitor
(ceph osd getmap) but no luck. still with old fsid (no find a way to
set the current epoch on osdmap)

Someone to give a hint ?

My goal is to be able to duplicate a ceph cluster (with data) to make some
tests... i would avoid taking the same fsid

Thanks !

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op
get-osdmap --file /tmp/osdmapfromosd3

# osdmaptool /tmp/osdmapfromosd3 --print
osdmaptool: osdmap file '/tmp/osdmapfromosd3'
epoch 24
fsid bb55e196-eedd-478d-99b6-1aad00b95f2a
created 2019-06-17 15:27:44.102409
modified 2019-06-17 15:53:37.279770
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 9
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release mimic

pool 1 'cephfs_data' replicated size 3 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 100 pgp_num 100 last_change 17 flags hashpspool
stripe_width 0 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 1 crush_rule 0
object_hash rjenkins pg_num 100 pgp_num 100 last_change 17 flags hashpspool
stripe_width 0 application cep
hfs

max_osd 3
osd.0 up in weight 1 up_from 23 up_thru 23 down_at 20 last_clean_interval
[5,19) 10.8.12.170:6800/3613 10.8.12.170:6801/3613 10.8.12.170:6802/3613
10.8.12.170:6803/3613 e
xists,up 01dbf73f-3866-47be-b623-b9c539dcd955
osd.1 up in weight 1 up_from 9 up_thru 23 down_at 0 last_clean_interval
[0,0) 10.8.29.71:6800/4364 10.8.29.71:6801/4364 10.8.29.71:6802/4364
10.8.29.71:6803/4364 exists,u
p ef7c0a4f-5118-4d44-a82b-c9a2cf3c0813
osd.2 up in weight 1 up_from 13 up_thru 23 down_at 0 last_clean_interval
[0,0) 10.8.32.182:6800/4361 10.8.32.182:6801/4361 10.8.32.182:6802/4361
10.8.32.182:6803/4361 exi
sts,up 905d17fc-6f37-4404-bd5d-4adc231c49b3


Le mar. 18 juin 2019 à 12:38, Vincent Pharabot 
a écrit :


Thanks Eugen for answering

Yes it came from another cluster, trying to move all osd from one cluster
to another (1 to 1) so i would avoid wiping the disk
It's indeed a ceph-volume OSD, i checked the lvm label and it's correct

# lvs --noheadings --readonly --separator=";" -o lv_tags

ceph.block_device=/dev/ceph-4681dda6-628d-47db-8981-1762effccf77/osd-block-01dbf73f-3866-47be-b623-b9c539dcd955,ceph.block_

uuid=uL57Kk-9kcO-DdOY-Glwm-cg9P-atmx-3m033v,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=173b6382-504b-421f-aa4d-52526fa80dfa

,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=01dbf73f-3866-47be-b623-b9c539dcd955,ceph
.osd_id=0,ceph.type=block,ceph.vdo=0

OSD bluestore labels are also correct

# ceph-bluestore-tool show-label --dev
/dev/ceph-4681dda6-628d-47db-8981-1762effccf77/osd-block-01dbf73f
-3866-47be-b623-b9c539dcd955
{
"/dev/ceph-4681dda6-628d-47db-8981-1762effccf77/osd-block-01dbf73f-3866-47be-b623-b9c539dcd955":
{
"osd_uuid": "01dbf73f-3866-47be-b623-b9c539dcd955",
"size": 1073737629696,
"btime": "2019-06-17 15:28:53.126482",
"description": "main",
"bluefs": "1",
"ceph_fsid": "173b6382-504b-421f-aa4d-52526fa80dfa",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQBXwwddy5OEAxAAS4AidvOF0kl+kxIBvFhT1A==",
"ready": "ready",
"whoami": "0"
}
}


Anyway to change wrong fsid from OSD without zapping the disk ?

Thank you




Le mar. 18 juin 2019 à 12:19, Eugen Block  a écrit :


Hi,

this OSD must have been part of a previous cluster, I assume.
I would remove it from crush if it's still there (check just to make
sure), wipe the disk, remove any traces like logical volumes (if it
was a ceph-volume lvm OSD) and if possible, reboot the node.

Regards,
Eugen


Zitat von Vincent Pharabot :

> Hello
>
> I have an OSD which is stuck in booting state.
> I find out that the daemon osd cluster_fsid is not the same that the
actual
> cluster fsid, which should explain why it does not join the cluster
>
> # ceph daemon osd.0 status
> {
> "cluster_fsid": "bb55e19

Re: [ceph-users] osd daemon cluster_fsid not reflecting actual cluster_fsid

2019-06-18 Thread Eugen Block

Hi,

this OSD must have been part of a previous cluster, I assume.
I would remove it from crush if it's still there (check just to make  
sure), wipe the disk, remove any traces like logical volumes (if it  
was a ceph-volume lvm OSD) and if possible, reboot the node.


Regards,
Eugen


Zitat von Vincent Pharabot :


Hello

I have an OSD which is stuck in booting state.
I find out that the daemon osd cluster_fsid is not the same that the actual
cluster fsid, which should explain why it does not join the cluster

# ceph daemon osd.0 status
{
"cluster_fsid": "bb55e196-eedd-478d-99b6-1aad00b95f2a",
"osd_fsid": "01dbf73f-3866-47be-b623-b9c539dcd955",
"whoami": 0,
"state": "booting",
"oldest_map": 1,
"newest_map": 24,
"num_pgs": 200
}

#ceph fsid
173b6382-504b-421f-aa4d-52526fa80dfa

I checked on the cluster fsid file and it's correct
# cat /var/lib/ceph/osd/ceph-0/ceph_fsid
173b6382-504b-421f-aa4d-52526fa80dfa

OSDMap shows correct fsid also

# ceph osd dump
epoch 33
fsid 173b6382-504b-421f-aa4d-52526fa80dfa
created 2019-06-17 16:42:52.632757
modified 2019-06-18 09:28:10.376573
flags noout,sortbitwise,recovery_deletes,purged_snapdirs
crush_version 13
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client jewel
min_compat_client jewel
require_osd_release mimic
pool 1 'cephfs_data' replicated size 3 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 100 pgp_num 100 last_change 17 flags hashpspool
stripe_width 0 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 1 crush_rule 0
object_hash rjenkins pg_num 100 pgp_num 100 last_change 17 flags hashpspool
stripe_width 0 application cephfs
max_osd 3
osd.0 down in weight 1 up_from 0 up_thru 0 down_at 0 last_clean_interval
[0,0) - - - - exists,new 01dbf73f-3866-47be-b623-b9c539dcd955
osd.1 down in weight 1 up_from 0 up_thru 0 down_at 0 last_clean_interval
[0,0) - - - - exists,new ef7c0a4f-5118-4d44-a82b-c9a2cf3c0813
osd.2 down in weight 1 up_from 13 up_thru 23 down_at 26 last_clean_interval
[0,0) 10.8.61.24:6800/4442 10.8.61.24:6801/4442 10.8.61.24:6802/4442
10.8.61.24:6803/4442 exists e40ef3ba-8f19-4b41-be9d-f95f679df0eb

So from where the daemon take the wrong cluster id ?
I might miss something obvious again...

Someone able to help ?

Thank you !
Vincent




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed Disk simulation question

2019-05-22 Thread Eugen Block

Hi Alex,

The cluster has been idle at the moment being new and all.  I  
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not  
been detected.  All osds were up and in and health was OK. OSD logs  
had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to  
start as expected.


if the cluster doesn't have to read or write to specific OSDs (or  
sectors on that OSD) the failure won't be detected immediately. We had  
an issue last year where one of the SSDs (used for rocksdb and wal)  
had a failure, but that was never reported. We discovered that when we  
tried to migrate the lvm to a new device and got read errors.


Later on, I performed the same operation during the fio bench mark  
and OSD failed immediately.


This confirms our experience, if there's data to read/write on that  
disk the failure will be detected.
Please note that this was in a Luminous cluster, I don't know if and  
how Nautilus has improved in sensing disk failures.


Regards,
Eugen


Zitat von Alex Litvak :


Hello cephers,

I know that there was similar question posted 5 years ago.  However  
the answer was inconclusive for me.
I installed a new Nautilus 14.2.1 cluster and started pre-production  
testing.  I followed RedHat document and simulated a soft disk  
failure by


#  echo 1 > /sys/block/sdc/device/delete

The cluster has been idle at the moment being new and all.  I  
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not  
been detected.  All osds were up and in and health was OK. OSD logs  
had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to  
start as expected.


Later on, I performed the same operation during the fio bench mark  
and OSD failed immediately.


My question is:  Should the disk problem have been detected quick  
enough even on the idle cluster? I thought Nautilus has the means to  
sense failure before intensive IO hit the disk.

Am I wrong to expect that?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus, k+m erasure coding a profile vs size+min_size

2019-05-21 Thread Eugen Block

Hi,

this question comes up regularly and is been discussed just now:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034867.html

Regards,
Eugen


Zitat von Yoann Moulin :


Dear all,

I am doing some tests with Nautilus and cephfs on erasure coding pool.

I noticed something strange between k+m in my erasure profile and  
size+min_size in the pool created:



test@icadmin004:~$ ceph osd erasure-code-profile get ecpool-4-2
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8


test@icadmin004:~$ ceph --cluster test osd pool create cephfs_data  
8 8 erasure ecpool-4-2

pool 'cephfs_data' created



test@icadmin004:~$ ceph osd pool ls detail | grep cephfs_data
pool 14 'cephfs_data' erasure size 6 min_size 5 crush_rule 1  
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn  
last_change 2646 flags hashpspool stripe_width 16384


Why min_size = 5 and not 4 ?

Best,

--
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is a not active mds doing something?

2019-05-21 Thread Eugen Block

Hi Marc,

have you configured the other MDS to be standby-replay for the active  
MDS? I have three MDS servers, one is active, the second is  
active-standby and the third just standby. If the active fails, the  
second takes over within seconds. This is what I have in my ceph.conf:


[mds.]
mds_standby_replay = true
mds_standby_for_rank = 0


Regards,
Eugen


Zitat von Marc Roos :


Should a not active mds be doing something??? When I restarted the not
active mds.c, My client io on the fs_data pool disappeared.


  services:
mon: 3 daemons, quorum a,b,c
mgr: c(active), standbys: a, b
mds: cephfs-1/1/1 up  {0=a=up:active}, 1 up:standby
osd: 32 osds: 32 up, 32 in
rgw: 2 daemons active



-Original Message-
From: Marc Roos
Sent: dinsdag 21 mei 2019 10:01
To: ceph-users@lists.ceph.com; Marc Roos
Subject: RE: [ceph-users] cephfs causing high load on vm, taking down 15
min later another cephfs vm


I have evicted all client connections and have still high load on osd's

And ceph osd pool stats shows still client activity?

pool fs_data id 20
  client io 565KiB/s rd, 120op/s rd, 0op/s wr




-Original Message-
From: Marc Roos
Sent: dinsdag 21 mei 2019 9:51
To: ceph-users@lists.ceph.com; Marc Roos
Subject: RE: [ceph-users] cephfs causing high load on vm, taking down 15
min later another cephfs vm


I have got this today again? I cannot unmount the filesystem and
looks like some osd's are having 100% cpu utilization?


-Original Message-
From: Marc Roos
Sent: maandag 20 mei 2019 12:42
To: ceph-users
Subject: [ceph-users] cephfs causing high load on vm, taking down 15 min
later another cephfs vm



I got my first problem with cephfs in a production environment. Is it
possible from these logfiles to deduct what happened?

svr1 is connected to ceph client network via switch
svr2 vm is collocated on c01 node.
c01 has osd's and the mon.a colocated.

svr1 was the first to report errors at 03:38:44. I have no error
messages reported of a network connection problem by any of the ceph
nodes. I have nothing in dmesg on c01.

[@c01 ~]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[@c01 ~]# uname -a
Linux c01 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019

x86_64 x86_64 x86_64 GNU/Linux
[@c01 ~]# ceph versions
{
"mon": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)

luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)

luminous (stable)": 3
},
"osd": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)

luminous (stable)": 32
},
"mds": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)

luminous (stable)": 2
},
"rgw": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)

luminous (stable)": 2
},
"overall": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)

luminous (stable)": 42
}
}




[0] svr1 messages
May 20 03:36:01 svr1 systemd: Started Session 308978 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308979 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308979 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308980 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308980 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308981 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308981 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308982 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308982 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308983 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308983 of user root.
May 20 03:38:44 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:44 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: last message repeated 5 times
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: last message repeated 5 times
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session
established
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session
established
May 20 03:38:45 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session
lost, 

Re: [ceph-users] inconsistent number of pools

2019-05-20 Thread Eugen Block

Hi, have you tried 'ceph health detail'?


Zitat von Lars Täuber :


Hi everybody,

with the status report I get a HEALTH_WARN I don't know how to get rid of.
It my be connected to recently removed pools.

# ceph -s
  cluster:
id: 6cba13d1-b814-489c-9aac-9c04aaf78720
health: HEALTH_WARN
1 pools have many more objects per pg than average

  services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 4h)
mgr: mon1(active, since 4h), standbys: cephsible, mon2, mon3
mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby
osd: 30 osds: 30 up (since 2h), 30 in (since 7w)

  data:
pools:   5 pools, 1029 pgs
objects: 315.51k objects, 728 GiB
usage:   4.6 TiB used, 163 TiB / 167 TiB avail
pgs: 1029 active+clean


!!! but:
# ceph osd lspools | wc -l
3

The status says there are 5 pools but the listing says there are only 3.
How to I get to know which pool is the reason for the health warning?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to get list of all the PGs assigned to an OSD?

2019-04-29 Thread Eugen Block

Sure there is:

ceph pg ls-by-osd 

Regards,
Eugen

Zitat von Igor Podlesny :


Or is there no direct way to accomplish that?
What workarounds can be used then?

--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: HW failure cause client IO drops

2019-04-16 Thread Eugen Block

Good morning,

the OSDs are usually marked out after 10 minutes, that's when  
rebalancing starts. But the I/O should not drop during that time, this  
could be related to your pool configuration. If you have a replicated  
pool of size 3 and also set min_size to 3 the I/O would pause if a  
node or OSD fails. So more information about the cluster would help,  
can you share that?


ceph osd tree
ceph osd pool ls detail

Were all pools affected or just specific pools?

Regards,
Eugen


Zitat von M Ranga Swami Reddy :


Hello - Recevenlt we had an issue with storage node's battery failure,
which cause ceph client IO dropped to '0' bytes. Means ceph cluster
couldn't perform IO operations on the cluster till the node takes out. This
is not expected from Ceph, as some HW fails, those respective OSDs should
mark as out/down and IO should go as is..

Please let me know if anyone seen the similar behavior and is this issue
resolved?

Thanks
Swami




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block

I always end up using "ceph --admin-daemon
/var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
is in effect now for a certain daemon.
Needs you to be on the host of the daemon of course.


Me too, I just wanted to try what OP reported. And after trying that,  
I'll keep it that way. ;-)



Zitat von Janne Johansson :


Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :


> If you don't specify which daemon to talk to, it tells you what the
> defaults would be for a random daemon started just now using the same
> config as you have in /etc/ceph/ceph.conf.

I tried that, too, but the result is not correct:

host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3



I always end up using "ceph --admin-daemon
/var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
is in effect now for a certain daemon.
Needs you to be on the host of the daemon of course.

--
May the most significant bit of your life be positive.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block

If you don't specify which daemon to talk to, it tells you what the
defaults would be for a random daemon started just now using the same
config as you have in /etc/ceph/ceph.conf.


I tried that, too, but the result is not correct:

host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3


Zitat von Janne Johansson :


Den ons 10 apr. 2019 kl 13:31 skrev Eugen Block :



While --show-config still shows

host1:~ # ceph --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3


It seems as if --show-config is not really up-to-date anymore?
Although I can execute it, the option doesn't appear in the help page
of a Mimic and Luminous cluster. So maybe this is deprecated.



If you don't specify which daemon to talk to, it tells you what the
defaults would be for a random daemon started just now using the same
config as you have in /etc/ceph/ceph.conf.


--
May the most significant bit of your life be positive.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-10 Thread Eugen Block

Hi,

I haven't used the --show-config option until now, but if you ask your  
OSD daemon directly, your change should have been applied:



host1:~ # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

host1:~ # ceph daemon osd.1 config show | grep osd_recovery_max_active
"osd_recovery_max_active": "4",

While --show-config still shows

host1:~ # ceph --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3


It seems as if --show-config is not really up-to-date anymore?  
Although I can execute it, the option doesn't appear in the help page  
of a Mimic and Luminous cluster. So maybe this is deprecated.


Regards,
Eugen


Zitat von solarflow99 :


I noticed when changing some settings, they appear to stay the same, for
example when trying to set this higher:

ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

It gives the usual warning about may need to restart, but it still has the
old value:

# ceph --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3


restarting the OSDs seems fairly intrusive for every configuration change.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Eugen Block

Sorry -- you need the "" as part of that command.


My bad, I only read this from the help page ignoring the   
(and forgot the pool name):


  -a [ --all ] list snapshots from all namespaces

I figured this would list all existing snapshots, similar to the "rbd  
-p  ls --long" command. Thanks for the clarification.


Eugen


Zitat von Jason Dillaman :


On Tue, Apr 2, 2019 at 8:42 AM Eugen Block  wrote:


Hi,

> If you run "rbd snap ls --all", you should see a snapshot in
> the "trash" namespace.

I just tried the command "rbd snap ls --all" on a lab cluster
(nautilus) and get this error:

ceph-2:~ # rbd snap ls --all
rbd: image name was not specified


Sorry -- you need the "" as part of that command.


Are there any requirements I haven't noticed? This lab cluster was
upgraded from Mimic a couple of weeks ago.

ceph-2:~ # ceph version
ceph version 14.1.0-559-gf1a72cff25
(f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev)

Regards,
Eugen


Zitat von Jason Dillaman :

> On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich
>  wrote:
>>
>> Hi,
>>
>> on one of my clusters, I'm getting error message which is getting
>> me a bit nervous.. while listing contents of a pool I'm getting
>> error for one of images:
>>
>> [root@node1 ~]# rbd ls -l nvme > /dev/null
>> rbd: error processing image  xxx: (2) No such file or directory
>>
>> [root@node1 ~]# rbd info nvme/xxx
>> rbd image 'xxx':
>> size 60 GiB in 15360 objects
>> order 22 (4 MiB objects)
>> id: 132773d6deb56
>> block_name_prefix: rbd_data.132773d6deb56
>> format: 2
>> features: layering, operations
>> op_features: snap-trash
>> flags:
>> create_timestamp: Wed Aug 29 12:25:13 2018
>>
>> volume contains production data and seems to be working  
correctly (it's used

>> by VM)
>>
>> is this something to worry about? What is snap-trash feature?
>> wasn't able to google
>> much about it..
>
> This implies that you are (or were) using transparent image clones and
> that you deleted a snapshot that had one or more child images attached
> to it. If you run "rbd snap ls --all", you should see a snapshot in
> the "trash" namespace. You can also list its child images by running
> "rbd children --snap-id  ".
>
> There definitely is an issue w/ the "rbd ls --long" command in that
> when it attempts to list all snapshots in the image, it is incorrectly
> using the snapshot's name instead of it's ID. I've opened a tracker
> ticket to get the bug fixed [1]. It was fixed in Nautilus but it
> wasn't flagged for backport to Mimic.
>
>> I'm running ceph 13.2.4 on centos 7.
>>
>> I'd be gratefull any help
>>
>> BR
>>
>> nik
>>
>>
>> --
>> -
>> Ing. Nikola CIPRICH
>> LinuxBox.cz, s.r.o.
>> 28.rijna 168, 709 00 Ostrava
>>
>> tel.:   +420 591 166 214
>> fax:+420 596 621 273
>> mobil:  +420 777 093 799
>> www.linuxbox.cz
>>
>> mobil servis: +420 737 238 656
>> email servis: ser...@linuxbox.cz
>> -
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> [1] http://tracker.ceph.com/issues/39081
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Jason




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Eugen Block

Hi,


If you run "rbd snap ls --all", you should see a snapshot in
the "trash" namespace.


I just tried the command "rbd snap ls --all" on a lab cluster  
(nautilus) and get this error:


ceph-2:~ # rbd snap ls --all
rbd: image name was not specified

Are there any requirements I haven't noticed? This lab cluster was  
upgraded from Mimic a couple of weeks ago.


ceph-2:~ # ceph version
ceph version 14.1.0-559-gf1a72cff25  
(f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev)


Regards,
Eugen


Zitat von Jason Dillaman :


On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich
 wrote:


Hi,

on one of my clusters, I'm getting error message which is getting
me a bit nervous.. while listing contents of a pool I'm getting
error for one of images:

[root@node1 ~]# rbd ls -l nvme > /dev/null
rbd: error processing image  xxx: (2) No such file or directory

[root@node1 ~]# rbd info nvme/xxx
rbd image 'xxx':
size 60 GiB in 15360 objects
order 22 (4 MiB objects)
id: 132773d6deb56
block_name_prefix: rbd_data.132773d6deb56
format: 2
features: layering, operations
op_features: snap-trash
flags:
create_timestamp: Wed Aug 29 12:25:13 2018

volume contains production data and seems to be working correctly (it's used
by VM)

is this something to worry about? What is snap-trash feature?  
wasn't able to google

much about it..


This implies that you are (or were) using transparent image clones and
that you deleted a snapshot that had one or more child images attached
to it. If you run "rbd snap ls --all", you should see a snapshot in
the "trash" namespace. You can also list its child images by running
"rbd children --snap-id  ".

There definitely is an issue w/ the "rbd ls --long" command in that
when it attempts to list all snapshots in the image, it is incorrectly
using the snapshot's name instead of it's ID. I've opened a tracker
ticket to get the bug fixed [1]. It was fixed in Nautilus but it
wasn't flagged for backport to Mimic.


I'm running ceph 13.2.4 on centos 7.

I'd be gratefull any help

BR

nik


--
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[1] http://tracker.ceph.com/issues/39081

--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-21 Thread Eugen Block

Hi Dan,

I don't know about keeping the osd-id but I just partially recreated  
your scenario. I wiped one OSD and recreated it. You are trying to  
re-use the existing block.db-LV with the device path (--block.db  
/dev/vg-name/lv-name) instead the lv notation (--block.db  
vg-name/lv-name):



# ceph-volume lvm create --data /dev/sdq --block.db
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
--osd-id 240


This fails in my test, too. But if I use the LV notation it works:

ceph-2:~ # ceph-volume lvm create --data /dev/sda --block.db  
ceph-journals/journal-osd3

[...]
Running command: /bin/systemctl enable --runtime ceph-osd@3
Running command: /bin/systemctl start ceph-osd@3
--> ceph-volume lvm activate successful for osd ID: 3
--> ceph-volume lvm create successful for: /dev/sda

This is a Nautilus test cluster, but I remember having this on a  
Luminous cluster, too. I hope this helps.


Regards,
Eugen


Zitat von Dan van der Ster :


On Tue, Mar 19, 2019 at 12:25 PM Dan van der Ster  wrote:


On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
>
> On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> >
> > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
 wrote:

> > >
> > > Hi all,
> > >
> > > We've just hit our first OSD replacement on a host created with
> > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > >
> > > The hdd /dev/sdq was prepared like this:
> > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > >
> > > Then /dev/sdq failed and was then zapped like this:
> > >   # ceph-volume lvm zap /dev/sdq --destroy
> > >
> > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > /dev/sdac (see P.S.)
> >
> > That is correct behavior for the zap command used.
> >
> > >
> > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > two options:
> > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > change when we re-create, right?)
> >
> > This is possible but you are right that in the current state, the FSID
> > and other cluster data exist in the LV metadata. To reuse this LV for
> > a new (replaced) OSD
> > then you would need to zap the LV *without* the --destroy flag, which
> > would clear all metadata on the LV and do a wipefs. The command would
> > need the full path to
> > the LV associated with osd.240, something like:
> >
> > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> >
> > >   2. remove the db lv from sdac then run
> > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > >  which should do the correct thing.
> >
> > This would also work if the db lv is fully removed with --destroy
> >
> > >
> > > This is all v12.2.11 btw.
> > > If (2) is the prefered approached, then it looks like a bug that the
> > > db lv was not destroyed by lvm zap --destroy.
> >
> > Since /dev/sdq was passed in to zap, just that one device was removed,
> > so this is working as expected.
> >
> > Alternatively, zap has the ability to destroy or zap LVs associated
> > with an OSD ID. I think this is not released yet for Luminous but
> > should be in the next release (which seems to be what you want)
>
> Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> can also zap by OSD FSID, both way will zap (and optionally destroy if
> using --destroy)
> all LVs associated with the OSD.
>
> Full examples on this can be found here:
>
> http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
>
>

Ohh that's an improvement! (Our goal is outsourcing the failure
handling to non-ceph experts, so this will help simplify things.)

In our example, the operator needs to know the osd id, then can do:

1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
the lvm from sdac for osd.240)
2. replace the hdd
3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240

But I just remembered that the --osd-ids flag hasn't been backported
to luminous, so we can't yet do that. I guess we'll follow the first
(1) procedure to re-use the existing db lv.


Hmm... re-using the db lv didn't work.

We zapped it (see https://pastebin.com/N6PwpbYu) then got this error
when trying to create:

# ceph-volume lvm create --data /dev/sdq --block.db
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
--osd-id 240
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
9f63b457-37e0-4e33-971e-c0fc24658b65 240
Running command: vgcreate --force --yes
ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45 /dev/sdq
 stdout: Physical volume "/dev/sdq" successfully created.
 stdout: Volume group "ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45"
successfully created
Running command: 

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Eugen Block

Hi,

my first guess would be a network issue. Double-check your connections  
and make sure the network setup works as expected. Check syslogs,  
dmesg, switches etc. for hints that a network interruption may have  
occured.


Regards,
Eugen


Zitat von Zhenshi Zhou :


Hi,

I deployed a ceph cluster with good performance. But the logs
indicate that the cluster is not as stable as I think it should be.

The log shows the monitors mark some osd as down periodly:
[image: image.png]

I didn't find any useful information in osd logs.

ceph version 13.2.4 mimic (stable)
OS version CentOS 7.6.1810
kernel version 5.0.0-2.el7

Thanks.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-03-11 Thread Eugen Block

Hi all,

we had some assistance with our SSD crash issue outside of this  
mailing list - which is not resolved yet  
(http://tracker.ceph.com/issues/38395) - but there's one thing I'd  
like to ask the list.


I noticed that a lot of the OSD crashes show a correlation to MON  
elections. For the last 18 OSD failures I count 7 MON elections  
happening right before the OSD failures are reported. But if I take  
into account that there's a grace period of 20 seconds, it seems as if  
some OSD failures could trigger a MON election, is that even possible?


The logs look like this:

---cut here---
2019-03-02 21:43:17.599452 mon.monitor02 mon.1 :6789/0 977222  
: cluster [INF] mon.monitor02 calling monitor election
2019-03-02 21:43:17.758506 mon.monitor01 mon.0 :6789/0  
1079594 : cluster [INF] mon.monitor01 calling monitor election
2019-03-02 21:43:22.938084 mon.monitor01 mon.0 :6789/0  
1079595 : cluster [INF] mon.monitor01 is new leader, mons  
monitor01,monitor02 in quorum (ranks 0,1)
2019-03-02 21:43:23.106667 mon.monitor01 mon.0 :6789/0  
1079600 : cluster [WRN] Health check failed: 1/3 mons down, quorum  
monitor01,monitor02 (MON_DOWN)
2019-03-02 21:43:23.180382 mon.monitor01 mon.0 :6789/0  
1079601 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum  
monitor01,monitor02
2019-03-02 21:43:27.454252 mon.monitor01 mon.0 :6789/0  
1079610 : cluster [INF] osd.20 failed (root=default,host=monitor03) (2  
reporters from different host after 20.000136 >= grace 20.00)

[...]
2019-03-04 10:06:35.743561 mon.monitor01 mon.0 :6789/0  
1164043 : cluster [INF] mon.monitor01 calling monitor election
2019-03-04 10:06:35.752565 mon.monitor02 mon.1 :6789/0  
1054674 : cluster [INF] mon.monitor02 calling monitor election
2019-03-04 10:06:35.835435 mon.monitor01 mon.0 :6789/0  
1164044 : cluster [INF] mon.monitor01 is new leader, mons  
monitor01,monitor02,monitor03 in quorum (ranks 0,1,2)
2019-03-04 10:06:35.701759 mon.monitor03 mon.2 :6789/0 287652  
: cluster [INF] mon.monitor03 calling monitor election
2019-03-04 10:06:35.954407 mon.monitor01 mon.0 :6789/0  
1164049 : cluster [INF] overall HEALTH_OK
2019-03-04 10:06:45.299686 mon.monitor01 mon.0 :6789/0  
1164057 : cluster [INF] osd.20 failed (root=default,host=monitor03) (2  
reporters from different host after 20.068848 >= grace 20.00)

[...]
---cut here---

These MON elections only happened when a OSD failure occured, no  
elections without OSD failures. Does this make sense to anybody? Any  
insights would be greatly appreciated.


Regards,
Eugen


Zitat von Igor Fedotov :


Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (=  
https://tracker.ceph.com/issues/36638 for luminous)


Hence it's not fixed in 12.2.10, target release is 12.2.11


Also please note the patch allows to avoid new occurrences for the  
issue. But there some chances that inconsistencies caused by it  
earlier are still present in DB. And assertion might still happen  
(hopefully with less frequency).


So could you please run fsck for OSDs that were broken once and  
share the results?


Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:

Hi list,

I found this thread [1] about crashing SSD OSDs, although that was  
about an upgrade to 12.2.7, we just hit (probably) the same issue  
after our update to 12.2.10 two days ago in a production cluster.

Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 :6789/0 109754 :  
cluster [INF] osd.10 failed (root=default,host=host1) (connection  
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 :6789/0 109771 :  
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: ***  
Caught signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in  
thread 7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]:  ceph  
version 12.2.10-544-gb10c702661  
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1:  
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2:  
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3:  
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4:  
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6:  
(interval_setunsigned long, std::less,  
mempool::pool_allocator<

Re: [ceph-users] ceph migration

2019-02-26 Thread Eugen Block

Hi,


Well, I've just reacted to all the text at the beginning of
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
including the title "the messy way". If the cluster is clean I see no
reason for doing brain surgery on monmaps
just to "save" a few minutes of redoing correctly from scratch.


with that I would agree. Careful planning and an installation  
following the docs should be first priority. But I would also  
encourage users to experiment with ceph before going into production.  
Dealing with failures and outages on a production cluster causes much  
more headache than on a test cluster. ;-)


If the cluster is empty anyway, I would also rather reinstall it, it  
doesn't take that much time. I just wanted to point out that there is  
a way that worked for me, although that was only a test cluster.


Regards,
Eugen


Zitat von Janne Johansson :


Den mån 25 feb. 2019 kl 13:40 skrev Eugen Block :

I just moved a (virtual lab) cluster to a different network, it worked
like a charm.
In an offline method - you need to:

- set osd noout, ensure there are no OSDs up
- Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP
ADDRESS", MONs are the only ones really
sticky with the IP
- Ensure ceph.conf has the new MON IPs and network IPs
- Start MONs with new monmap, then start OSDs

> No, certain ips will be visible in the databases, and those will  
not change.

I'm not sure where old IPs will be still visible, could you clarify
that, please?


Well, I've just reacted to all the text at the beginning of
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
including the title "the messy way". If the cluster is clean I see no
reason for doing brain surgery on monmaps
just to "save" a few minutes of redoing correctly from scratch. What
if you miss some part, some command gives you an error
you really aren't comfortable with, something doesn't really feel
right after doing it, then the whole lifetime of that cluster
will be followed by a small nagging feeling that it might have been
that time you followed a guide that tries to talk you out of
doing it that way, for a cluster with no data.

I think that is the wrong way to learn how to run clusters.

--
May the most significant bit of your life be positive.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph migration

2019-02-25 Thread Eugen Block
I just moved a (virtual lab) cluster to a different network, it worked  
like a charm.


In an offline method - you need to:

- set osd noout, ensure there are no OSDs up
- Change the MONs IP, See the bottom of [1] "CHANGING A MONITOR’S IP  
ADDRESS", MONs are the only ones really

sticky with the IP
- Ensure ceph.conf has the new MON IPs and network IPs
- Start MONs with new monmap, then start OSDs


No, certain ips will be visible in the databases, and those will not change.


I'm not sure where old IPs will be still visible, could you clarify  
that, please?


Regards,
Eugen


[1] http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/


Zitat von Janne Johansson :


Den mån 25 feb. 2019 kl 12:33 skrev Zhenshi Zhou :

I deployed a new cluster(mimic). Now I have to move all servers
in this cluster to another place, with new IP.
I'm not sure if the cluster will run well or not after I modify config
files, include /etc/hosts and /etc/ceph/ceph.conf.


No, certain ips will be visible in the databases, and those will not change.


Fortunately, the cluster has no data at present. I never encounter
such an issue like this. Is there any suggestion for me?


If you recently created the cluster, it should be easy to just
recreate it again,
using the same scripts so you don't have to repeat yourself as an admin since
computers are very good at repetitive tasks.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] min_size vs. K in erasure coded pools

2019-02-20 Thread Eugen Block

Hi,

I see that as a security feature ;-)
You can prevent data loss if k chunks are intact, but you don't want  
to work with the least required amount of chunks. In a disaster  
scenario you can reduce min_size to k temporarily, but the main goal  
should always be to get the OSDs back up.
For example, in a replicated pool with size 3 we set min_size to 2 not  
to 1, although that would also work if everything is healthy. But it's  
risky since there's also a chance that two corrupt PGs overwrite a  
healthy PG.


Regards,
Eugen


Zitat von "Clausen, Jörn" :


Hi!

While trying to understand erasure coded pools, I would have  
expected that "min_size" of a pool is equal to the "K" parameter.  
But it turns out, that it is always K+1.


Isn't the description of erasure coding misleading then? In a K+M  
setup, I would expect to be good (in the sense of "no service  
impact"), even if M OSDs are lost. But in reality, my clients would  
already experience an impact when M-1 OSDs are lost. This means, you  
should always consider one more spare than you would do in e.g. a  
classic RAID setup, right?


Joern

--
Jörn Clausen
Daten- und Rechenzentrum
GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
Düsternbrookerweg 20
24105 Kiel




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-18 Thread Eugen Block

Hi,


We skipped stage 1 and replaced the UUIDs of old disks with the new
ones in the policy.cfg
We ran salt '*' pillar.items and confirmed that the output was correct.
It showed the new UUIDs in the correct places.
Next we ran  salt-run state.orch ceph.stage.3
PS: All of the above ran successfully.


you should lead with the information that you're using SES, otherwise  
it's likely that misunderstandings come up.
Anyway, if you change the policy.cfg you should run stage.2 to make  
sure the changes are applied. Although you state that pillar.items  
shows the correct values I would recommend to run that (short) stage,  
too.



The output of ceph osd tree showed that these new disks are currently
in a ghost bucket, not even under root=default and without a weight.


Where is the respective host listed in the tree? Can you show a ceph  
osd tree, please?
Did you remove all OSDs of one host so the complete host is in the  
"ghost bucket" or just single OSDs? Are other OSDs on that host listed  
correctly?
Since deepsea is not aware of the crush map it can't figure out which  
bucket it should put the new OSDs in. So this part is not automated  
(yet?), you have to do it yourself. But if the host containing the  
replaced OSDs is already placed correctly, then there's definitely  
something wrong.



The first step I then tried was to reweight them but found errors
below:
Error ENOENT: device osd. does not appear in the crush map
Error ENOENT: unable to set item id 39 name 'osd.39' weight 5.45599 at
location
{host=veeam-mk2-rack1-osd3,rack=veeam-mk2-rack1,room=veeam-mk2,root=veeam}:
does not exist


You can't reweight the OSD if it's not in a bucket yet, try to move it  
to it's dedicated bucket first, if possible.


As already requested by Konstantin, please paste your osd tree?

Regards,
Eugen


Zitat von John Molefe :


Hi David

Removal process/commands ran as follows:

#ceph osd crush reweight osd. 0
#ceph osd out 
#systemctl stop ceph-osd@
#umount /var/lib/ceph/osd/ceph-

#ceph osd crush remove osd.
#ceph auth del osd.
#ceph osd rm 
#ceph-disk zap /dev/sd??

Adding them back on:

We skipped stage 1 and replaced the UUIDs of old disks with the new
ones in the policy.cfg
We ran salt '*' pillar.items and confirmed that the output was correct.
It showed the new UUIDs in the correct places.
Next we ran  salt-run state.orch ceph.stage.3
PS: All of the above ran successfully.

The output of ceph osd tree showed that these new disks are currently
in a ghost bucket, not even under root=default and without a weight.

The first step I then tried was to reweight them but found errors
below:
Error ENOENT: device osd. does not appear in the crush map
Error ENOENT: unable to set item id 39 name 'osd.39' weight 5.45599 at
location
{host=veeam-mk2-rack1-osd3,rack=veeam-mk2-rack1,room=veeam-mk2,root=veeam}:
does not exist

But when I run the command: ceph osd find 
v-cph-admin:/testing # ceph osd find 39
{
"osd": 39,
"ip": "143.160.78.97:6870\/24436",
"crush_location": {}
}

Please let me know if there's any other info that you may need to
assist

Regards
J.

David Turner  2019/02/18 17:08 >>>

Also what commands did you run to remove the failed HDDs and the
commands you have so far run to add their replacements back in?

On Sat, Feb 16, 2019 at 9:55 PM Konstantin Shalygin 
wrote:




I recently replaced failed HDDs and removed them from their respective
buckets as per procedure. But I’m now facing an issue when trying to
place new ones back into the buckets. I’m getting an error of ‘osd nr
not found’ OR ‘file or directory not found’ OR command sintax error. I
have been using the commands below: ceph osd crush set  
ceph osd crush  setI do
however find the OSD number when i run command: ceph osd find  Your
assistance/response to this will be highly appreciated. Regards John.
Please, paste your `ceph osd tree`, your version and what exactly error
you get include osd number.
Less obfuscation is better in this, perhaps, simple case.


k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Vrywaringsklousule / Disclaimer:
http://www.nwu.ac.za/it/gov-man/disclaimer.html
( http://www.nwu.ac.za/it/gov-man/disclaimer.html )




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Eugen Block

I have no issues opening that site from Germany.


Zitat von Dan van der Ster :


On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  wrote:


On 15/02/2019 10:39, Ilya Dryomov wrote:
> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
>>
>> Hi Marc,
>>
>> You can see previous designs on the Ceph store:
>>
>> https://www.proforma.com/sdscommunitystore
>
> Hi Mike,
>
> This site stopped working during DevConf and hasn't been working since.
> I think Greg has contacted some folks about this, but it would be great
> if you could follow up because it's been a couple of weeks now...

Ilya,

The site is working for me.
It only does not contain the Nautilus shirts (yet)


I found in the past that the http redirection for www.proforma.com
doesn't work from over here in Europe.
If someone can post the redirection target then we can access it directly.

-- dan




--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block

I created http://tracker.ceph.com/issues/38310 for this.

Regards,
Eugen


Zitat von Konstantin Shalygin :


On 2/14/19 2:21 PM, Eugen Block wrote:

Already did, but now with highlighting ;-)

http://docs.ceph.com/docs/luminous/rados/operations/health-checks/?highlight=osd_deep_mon_scrub_interval  
http://docs.ceph.com/docs/mimic/rados/operations/health-checks/?highlight=osd_deep_mon_scrub_interval


I think this is should be `osd_deep_scrub_interval`. Please fill  
ticket about this.




k




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block

Already did, but now with highlighting ;-)

http://docs.ceph.com/docs/luminous/rados/operations/health-checks/?highlight=osd_deep_mon_scrub_interval
http://docs.ceph.com/docs/mimic/rados/operations/health-checks/?highlight=osd_deep_mon_scrub_interval


Zitat von Konstantin Shalygin :


On 2/14/19 2:16 PM, Eugen Block wrote:
Exactly, it's also not available in a Mimic test-cluster. But it's  
mentioned in the docs for L and M (I didn't check the docs for  
other releases), that's what I was wondering about.


Can you provide url to this page?



k




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block

My Ceph Luminous don't know anything about this option:
# ceph daemon osd.7 config help osd_deep_mon_scrub_interval
{
"error": "Setting not found: 'osd_deep_mon_scrub_interval'"
}



Exactly, it's also not available in a Mimic test-cluster. But it's  
mentioned in the docs for L and M (I didn't check the docs for other  
releases), that's what I was wondering about.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block

Thank you, Konstantin,

I'll give that a try.

Do you have any comment on osd_deep_mon_scrub_interval?

Eugen


Zitat von Konstantin Shalygin :


The expectation was to prevent the automatic deep-scrubs but they are
started anyway
You can disable deep-scrubs per pool via `ceph osd pool set  
 nodeep-scrub`.




k




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to control automatic deep-scrubs

2019-02-13 Thread Eugen Block

Hi cephers,

I'm struggling a little with the deep-scrubs. I know this has been  
discussed multiple times (e.g. in [1]) and we also use a known crontab  
script in a Luminous cluster (12.2.10) to start the deep-scrubbing  
manually (a quarter of all PGs 4 times a week). The script works just  
fine, but it doesn't prevent the automatic deep-scrubs initiated by  
ceph itself.

These are the relevant config settings:

osd_scrub_begin_hour = 0
osd_scrub_end_hour = 7
osd_scrub_sleep = 0.1
osd_deep_scrub_interval = 2419200


The expectation was to prevent the automatic deep-scrubs but they are  
started anyway, and they are executed between midnight and 7 am, so at  
least some of the configs are "honored". I took a look at one specific  
PG:


2019-02-06 22:52:03.438079 7fd7f19cb700  0 log_channel(cluster) log  
[DBG] : 1.b7d deep-scrub starts
2019-02-06 22:52:24.909413 7fd7f19cb700  0 log_channel(cluster) log  
[DBG] : 1.b7d deep-scrub ok
2019-02-11 00:39:42.941238 7fd7f19cb700  0 log_channel(cluster) log  
[DBG] : 1.b7d deep-scrub starts
2019-02-11 00:40:04.447500 7fd7f19cb700  0 log_channel(cluster) log  
[DBG] : 1.b7d deep-scrub ok
2019-02-12 01:35:17.898666 7f97e42fa700  0 log_channel(cluster) log  
[DBG] : 1.b7d deep-scrub starts
2019-02-12 01:35:39.145579 7f97e42fa700  0 log_channel(cluster) log  
[DBG] : 1.b7d deep-scrub ok


The scrubs from 2019-02-06 are from the cronjob, the 2019-02-11 scrub  
was an automatic scrub, the last one could be automatic or manual,  
hard to tell because the cronjob starts at 20:00 and usually ends at  
about 5:30 in the morning. Anyway, I wouldn't expect a PG to be  
deep-scrubbed twice within 24 hours.


Then I continued looking for other config options etc., maybe we  
missed something, and I stumbled upon [2], where it says:



PGs are normally scrubbed every osd_deep_mon_scrub_interval seconds


So I searched for that config option with "ceph daemon [...] config  
show" but couldn't find anything in a Luminous or Mimic cluster.  
Setting that value in the ceph.conf of a test cluster (and restarting  
the cluster) also doesn't show it in the config dump. Is this a  
mistake in the docs? Could it be related to my question?


Of course I could set the nodeep-scrub flag to prevent automatic  
scrubs, but I consider osd flags as temporary. How do you handle your  
deep-scrubs? Any hints are appreciated!


Best regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021118.html

[2] http://docs.ceph.com/docs/luminous/rados/operations/health-checks/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-12 Thread Eugen Block
Will there be much difference in performance between EC and  
replicated?  Thanks.
Hope can do more testing on EC  before deadline of our first  
production CEPH...


In general, yes, there will be a difference in performance. Of course  
it depends on the actual configuration, but if you rely on performance  
I would stick with replication. Running your own tests with EC on your  
existing setup will reveal performance differences and help you decide  
which way to go.


Regards,
Eugen


Zitat von "ST Wong (ITSC)" :


Hi,

Thanks.As power supply to one of our server rooms is not so  
stable, will probably use size=4,min_size=2 to prevent data lose.



If the overhead is too high could EC be an option for your setup?


Will there be much difference in performance between EC and  
replicated?  Thanks.
Hope can do more testing on EC  before deadline of our first  
production CEPH...


Regards,
/st

-Original Message-
From: ceph-users  On Behalf Of Eugen Block
Sent: Tuesday, February 12, 2019 5:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object  
relocation in OSD failure ?


Hi,

I came to the same conclusion after doing various tests with rooms  
and failure domains. I agree with Maged and suggest to use size=4,
min_size=2 for replicated pools. It's more overhead but you can  
survive the loss of one room and even one more OSD (of the affected
PG) without losing data. You'll also have the certainty that there  
are always two replicas per room, no guessing or hoping which room  
is more likely to fail.


If the overhead is too high could EC be an option for your setup?

Regards,
Eugen


Zitat von "ST Wong (ITSC)" :


Hi all,

Tested 4 cases.  Case 1-3 are as expected, while for case 4,
rebuild didn’t take place on surviving room as Gregory mentioned.
Repeated case 4 several times on both rooms got same result.  We’re
running mimic 13.2.2.

E.g.

Room1
Host 1 osd: 2,5
Host 2 osd: 1,3

Room 2  <-- failed room
Host 3 osd: 0,4
Host 4 osd: 6,7


Before:
5.62  0  00 0   0
 000 active+clean 2019-02-12 04:47:28.183375
0'0  3643:2299   [0,7,5]  0   [0,7,5]  0
   0'0 2019-02-12 04:47:28.183218 0'0 2019-02-11
01:20:51.276922 0

After:
5.62  0  00 0   0
 000  undersized+peered 2019-02-12
09:10:59.1010960'0  3647:2284   [5]  5
[5]  50'0 2019-02-12 04:47:28.183218
0'0 2019-02-11 01:20:51.276922 0

Fyi.   Sorry for the belated report.

Thanks a lot.
/st


From: Gregory Farnum 
Sent: Monday, November 26, 2018 9:27 PM
To: ST Wong (ITSC) 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object
relocation in OSD failure ?

On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC)
mailto:s...@itsc.cuhk.edu.hk>> wrote:

Hi all,



We've 8 osd hosts, 4 in room 1 and 4 in room2.

A pool with size = 3 using following crush map is created, to cater
for room failure.


rule multiroom {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}




We're expecting:

1.for each object, there are always 2 replicas in one room and 1
replica in other room making size=3.  But we can't control which room
has 1 or 2 replicas.

Right.


2.in<http://2.in> case an osd host fails, ceph will assign remaining
osds to the same PG to hold replicas on the failed osd host.
Selection is based on crush rule of the pool, thus maintaining the
same failure domain - won't make all replicas in the same room.

Yes, if a host fails the copies it held will be replaced by new copies
in the same room.


3.in<http://3.in> case of entire room with 1 replica fails, the pool
will remain degraded but won't do any replica relocation.

Right.


4. in case of entire room with 2 replicas fails, ceph will make use of
osds in the surviving room and making 2 replicas.  Pool will not be
writeable before all objects are made 2 copies (unless we make pool
size=4?).  Then when recovery is complete, pool will remain in
degraded state until the failed room recover.

Hmm, I'm actually not sure if this will work out — because CRUSH is
hierarchical, it will keep trying to select hosts from the dead room
and will fill out the location vector's first two spots with -1. It
could be that Ceph will skip all those "nonexistent" entries and just
pick the two copies from slots 3 and 4, but it might not. You should
test this carefully and report back!
-Greg

Is our understanding correct?  Thanks a lot.
Will do some simulation later to verify.

Regards,
/stwong
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@l

Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-12 Thread Eugen Block

Hi,

I came to the same conclusion after doing various tests with rooms and  
failure domains. I agree with Maged and suggest to use size=4,  
min_size=2 for replicated pools. It's more overhead but you can  
survive the loss of one room and even one more OSD (of the affected  
PG) without losing data. You'll also have the certainty that there are  
always two replicas per room, no guessing or hoping which room is more  
likely to fail.


If the overhead is too high could EC be an option for your setup?

Regards,
Eugen


Zitat von "ST Wong (ITSC)" :


Hi all,

Tested 4 cases.  Case 1-3 are as expected, while for case 4,
rebuild didn’t take place on surviving room as Gregory mentioned.   
Repeated case 4 several times on both rooms got same result.  We’re  
running mimic 13.2.2.


E.g.

Room1
Host 1 osd: 2,5
Host 2 osd: 1,3

Room 2  <-- failed room
Host 3 osd: 0,4
Host 4 osd: 6,7


Before:
5.62  0  00 0   0 
 000 active+clean 2019-02-12 04:47:28.183375 
0'0  3643:2299   [0,7,5]  0   [0,7,5]  0  
   0'0 2019-02-12 04:47:28.183218 0'0 2019-02-11  
01:20:51.276922 0


After:
5.62  0  00 0   0 
 000  undersized+peered 2019-02-12  
09:10:59.1010960'0  3647:2284   [5]  5 
[5]  50'0 2019-02-12 04:47:28.183218  
0'0 2019-02-11 01:20:51.276922 0


Fyi.   Sorry for the belated report.

Thanks a lot.
/st


From: Gregory Farnum 
Sent: Monday, November 26, 2018 9:27 PM
To: ST Wong (ITSC) 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object  
relocation in OSD failure ?


On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC)  
mailto:s...@itsc.cuhk.edu.hk>> wrote:


Hi all,



We've 8 osd hosts, 4 in room 1 and 4 in room2.

A pool with size = 3 using following crush map is created, to cater  
for room failure.



rule multiroom {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}




We're expecting:

1.for each object, there are always 2 replicas in one room and 1  
replica in other room making size=3.  But we can't control which  
room has 1 or 2 replicas.


Right.


2.in case an osd host fails, ceph will assign remaining  
osds to the same PG to hold replicas on the failed osd host.   
Selection is based on crush rule of the pool, thus maintaining the  
same failure domain - won't make all replicas in the same room.


Yes, if a host fails the copies it held will be replaced by new  
copies in the same room.



3.in case of entire room with 1 replica fails, the pool  
will remain degraded but won't do any replica relocation.


Right.


4. in case of entire room with 2 replicas fails, ceph will make use  
of osds in the surviving room and making 2 replicas.  Pool will not  
be writeable before all objects are made 2 copies (unless we make  
pool size=4?).  Then when recovery is complete, pool will remain in  
degraded state until the failed room recover.


Hmm, I'm actually not sure if this will work out — because CRUSH is  
hierarchical, it will keep trying to select hosts from the dead room  
and will fill out the location vector's first two spots with -1. It  
could be that Ceph will skip all those "nonexistent" entries and  
just pick the two copies from slots 3 and 4, but it might not. You  
should test this carefully and report back!

-Greg

Is our understanding correct?  Thanks a lot.
Will do some simulation later to verify.

Regards,
/stwong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for EC pools

2019-02-07 Thread Eugen Block

Hi Francois,

Is that correct that recovery will be forbidden by the crush rule if  
a node is down?


yes, that is correct, failure-domain=host means no two chunks of the  
same PG can be on the same host. So if your PG is divided into 6  
chunks, they're all on different hosts, no recovery is possible at  
this point (for the EC-pool).


After rebooting all nodes we noticed that the recovery was slow,  
maybe half an hour, but all pools are currently empty (new install).

This is odd...


If the pools are empty I also wouldn't expect that, is restarting one  
OSD also that slow or is it just when you reboot the whole cluster?



Which k values are preferred on 6 nodes?


It depends on the failures you expect and how many concurrent failures  
you need to cover.
I think I would keep failure-domain=host (with only 4 OSDs per host).  
As for the k and m values, 3+2 would make sense, I guess. That profile  
would leave one host for recovery and two OSDs of one PG acting set  
could fail without data loss, so as resilient as the 4+2 profile. This  
is one approach, so please don't read this as *the* solution for your  
environment.


Regards,
Eugen


Zitat von Scheurer François :


Dear All


We created an erasure coded pool with k=4 m=2 with  
failure-domain=host but have only 6 osd nodes.
Is that correct that recovery will be forbidden by the crush rule if  
a node is down?


After rebooting all nodes we noticed that the recovery was slow,  
maybe half an hour, but all pools are currently empty (new install).

This is odd...

Can it be related to the k+m being equal to the number of nodes? (4+2=6)
step set_choose_tries 100 was already in the EC crush rule.

rule ewos1-prod_cinder_ec {
id 2
type erasure
min_size 3
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class nvme
step chooseleaf indep 0 type host
step emit
}

ceph osd erasure-code-profile set ec42 k=4 m=2 crush-root=default  
crush-failure-domain=host crush-device-class=nvme

ceph osd pool create ewos1-prod_cinder_ec 256 256 erasure ec42

ceph version 12.2.10-543-gfc6f0c7299  
(fc6f0c7299e3442e8a0ab83260849a6249ce7b5f) luminous (stable)


  cluster:
id: b5e30221-a214-353c-b66b-8c37b4349123
health: HEALTH_WARN
noout flag(s) set
Reduced data availability: 125 pgs inactive, 32 pgs peering

  services:
mon: 3 daemons, quorum ewos1-osd1-prod,ewos1-osd3-prod,ewos1-osd5-prod
mgr: ewos1-osd5-prod(active), standbys: ewos1-osd3-prod, ewos1-osd1-prod
osd: 24 osds: 24 up, 24 in
 flags noout

  data:
pools:   4 pools, 1600 pgs
objects: 0 objects, 0B
usage:   24.3GiB used, 43.6TiB / 43.7TiB avail
pgs: 7.812% pgs not active
 1475 active+clean
 93   activating
 32   peering


Which k values are preferred on 6 nodes?
BTW, we plan to use this EC pool as a second rbd pool in Openstack,  
with the main first rbd pool being replicated size=3; it is nvme ssd  
only.



Thanks for your help!



Best Regards
Francois Scheurer




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous to Mimic: MON upgrade requires "full luminous scrub". What is that?

2019-02-07 Thread Eugen Block

Hi,

could it be a missing 'ceph osd require-osd-release luminous' on your cluster?

When I check a luminous cluster I get this:

host1:~ # ceph osd dump | grep recovery
flags sortbitwise,recovery_deletes,purged_snapdirs

The flags in the code you quote seem related to that.
Can you check that output on your cluster?

Found this in a thread from last year [1].


Regards,
Eugen

[1] https://www.spinics.net/lists/ceph-devel/msg40191.html

Zitat von Andrew Bruce :

Hello All! Yesterday started upgrade from luminous to mimic with one  
of my 3 MONs.


After applying mimic yum repo and updating - a restart reports the  
following error from the MON log file:


==> /var/log/ceph/ceph-mon.lvtncephx121.log <==
2019-02-07 10:02:40.110 7fc8283ed700 -1 mon.lvtncephx121@0(probing)  
e4 handle_probe_reply existing cluster has not completed a full  
luminous scrub to purge legacy snapdir objects; please scrub before  
upgrading beyond luminous.


My question is simply: What exactly does this require?

Yesterday afternoon I did a manual:

ceph osd scrub all

But that has zero effect. I still get the same message on restarting the MON

I have no errors in the cluster except for the single MON  
(lvtncephx121) that I'm working to migrate to mimic first:


[root@lvtncephx110 ~]# ceph status
  cluster:
id: 5fabf1b2-cfd0-44a8-a6b5-fb3fd0545517
health: HEALTH_WARN
1/3 mons down, quorum lvtncephx122,lvtncephx123

  services:
mon: 3 daemons, quorum lvtncephx122,lvtncephx123, out of quorum:  
lvtncephx121

mgr: lvtncephx122(active), standbys: lvtncephx123, lvtncephx121
mds: cephfs-1/1/1 up  {0=lvtncephx151=up:active}, 1 up:standby
osd: 18 osds: 18 up, 18 in
rgw: 2 daemons active

  data:
pools:   23 pools, 2016 pgs
objects: 2608k objects, 10336 GB
usage:   20689 GB used, 39558 GB / 60247 GB avail
pgs: 2016 active+clean

  io:
client:   5612 B/s rd, 3756 kB/s wr, 1350 op/s rd, 412 op/s wr

FWIW: The source code has the following:

// Monitor.cc
if (!osdmon()->osdmap.test_flag(CEPH_OSDMAP_PURGED_SNAPDIRS) ||
!osdmon()->osdmap.test_flag(CEPH_OSDMAP_RECOVERY_DELETES)) {
  derr << __func__ << " existing cluster has not completed a  
full luminous"

   << " scrub to purge legacy snapdir objects; please scrub before"
   << " upgrading beyond luminous." << dendl;
  exit(0);
}
  }

So two question:
How to show the current flags in the OSD map checked by the monitor?
How to get these flags set so the MON will actually start.

Thanks,
Andy




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Eugen Block
At first - you should upgrade to 12.2.11 (or bring the mentioned  
patch in by other means) to fix rename procedure which will avoid  
new inconsistent objects appearance in DB. This should at least  
reduce the OSD crash frequency.


We'll have to wait until 12.2.11 is available for openSUSE, I'm not  
sure how long it will take.


So I'd like to have fsck report to verify that. No matter if you do  
fsck before or after the upgrade.


Once we have fsck report we can proceed with the repair. Which is a  
bit risky procedure so may be I should try to simulate  the  
inconsistency  in question and check if built-in repair handles that  
properly. Will see, lets get fsck report first.


I'll try to run the fsck today, I have to wait until there are fewer  
clients active. Depending on the log file size, would it be okay to  
attach it to an email and send it directly to you or what is the best  
procedure for you?


Thanks for your support!
Eugen

Zitat von Igor Fedotov :


Eugen,

At first - you should upgrade to 12.2.11 (or bring the mentioned  
patch in by other means) to fix rename procedure which will avoid  
new inconsistent objects appearance in DB. This should at least  
reduce the OSD crash frequency.


At second - theoretically previous crashes could result in  
persistent inconsistent objects in your DB. I haven't seen that in  
real life before but probably they exist. We need to check. If so  
OSD crashes might still occur.


So I'd like to have fsck report to verify that. No matter if you do  
fsck before or after the upgrade.


Once we have fsck report we can proceed with the repair. Which is a  
bit risky procedure so may be I should try to simulate  the  
inconsistency  in question and check if built-in repair handles that  
properly. Will see, lets get fsck report first.



W.r.t to running ceph-bluestore-tool - you might want to specify log  
file and increase log level to 20 using --log-file and --log-level  
options.



On 2/7/2019 4:45 PM, Eugen Block wrote:

Hi Igor,

thanks for the quick response!
Just to make sure I don't misunderstand, and because it's a  
production cluster:
before anything else I should run fsck on that OSD? Depending on  
the result we'll decide how to continue, right?
Is there anything else to be enabled for that command or can I  
simply run 'ceph-bluestore-tool fsck --path  
/var/lib/ceph/osd/ceph-'?


Any other obstacles I should be aware of when running fsck?

Thanks!
Eugen


Zitat von Igor Fedotov :


Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (=  
https://tracker.ceph.com/issues/36638 for luminous)


Hence it's not fixed in 12.2.10, target release is 12.2.11


Also please note the patch allows to avoid new occurrences for the  
issue. But there some chances that inconsistencies caused by it  
earlier are still present in DB. And assertion might still happen  
(hopefully with less frequency).


So could you please run fsck for OSDs that were broken once and  
share the results?


Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:

Hi list,

I found this thread [1] about crashing SSD OSDs, although that  
was about an upgrade to 12.2.7, we just hit (probably) the same  
issue after our update to 12.2.10 two days ago in a production  
cluster.

Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 :6789/0 109754 :  
cluster [INF] osd.10 failed (root=default,host=host1) (connection  
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 :6789/0 109771 :  
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: ***  
Caught signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in  
thread 7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]: ceph  
version 12.2.10-544-gb10c702661  
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1:  
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2:  
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3:  
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4:  
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6:  
(interval_setunsigned long, std::less,  
mempool::pool_allocator<(mempool::pool_index_t)1,  
std::pair >, 256>  
>::insert(unsigned long, u

Re: [ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Eugen Block

Hi Igor,

thanks for the quick response!
Just to make sure I don't misunderstand, and because it's a production  
cluster:
before anything else I should run fsck on that OSD? Depending on the  
result we'll decide how to continue, right?
Is there anything else to be enabled for that command or can I simply  
run 'ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-'?


Any other obstacles I should be aware of when running fsck?

Thanks!
Eugen


Zitat von Igor Fedotov :


Hi Eugen,

looks like this isn't [1] but rather

https://tracker.ceph.com/issues/38049

and

https://tracker.ceph.com/issues/36541 (=  
https://tracker.ceph.com/issues/36638 for luminous)


Hence it's not fixed in 12.2.10, target release is 12.2.11


Also please note the patch allows to avoid new occurrences for the  
issue. But there some chances that inconsistencies caused by it  
earlier are still present in DB. And assertion might still happen  
(hopefully with less frequency).


So could you please run fsck for OSDs that were broken once and  
share the results?


Then we can decide if it makes sense to proceed with the repair.


Thanks,

Igor

On 2/7/2019 3:37 PM, Eugen Block wrote:

Hi list,

I found this thread [1] about crashing SSD OSDs, although that was  
about an upgrade to 12.2.7, we just hit (probably) the same issue  
after our update to 12.2.10 two days ago in a production cluster.

Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 :6789/0 109754 :  
cluster [INF] osd.10 failed (root=default,host=host1) (connection  
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 :6789/0 109771 :  
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: ***  
Caught signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in  
thread 7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]:  ceph  
version 12.2.10-544-gb10c702661  
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1:  
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2:  
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3:  
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4:  
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6:  
(interval_setunsigned long, std::less,  
mempool::pool_allocator<(mempool::pool_index_t)1,  
std::pair >, 256>  
>::insert(unsigned long, unsigned long, unsigned long*, unsigned  
long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7:  
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)  
[0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8:  
(StupidAllocator::release(unsigned long, unsigned long)+0x7d)  
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9:  
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)  
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10:  
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)  
[0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11:  
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)  
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12:  
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13:  
(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]:  14:  
(()+0x8744) [0x7f75d837e744]
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]:  15:  
(clone()+0x6d) [0x7f75d7405aad]
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]:  
2019-02-07 13:01:51.185833 7f75ce646700 -1 *** Caught signal  
(Aborted) **

---cut here---

Is there anything we can do about this? The issue in [1] doesn't  
seem to be resolved, yet. Debug logging is not enabled, so I don't  
have more detailed information except the full stack trace from the  
OSD log. Any help is appreciated!


Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cg

[ceph-users] SSD OSD crashing after upgrade to 12.2.10

2019-02-07 Thread Eugen Block

Hi list,

I found this thread [1] about crashing SSD OSDs, although that was  
about an upgrade to 12.2.7, we just hit (probably) the same issue  
after our update to 12.2.10 two days ago in a production cluster.

Just half an hour ago I saw one OSD (SSD) crashing (for the first time):

2019-02-07 13:02:07.682178 mon.host1 mon.0 :6789/0 109754 :  
cluster [INF] osd.10 failed (root=default,host=host1) (connection  
refused reported by osd.20)
2019-02-07 13:02:08.623828 mon.host1 mon.0 :6789/0 109771 :  
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)


One minute later, the OSD was back online.
This is the stack trace reported in syslog:

---cut here---
2019-02-07T13:01:51.181027+01:00 host1 ceph-osd[1136505]: *** Caught  
signal (Aborted) **
2019-02-07T13:01:51.181232+01:00 host1 ceph-osd[1136505]:  in thread  
7f75ce646700 thread_name:bstore_kv_final
2019-02-07T13:01:51.185873+01:00 host1 ceph-osd[1136505]:  ceph  
version 12.2.10-544-gb10c702661  
(b10c702661a31c8563b3421d6d664de93a0cb0e2) luminous (stable)
2019-02-07T13:01:51.186077+01:00 host1 ceph-osd[1136505]:  1:  
(()+0xa587d9) [0x560b921cc7d9]
2019-02-07T13:01:51.186226+01:00 host1 ceph-osd[1136505]:  2:  
(()+0x10b10) [0x7f75d8386b10]
2019-02-07T13:01:51.186368+01:00 host1 ceph-osd[1136505]:  3:  
(gsignal()+0x37) [0x7f75d73508d7]
2019-02-07T13:01:51.186773+01:00 host1 ceph-osd[1136505]:  4:  
(abort()+0x13a) [0x7f75d7351caa]
2019-02-07T13:01:51.186906+01:00 host1 ceph-osd[1136505]:  5:  
(ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x280) [0x560b922096d0]
2019-02-07T13:01:51.187027+01:00 host1 ceph-osd[1136505]:  6:  
(interval_setlong, std::less,  
mempool::pool_allocator<(mempool::pool_index_t)1, std::pairlong const, unsigned long> >, 256> >::insert(unsigned long, unsigned  
long, unsigned long*, unsigned long*)+0xef2) [0x560b921bd432]
2019-02-07T13:01:51.187167+01:00 host1 ceph-osd[1136505]:  7:  
(StupidAllocator::_insert_free(unsigned long, unsigned long)+0x126)  
[0x560b921b4a06]
2019-02-07T13:01:51.187294+01:00 host1 ceph-osd[1136505]:  8:  
(StupidAllocator::release(unsigned long, unsigned long)+0x7d)  
[0x560b921b4f4d]
2019-02-07T13:01:51.187418+01:00 host1 ceph-osd[1136505]:  9:  
(BlueStore::_txc_release_alloc(BlueStore::TransContext*)+0x72)  
[0x560b9207fa22]
2019-02-07T13:01:51.187539+01:00 host1 ceph-osd[1136505]:  10:  
(BlueStore::_txc_finish(BlueStore::TransContext*)+0x5d7)  
[0x560b92092d77]
2019-02-07T13:01:51.187661+01:00 host1 ceph-osd[1136505]:  11:  
(BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x1f6)  
[0x560b920a3fa6]
2019-02-07T13:01:51.187781+01:00 host1 ceph-osd[1136505]:  12:  
(BlueStore::_kv_finalize_thread()+0x620) [0x560b920a58e0]
2019-02-07T13:01:51.187898+01:00 host1 ceph-osd[1136505]:  13:  
(BlueStore::KVFinalizeThread::entry()+0xd) [0x560b920fb57d]
2019-02-07T13:01:51.188017+01:00 host1 ceph-osd[1136505]:  14:  
(()+0x8744) [0x7f75d837e744]
2019-02-07T13:01:51.188138+01:00 host1 ceph-osd[1136505]:  15:  
(clone()+0x6d) [0x7f75d7405aad]
2019-02-07T13:01:51.188271+01:00 host1 ceph-osd[1136505]: 2019-02-07  
13:01:51.185833 7f75ce646700 -1 *** Caught signal (Aborted) **

---cut here---

Is there anything we can do about this? The issue in [1] doesn't seem  
to be resolved, yet. Debug logging is not enabled, so I don't have  
more detailed information except the full stack trace from the OSD  
log. Any help is appreciated!


Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029616.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Eugen Block

You can check all objects of that pool to see if your caps match:

rados -p backup ls | grep rbd_id


Zitat von Eugen Block :


caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
rbd_id.rbd-image"


I think your caps are not entirely correct, the part "[...]  
object_prefix rbd_id.rbd-image" should contain the

actual image name, so in your case it should be "[...] rbd_id.gbs".

Regards,
Eugen

Zitat von Thomas <74cmo...@gmail.com>:


Thanks.

Unfortunately this is still not working.

Here's the info of my image:
root@ld4257:/etc/ceph# rbd info backup/gbs
rbd image 'gbs':
    size 500GiB in 128000 objects
    order 22 (4MiB objects)
    block_name_prefix: rbd_data.18102d6b8b4567
    format: 2
    features: layering
    flags:
    create_timestamp: Thu Jan 24 16:01:55 2019

And here's the user caps ouput:
root@ld4257:/etc/ceph# ceph auth get client.gbsadm
exported keyring for client.gbsadm
[client.gbsadm]
    key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
    caps mon = "allow r"
    caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
rbd_id.rbd-image"


Trying to map rbd "backup/gbs" now fails with this error; this operation
should be permitted:
ld7581:/etc/ceph # rbd map backup/gbs --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 12:15:19.786724 7fe4357fa700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 12:15:19.786962 7fe434ff9700 -1 librbd::ImageState:
0x55b6522177f0 failed to open image: (1) Operation not permitted
rbd: error opening image gbs: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted

The same error is shown when I try to map rbd "backup/isa"; this
operation must be prohibited:
ld7581:/etc/ceph # rbd map backup/isa --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 12:15:04.850151 7f8041ffb700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 12:15:04.850350 7f80417fa700 -1 librbd::ImageState:
0x5643668a5700 failed to open image: (1) Operation not permitted
rbd: error opening image isa: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted


Regards
Thomas

Am 25.01.2019 um 11:52 schrieb Eugen Block:

osd 'allow rwx
pool  object_prefix rbd_data.2b36cf238e1f29; allow rwx pool 
object_prefix rbd_header.2b36cf238e1f29




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Eugen Block

caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
rbd_id.rbd-image"


I think your caps are not entirely correct, the part "[...]  
object_prefix rbd_id.rbd-image" should contain the

actual image name, so in your case it should be "[...] rbd_id.gbs".

Regards,
Eugen

Zitat von Thomas <74cmo...@gmail.com>:


Thanks.

Unfortunately this is still not working.

Here's the info of my image:
root@ld4257:/etc/ceph# rbd info backup/gbs
rbd image 'gbs':
    size 500GiB in 128000 objects
    order 22 (4MiB objects)
    block_name_prefix: rbd_data.18102d6b8b4567
    format: 2
    features: layering
    flags:
    create_timestamp: Thu Jan 24 16:01:55 2019

And here's the user caps ouput:
root@ld4257:/etc/ceph# ceph auth get client.gbsadm
exported keyring for client.gbsadm
[client.gbsadm]
    key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
    caps mon = "allow r"
    caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
rbd_id.rbd-image"


Trying to map rbd "backup/gbs" now fails with this error; this operation
should be permitted:
ld7581:/etc/ceph # rbd map backup/gbs --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 12:15:19.786724 7fe4357fa700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 12:15:19.786962 7fe434ff9700 -1 librbd::ImageState:
0x55b6522177f0 failed to open image: (1) Operation not permitted
rbd: error opening image gbs: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted

The same error is shown when I try to map rbd "backup/isa"; this
operation must be prohibited:
ld7581:/etc/ceph # rbd map backup/isa --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 12:15:04.850151 7f8041ffb700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 12:15:04.850350 7f80417fa700 -1 librbd::ImageState:
0x5643668a5700 failed to open image: (1) Operation not permitted
rbd: error opening image isa: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted


Regards
Thomas

Am 25.01.2019 um 11:52 schrieb Eugen Block:

osd 'allow rwx
pool  object_prefix rbd_data.2b36cf238e1f29; allow rwx pool 
object_prefix rbd_header.2b36cf238e1f29




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Eugen Block

Hi,

I replied to your thread a couple of days ago, maybe you didn't notice:

Restricting user access is possible on rbd image level. You can grant  
read/write access for one client and only read access for other  
clients, you have to create different clients for that, see [1] for  
more details.


Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024424.html




Zitat von cmonty14 <74cmo...@gmail.com>:


Hi,
I can create a block device user with this command:

ceph auth get-or-create client.{ID} mon 'profile rbd' osd 'profile
{profile name} [pool={pool-name}][, profile ...]'

Question:
How can I create a user that has access only to a specific image
created in pool ?

If this is not possible this would mean that any user with pool access
can map any image created in this pool.
In my opinion this is a security concern.

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The OSD can be “down” but still “in”.

2019-01-23 Thread Eugen Block

Hi,


If the OSD represents the primary one for a PG, then all IO will be
stopped..which may lead to application failure..


no, that's not how it works. You have an acting set of OSDs for a PG,  
typically 3 OSDs in a replicated pool. If the primary OSD goes down,  
the secondary becomes the primary immediately and serves client  
requests.
I recommend to read the docs [1] to get a better understanding of the  
workflow, or set up a practice environment to test failure scenarios  
and watch what happens if an OSD/host/rack etc. fails.


Regards,
Eugen


[1] http://docs.ceph.com/docs/master/architecture/#peering-and-sets


Zitat von M Ranga Swami Reddy :


Thanks for reply.
If the OSD represents the primary one for a PG, then all IO will be
stopped..which may lead to application failure..



On Tue, Jan 22, 2019 at 5:32 PM Matthew Vernon  wrote:


Hi,

On 22/01/2019 10:02, M Ranga Swami Reddy wrote:
> Hello - If an OSD shown as down and but its still "in" state..what
> will happen with write/read operations on this down OSD?

It depends ;-)

In a typical 3-way replicated setup with min_size 2, writes to placement
groups on that OSD will still go ahead - when 2 replicas are written OK,
then the write will complete. Once the OSD comes back up, these writes
will then be replicated to that OSD. If it stays down for long enough to
be marked out, then pgs on that OSD will be replicated elsewhere.

If you had min_size 3 as well, then writes would block until the OSD was
back up (or marked out and the pgs replicated to another OSD).

Regards,

Matthew


--
 The Wellcome Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread Eugen Block

Hi Thomas,


What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?


I don't think one pool per DB is reasonable. If the number of DBs  
increases you'll have to create more pools and change the respective  
auth settings. One pool for your DB backups would suffice, and  
restricting user access is possible on rbd image level. You can grant  
read/write access for one client and only read access for other  
clients, you have to create different clients for that, see [1] for  
more details.


Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024424.html



Zitat von Thomas <74cmo...@gmail.com>:


Hi,
 
my use case for Ceph is serving a central backup storage.
This means I will backup multiple databases in Ceph storage cluster.
 
This is my question:
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?
 
This is the security demand that should be considered:
DB-owner A can only modify the files that belong to A; other files
(owned by B, C or D) are accessible for A.

And there's another issue:
How can I identify a backup created by client A that I want to restore
on another client Z?
I mean typically client A would write a backup file identified by the
filename.
Would it be possible on client Z to identify this backup file by
filename? If yes, how?
 
 
THX




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Eugen Block

Hi Jan,

I think you're running into an issue reported a couple of times.
For the use of LVM you have to specify the name of the Volume Group  
and the respective Logical Volume instead of the path, e.g.


ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda

Regards,
Eugen


Zitat von Jan Kasprzak :


Hello, Ceph users,

replying to my own post from several weeks ago:

Jan Kasprzak wrote:
: [...] I plan to add new OSD hosts,
: and I am looking for setup recommendations.
:
: Intended usage:
:
: - small-ish pool (tens of TB) for RBD volumes used by QEMU
: - large pool for object-based cold (or not-so-hot :-) data,
:   write-once read-many access pattern, average object size
:   10s or 100s of MBs, probably custom programmed on top of
:   libradosstriper.
:
: Hardware:
:
: The new OSD hosts have ~30 HDDs 12 TB each, and two 960 GB SSDs.
: There is a small RAID-1 root and RAID-1 swap volume spanning both SSDs,
: leaving about 900 GB free on each SSD.
: The OSD hosts have two CPU sockets (32 cores including SMT), 128 GB RAM.
:
: My questions:
[...]
: - block.db on SSDs? The docs recommend about 4 % of the data size
:   for block.db, but my SSDs are only 0.6 % of total storage size.
:
: - or would it be better to leave SSD caching on the OS and use LVMcache
:   or something?
:
: - LVM or simple volumes?

I have problem setting this up with ceph-volume: I want to have an OSD
on each HDD, with block.db on the SSD. In order to set this up,
I have created a VG on the two SSDs, created 30 LVs on top of it for  
block.db,

and wanted to create an OSD using the following:

# ceph-volume lvm prepare --bluestore \
--block.db /dev/ssd_vg/ssd00 \
--data /dev/sda
[...]
--> blkid could not detect a PARTUUID for device: /dev/cbia_ssd_vg/ssd00
--> Was unable to complete a new OSD, will rollback changes
[...]

Then it failed, because deploying a volume used client.bootstrap-osd user,
but trying to roll the changes back required the client.admin user,
which does not have a keyring on the OSD host. Never mind.

The problem is with determining the PARTUUID of the SSD LV for block.db.
How can I deploy an OSD which is on top of bare HDD, but which also
has a block.db on an existing LV?

Thanks,

-Yenya

--
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clarification of communication between mon and osd

2019-01-14 Thread Eugen Block

Thanks for the reply, Paul.


Yes, your understanding is correct. But the main mechanism by which
OSDs are reported as down is that other OSDs report them as down with
a much stricter timeout (20 seconds? 30 seconds? something like that).


Yes, the osd_heartbeat_grace of 20 seconds has occured from time to  
time in setups with network configuration issues.



It's quite rare to hit the "mon osd report timeout" (the usual
scenario here is a network partition)


Thanks for the confirmation.

Eugen

Zitat von Paul Emmerich :


Yes, your understanding is correct. But the main mechanism by which
OSDs are reported as down is that other OSDs report them as down with
a much stricter timeout (20 seconds? 30 seconds? something like that).

It's quite rare to hit the "mon osd report timeout" (the usual
scenario here is a network partition)

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Jan 14, 2019 at 10:17 AM Eugen Block  wrote:


Hello list,

I noticed my last post was displayed as a reply to a different thread,
so I re-send my question, please excuse the noise.

There are two config options of mon/osd interaction that I don't fully
understand. Maybe one of you could clarify it for me.

> mon osd report timeout
> - The grace period in seconds before declaring unresponsive Ceph OSD
> Daemons down. Default 900

> mon osd down out interval
> - The number of seconds Ceph waits before marking a Ceph OSD Daemon
> down and out if it doesn’t respond. Default 600

I've seen the mon_osd_down_out_interval beeing hit plenty of times,
e.g. If I manually take down an OSD it will be marked out after 10
minutes. But I can't quite remember seeing the 900 seconds timeout
happen. When exactly will the mon_osd_report_timeout kick in? Does
this mean that if for some reason one OSD is unresponsive the MON will
mark it down after 15 minutes, then wait another 10 minutes until it
is marked out so the recovery can start?

I'd appreciate any insight!

Regards,
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Clarification of communication between mon and osd

2019-01-14 Thread Eugen Block

Hello list,

I noticed my last post was displayed as a reply to a different thread,  
so I re-send my question, please excuse the noise.


There are two config options of mon/osd interaction that I don't fully  
understand. Maybe one of you could clarify it for me.



mon osd report timeout
- The grace period in seconds before declaring unresponsive Ceph OSD  
Daemons down. Default 900



mon osd down out interval
- The number of seconds Ceph waits before marking a Ceph OSD Daemon  
down and out if it doesn’t respond. Default 600


I've seen the mon_osd_down_out_interval beeing hit plenty of times,  
e.g. If I manually take down an OSD it will be marked out after 10  
minutes. But I can't quite remember seeing the 900 seconds timeout  
happen. When exactly will the mon_osd_report_timeout kick in? Does  
this mean that if for some reason one OSD is unresponsive the MON will  
mark it down after 15 minutes, then wait another 10 minutes until it  
is marked out so the recovery can start?


I'd appreciate any insight!

Regards,
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Clarification of mon osd communication

2019-01-10 Thread Eugen Block

Hello list,

there are two config options of mon/osd interaction that I don't fully  
understand. Maybe one of you could clarify it for me.



mon osd report timeout
- The grace period in seconds before declaring unresponsive Ceph OSD  
Daemons down. Default 900



mon osd down out interval
- The number of seconds Ceph waits before marking a Ceph OSD Daemon  
down and out if it doesn’t respond. Default 600


I've seen the mon_osd_down_out_interval beeing hit plenty of times, If  
I manually take down an OSD it will be marked out after 10 minutes.


But when exactly will the mon_osd_report_timeout kick in? Does this  
mean that if for some reason one OSD is unresponsive the MON will mark  
it down after 15 minutes, then wait another 10 minutes until it is  
marked out so the recovery can start?


I'd appreciate any insight!

Regards,
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack ceph - non bootable volumes

2018-12-20 Thread Eugen Block

Volumes are being created just fine in the "volumes" pool but they are not
bootable
Also, ephemeral instances are working fine ( disks are being created on the
dedicated ceph pool "instances')


That sounds like cinder is missing something regarding glance. So the  
instance is listed as "ACTIVE" but what do you see on the instance?  
Does it show something like "no bootable device" or a similar message?


I don't have Rocky and Mimic, so there might be something else. To  
rule out the qemu layer you could try to launch from a volume with kvm  
like suggested in [1].


Is this a new setup or an upgraded environment? If it's an existing  
environment, did it ever work with volumes? Do you use cephx and have  
you setup rbd_secret on your compute node(s)?


Regards,
Eugen

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-May/031454.html


Zitat von Steven Vacaroaia :


Thanks for your prompt reply

Volumes are being created just fine in the "volumes" pool but they are not
bootable
Also, ephemeral instances are working fine ( disks are being created on the
dedicated ceph pool "instances')

Access for cinder user from compute node is fine

[root@ops-ctrl-new ~(devops)]# rbd --user cinder -k
/etc/ceph/ceph.client.cinder.keyring -p volumes ls
volume-04c3d1a6-e4f0-40bf-a257-0d76c2a6e337
volume-6393ea70-b05d-4926-a2e0-a4a41f8b38e7
volume-82d97331-6b9b-446c-a5a4-b402a5212e4c
volume-9ed2658a-1a9c-4466-a36a-587cc50ed1ff
volume-baa6c928-8ac1-4240-b189-32b444b434a3
volume-c23a69dc-d043-45f7-970d-1eec2ccb10cc
volume-f1872ae6-48e3-4a62-9f46-bf157f079e7f


On Wed, 19 Dec 2018 at 09:25, Eugen Block  wrote:


Hi,

can you explain more detailed what exactly goes wrong?
In many cases it's an authentication error, can you check if your
specified user is allowed to create volumes in the respective pool?

You could try something like this (from compute node):

rbd --user  -k
/etc/ceph/ceph.client.OPENSTACK_USER.keyring -p  ls

additionally, try to create an image in that pool. If that works, you
should double check the credentials you created within ceph and
compare them to the credentials in your openstack configs.
That would be my first guess.

Regards,
Eugen

Zitat von Steven Vacaroaia :

> Hi,
>
> I'll appreciated if someone can provide some guidance for
troubleshooting /
> setting up Openstack (rocky) + ceph (mimic)  so that volumes created on
> ceph be bootable
>
> I have followed this http://docs.ceph.com/docs/mimic/rbd/rbd-openstack/
> enabled debug in both nova and cinder but still not been able to figure
it
> out why volumes are not bootable
>
> Many thanks
> Steven



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack ceph - non bootable volumes

2018-12-19 Thread Eugen Block

Hi,

can you explain more detailed what exactly goes wrong?
In many cases it's an authentication error, can you check if your  
specified user is allowed to create volumes in the respective pool?


You could try something like this (from compute node):

rbd --user  -k  
/etc/ceph/ceph.client.OPENSTACK_USER.keyring -p  ls


additionally, try to create an image in that pool. If that works, you  
should double check the credentials you created within ceph and  
compare them to the credentials in your openstack configs.

That would be my first guess.

Regards,
Eugen

Zitat von Steven Vacaroaia :


Hi,

I'll appreciated if someone can provide some guidance for troubleshooting /
setting up Openstack (rocky) + ceph (mimic)  so that volumes created on
ceph be bootable

I have followed this http://docs.ceph.com/docs/mimic/rbd/rbd-openstack/
enabled debug in both nova and cinder but still not been able to figure it
out why volumes are not bootable

Many thanks
Steven




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

2018-10-19 Thread Eugen Block

No, you do not need to set nobackfill and norecover if you only shut
down one server. The guide you are referencing is about shutting down
everything.

It will not recover degraded PGs if you shut down one server with noout.


You are right, I must have confused something in my memory with the  
recovery I/O and the flags. Thanks for the clarification.


Eugen


Zitat von Paul Emmerich :


No, you do not need to set nobackfill and norecover if you only shut
down one server. The guide you are referencing is about shutting down
everything.

It will not recover degraded PGs if you shut down one server with noout.

Paul
Am Fr., 19. Okt. 2018 um 11:37 Uhr schrieb Eugen Block :


Hi Denny,

the recommendation for ceph maintenance is to set three flags if you
need to shutdown a node (or the entire cluster):

ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover

Although the 'noout' flag seems to be enough for many maintenance
tasks it doesn't prevent the cluster from rebalancing because some PGs
are degraded and the cluster tries to recover.
Of course there are also scenarios where something could go wrong
during the maintenance and a recovery would be required, but there's
always some kind of tradeoff, I guess.

Until now we had no problems with maintenance when these flags were
set, and also no stuck I/O for the clients.

> We never saw this in the past, with the same procedure ...

If this has always worked for you and you suddenly encounter these
problems for the first time there could be something else going on, of
course, but I just wanted to point out the importance of those osd
flags.

Regards,
Eugen

[1] https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/


Zitat von Denny Fuchs :

> Hi,
>
> today we had an issue with our 6 node Ceph cluster.
>
> We had to shutdown one node (Ceph-03), to replace a disk (because,
> we did now know the slot). We set the noout flag and did a graceful
> shutdown. All was O.K. After the disk was replaced, the node comes
> up and our VMs had a big I/O latency.
> We never saw this in the past, with the same procedure ...
>
> * From our logs on Ceph-01:
>
> 2018-10-18 15:53:45.455743 mon.qh-a07-ceph-osd-03 mon.2
> 10.3.0.3:6789/0 1 : cluster [INF] mon.qh-a07-ceph-  osd-03 calling
> monitor election
> ...
> 2018-10-18 15:53:45.503818 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663050 : cluster [INF] mon.qh-a07-ceph-osd-01 is
> new leader, mons
>  
qh-a07-ceph-osd-01,qh-a07-ceph-osd-02,qh-a07-ceph-osd-03,qh-a07-ceph-osd-04,qh-a07-ceph-osd-05,qh

>
> * First OSD comes up:
>
> 2018-10-18 15:53:55.207742 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663063 : cluster [WRN] Health check update: 10 osds
> down (OSD_DOWN)
> 2018-10-18 15:53:55.207768 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663064 : cluster [INF] Health check cleared:
> OSD_HOST_DOWN (was: 1 host (11 osds) down)
> 2018-10-18 15:53:55.240079 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663065 : cluster [INF] osd.43 10.3.0.3:6812/7554 boot
>
> * All OSDs where up:
>
> 2018-10-18 15:54:25.331692 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663134 : cluster [INF] Health check cleared:
> OSD_DOWN (was: 1 osds down)
> 2018-10-18 15:54:25.360151 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663135 : cluster [INF] osd.12 10.3.0.3:6820/8537 boot
>
> * This OSDs here are a mix of HDD and SDD and different nodes
>
> 2018-10-18 15:54:27.073266 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663138 : cluster [WRN] Health check update:
> Degraded data redundancy: 84012/4293867 objects degraded (1.957%),
> 1316 pgs degraded, 487 pgs undersized (PG_DEGRADED)
> 2018-10-18 15:54:32.073644 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663142 : cluster [WRN] Health check update:
> Degraded data redundancy: 4611/4293867 objects degraded (0.107%),
> 1219 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:36.841189 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663144 : cluster [WRN] Health check failed: 1 slow
> requests are blocked > 32 sec. Implicated osds 16 (REQUEST_SLOW)
> 2018-10-18 15:54:37.074098 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663145 : cluster [WRN] Health check update:
> Degraded data redundancy: 4541/4293867 objects degraded (0.106%),
> 1216 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:42.074510 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663149 : cluster [WRN] Health check update:
> Degraded data redundancy: 4364/4293867 objects degraded (0.102%),
> 1176 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:42.074561 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663150 : cluster [WRN] Health check update: 5 slow
> requests are blocked > 32 sec. Implicated osds 15,25,30,34
> (REQUEST_SLOW)
> 2018-10-18 15:54:47.074886 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6

Re: [ceph-users] 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

2018-10-19 Thread Eugen Block

Hi Denny,

the recommendation for ceph maintenance is to set three flags if you  
need to shutdown a node (or the entire cluster):


ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover

Although the 'noout' flag seems to be enough for many maintenance  
tasks it doesn't prevent the cluster from rebalancing because some PGs  
are degraded and the cluster tries to recover.
Of course there are also scenarios where something could go wrong  
during the maintenance and a recovery would be required, but there's  
always some kind of tradeoff, I guess.


Until now we had no problems with maintenance when these flags were  
set, and also no stuck I/O for the clients.



We never saw this in the past, with the same procedure ...


If this has always worked for you and you suddenly encounter these  
problems for the first time there could be something else going on, of  
course, but I just wanted to point out the importance of those osd  
flags.


Regards,
Eugen

[1] https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/


Zitat von Denny Fuchs :


Hi,

today we had an issue with our 6 node Ceph cluster.

We had to shutdown one node (Ceph-03), to replace a disk (because,   
we did now know the slot). We set the noout flag and did a graceful  
shutdown. All was O.K. After the disk was replaced, the node comes  
up and our VMs had a big I/O latency.

We never saw this in the past, with the same procedure ...

* From our logs on Ceph-01:

2018-10-18 15:53:45.455743 mon.qh-a07-ceph-osd-03 mon.2  
10.3.0.3:6789/0 1 : cluster [INF] mon.qh-a07-ceph-  osd-03 calling  
monitor election

...
2018-10-18 15:53:45.503818 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663050 : cluster [INF] mon.qh-a07-ceph-osd-01 is  
new leader, mons  
qh-a07-ceph-osd-01,qh-a07-ceph-osd-02,qh-a07-ceph-osd-03,qh-a07-ceph-osd-04,qh-a07-ceph-osd-05,qh


* First OSD comes up:

2018-10-18 15:53:55.207742 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663063 : cluster [WRN] Health check update: 10 osds  
down (OSD_DOWN)
2018-10-18 15:53:55.207768 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663064 : cluster [INF] Health check cleared:  
OSD_HOST_DOWN (was: 1 host (11 osds) down)
2018-10-18 15:53:55.240079 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663065 : cluster [INF] osd.43 10.3.0.3:6812/7554 boot


* All OSDs where up:

2018-10-18 15:54:25.331692 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663134 : cluster [INF] Health check cleared:  
OSD_DOWN (was: 1 osds down)
2018-10-18 15:54:25.360151 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663135 : cluster [INF] osd.12 10.3.0.3:6820/8537 boot


* This OSDs here are a mix of HDD and SDD and different nodes

2018-10-18 15:54:27.073266 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663138 : cluster [WRN] Health check update:  
Degraded data redundancy: 84012/4293867 objects degraded (1.957%),  
1316 pgs degraded, 487 pgs undersized (PG_DEGRADED)
2018-10-18 15:54:32.073644 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663142 : cluster [WRN] Health check update:  
Degraded data redundancy: 4611/4293867 objects degraded (0.107%),  
1219 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:36.841189 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663144 : cluster [WRN] Health check failed: 1 slow  
requests are blocked > 32 sec. Implicated osds 16 (REQUEST_SLOW)
2018-10-18 15:54:37.074098 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663145 : cluster [WRN] Health check update:  
Degraded data redundancy: 4541/4293867 objects degraded (0.106%),  
1216 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:42.074510 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663149 : cluster [WRN] Health check update:  
Degraded data redundancy: 4364/4293867 objects degraded (0.102%),  
1176 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:42.074561 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663150 : cluster [WRN] Health check update: 5 slow  
requests are blocked > 32 sec. Implicated osds 15,25,30,34  
(REQUEST_SLOW)
2018-10-18 15:54:47.074886 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663152 : cluster [WRN] Health check update:  
Degraded data redundancy: 4193/4293867 objects degraded (0.098%),  
1140 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:47.074934 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663153 : cluster [WRN] Health check update: 5 slow  
requests are blocked > 32 sec. Implicated osds 9,15,23,30  
(REQUEST_SLOW)
2018-10-18 15:54:52.075274 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663156 : cluster [WRN] Health check update:  
Degraded data redundancy: 4087/4293867 objects degraded (0.095%),  
1120 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:52.075313 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663157 : cluster [WRN] Health check update: 14 slow  
requests are blocked > 32 sec. Implicated osds 2,13,14,15,16,23  
(REQUEST_SLOW)
2018-10-18 15:54:57.075635 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663158 : cluster [WRN] Health check update:  
Degraded data 

Re: [ceph-users] Luminous with osd flapping, slow requests when deep scrubbing

2018-10-15 Thread Eugen Block

Hi Andrei,

we have been using the script from [1] to define the number of PGs to  
deep-scrub in parallel, we currently use MAXSCRUBS=4, you could start  
with 1 to minimize performance impacts.


And these are the scrub settings from our ceph.conf:

ceph:~ # grep scrub /etc/ceph/ceph.conf
osd_scrub_begin_hour = 0
osd_scrub_end_hour = 7
osd_scrub_sleep = 0.1
osd_deep_scrub_interval = 2419200

The osd_deep_scrub_interval is set to 4 weeks so that it doesn't  
interfere with our own interval defined by the cronjob, scrubbing a  
quarter of PGs four times a week, so that every PG has been  
deep-scrubbed within one week.


Regards,
Eugen

[1]  
https://www.formann.de/2015/05/cronjob-to-enable-timed-deep-scrubbing-in-a-ceph-cluster/



Zitat von Andrei Mikhailovsky :


Hello,

I am currently running Luminous 12.2.8 on Ubuntu with  
4.15.0-36-generic kernel from the official ubuntu repo. The cluster  
has 4 mon + osd servers. Each osd server has the total of 9 spinning  
osds and 1 ssd for the hdd and ssd pools. The hdds are backed by the  
S3710 ssds for journaling with a ration of 1:5. The ssd pool osds  
are not using external journals. Ceph is used as a Primary storage  
for Cloudstack - all vm disk images are stored on the cluster.


I have recently migrated all osds to the bluestore, which was a long  
process with ups and downs, but I am happy to say that the migration  
is done. During the migration I've disabled the scrubbing (both deep  
and standard). After reenabling the scrubbing I have noticed the  
cluster started having a large number of slow requests and poor  
client IO (to the point of vms stall for minutes). Further  
investigation showed that the slow requests happen because of the  
osds flapping. In a single day my logs have over 1000 entries which  
report osd going down. This effects random osds. Disabling  
deep-scrubbing stabilises the cluster and the osds are no longer  
flap and the slow requests disappear. As a short term solution I've  
disabled the deepscurbbing, but was hoping to fix the issues with  
your help.


At the moment, I am running the cluster with default settings apart  
from the following settings:


[global]
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = idle
osd_recovery_op_priority = 1

[osd]
debug_ms = 0
debug_auth = 0
debug_osd = 0
debug_bluestore = 0
debug_bluefs = 0
debug_bdev = 0
debug_rocksdb = 0


Could you share experiences with deep scrubbing of bluestore osds?  
Are there any options that I should set to make sure the osds are  
not flapping and the client IO is still available?


Thanks

Andrei




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does anyone use interactive CLI mode?

2018-10-11 Thread Eugen Block
I only tried to use the Ceph CLI once out of curiosity, simply because  
it is there, but I don't really benefit from it.
Usually when I'm working with clusters it requires a combination of  
different commands (rbd, rados, ceph etc.), so this would mean either  
exiting and entering the CLI back and forth or switching between  
different terminal windows/tabs. Running all the necessary commands  
from the same terminal is much more comfortable and auto completion  
makes it even easier.
That's why I join the previous comments, I wouldn't mind if the CLI  
would be removed.


Regards,
Eugen


Zitat von Brady Deetz :


I run 2 clusters and have never purposely executed the interactive cli. I
save remove the code bloat.

On Wed, Oct 10, 2018 at 9:20 AM John Spray  wrote:


Hi all,

Since time immemorial, the Ceph CLI has had a mode where when run with
no arguments, you just get an interactive prompt that lets you run
commands without "ceph" at the start.

I recently discovered that we actually broke this in Mimic[1], and it
seems that nobody noticed!

So the question is: does anyone actually use this feature?  It's not
particularly expensive to maintain, but it might be nice to have one
less path through the code if this is entirely unused.

Cheers,
John

1. https://github.com/ceph/ceph/pull/24521
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds_cache_memory_limit value

2018-10-05 Thread Eugen Block

Hi,

you can monitor the cache size and see if the new values are applied:

ceph@mds:~> ceph daemon mds. cache status
{
"pool": {
"items": 106708834,
"bytes": 5828227058
}
}

You should also see in top (or similar tools) that the memory  
increases/decreases. From my experience the new config value is  
applied immediately.


Regards,
Eugen


Zitat von Hervé Ballans :


Hi all,

I have just configured a new value for 'mds_cache_memory_limit'. The  
output message tells "not observed, change may require restart".
So I'm not really sure, has the new value been taken into account  
directly or do I have to restart the mds daemons on each MDS node ?


$ sudo ceph tell mds.* injectargs '--mds_cache_memory_limit 17179869184';
2018-10-04 16:25:11.692131 7f3012ffd700  0 client.2226325  
ms_handle_reset on IP1:6804/2649460488
2018-10-04 16:25:11.714746 7f3013fff700  0 client.4154799  
ms_handle_reset on IP1:6804/2649460488
mds.mon1: mds_cache_memory_limit = '17179869184' (not observed,  
change may require restart)
2018-10-04 16:25:11.725028 7f3012ffd700  0 client.4154802  
ms_handle_reset on IP0:6805/997393445
2018-10-04 16:25:11.748790 7f3013fff700  0 client.4154805  
ms_handle_reset on IP0:6805/997393445
mds.mon0: mds_cache_memory_limit = '17179869184' (not observed,  
change may require restart)
2018-10-04 16:25:11.760127 7f3012ffd700  0 client.2226334  
ms_handle_reset on IP2:6801/2590484227
2018-10-04 16:25:11.787951 7f3013fff700  0 client.2226337  
ms_handle_reset on IP2:6801/2590484227
mds.mon2: mds_cache_memory_limit = '17179869184' (not observed,  
change may require restart)


Thanks,
Hervé




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ACL '+' not shown in 'ls' on kernel cephfs mount

2018-09-26 Thread Eugen Block

Hi,

I can confirm this for:

ceph --version
ceph version 12.2.5-419-g8cbf63d997  
(8cbf63d997fb5cdc783fe7bfcd4f5032ee140c0c) luminous (stable)


Setting ACLs on a file works as expected (restrict file access to  
specific user), getfacl displays correct information, but 'ls -la'  
does not indicate that there are ACLs on that file.



Zitat von Chad W Seys :


Hi all,
   It appears as though the '+' which indicates an extended ACL is not
shown when 'ls'-ing cephfs is mounted by kernel.

# ls -al
total 9
drwxrwxr-x+ 4 root smbadmin4096 Aug 13 10:14 .
drwxrwxr-x  5 root smbadmin4096 Aug 17 09:37 ..
dr-xr-xr-x  4 root root   3 Sep 11 09:50 timemachine
dr-xr-xr-x+ 3 root root 2114912 Aug 15 14:09 winbak

Two directories, timemachine is mounted by kernel cephfs module:

df -h timemachine/
Filesystem  Size
  Used Avail Use% Mounted on
128.104.164.197,128.104.164.198,10.128.198.51:/backups/timemachine  8.7T
  3.2T  5.6T  37% /srv/smb/timemachine

and winbak is ceph-fuse mounted:

df -h winbak/
Filesystem  Size  Used Avail Use% Mounted on
ceph-fuse   8.7T  3.2T  5.6T  37% /srv/smb/winbak

You can see in the 'ls' output that winbak has a '+' (indicating ACL).

timemachine also has ACL but no '+':

getfacl timemachine/
# file: timemachine/
# owner: root
# group: root
user::r-x
user:backupadmin:r-x
group::r-x
group:wheel:rwx
mask::rwx
other::r-x
default:user::rwx
default:user:backupadmin:r-x
default:group::r-x
default:group:wheel:rwx
default:mask::rwx
default:other::r-x

Known bug? My bug?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread Eugen Block

Hi,

I'm not sure how the recovery "still works" with the flag "norecover".
Anyway, I think you should unset the flags norecover, nobackfill. Even  
if not all OSDs come back up you should allow the cluster to backfill  
PGs. Not sure, but unsetting norebalance could also be useful, but  
that can be done step by step. First watch if the cluster gets any  
better without it.



And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.


The suggested config settings look reasonable to mee. You should also  
try to raise the timeouts for the MONs and increase their db cache as  
suggested earlier today.


after this point, if an osd is down, it's fine...it'll only prevent  
access to that specific data (bad for clients, fine for recovery)


I agree with that, the cluster state has to become stable first, then  
you can take a look into those OSDs that won't get up.


Regards,
Eugen


Zitat von by morphin :


Hello Eugen.  Thank you for your answer. I was loosing my hope to get
an answer here.

I faced so many times with losing 2/3 mons but I never faced any
problem like this on luminous.
The recovery still works and its have been 30hours.  The last state of
my cluster is: https://paste.ubuntu.com/p/rDNHCcNG7P/
We are discussing should we unset the nodown, norecover flags or not on IRC.

I tried unset the nodown flag yesterday and I have 15 osd do not start
anymore with same error --> : https://paste.ubuntu.com/p/94xpzxTSnr/
I dont know what is the reason of this but I saw some commits for the
dump problem. Is this bug or something else?

And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.
What do you think?
Eugen Block , 26 Eyl 2018 Çar, 12:54 tarihinde şunu yazdı:


Hi,

could this be related to this other Mimic upgrade thread [1]? Your
failing MONs sound a bit like the problem described there, eventually
the user reported recovery success. You could try the described steps:

  - disable cephx auth with 'auth_cluster_required = none'
  - set the mon_osd_cache_size = 20 (default 10)
  - Setting 'osd_heartbeat_interval = 30'
  - setting 'mon_lease = 75'
  - increase the rocksdb_cache_size and leveldb_cache_size on the mons
to be big enough to cache the entire db

I just copied the mentioned steps, so please read the thread before
applying anything.

Regards,
Eugen

[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030018.html


Zitat von by morphin :

> After I tried too many things with so many helps on IRC. My pool
> health is still in ERROR and I think I can't recover from this.
> https://paste.ubuntu.com/p/HbsFnfkYDT/
> At the end 2 of 3 mons crashed and started at same time and the pool
> is offlined. Recovery takes more than 12hours and it is way too slow.
> Somehow recovery seems to be not working.
>
> If I can reach my data I will re-create the pool easily.
> If I run ceph-object-tool script to regenerate mon store.db can I
> acccess the RBD pool again?
> by morphin , 25 Eyl 2018 Sal, 20:03
> tarihinde şunu yazdı:
>>
>> Hi,
>>
>> Cluster is still down :(
>>
>> Up to not we have managed to compensate the OSDs. 118s of 160 OSD are
>> stable and cluster is still in the progress of settling. Thanks for
>> the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
>> flapping OSDs stable.
>>
>> What we learned up now is that this is the cause of unsudden death of
>> 2 monitor servers of 3. And when they come back if they do not start
>> one by one (each after joining cluster) this can happen. Cluster can
>> be unhealty and it can take countless hour to come back.
>>
>> Right now here is our status:
>> ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
>> health detail: https://paste.ubuntu.com/p/w4gccnqZjR/
>>
>> Since OSDs disks are NL-SAS it can take up to 24 hours for an online
>> cluster. What is most it has been said that we could be extremely luck
>> if all the data is rescued.
>>
>> Most unhappily our strategy is just to sit and wait :(. As soon as the
>> peering and activating count drops to 300-500 pgs we will restart the
>> stopped OSDs one by one. For each OSD and we will wait the cluster to
>> settle down. The amount of data stored is OSD is 33TB. Our most
>> concern is to export our rbd pool data outside to a backup space. Then
>> we will start again with clean one.
>>
>> I hope to justify our analysis with an expert. Any help or advise
>> would be greatly appreciated.
>> by morphin , 25 Eyl 2018 Sal, 15:08
>> tarihinde şunu yazdı:
>> >
>> > After reducing the recovery param

Re: [ceph-users] Bluestore DB showing as ssd

2018-09-26 Thread Eugen Block

Hi,

how did you create the OSDs? Were they built from scratch with the  
respective command options (--block.db /dev/)?

You could check what the bluestore tool tells you about the block.db:

ceph1:~ # ceph-bluestore-tool show-label --dev  
/var/lib/ceph/osd/ceph-21/block | grep path

"path_block.db": "/dev/ceph-journals/bluefsdb-21",

Does it point to the right device(s)?

Regards,
Eugen

Zitat von Hervé Ballans :


Hi,

By testing the command on my side, it gives me the right information  
(modulo the fact that the disk is a nvme and not ssd) :


# ceph osd metadata 1 |grep bluefs_db
    "bluefs_db_access_mode": "blk",
    "bluefs_db_block_size": "4096",
    "bluefs_db_dev": "259:3",
    "bluefs_db_dev_node": "nvme0n1",
    "bluefs_db_driver": "KernelDevice",
    "bluefs_db_model": "Dell Express Flash PM1725a 800GB SFF    ",
    "bluefs_db_partition_path": "/dev/nvme0n1p2",
    "bluefs_db_rotational": "0",
    "bluefs_db_serial": "  S39YNX0JC02801",
    "bluefs_db_size": "80016834560",
    "bluefs_db_type": "ssd",

I notice that in your case, the db_model returns the information on  
the PERC card and not on the disk...

Maybe this is where the missing information comes from ?

Le 22/09/2018 à 01:24, Brett Chancellor a écrit :
Hi all. Quick question about osd metadata information. I have  
several OSDs setup with the data dir on HDD and the db going to a  
partition on ssd. But when I look at the metadata for all the OSDs,  
it's showing the db as "hdd". Does this effect anything? And is  
there anyway to change it?


$ sudo ceph osd metadata 1
{
    "id": 1,
    "arch": "x86_64",
    "back_addr": ":6805/2053608",
    "back_iface": "eth0",
    "bluefs": "1",
    "bluefs_db_access_mode": "blk",
    "bluefs_db_block_size": "4096",
    "bluefs_db_dev": "8:80",
    "bluefs_db_dev_node": "sdf",
    "bluefs_db_driver": "KernelDevice",
    "bluefs_db_model": "PERC H730 Mini  ",
    "bluefs_db_partition_path": "/dev/sdf2",
    "bluefs_db_rotational": "1",
    "bluefs_db_size": "266287972352",
*    "bluefs_db_type": "hdd",*




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread Eugen Block

Hi,

could this be related to this other Mimic upgrade thread [1]? Your  
failing MONs sound a bit like the problem described there, eventually  
the user reported recovery success. You could try the described steps:


 - disable cephx auth with 'auth_cluster_required = none'
 - set the mon_osd_cache_size = 20 (default 10)
 - Setting 'osd_heartbeat_interval = 30'
 - setting 'mon_lease = 75'
 - increase the rocksdb_cache_size and leveldb_cache_size on the mons  
to be big enough to cache the entire db


I just copied the mentioned steps, so please read the thread before  
applying anything.


Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030018.html



Zitat von by morphin :


After I tried too many things with so many helps on IRC. My pool
health is still in ERROR and I think I can't recover from this.
https://paste.ubuntu.com/p/HbsFnfkYDT/
At the end 2 of 3 mons crashed and started at same time and the pool
is offlined. Recovery takes more than 12hours and it is way too slow.
Somehow recovery seems to be not working.

If I can reach my data I will re-create the pool easily.
If I run ceph-object-tool script to regenerate mon store.db can I
acccess the RBD pool again?
by morphin , 25 Eyl 2018 Sal, 20:03
tarihinde şunu yazdı:


Hi,

Cluster is still down :(

Up to not we have managed to compensate the OSDs. 118s of 160 OSD are
stable and cluster is still in the progress of settling. Thanks for
the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
flapping OSDs stable.

What we learned up now is that this is the cause of unsudden death of
2 monitor servers of 3. And when they come back if they do not start
one by one (each after joining cluster) this can happen. Cluster can
be unhealty and it can take countless hour to come back.

Right now here is our status:
ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
health detail: https://paste.ubuntu.com/p/w4gccnqZjR/

Since OSDs disks are NL-SAS it can take up to 24 hours for an online
cluster. What is most it has been said that we could be extremely luck
if all the data is rescued.

Most unhappily our strategy is just to sit and wait :(. As soon as the
peering and activating count drops to 300-500 pgs we will restart the
stopped OSDs one by one. For each OSD and we will wait the cluster to
settle down. The amount of data stored is OSD is 33TB. Our most
concern is to export our rbd pool data outside to a backup space. Then
we will start again with clean one.

I hope to justify our analysis with an expert. Any help or advise
would be greatly appreciated.
by morphin , 25 Eyl 2018 Sal, 15:08
tarihinde şunu yazdı:
>
> After reducing the recovery parameter values did not change much.
> There are a lot of OSD still marked down.
>
> I don't know what I need to do after this point.
>
> [osd]
> osd recovery op priority = 63
> osd client op priority = 1
> osd recovery max active = 1
> osd max scrubs = 1
>
>
> ceph -s
>   cluster:
> id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
> health: HEALTH_ERR
> 42 osds down
> 1 host (6 osds) down
> 61/8948582 objects unfound (0.001%)
> Reduced data availability: 3837 pgs inactive, 1822 pgs
> down, 1900 pgs peering, 6 pgs stale
> Possible data damage: 18 pgs recovery_unfound
> Degraded data redundancy: 457246/17897164 objects degraded
> (2.555%), 213 pgs degraded, 209 pgs undersized
> 2554 slow requests are blocked > 32 sec
> 3273 slow ops, oldest one blocked for 1453 sec, daemons
>  
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...

> have slow ops.
>
>   services:
> mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
> mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3,
> SRV-SEKUARK4
> osd: 168 osds: 118 up, 160 in
>
>   data:
> pools:   1 pools, 4096 pgs
> objects: 8.95 M objects, 17 TiB
> usage:   33 TiB used, 553 TiB / 586 TiB avail
> pgs: 93.677% pgs not active
>  457246/17897164 objects degraded (2.555%)
>  61/8948582 objects unfound (0.001%)
>  1676 down
>  1372 peering
>  528  stale+peering
>  164  active+undersized+degraded
>  145  stale+down
>  73   activating
>  40   active+clean
>  29   stale+activating
>  17   active+recovery_unfound+undersized+degraded
>  16   stale+active+clean
>  16   stale+active+undersized+degraded
>  9activating+undersized+degraded
>  3active+recovery_wait+degraded
>  2activating+undersized
>  2activating+degraded
>  1creating+down
>  1stale+active+recovery_unfound+undersized+degraded
>  1stale+active+clean+scrubbing+deep
>  1stale+active+recovery_wait+degraded
>
> ceph -w: 

Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-25 Thread Eugen Block
I would try to reduce recovery to a minimum, something like this  
helped us in in a small cluster (25 OSDs on 3 hosts) in case of  
recovery while operation continued without impact:


ceph tell 'osd.*' injectargs '--osd-recovery-max-active 2'
ceph tell 'osd.*' injectargs '--osd-max-backfills 8'

Regards,
Eugen


Zitat von by morphin :


After reducing the recovery parameter values did not change much.
There are a lot of OSD still marked down.

I don't know what I need to do after this point.

[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 1
osd max scrubs = 1


ceph -s
  cluster:
id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
health: HEALTH_ERR
42 osds down
1 host (6 osds) down
61/8948582 objects unfound (0.001%)
Reduced data availability: 3837 pgs inactive, 1822 pgs
down, 1900 pgs peering, 6 pgs stale
Possible data damage: 18 pgs recovery_unfound
Degraded data redundancy: 457246/17897164 objects degraded
(2.555%), 213 pgs degraded, 209 pgs undersized
2554 slow requests are blocked > 32 sec
3273 slow ops, oldest one blocked for 1453 sec, daemons
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
have slow ops.

  services:
mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3,
SRV-SEKUARK4
osd: 168 osds: 118 up, 160 in

  data:
pools:   1 pools, 4096 pgs
objects: 8.95 M objects, 17 TiB
usage:   33 TiB used, 553 TiB / 586 TiB avail
pgs: 93.677% pgs not active
 457246/17897164 objects degraded (2.555%)
 61/8948582 objects unfound (0.001%)
 1676 down
 1372 peering
 528  stale+peering
 164  active+undersized+degraded
 145  stale+down
 73   activating
 40   active+clean
 29   stale+activating
 17   active+recovery_unfound+undersized+degraded
 16   stale+active+clean
 16   stale+active+undersized+degraded
 9activating+undersized+degraded
 3active+recovery_wait+degraded
 2activating+undersized
 2activating+degraded
 1creating+down
 1stale+active+recovery_unfound+undersized+degraded
 1stale+active+clean+scrubbing+deep
 1stale+active+recovery_wait+degraded

ceph -w: https://paste.ubuntu.com/p/WZ2YqzS86S/
ceph health detail: https://paste.ubuntu.com/p/8w7Jpms8fj/
by morphin , 25 Eyl 2018 Sal, 14:32
tarihinde şunu yazdı:


The config didnt work. Because increasing the number faced with  
more OSD Drops.


bhfs -s
  cluster:
id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
health: HEALTH_ERR
norebalance,norecover flag(s) set
1 osds down
17/8839434 objects unfound (0.000%)
Reduced data availability: 3578 pgs inactive, 861 pgs
down, 1928 pgs peering, 11 pgs stale
Degraded data redundancy: 44853/17678868 objects degraded
(0.254%), 221 pgs degraded, 20 pgs undersized
610 slow requests are blocked > 32 sec
3996 stuck requests are blocked > 4096 sec
6076 slow ops, oldest one blocked for 4129 sec, daemons
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
have slow ops.

  services:
mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3
osd: 168 osds: 128 up, 129 in; 2 remapped pgs
 flags norebalance,norecover

  data:
pools:   1 pools, 4096 pgs
objects: 8.84 M objects, 17 TiB
usage:   26 TiB used, 450 TiB / 477 TiB avail
pgs: 0.024% pgs unknown
 89.160% pgs not active
 44853/17678868 objects degraded (0.254%)
 17/8839434 objects unfound (0.000%)
 1612 peering
 720  down
 583  activating
 319  stale+peering
 255  active+clean
 157  stale+activating
 108  stale+down
 95   activating+degraded
 84   stale+active+clean
 50   active+recovery_wait+degraded
 29   creating+down
 23   stale+activating+degraded
 18   stale+active+recovery_wait+degraded
 14   active+undersized+degraded
 12   active+recovering+degraded
 4stale+creating+down
 3stale+active+recovering+degraded
 3stale+active+undersized+degraded
 2stale
 1active+recovery_wait+undersized+degraded
 1active+clean+scrubbing+deep
 1unknown
 1active+undersized+degraded+remapped+backfilling
 1active+recovering+undersized+degraded

I guess 

Re: [ceph-users] bluestore osd journal move

2018-09-24 Thread Eugen Block

Hi,

I am wondering if it is possible to move the ssd journal for the  
bluestore osd? I would like to move it from one ssd drive to another.


yes, this question has been asked several times.

Depending on your deployment there are several things to be aware of,  
maybe you should first read [1] before doing anything.


A couple of months ago we had to replace a failed SSD which held  
several journals for OSDs. We wrote a blog post [2] about it with a  
description of the required steps. Please be aware that those steps  
did work for us, there's no guarantee that they will work for you.  
Read it very carefully and recheck every step before executing it.


Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024913.html
[2]  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/



Zitat von Andrei Mikhailovsky :


Hello everyone,

I am wondering if it is possible to move the ssd journal for the  
bluestore osd? I would like to move it from one ssd drive to another.


Thanks




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block

I also switched the cache tier to "readproxy", to avoid using this
cache. But, it's still blocked.


You could change the cache mode to "none" to disable it. Could you  
paste the output of:


ceph osd pool ls detail | grep cache-bkp-foo



Zitat von Olivier Bonvalet :


In fact, one object (only one) seem to be blocked on the cache tier
(writeback).

I tried to flush the cache with "rados -p cache-bkp-foo cache-flush-
evict-all", but it blocks on the object
"rbd_data.f66c92ae8944a.000f2596".

So I reduced (a lot) the cache tier to 200MB, "rados -p cache-bkp-foo
ls" now show only 3 objects :

rbd_directory
rbd_data.f66c92ae8944a.000f2596
rbd_header.f66c92ae8944a

And "cache-flush-evict-all" still hangs.

I also switched the cache tier to "readproxy", to avoid using this
cache. But, it's still blocked.




Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet a écrit :

Hello,

on a Luminous cluster, I have a PG incomplete and I can't find how to
fix that.

It's an EC pool (4+2) :

pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool
bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
'incomplete')

Of course, we can't reduce min_size from 4.

And the full state : https://pastebin.com/zrwu5X0w

So, IO are blocked, we can't access thoses damaged data.
OSD blocks too :
osds 32,68,69 have stuck requests > 4194.3 sec

OSD 32 is the primary of this PG.
And OSD 68 and 69 are for cache tiering.

Any idea how can I fix that ?

Thanks,

Olivier


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block

I tried to flush the cache with "rados -p cache-bkp-foo cache-flush-
evict-all", but it blocks on the object
"rbd_data.f66c92ae8944a.000f2596".


This is the object that's stuck in the cache tier (according to your  
output in https://pastebin.com/zrwu5X0w). Can you verify if that block  
device is in use and healthy or is it corrupt?



Zitat von Maks Kowalik :


Could you, please paste the output of pg 37.9c query

pt., 21 wrz 2018 o 14:39 Olivier Bonvalet  napisał(a):


In fact, one object (only one) seem to be blocked on the cache tier
(writeback).

I tried to flush the cache with "rados -p cache-bkp-foo cache-flush-
evict-all", but it blocks on the object
"rbd_data.f66c92ae8944a.000f2596".

So I reduced (a lot) the cache tier to 200MB, "rados -p cache-bkp-foo
ls" now show only 3 objects :

rbd_directory
rbd_data.f66c92ae8944a.000f2596
rbd_header.f66c92ae8944a

And "cache-flush-evict-all" still hangs.

I also switched the cache tier to "readproxy", to avoid using this
cache. But, it's still blocked.




Le vendredi 21 septembre 2018 à 02:14 +0200, Olivier Bonvalet a écrit :
> Hello,
>
> on a Luminous cluster, I have a PG incomplete and I can't find how to
> fix that.
>
> It's an EC pool (4+2) :
>
> pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool
> bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
> 'incomplete')
>
> Of course, we can't reduce min_size from 4.
>
> And the full state : https://pastebin.com/zrwu5X0w
>
> So, IO are blocked, we can't access thoses damaged data.
> OSD blocks too :
> osds 32,68,69 have stuck requests > 4194.3 sec
>
> OSD 32 is the primary of this PG.
> And OSD 68 and 69 are for cache tiering.
>
> Any idea how can I fix that ?
>
> Thanks,
>
> Olivier
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block

cache-tier on this pool have 26GB of data (for 5.7TB of data on the EC
pool).
We tried to flush the cache tier, and restart OSD 68 & 69, without any
success.


I meant the replication size of the pool

ceph osd pool ls detail | grep 

In the experimental state of our cluster we had a cache tier (for rbd  
pool) with size 2, that can cause problems during recovery. Since only  
OSDs 68 and 69 are mentioned I was wondering if your cache tier also  
has size 2.



Zitat von Olivier Bonvalet :


Hi,

cache-tier on this pool have 26GB of data (for 5.7TB of data on the EC
pool).
We tried to flush the cache tier, and restart OSD 68 & 69, without any
success.

But I don't see any related data on cache-tier OSD (filestore) with :

find /var/lib/ceph/osd/ -maxdepth 3 -name '*37.9c*'


I don't see any usefull information in logs. Maybe I should increase
log level ?

Thanks,

Olivier


Le vendredi 21 septembre 2018 à 09:34 +0000, Eugen Block a écrit :

Hi Olivier,

what size does the cache tier have? You could set cache-mode to
forward and flush it, maybe restarting those OSDs (68, 69) helps,
too.
Or there could be an issue with the cache tier, what do those logs
say?

Regards,
Eugen


Zitat von Olivier Bonvalet :

> Hello,
>
> on a Luminous cluster, I have a PG incomplete and I can't find how
> to
> fix that.
>
> It's an EC pool (4+2) :
>
> pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool
> bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
> 'incomplete')
>
> Of course, we can't reduce min_size from 4.
>
> And the full state : https://pastebin.com/zrwu5X0w
>
> So, IO are blocked, we can't access thoses damaged data.
> OSD blocks too :
> osds 32,68,69 have stuck requests > 4194.3 sec
>
> OSD 32 is the primary of this PG.
> And OSD 68 and 69 are for cache tiering.
>
> Any idea how can I fix that ?
>
> Thanks,
>
> Olivier
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck incomplete

2018-09-21 Thread Eugen Block

Hi Olivier,

what size does the cache tier have? You could set cache-mode to  
forward and flush it, maybe restarting those OSDs (68, 69) helps, too.  
Or there could be an issue with the cache tier, what do those logs say?


Regards,
Eugen


Zitat von Olivier Bonvalet :


Hello,

on a Luminous cluster, I have a PG incomplete and I can't find how to
fix that.

It's an EC pool (4+2) :

pg 37.9c is incomplete, acting [32,50,59,1,0,75] (reducing pool
bkp-sb-raid6 min_size from 4 may help; search ceph.com/docs for
'incomplete')

Of course, we can't reduce min_size from 4.

And the full state : https://pastebin.com/zrwu5X0w

So, IO are blocked, we can't access thoses damaged data.
OSD blocks too :
osds 32,68,69 have stuck requests > 4194.3 sec

OSD 32 is the primary of this PG.
And OSD 68 and 69 are for cache tiering.

Any idea how can I fix that ?

Thanks,

Olivier


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests blocked. No rebalancing

2018-09-20 Thread Eugen Block

Hi,

to reduce impact on clients during migration I would set the OSD's  
primary-affinity to 0 beforehand. This should prevent the slow  
requests, at least this setting has helped us a lot with problematic  
OSDs.


Regards
Eugen


Zitat von Jaime Ibar :


Hi all,

we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're  
trying to migrate the


osd's to Bluestore following this document[0], however when I mark  
the osd as out,


I'm getting warnings similar to these ones

2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed:  
2 slow requests are blocked > 32 sec. Implicated osds 16,28  
(REQUEST_SLOW)
2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update:  
7 slow requests are blocked > 32 sec. Implicated osds 10,16,28,32,59  
(REQUEST_SLOW)
2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update:  
15 slow requests are blocked > 32 sec. Implicated osds  
10,16,28,31,32,59,78,80 (REQUEST_SLOW)


2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update:  
244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED)
2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update:  
249 PGs pending on creation (PENDING_CREATING_PGS)


which prevent ceph start rebalancing and the vm's running on ceph  
start hanging and we have to mark the osd back in.


I tried to reweight the osd to 0.90 in order to minimize the impact  
on the cluster but the warnings are the same.


I tried to increased these settings to

mds cache memory limit = 2147483648
rocksdb cache size = 2147483648

but with no luck, same warnings.

We also have cephfs for storing files from different projects(no  
directory fragmentation enabled).


The problem here is that if one osd dies, all the services will be  
blocked as ceph won't be able to


start rebalancing.

The cluster is

- 3 mons

- 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby

- 3 mgr(running on the same hosts as the mons)

- 6 servers, 12 osd's each.

- 1GB private network


Does anyone know how to fix or where the problem could be?

Thanks a lot in advance.

Jaime


[0] http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/  |ja...@tchpc.tcd.ie
Tel: +353-1-896-3725




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-19 Thread Eugen Block

Yeah, since we haven't knowingly done anything about it, it would be a
(pleasant) surprise if it was accidentally resolved in mimic ;-)


Too bad ;-)
Thanks for your help!

Eugen

Zitat von John Spray :


On Wed, Sep 19, 2018 at 10:37 AM Eugen Block  wrote:


Hi John,

> I'm not 100% sure of that.  It could be that there's a path through
> the code that's healthy, but just wasn't anticipated at the point that
> warning message was added.  I wish a had a more unambiguous response
> to give!

then I guess we'll just keep ignoring these warnings from the replay
mds until we hit a real issue. ;-)

It's probably impossible to predict any improvement on this with  
mimic, right?


Yeah, since we haven't knowingly done anything about it, it would be a
(pleasant) surprise if it was accidentally resolved in mimic ;-)

John


Regards,
Eugen


Zitat von John Spray :

> On Mon, Sep 17, 2018 at 2:49 PM Eugen Block  wrote:
>>
>> Hi,
>>
>> from your response I understand that these messages are not expected
>> if everything is healthy.
>
> I'm not 100% sure of that.  It could be that there's a path through
> the code that's healthy, but just wasn't anticipated at the point that
> warning message was added.  I wish a had a more unambiguous response
> to give!
>
> John
>
>> We face them every now and then, three or four times a week, but
>> there's no real connection to specific jobs or a high load in our
>> cluster. It's a Luminous cluster (12.2.7) with 1 active, 1
>> standby-replay and 1 standby MDS.
>> Since it's only the replay server reporting this and the failover
>> works fine we didn't really bother. But what can we do to prevent this
>> from happening? The messages appear quite randomly, so I don't really
>> now when to increase the debug log level.
>>
>> Any hint would be highly appreciated!
>>
>> Regards,
>> Eugen
>>
>>
>> Zitat von John Spray :
>>
>> > On Thu, Sep 13, 2018 at 11:01 AM Stefan Kooman  wrote:
>> >>
>> >> Hi John,
>> >>
>> >> Quoting John Spray (jsp...@redhat.com):
>> >>
>> >> > On Wed, Sep 12, 2018 at 2:59 PM Stefan Kooman  wrote:
>> >> >
>> >> > When replaying a journal (either on MDS startup or on a  
standby-replay

>> >> > MDS), the replayed file creation operations are being checked for
>> >> > consistency with the state of the replayed client sessions.  Client
>> >> > sessions have a "preallocated _inos" list that contains a  
set of inode

>> >> > numbers they should be using to create new files.
>> >> >
>> >> > There are two checks being done: a soft check (just log it) that the
>> >> > inode used for a new file is the same one that the session would be
>> >> > expected to use for a new file, and a hard check  
(assertion) that the

>> >> > inode used is one of the inode numbers that can be used for a new
>> >> > file.  When that soft check fails, it doesn't indicate anything
>> >> > inconsistent in the metadata, just that the inodes are being used in
>> >> > an unexpected order.
>> >> >
>> >> > The WRN severity message mainly benefits our automated  
testing -- the

>> >> > hope would be that if we're hitting strange scenarios like this in
>> >> > automated tests then it would trigger a test failure (we by  
fail tests

>> >> > if they emit unexpected warnings).
>> >>
>> >> Thanks for the explanation.
>> >>
>> >> > It would be interesting to know more about what's going on on your
>> >> > cluster when this is happening -- do you have standby replay MDSs?
>> >> > Multiple active MDSs?  Were any daemons failing over at a  
similar time

>> >> > to the warnings?  Did you have anything funny going on with clients
>> >> > (like forcing them to reconnect after being evicted)?
>> >>
>> >> Two MDSs in total. One active, one standby-replay. The  
clients are doing

>> >> "funny" stuff. We are testing "CTDB" [1] in combination with cephfs to
>> >> build a HA setup (to prevent split brain). We have two  
clients that, in
>> >> case of a failure, need to require a lock on a file  
"ctdb_recovery_lock"

>> >> before doing a recovery. Somehow, while configuring this setup, we
>> >> triggered the "replayed op" warnings. We try to reproduce that, but no
>> >> matter what we do the "replayed op" warning

Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-19 Thread Eugen Block

Hi John,


I'm not 100% sure of that.  It could be that there's a path through
the code that's healthy, but just wasn't anticipated at the point that
warning message was added.  I wish a had a more unambiguous response
to give!


then I guess we'll just keep ignoring these warnings from the replay  
mds until we hit a real issue. ;-)


It's probably impossible to predict any improvement on this with mimic, right?

Regards,
Eugen


Zitat von John Spray :


On Mon, Sep 17, 2018 at 2:49 PM Eugen Block  wrote:


Hi,

from your response I understand that these messages are not expected
if everything is healthy.


I'm not 100% sure of that.  It could be that there's a path through
the code that's healthy, but just wasn't anticipated at the point that
warning message was added.  I wish a had a more unambiguous response
to give!

John


We face them every now and then, three or four times a week, but
there's no real connection to specific jobs or a high load in our
cluster. It's a Luminous cluster (12.2.7) with 1 active, 1
standby-replay and 1 standby MDS.
Since it's only the replay server reporting this and the failover
works fine we didn't really bother. But what can we do to prevent this
from happening? The messages appear quite randomly, so I don't really
now when to increase the debug log level.

Any hint would be highly appreciated!

Regards,
Eugen


Zitat von John Spray :

> On Thu, Sep 13, 2018 at 11:01 AM Stefan Kooman  wrote:
>>
>> Hi John,
>>
>> Quoting John Spray (jsp...@redhat.com):
>>
>> > On Wed, Sep 12, 2018 at 2:59 PM Stefan Kooman  wrote:
>> >
>> > When replaying a journal (either on MDS startup or on a standby-replay
>> > MDS), the replayed file creation operations are being checked for
>> > consistency with the state of the replayed client sessions.  Client
>> > sessions have a "preallocated _inos" list that contains a set of inode
>> > numbers they should be using to create new files.
>> >
>> > There are two checks being done: a soft check (just log it) that the
>> > inode used for a new file is the same one that the session would be
>> > expected to use for a new file, and a hard check (assertion) that the
>> > inode used is one of the inode numbers that can be used for a new
>> > file.  When that soft check fails, it doesn't indicate anything
>> > inconsistent in the metadata, just that the inodes are being used in
>> > an unexpected order.
>> >
>> > The WRN severity message mainly benefits our automated testing -- the
>> > hope would be that if we're hitting strange scenarios like this in
>> > automated tests then it would trigger a test failure (we by fail tests
>> > if they emit unexpected warnings).
>>
>> Thanks for the explanation.
>>
>> > It would be interesting to know more about what's going on on your
>> > cluster when this is happening -- do you have standby replay MDSs?
>> > Multiple active MDSs?  Were any daemons failing over at a similar time
>> > to the warnings?  Did you have anything funny going on with clients
>> > (like forcing them to reconnect after being evicted)?
>>
>> Two MDSs in total. One active, one standby-replay. The clients are doing
>> "funny" stuff. We are testing "CTDB" [1] in combination with cephfs to
>> build a HA setup (to prevent split brain). We have two clients that, in
>> case of a failure, need to require a lock on a file "ctdb_recovery_lock"
>> before doing a recovery. Somehow, while configuring this setup, we
>> triggered the "replayed op" warnings. We try to reproduce that, but no
>> matter what we do the "replayed op" warnings do not occur anymore ...
>>
>> We have seen these warnings before (other clients). Warnings started
>> after we had switched from mds1 -> mds2 (upgrade of Ceph cluster
>> according to MDS upgrade procedure, reboots afterwards, hence the
>> failover).
>>
>> Something I just realised is that _only_ the active-standby MDS
>> is emitting the warnings, not the active MDS.
>>
>> Not related to the "replayed op" warning, but related to the CTDB "lock
>> issue":
>>
>> The "surviving" cephfs client tries to acquire a lock on a file, but
>> although the other client is dead (but not yet evicted by the MDS) it
>> can't. Not until the dead client is evicted by the MDS after ~ 300 sec
>> (mds_session_autoclose=300). Turns out ctdb uses fcntl() locking. Does
>> cephfs support this kind of locking in the way ctdb expects it to?
>
> We implement locking, and it's correct that another client can't gain
> the lo

Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-17 Thread Eugen Block

Hi,

from your response I understand that these messages are not expected  
if everything is healthy.


We face them every now and then, three or four times a week, but  
there's no real connection to specific jobs or a high load in our  
cluster. It's a Luminous cluster (12.2.7) with 1 active, 1  
standby-replay and 1 standby MDS.
Since it's only the replay server reporting this and the failover  
works fine we didn't really bother. But what can we do to prevent this  
from happening? The messages appear quite randomly, so I don't really  
now when to increase the debug log level.


Any hint would be highly appreciated!

Regards,
Eugen


Zitat von John Spray :


On Thu, Sep 13, 2018 at 11:01 AM Stefan Kooman  wrote:


Hi John,

Quoting John Spray (jsp...@redhat.com):

> On Wed, Sep 12, 2018 at 2:59 PM Stefan Kooman  wrote:
>
> When replaying a journal (either on MDS startup or on a standby-replay
> MDS), the replayed file creation operations are being checked for
> consistency with the state of the replayed client sessions.  Client
> sessions have a "preallocated _inos" list that contains a set of inode
> numbers they should be using to create new files.
>
> There are two checks being done: a soft check (just log it) that the
> inode used for a new file is the same one that the session would be
> expected to use for a new file, and a hard check (assertion) that the
> inode used is one of the inode numbers that can be used for a new
> file.  When that soft check fails, it doesn't indicate anything
> inconsistent in the metadata, just that the inodes are being used in
> an unexpected order.
>
> The WRN severity message mainly benefits our automated testing -- the
> hope would be that if we're hitting strange scenarios like this in
> automated tests then it would trigger a test failure (we by fail tests
> if they emit unexpected warnings).

Thanks for the explanation.

> It would be interesting to know more about what's going on on your
> cluster when this is happening -- do you have standby replay MDSs?
> Multiple active MDSs?  Were any daemons failing over at a similar time
> to the warnings?  Did you have anything funny going on with clients
> (like forcing them to reconnect after being evicted)?

Two MDSs in total. One active, one standby-replay. The clients are doing
"funny" stuff. We are testing "CTDB" [1] in combination with cephfs to
build a HA setup (to prevent split brain). We have two clients that, in
case of a failure, need to require a lock on a file "ctdb_recovery_lock"
before doing a recovery. Somehow, while configuring this setup, we
triggered the "replayed op" warnings. We try to reproduce that, but no
matter what we do the "replayed op" warnings do not occur anymore ...

We have seen these warnings before (other clients). Warnings started
after we had switched from mds1 -> mds2 (upgrade of Ceph cluster
according to MDS upgrade procedure, reboots afterwards, hence the
failover).

Something I just realised is that _only_ the active-standby MDS
is emitting the warnings, not the active MDS.

Not related to the "replayed op" warning, but related to the CTDB "lock
issue":

The "surviving" cephfs client tries to acquire a lock on a file, but
although the other client is dead (but not yet evicted by the MDS) it
can't. Not until the dead client is evicted by the MDS after ~ 300 sec
(mds_session_autoclose=300). Turns out ctdb uses fcntl() locking. Does
cephfs support this kind of locking in the way ctdb expects it to?


We implement locking, and it's correct that another client can't gain
the lock until the first client is evicted.  Aside from speeding up
eviction by modifying the timeout, if you have another mechanism for
detecting node failure then you could use that to explicitly evict the
client.

John


In the mean time we will try [7] (rados object) as a recovery lock.
Would eliminate a layer / dependency as well.

Thanks,

Gr. Stefan

[1]: https://ctdb.samba.org/
[2]: https://ctdb.samba.org/manpages/ctdb_mutex_ceph_rados_helper.7.html

--
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-14 Thread Eugen Block

Hi,


Between tests we destroyed the OSDs and created them from scratch. We used
Docker image to deploy Ceph on one machine.
I've seen that there are WAL/DB partitions created on the disks.
Should I also check somewhere in ceph config that it actually uses those?


if you created them from scratch you should be fine.

If you want to check anyway you run something like this:

ceph@host1:~> ceph-bluestore-tool show-label --dev  
/var/lib/ceph/osd/ceph-1/block | grep path

"path_block.db": "/dev/ceph-journals/bluefsdb-1",

If you also use LVM with Ceph you should check the respective tags of  
the OSD's block and block.db symlinks, they should match.


You could also run something like iostat on the SSD/NVMe devices to  
see if something's going on there.


Regards,
Eugen


Zitat von Ján Senko :


Eugene:
Between tests we destroyed the OSDs and created them from scratch. We used
Docker image to deploy Ceph on one machine.
I've seen that there are WAL/DB partitions created on the disks.
Should I also check somewhere in ceph config that it actually uses those?

David:
We used 4MB writes.

I know about the recommended journal size, however this is the machine we
have at the moment.
For final production I can change the size of SSD (if it makes sense)
The benchmark hasn't filled the 30GB of DB in the time it was running, so I
doubt that having properly sized DB would change the results.
(It wrote 38GB per minute of testing, spread across 12 disks, with 50% EC
overhead, therefore about 5GB/minute)

Jan

st 12. 9. 2018 o 17:36 David Turner  napísal(a):


If you're writes are small enough (64k or smaller) they're being placed on
the WAL device regardless of where your DB is.  If you change your testing
to use larger writes you should see a difference by adding the DB.

Please note that the community has never recommended using less than 120GB
DB for a 12TB OSD and the docs have come out and officially said that you
should use at least a 480GB DB for a 12TB OSD.  If you're setting up your
OSDs with a 30GB DB, you're just going to fill that up really quick and
spill over onto the HDD and have wasted your money on the SSDs.

On Wed, Sep 12, 2018 at 11:07 AM Ján Senko  wrote:


We are benchmarking a test machine which has:
8 cores, 64GB RAM
12 * 12 TB HDD (SATA)
2 * 480 GB SSD (SATA)
1 * 240 GB SSD (NVME)
Ceph Mimic

Baseline benchmark for HDD only (Erasure Code 4+2)
Write 420 MB/s, 100 IOPS, 150ms latency
Read 1040 MB/s, 260 IOPS, 60ms latency

Now we moved WAL to the SSD (all 12 WALs on single SSD, default size
(512MB)):
Write 640 MB/s, 160 IOPS, 100ms latency
Read identical as above.

Nice boost we thought, so we moved WAL+DB to the SSD (Assigned 30GB for
DB)
All results are the same as above!

Q: This is suspicious, right? Why is the DB on SSD not helping with our
benchmark? We use *rados bench*

We tried putting WAL on the NVME, and again, the results are the same as
on SSD.
Same for WAL+DB on NVME

Again, the same speed. Any ideas why we don't gain speed by using faster
HW here?

Jan

--
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818 <+420%20777%20843%20818>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS WRN replayed op client.$id

2018-09-13 Thread Eugen Block

Hi Stefan,


mds.mds1 [WRN]  replayed op client.15327973:15585315,15585103 used ino
0x19918de but session next is 0x1873b8b

Nothing of importance is logged in the mds (debug_mds_log": "1/5").

What does this warning message mean / indicate?


we face these messages on a regular basis. They are (almost?) always  
reported from our standby-replay mds server. I can't explain those  
messages, I assume the standby server regularly checks the client's  
inodes and updates them if necessary. It would be great if someone  
could shed some light on this, though.


But the replay messages aren't related to the slow requests. Those are  
definitely an issue and should be resolved.
What ceph version are you running? How is your setup? Are more roles  
assigned to the mds, e.g. MON or OSD? Do you monitor the resources  
(disk saturation, RAM, CPU, network saturation)? How many clients are  
connected to cephfs? How is mds_cache_memory_limit configured?


Regards,
Eugen


Zitat von Stefan Kooman :


Hi,

Once in a while, today a bit more often, the MDS is logging the
following:

mds.mds1 [WRN]  replayed op client.15327973:15585315,15585103 used ino
0x19918de but session next is 0x1873b8b

Nothing of importance is logged in the mds (debug_mds_log": "1/5").

What does this warning message mean / indicate?

At some point this client (ceph-fuse, mimic 13.2.1) triggers the following:

mon.mon1 [WRN] Health check failed: 1 MDSs report slow requests
(MDS_SLOW_REQUEST)
mds.mds2 [WRN] 1 slow requests, 1 included below; oldest blocked for >
30.911624 secs
mds.mds2 [WRN] slow request 30.911624 seconds old, received at
2018-09-12 15:18:44.739321: client_request(client.15732335:9506 lookup
#0x16901a7/ctdb_recovery_lock caller_uid=0, caller_gid=0{})  
currently failed to

rdlock, waiting

mds logging:

2018-09-12 11:35:07.373091 7f80af91e700  0 --  
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >>  
[2001:7b8:81:7::11]:0/2366241118 conn(0x56332404f000 :6800  
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0  
l=0).handle_connect_msg: challenging authorizer
2018-09-12 13:24:17.000787 7f80af91e700  0 --  
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >>  
[2001:7b8:81:7::11]:0/526035198 conn(0x56330c726000 :6800  
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0  
l=0).handle_connect_msg: challenging authorizer
2018-09-12 15:21:17.176405 7f80af91e700  0 --  
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >>  
[2001:7b8:81:7::11]:0/526035198 conn(0x56330c726000 :6800  
s=STATE_OPEN pgs=3 cs=1 l=0).fault server, going to standby
2018-09-12 15:22:26.641501 7f80af91e700  0 --  
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >>  
[2001:7b8:81:7::11]:0/526035198 conn(0x5633678f7000 :6800  
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0  
l=0).handle_connect_msg: challenging authorizer
2018-09-12 15:22:26.641694 7f80af91e700  0 --  
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >>  
[2001:7b8:81:7::11]:0/526035198 conn(0x5633678f7000 :6800  
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0  
l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1  
existing_state=STATE_STANDBY
2018-09-12 15:22:26.641971 7f80af91e700  0 --  
[2001:7b8:80:3:0:2c:3:2]:6800/1086374448 >>  
[2001:7b8:81:7::11]:0/526035198 conn(0x56330c726000 :6800  
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=3 cs=1  
l=0).handle_connect_msg: challenging authorizer


Thanks,

Stefan


--
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Benchmark does not show gains with DB on SSD

2018-09-12 Thread Eugen Block

Hi Jan,

how did you move the WAL and DB to the SSD/NVMe? By recreating the  
OSDs or a different approach? Did you check afterwards that the  
devices were really used for that purpose? We had to deal with that a  
couple of months ago [1] and it's not really obvious if the new  
devices are really used.


Regards,
Eugen

[1]  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/



Zitat von Ján Senko :


We are benchmarking a test machine which has:
8 cores, 64GB RAM
12 * 12 TB HDD (SATA)
2 * 480 GB SSD (SATA)
1 * 240 GB SSD (NVME)
Ceph Mimic

Baseline benchmark for HDD only (Erasure Code 4+2)
Write 420 MB/s, 100 IOPS, 150ms latency
Read 1040 MB/s, 260 IOPS, 60ms latency

Now we moved WAL to the SSD (all 12 WALs on single SSD, default size
(512MB)):
Write 640 MB/s, 160 IOPS, 100ms latency
Read identical as above.

Nice boost we thought, so we moved WAL+DB to the SSD (Assigned 30GB for DB)
All results are the same as above!

Q: This is suspicious, right? Why is the DB on SSD not helping with our
benchmark? We use *rados bench*

We tried putting WAL on the NVME, and again, the results are the same as on
SSD.
Same for WAL+DB on NVME

Again, the same speed. Any ideas why we don't gain speed by using faster HW
here?

Jan

--
Jan Senko, Skype janos-
Phone in Switzerland: +41 774 144 602
Phone in Czech Republic: +420 777 843 818




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Eugen Block

Hi,


Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?


yes. By default ceph deploys block.db and wal.db on the same device if  
no separate wal device is specified.


Regards,
Eugen


Zitat von Muhammad Junaid :


Thanks Alfredo. Just to clear that My configuration has 5 OSD's (7200 rpm
SAS HDDS) which are slower than the 200G SSD. Thats why I asked for a 10G
WAL partition for each OSD on the SSD.

Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?

On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  wrote:


On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid 
wrote:
> Hi there
>
> Asking the questions as a newbie. May be asked a number of times before
by
> many but sorry, it is not clear yet to me.
>
> 1. The WAL device is just like journaling device used before bluestore.
And
> CEPH confirms Write to client after writing to it (Before actual write to
> primary device)?
>
> 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
> partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size
against
> each OSD as 10GB? Or what min/max we should set for WAL Partition? And
can
> we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?

A WAL partition would only help if you have a device faster than the
SSD where the block.db would go.

We recently updated our sizing recommendations for block.db at least
4% of the size of block (also referenced as the data device):


http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing

In your case, what you want is to create 5 logical volumes from your
200GB at 40GB each, without a need for a WAL device.


>
> Thanks in advance. Regards.
>
> Muhammad Junaid
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] failing to respond to cache pressure

2018-09-06 Thread Eugen Block

Hi,

I would like to update this thread for others struggling with cache pressure.

The last time we hit that message was more than three weeks ago  
(workload has not changed), so it seems as our current configuration  
is fitting our workload.
Reducing client_oc_size to 100 MB (from default 200 MB) seems to be  
the trick here, just increasing the cache size was not enough, at  
least not if you are limited in memory. Currently we have set  
mds_cache_memory_limit to 4 GB.


Another note on MDS cache size:
I had configured the mds_cache_memory_limit (4 GB) and client_oc_size  
(100 MB) in version 12.2.5. Comparing the real usage with "ceph daemon  
mds. cache status" and the reserved memory with "top" I noticed a  
huge difference, the reserved memory was almost 8 GB while "cache  
status" was at nearly 4 GB.
After upgrading to 12.2.7 the reserved memory size in top is still  
only about 5 GB after one week. Obviously there have been improvements  
regarding memory consumption of MDS, which is nice. :-)


Regards,
Eugen


Zitat von Eugen Block :


Hi,


I think it does have positive effect on the messages. Cause I get fewer
messages than before.


that's nice. I also receive definitely less cache pressure messages  
than before.
I also started to play around with the client side cache  
configuration. I halved the client object cache size from 200 MB to  
100 MB:


ceph@host1:~ $ ceph daemon mds.host1 config set client_oc_size 104857600

Although I still encountered one pressure message recently the total  
amount of these messages has decreased significantly.


Regards,
Eugen


Zitat von Zhenshi Zhou :


Hi Eugen,
I think it does have positive effect on the messages. Cause I get fewer
messages than before.

Eugen Block  于2018年8月20日周一 下午9:29写道:


Update: we are getting these messages again.

So the search continues...


Zitat von Eugen Block :


Hi,

Depending on your kernel (memory leaks with CephFS) increasing the
mds_cache_memory_limit could be of help. What is your current
setting now?

ceph:~ # ceph daemon mds. config show | grep mds_cache_memory_limit

We had these messages for months, almost every day.
It would occur when hourly backup jobs ran and the MDS had to serve
an additional client (searching the whole CephFS for changes)
besides the existing CephFS clients. First we updated all clients to
a more recent kernel version, but the warnings didn't stop. Then we
doubled the cache size from 2 GB to 4 GB last week and since then I
haven't seen this warning again (for now).

Try playing with the cache size to find a setting fitting your
needs, but don't forget to monitor your MDS in case something goes
wrong.

Regards,
Eugen


Zitat von Wido den Hollander :


On 08/13/2018 01:22 PM, Zhenshi Zhou wrote:

Hi,
Recently, the cluster runs healthy, but I get warning messages

everyday:




Which version of Ceph? Which version of clients?

Can you post:

$ ceph versions
$ ceph features
$ ceph fs status

Wido


2018-08-13 17:39:23.682213 [INF]  Cluster is now healthy
2018-08-13 17:39:23.682144 [INF]  Health check cleared:
MDS_CLIENT_RECALL (was: 6 clients failing to respond to cache pressure)
2018-08-13 17:39:23.052022 [INF]  MDS health message cleared (mds.0):
Client docker38:docker failing to respond to cache pressure
2018-08-13 17:39:23.051979 [INF]  MDS health message cleared (mds.0):
Client docker73:docker failing to respond to cache pressure
2018-08-13 17:39:23.051934 [INF]  MDS health message cleared (mds.0):
Client docker74:docker failing to respond to cache pressure
2018-08-13 17:39:23.051853 [INF]  MDS health message cleared (mds.0):
Client docker75:docker failing to respond to cache pressure
2018-08-13 17:39:23.051815 [INF]  MDS health message cleared (mds.0):
Client docker27:docker failing to respond to cache pressure
2018-08-13 17:39:23.051753 [INF]  MDS health message cleared (mds.0):
Client docker27 failing to respond to cache pressure
2018-08-13 17:38:11.100331 [WRN]  Health check update: 6 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:37:39.570014 [WRN]  Health check update: 5 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:37:31.099418 [WRN]  Health check update: 3 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:36:34.564345 [WRN]  Health check update: 1 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:36:27.121891 [WRN]  Health check update: 3 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:36:11.967531 [WRN]  Health check update: 5 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:35:59.870055 [WRN]  Health check update: 6 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:35:47.787323 [WRN]  Health check update: 3 clients

failing

to respond to cache pressure (MDS_CLIENT_RECALL)
2018-08-13 17:34:59.435933 [WRN]  Health ch

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-09-05 Thread Eugen Block
ce
with, for instance, the nodes swap, boot or main partition?

And so the only possible way to have a functioning ceph distributed
filesystem working would be by having in each node at least one disk
dedicated for the operational system and another, independent disk
dedicated to the ceph filesystem?

That would be a awful drawback in our plans if real, but if there is no
other way, we will have to just give up. Just, please, answer this two
questions clearly, before we capitulate?  :(

Anyway, thanks a lot, once again,

Jones

On Mon, Sep 3, 2018 at 5:39 AM Eugen Block  wrote:


Hi Jones,

I still don't think creating an OSD on a partition will work. The
reason is that SES creates an additional partition per OSD resulting
in something like this:

vdb   253:16   05G  0 disk
├─vdb1253:17   0  100M  0 part /var/lib/ceph/osd/ceph-1
└─vdb2253:18   0  4,9G  0 part

Even with external block.db and wal.db on additional devices you would
still need two partitions for the OSD. I'm afraid with your setup this
can't work.

Regards,
Eugen





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-09-03 Thread Eugen Block
et: disabled)   Active: active (running) since Fri
2018-08-31 12:01:04 -03; 50min ago Main PID: 1421 (salt-minion)Tasks: 6
(limit: 629145)   CGroup: /system.slice/salt-minion.service
├─1421 /usr/bin/python3 /usr/bin/salt-minion   ├─1595
/usr/bin/python3 /usr/bin/salt-minion   └─1647 /usr/bin/python3
/usr/bin/salt-minionago 31 12:00:59 polar systemd[1]: Starting The Salt
Minion...ago 31 12:01:04 polar systemd[1]: Started The Salt Minion.ago 31
12:11:04 polar salt-minion[1421]: [WARNING ] The function "module.run" is
using its deprecated version and will expire in version "Sodium".ago 31
12:23:30 polar salt-minion[1421]: [WARNING ] The function "module.run" is
using its deprecated version and will expire in version "Sodium".ago 31
12:37:51 polar salt-minion[1421]: [WARNING ] The function "module.run" is
using its deprecated version and will expire in version "Sodium".ago 31
12:40:23 polar salt-minion[1421]: [WARNING ] The function "module.run" is
using its deprecated version and will expire in version "Sodium".ago 31
12:40:44 polar salt-minion[1421]: [WARNING ] The function "module.run" is
using its deprecated version and will expire in version "Sodium".ago 31
12:43:51 polar salt-minion[1421]: [WARNING ] The function "module.run" is
using its deprecated version and will expire in version "Sodium".ago 31
12:43:51 polar salt-minion[1421]: [ERROR   ] Mine on polar.iq.ufrgs.br
<http://polar.iq.ufrgs.br> for cephdisks.listago 31 12:43:51 polar
salt-minion[1421]: [ERROR   ] Module function osd.deploy threw an
exception. Exception: Mine on polar.iq.ufrgs.br <http://polar.iq.ufrgs.br>
for cephdisks.list*
*#*

So I mostly see warnings concerning the SNTP server (possibly due to the
bad connection and rainy weather disturbing our faulty internet) and about
deprecated versions of software. The real error I'm getting (but still not
understanding how to solve it) is:

*#*


*2018-08-31T12:43:51.995945-03:00 polar salt-minion[1421]: [ERROR   ] Mine
on polar.iq.ufrgs.br <http://polar.iq.ufrgs.br> for
cephdisks.list2018-08-31T12:43:51.996215-03:00 polar salt-minion[1421]:
[ERROR   ] Module function osd.deploy threw an exception. Exception: Mine
on polar.iq.ufrgs.br <http://polar.iq.ufrgs.br> for cephdisks.list*
*#*

Any ideas on how to proceed from here?  :(  I'm totally clueless. :(

Thanks a lot once again,

Jones

On Fri, Aug 31, 2018 at 4:00 AM Eugen Block  wrote:


Hi,

I'm not sure if there's a misunderstanding. You need to track the logs
during the osd deployment step (stage.3), that is where it fails, and
this is where /var/log/messages could be useful. Since the deployment
failed you have no systemd-units (ceph-osd@.service) to log
anything.

Before running stage.3 again try something like

grep -C5 ceph-disk /var/log/messages (or messages-201808*.xz)

or

grep -C5 sda4 /var/log/messages (or messages-201808*.xz)

If that doesn't reveal anything run stage.3 again and watch the logs.

Regards,
Eugen


Zitat von Jones de Andrade :

> Hi Eugen.
>
> Ok, edited the file /etc/salt/minion, uncommented the "log_level_logfile"
> line and set it to "debug" level.
>
> Turned off the computer, waited a few minutes so that the time frame
would
> stand out in the /var/log/messages file, and restarted the computer.
>
> Using vi I "greped out" (awful wording) the reboot section. From that, I
> also removed most of what it seemed totally unrelated to ceph, salt,
> minions, grafana, prometheus, whatever.
>
> I got the lines below. It does not seem to complain about anything that I
> can see. :(
>
> 
> 2018-08-30T15:41:46.455383-03:00 torcello systemd[1]: systemd 234 running
> in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT
+UTMP
> +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS
> +KMOD -IDN2 -IDN default-hierarchy=hybrid)
> 2018-08-30T15:41:46.456330-03:00 torcello systemd[1]: Detected
architecture
> x86-64.
> 2018-08-30T15:41:46.456350-03:00 torcello systemd[1]: nss-lookup.target:
> Dependency Before=nss-lookup.target dropped
> 2018-08-30T15:41:46.456357-03:00 torcello systemd[1]: Started Load Kernel
> Modules.
> 2018-08-30T15:41:46.456369-03:00 torcello systemd[1]: Starting Apply
Kernel
> Variables...
> 2018-08-30T15:41:46.457230-03:00 torcello systemd[1]: Started
Alertmanager
> for prometheus.
> 2018-08-30T15:41:46.457237-03:00 torcello systemd[1]: Started Monitoring
> system and time series database.
> 2018-08-30T15:41:46.457403-03:00 torcello systemd[1]: Starting NTP
> client/server...
>
>
>
>
>
>
> *2018-08-30T15:41:46.457425-03:00 torcello systemd[1]: Started Prometheus
> exporter for machine metrics.2018-08-30T15

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-31 Thread Eugen Block
[1792]: Stopped target
Paths.
2018-08-30T15:44:15.502923-03:00 torcello systemd[1792]: Stopped target
Timers.
2018-08-30T15:44:15.503062-03:00 torcello systemd[1792]: Stopped target
Sockets.
2018-08-30T15:44:15.503200-03:00 torcello systemd[1792]: Closed D-Bus User
Message Bus Socket.
2018-08-30T15:44:15.503356-03:00 torcello systemd[1792]: Reached target
Shutdown.
2018-08-30T15:44:15.503572-03:00 torcello systemd[1792]: Starting Exit the
Session...
2018-08-30T15:44:15.511298-03:00 torcello systemd[2295]: Starting D-Bus
User Message Bus Socket.
2018-08-30T15:44:15.511493-03:00 torcello systemd[2295]: Reached target
Timers.
2018-08-30T15:44:15.511664-03:00 torcello systemd[2295]: Reached target
Paths.
2018-08-30T15:44:15.517873-03:00 torcello systemd[2295]: Listening on D-Bus
User Message Bus Socket.
2018-08-30T15:44:15.518060-03:00 torcello systemd[2295]: Reached target
Sockets.
2018-08-30T15:44:15.518216-03:00 torcello systemd[2295]: Reached target
Basic System.
2018-08-30T15:44:15.518373-03:00 torcello systemd[2295]: Reached target
Default.
2018-08-30T15:44:15.518501-03:00 torcello systemd[2295]: Startup finished
in 31ms.
2018-08-30T15:44:15.518634-03:00 torcello systemd[1]: Started User Manager
for UID 1000.
2018-08-30T15:44:15.518759-03:00 torcello systemd[1792]: Received
SIGRTMIN+24 from PID 2300 (kill).
2018-08-30T15:44:15.537634-03:00 torcello systemd[1]: Stopped User Manager
for UID 464.
2018-08-30T15:44:15.538422-03:00 torcello systemd[1]: Removed slice User
Slice of sddm.
2018-08-30T15:44:15.613246-03:00 torcello systemd[2295]: Started D-Bus User
Message Bus.
2018-08-30T15:44:15.623989-03:00 torcello dbus-daemon[2311]: [session
uid=1000 pid=2311] Successfully activated service 'org.freedesktop.systemd1'
2018-08-30T15:44:16.447162-03:00 torcello kapplymousetheme[2350]:
kcm_input: Using X11 backend
2018-08-30T15:44:16.901642-03:00 torcello node_exporter[807]:
time="2018-08-30T15:44:16-03:00" level=error msg="ERROR: ntp collector
failed after 0.000205s: couldn't get SNTP reply: read udp 127.0.0.1:53434->
127.0.0.1:123: read: connection refused" source="collector.go:123"


Any ideas?

Thanks a lot,

Jones

On Thu, Aug 30, 2018 at 4:14 AM Eugen Block  wrote:


Hi,

> So, it only contains logs concerning the node itself (is it correct?
sincer
> node01 is also the master, I was expecting it to have logs from the other
> too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
> available, and nothing "shines out" (sorry for my poor english) as a
> possible error.

the logging is not configured to be centralised per default, you would
have to configure that yourself.

Regarding the OSDs, if there are OSD logs created, they're created on
the OSD nodes, not on the master. But since the OSD deployment fails,
there probably are no OSD specific logs yet. So you'll have to take a
look into the syslog (/var/log/messages), that's where the salt-minion
reports its attempts to create the OSDs. Chances are high that you'll
find the root cause in here.

If the output is not enough, set the log-level to debug:

osd-1:~ # grep -E "^log_level" /etc/salt/minion
log_level: debug


Regards,
Eugen


Zitat von Jones de Andrade :

> Hi Eugen.
>
> Sorry for the delay in answering.
>
> Just looked in the /var/log/ceph/ directory. It only contains the
following
> files (for example on node01):
>
> ###
> # ls -lart
> total 3864
> -rw--- 1 ceph ceph 904 ago 24 13:11 ceph.audit.log-20180829.xz
> drwxr-xr-x 1 root root 898 ago 28 10:07 ..
> -rw-r--r-- 1 ceph ceph  189464 ago 28 23:59
ceph-mon.node01.log-20180829.xz
> -rw--- 1 ceph ceph   24360 ago 28 23:59 ceph.log-20180829.xz
> -rw-r--r-- 1 ceph ceph   48584 ago 29 00:00
ceph-mgr.node01.log-20180829.xz
> -rw--- 1 ceph ceph   0 ago 29 00:00 ceph.audit.log
> drwxrws--T 1 ceph ceph 352 ago 29 00:00 .
> -rw-r--r-- 1 ceph ceph 1908122 ago 29 12:46 ceph-mon.node01.log
> -rw--- 1 ceph ceph  175229 ago 29 12:48 ceph.log
> -rw-r--r-- 1 ceph ceph 1599920 ago 29 12:49 ceph-mgr.node01.log
> ###
>
> So, it only contains logs concerning the node itself (is it correct?
sincer
> node01 is also the master, I was expecting it to have logs from the other
> too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
> available, and nothing "shines out" (sorry for my poor english) as a
> possible error.
>
> Any suggestion on how to proceed?
>
> Thanks a lot in advance,
>
> Jones
>
>
> On Mon, Aug 27, 2018 at 5:29 AM Eugen Block  wrote:
>
>> Hi Jones,
>>
>> all ceph logs are in the directory /var/log/ceph/, each daemon has its
>> own log file, e.g. OSD logs are named ceph-osd.*.
>>
>> I haven't tried it but I don't think SUSE Enterprise Storage deploys
>> OSDs on partitioned disks. Is the

Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Eugen Block

Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.


Thanks for the clarification!


Zitat von Ilya Dryomov :


On Thu, Aug 30, 2018 at 1:04 PM Eugen Block  wrote:


Hi again,

we still didn't figure out the reason for the flapping, but I wanted
to get back on the dmesg entries.
They just reflect what happened in the past, they're no indicator to
predict anything.


The kernel client is just that, a client.  Almost by definition,
everything it sees has already happened.



For example, when I changed the primary-affinity of OSD.24 last week,
one of the clients realized that only today, 4 days later. If the
clients don't have to communicate with the respective host/osd in the
meantime, they log those events on the next reconnect.


Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.

Thanks,

Ilya




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Eugen Block

Hi again,

we still didn't figure out the reason for the flapping, but I wanted  
to get back on the dmesg entries.
They just reflect what happened in the past, they're no indicator to  
predict anything.


For example, when I changed the primary-affinity of OSD.24 last week,  
one of the clients realized that only today, 4 days later. If the  
clients don't have to communicate with the respective host/osd in the  
meantime, they log those events on the next reconnect.
I just wanted to share that in case anybody else is wondering (or  
maybe it was just me).


Regards,
Eugen


Zitat von Eugen Block :


Update:
I changed the primary affinity of one OSD back to 1.0 to test if  
those metrics change, and indeed they do:

OSD.24 immediately shows values greater than 0.
I guess the metrics are completely unrelated to the flapping.

So the search goes on...


Zitat von Eugen Block :

An hour ago host5 started to report the OSDs on host4 as down  
(still no clue why), resulting in slow requests. This time no  
flapping occured, the cluster recovered a couple of minutes later.  
No other OSDs reported that, only those two on host5. There's  
nothing in the logs of the reporting or the affected OSDs.


Then I compared a perf dump of one healthy OSD with one on host4.  
There's something strange about the metrics (many of them are 0), I  
just can't tell if it's related to the fact that host4 has no  
primary OSDs. But even with no primary OSD I would expect different  
values for OSDs that are running for a week now.


---cut here---
host1:~ # diff -u perfdump.osd1 perfdump.osd24
--- perfdump.osd1   2018-08-23 11:03:03.695927316 +0200
+++ perfdump.osd24  2018-08-23 11:02:09.919927375 +0200
@@ -1,99 +1,99 @@
{
"osd": {
"op_wip": 0,
-"op": 7878594,
-"op_in_bytes": 852767683202,
-"op_out_bytes": 1019871565411,
+"op": 0,
+"op_in_bytes": 0,
+"op_out_bytes": 0,
"op_latency": {
-"avgcount": 7878594,
-"sum": 1018863.131206702,
-"avgtime": 0.129320425
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_process_latency": {
-"avgcount": 7878594,
-"sum": 879970.400440694,
-"avgtime": 0.111691299
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_prepare_latency": {
-"avgcount": 8321733,
-"sum": 41376.442963329,
-"avgtime": 0.004972094
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_r": 3574792,
-"op_r_out_bytes": 1019871565411,
+"op_r": 0,
+"op_r_out_bytes": 0,
"op_r_latency": {
-"avgcount": 3574792,
-"sum": 54750.502669010,
-"avgtime": 0.015315717
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_r_process_latency": {
-"avgcount": 3574792,
-"sum": 34107.703579874,
-"avgtime": 0.009541171
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_r_prepare_latency": {
-"avgcount": 3574817,
-"sum": 34262.515884817,
-"avgtime": 0.009584411
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_w": 4249520,
-"op_w_in_bytes": 847518164870,
+"op_w": 0,
+"op_w_in_bytes": 0,
"op_w_latency": {
-"avgcount": 4249520,
-"sum": 960898.540843217,
-"avgtime": 0.226119312
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_w_process_latency": {
-"avgcount": 4249520,
-"sum": 844398.804808119,
-"avgtime": 0.198704513
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_w_prepare_latency": {
-"avgcount": 4692618,
-"sum

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-30 Thread Eugen Block

Hi,


So, it only contains logs concerning the node itself (is it correct? sincer
node01 is also the master, I was expecting it to have logs from the other
too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
available, and nothing "shines out" (sorry for my poor english) as a
possible error.


the logging is not configured to be centralised per default, you would  
have to configure that yourself.


Regarding the OSDs, if there are OSD logs created, they're created on  
the OSD nodes, not on the master. But since the OSD deployment fails,  
there probably are no OSD specific logs yet. So you'll have to take a  
look into the syslog (/var/log/messages), that's where the salt-minion  
reports its attempts to create the OSDs. Chances are high that you'll  
find the root cause in here.


If the output is not enough, set the log-level to debug:

osd-1:~ # grep -E "^log_level" /etc/salt/minion
log_level: debug


Regards,
Eugen


Zitat von Jones de Andrade :


Hi Eugen.

Sorry for the delay in answering.

Just looked in the /var/log/ceph/ directory. It only contains the following
files (for example on node01):

###
# ls -lart
total 3864
-rw--- 1 ceph ceph 904 ago 24 13:11 ceph.audit.log-20180829.xz
drwxr-xr-x 1 root root 898 ago 28 10:07 ..
-rw-r--r-- 1 ceph ceph  189464 ago 28 23:59 ceph-mon.node01.log-20180829.xz
-rw--- 1 ceph ceph   24360 ago 28 23:59 ceph.log-20180829.xz
-rw-r--r-- 1 ceph ceph   48584 ago 29 00:00 ceph-mgr.node01.log-20180829.xz
-rw--- 1 ceph ceph   0 ago 29 00:00 ceph.audit.log
drwxrws--T 1 ceph ceph 352 ago 29 00:00 .
-rw-r--r-- 1 ceph ceph 1908122 ago 29 12:46 ceph-mon.node01.log
-rw--- 1 ceph ceph  175229 ago 29 12:48 ceph.log
-rw-r--r-- 1 ceph ceph 1599920 ago 29 12:49 ceph-mgr.node01.log
###

So, it only contains logs concerning the node itself (is it correct? sincer
node01 is also the master, I was expecting it to have logs from the other
too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
available, and nothing "shines out" (sorry for my poor english) as a
possible error.

Any suggestion on how to proceed?

Thanks a lot in advance,

Jones


On Mon, Aug 27, 2018 at 5:29 AM Eugen Block  wrote:


Hi Jones,

all ceph logs are in the directory /var/log/ceph/, each daemon has its
own log file, e.g. OSD logs are named ceph-osd.*.

I haven't tried it but I don't think SUSE Enterprise Storage deploys
OSDs on partitioned disks. Is there a way to attach a second disk to
the OSD nodes, maybe via USB or something?

Although this thread is ceph related it is referring to a specific
product, so I would recommend to post your question in the SUSE forum
[1].

Regards,
Eugen

[1] https://forums.suse.com/forumdisplay.php?99-SUSE-Enterprise-Storage

Zitat von Jones de Andrade :

> Hi Eugen.
>
> Thanks for the suggestion. I'll look for the logs (since it's our first
> attempt with ceph, I'll have to discover where they are, but no problem).
>
> One thing called my attention on your response however:
>
> I haven't made myself clear, but one of the failures we encountered were
> that the files now containing:
>
> node02:
>--
>storage:
>--
>osds:
>--
>/dev/sda4:
>--
>format:
>bluestore
>standalone:
>True
>
> Were originally empty, and we filled them by hand following a model found
> elsewhere on the web. It was necessary, so that we could continue, but
the
> model indicated that, for example, it should have the path for /dev/sda
> here, not /dev/sda4. We chosen to include the specific partition
> identification because we won't have dedicated disks here, rather just
the
> very same partition as all disks were partitioned exactly the same.
>
> While that was enough for the procedure to continue at that point, now I
> wonder if it was the right call and, if it indeed was, if it was done
> properly.  As such, I wonder: what you mean by "wipe" the partition here?
> /dev/sda4 is created, but is both empty and unmounted: Should a different
> operation be performed on it, should I remove it first, should I have
> written the files above with only /dev/sda as target?
>
> I know that probably I wouldn't run in this issues with dedicated discks,
> but unfortunately that is absolutely not an option.
>
> Thanks a lot in advance for any comments and/or extra suggestions.
>
> Sincerely yours,
>
> Jones
>
> On Sat, Aug 25, 2018 at 5:46 PM Eugen Block  wrote:
>
>> Hi,
>>
>> take a look into the logs, they should point you in the right direction.
>> Since the deployment stage fails at the OSD level, start with the OSD
>> logs. Something's not right with the disks

Re: [ceph-users] Error EINVAL: (22) Invalid argument While using ceph osd safe-to-destroy

2018-08-27 Thread Eugen Block

Hi,

could you please paste your osd tree and the exact command you try to execute?

Extra note, the while loop in the instructions look like it's bad.   
I had to change it to make it work in bash.


The documented command didn't work for me either.

Regards,
Eugen

Zitat von Robert Stanford :


I am following the procedure here:
http://docs.ceph.com/docs/mimic/rados/operations/bluestore-migration/

 When I get to the part to run "ceph osd safe-to-destroy $ID" in a while
loop, I get a EINVAL error.  I get this error when I run "ceph osd
safe-to-destroy 0" on the command line by itself, too.  (Extra note, the
while loop in the instructions look like it's bad.  I had to change it to
make it work in bash.)

 I know my ID is correct because I was able to use it in the previous step
(ceph osd out $ID).  I also substituted $ID for the number on the command
line and got the same error.  Why isn't this working?

Error: Error EINVAL: (22) Invalid argument While using ceph osd
safe-to-destroy

 Thank you
R




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-27 Thread Eugen Block

Hi Jones,

all ceph logs are in the directory /var/log/ceph/, each daemon has its  
own log file, e.g. OSD logs are named ceph-osd.*.


I haven't tried it but I don't think SUSE Enterprise Storage deploys  
OSDs on partitioned disks. Is there a way to attach a second disk to  
the OSD nodes, maybe via USB or something?


Although this thread is ceph related it is referring to a specific  
product, so I would recommend to post your question in the SUSE forum  
[1].


Regards,
Eugen

[1] https://forums.suse.com/forumdisplay.php?99-SUSE-Enterprise-Storage

Zitat von Jones de Andrade :


Hi Eugen.

Thanks for the suggestion. I'll look for the logs (since it's our first
attempt with ceph, I'll have to discover where they are, but no problem).

One thing called my attention on your response however:

I haven't made myself clear, but one of the failures we encountered were
that the files now containing:

node02:
   --
   storage:
   --
   osds:
   --
   /dev/sda4:
   --
   format:
   bluestore
   standalone:
   True

Were originally empty, and we filled them by hand following a model found
elsewhere on the web. It was necessary, so that we could continue, but the
model indicated that, for example, it should have the path for /dev/sda
here, not /dev/sda4. We chosen to include the specific partition
identification because we won't have dedicated disks here, rather just the
very same partition as all disks were partitioned exactly the same.

While that was enough for the procedure to continue at that point, now I
wonder if it was the right call and, if it indeed was, if it was done
properly.  As such, I wonder: what you mean by "wipe" the partition here?
/dev/sda4 is created, but is both empty and unmounted: Should a different
operation be performed on it, should I remove it first, should I have
written the files above with only /dev/sda as target?

I know that probably I wouldn't run in this issues with dedicated discks,
but unfortunately that is absolutely not an option.

Thanks a lot in advance for any comments and/or extra suggestions.

Sincerely yours,

Jones

On Sat, Aug 25, 2018 at 5:46 PM Eugen Block  wrote:


Hi,

take a look into the logs, they should point you in the right direction.
Since the deployment stage fails at the OSD level, start with the OSD
logs. Something's not right with the disks/partitions, did you wipe
the partition from previous attempts?

Regards,
Eugen

Zitat von Jones de Andrade :


(Please forgive my previous email: I was using another message and
completely forget to update the subject)

Hi all.

I'm new to ceph, and after having serious problems in ceph stages 0, 1

and

2 that I could solve myself, now it seems that I have hit a wall harder
than my head. :)

When I run salt-run state.orch ceph.stage.deploy, i monitor I see it

going

up to here:

###
[14/71]   ceph.sysctl on
  node01... ✓ (0.5s)
  node02 ✓ (0.7s)
  node03... ✓ (0.6s)
  node04. ✓ (0.5s)
  node05... ✓ (0.6s)
  node06.. ✓ (0.5s)

[15/71]   ceph.osd on
  node01.. ❌ (0.7s)
  node02 ❌ (0.7s)
  node03... ❌ (0.7s)
  node04. ❌ (0.6s)
  node05... ❌ (0.6s)
  node06.. ❌ (0.7s)

Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s

Failures summary:

ceph.osd (/srv/salt/ceph/osd):
  node02:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node02 for cephdisks.list
  node03:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node03 for cephdisks.list
  node01:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node01 for cephdisks.list
  node04:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node04 for cephdisks.list
  node05:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node05 for cephdisks.list
  node06:
deploy OSDs: Module function osd.deploy threw an exception.

Exception:

Mine on node06 for cephdisks.list
###

Since this is a first attempt in 6 simple test machines, we are going to
put the mon, osds, etc, in all nodes at first. Only the master is left

in a

single machine (node01) by now.

As they are simple machines, they have a single hdd, which is partitioned
as follows (the hda4 partition is unmounted and left for the ce

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-25 Thread Eugen Block

Hi,

take a look into the logs, they should point you in the right direction.
Since the deployment stage fails at the OSD level, start with the OSD  
logs. Something's not right with the disks/partitions, did you wipe  
the partition from previous attempts?


Regards,
Eugen

Zitat von Jones de Andrade :


(Please forgive my previous email: I was using another message and
completely forget to update the subject)

Hi all.

I'm new to ceph, and after having serious problems in ceph stages 0, 1 and
2 that I could solve myself, now it seems that I have hit a wall harder
than my head. :)

When I run salt-run state.orch ceph.stage.deploy, i monitor I see it going
up to here:

###
[14/71]   ceph.sysctl on
  node01... ✓ (0.5s)
  node02 ✓ (0.7s)
  node03... ✓ (0.6s)
  node04. ✓ (0.5s)
  node05... ✓ (0.6s)
  node06.. ✓ (0.5s)

[15/71]   ceph.osd on
  node01.. ❌ (0.7s)
  node02 ❌ (0.7s)
  node03... ❌ (0.7s)
  node04. ❌ (0.6s)
  node05... ❌ (0.6s)
  node06.. ❌ (0.7s)

Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s

Failures summary:

ceph.osd (/srv/salt/ceph/osd):
  node02:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node02 for cephdisks.list
  node03:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node03 for cephdisks.list
  node01:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node01 for cephdisks.list
  node04:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node04 for cephdisks.list
  node05:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node05 for cephdisks.list
  node06:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node06 for cephdisks.list
###

Since this is a first attempt in 6 simple test machines, we are going to
put the mon, osds, etc, in all nodes at first. Only the master is left in a
single machine (node01) by now.

As they are simple machines, they have a single hdd, which is partitioned
as follows (the hda4 partition is unmounted and left for the ceph system):

###
# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda  8:00 465,8G  0 disk
├─sda1   8:10   500M  0 part /boot/efi
├─sda2   8:2016G  0 part [SWAP]
├─sda3   8:30  49,3G  0 part /
└─sda4   8:40   400G  0 part
sr0 11:01   3,7G  0 rom

# salt -I 'roles:storage' cephdisks.list
node01:
node02:
node03:
node04:
node05:
node06:

# salt -I 'roles:storage' pillar.get ceph
node02:
--
storage:
--
osds:
--
/dev/sda4:
--
format:
bluestore
standalone:
True
(and so on for all 6 machines)
##

Finally and just in case, my policy.cfg file reads:

#
#cluster-unassigned/cluster/*.sls
cluster-ceph/cluster/*.sls
profile-default/cluster/*.sls
profile-default/stack/default/ceph/minions/*yml
config/stack/default/global.yml
config/stack/default/ceph/cluster.yml
role-master/cluster/node01.sls
role-admin/cluster/*.sls
role-mon/cluster/*.sls
role-mgr/cluster/*.sls
role-mds/cluster/*.sls
role-ganesha/cluster/*.sls
role-client-nfs/cluster/*.sls
role-client-cephfs/cluster/*.sls
##

Please, could someone help me and shed some light on this issue?

Thanks a lot in advance,

Regasrds,

Jones




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-24 Thread Eugen Block

Update:
I changed the primary affinity of one OSD back to 1.0 to test if those  
metrics change, and indeed they do:

OSD.24 immediately shows values greater than 0.
I guess the metrics are completely unrelated to the flapping.

So the search goes on...


Zitat von Eugen Block :

An hour ago host5 started to report the OSDs on host4 as down (still  
no clue why), resulting in slow requests. This time no flapping  
occured, the cluster recovered a couple of minutes later. No other  
OSDs reported that, only those two on host5. There's nothing in the  
logs of the reporting or the affected OSDs.


Then I compared a perf dump of one healthy OSD with one on host4.  
There's something strange about the metrics (many of them are 0), I  
just can't tell if it's related to the fact that host4 has no  
primary OSDs. But even with no primary OSD I would expect different  
values for OSDs that are running for a week now.


---cut here---
host1:~ # diff -u perfdump.osd1 perfdump.osd24
--- perfdump.osd1   2018-08-23 11:03:03.695927316 +0200
+++ perfdump.osd24  2018-08-23 11:02:09.919927375 +0200
@@ -1,99 +1,99 @@
 {
 "osd": {
 "op_wip": 0,
-"op": 7878594,
-"op_in_bytes": 852767683202,
-"op_out_bytes": 1019871565411,
+"op": 0,
+"op_in_bytes": 0,
+"op_out_bytes": 0,
 "op_latency": {
-"avgcount": 7878594,
-"sum": 1018863.131206702,
-"avgtime": 0.129320425
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_process_latency": {
-"avgcount": 7878594,
-"sum": 879970.400440694,
-"avgtime": 0.111691299
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_prepare_latency": {
-"avgcount": 8321733,
-"sum": 41376.442963329,
-"avgtime": 0.004972094
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_r": 3574792,
-"op_r_out_bytes": 1019871565411,
+"op_r": 0,
+"op_r_out_bytes": 0,
 "op_r_latency": {
-"avgcount": 3574792,
-"sum": 54750.502669010,
-"avgtime": 0.015315717
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_r_process_latency": {
-"avgcount": 3574792,
-"sum": 34107.703579874,
-"avgtime": 0.009541171
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_r_prepare_latency": {
-"avgcount": 3574817,
-"sum": 34262.515884817,
-"avgtime": 0.009584411
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_w": 4249520,
-"op_w_in_bytes": 847518164870,
+"op_w": 0,
+"op_w_in_bytes": 0,
 "op_w_latency": {
-"avgcount": 4249520,
-"sum": 960898.540843217,
-"avgtime": 0.226119312
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_w_process_latency": {
-"avgcount": 4249520,
-"sum": 844398.804808119,
-"avgtime": 0.198704513
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_w_prepare_latency": {
-"avgcount": 4692618,
-"sum": 7032.358957948,
-"avgtime": 0.001498600
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_rw": 54282,
-"op_rw_in_bytes": 5249518332,
+"op_rw": 0,
+"op_rw_in_bytes": 0,
 "op_rw_out_bytes": 0,
 "op_rw_latency": {
-"avgcount": 54282,
-"sum": 3214.087694475,
-"avgtime": 0.059210929
+"av

  1   2   >