Re: [ceph-users] Major ceph disaster

2019-05-15 Thread Konstantin Shalygin


On 5/15/19 1:49 PM, Kevin Flöh wrote:


since we have 3+1 ec I didn't try before. But when I run the command 
you suggested I get the following error:


ceph osd pool set ec31 min_size 2
Error EINVAL: pool min_size must be between 3 and 4



What is your current min size? `ceph osd pool get ec31 min_size`



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-15 Thread Kevin Flöh

ceph osd pool get ec31 min_size
min_size: 3

On 15.05.19 9:09 vorm., Konstantin Shalygin wrote:

ceph osd pool get ec31 min_size

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using centraliced management configuration drops some unrecognized config option

2019-05-15 Thread Paul Emmerich
All of these options have been removed quite some time ago, most of them
have been removed in Luminous.


Paul


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, May 15, 2019 at 2:56 AM EDH - Manuel Rios Fernandez <
mrios...@easydatahost.com> wrote:

> Hi
>
>
>
> We’re moving our config to centralized management configuration with “ceph
> config set” and with the minimal ceph.conf in all nodes.
>
>
>
> Several options from ceph are not allowed… Why?
>
> ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>
>
>
> ceph config set osd osd_mkfs_type xfs
>
> Error EINVAL: unrecognized config option 'osd_mkfs_type'
>
> ceph config set osd osd_op_threads 12
>
> Error EINVAL: unrecognized config option 'osd_op_threads'
>
> ceph config set osd osd_disk_threads 2
>
> Error EINVAL: unrecognized config option 'osd_disk_threads'
>
> ceph config set osd osd_recovery_threads 4
>
> Error EINVAL: unrecognized config option 'osd_recovery_threads'
>
> ceph config set osd osd_recovery_thread 4
>
> Error EINVAL: unrecognized config option 'osd_recovery_thread'
>
>
>
> Bug? Failed in the cli setup?
>
>
>
> Regards
>
>
>
> Manuel
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does anybody know whether S3 encryption of Ceph is ready for production?

2019-05-15 Thread Guoyong
Does anybody know whether S3 encryption of Ceph is ready for production?
-
本邮件及其附件含有新华三集团的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from New H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph nautilus deep-scrub health error

2019-05-15 Thread nokia ceph
Hi Manuel.

Thanks for your response. We will consider this settings when we enable
deep-scrubbing. For now i saw this write up from Nautilus release notes,

Configuration values mon_warn_not_scrubbed and
mon_warn_not_deep_scrubbed have been renamed. They are now
mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio
respectively. This is to clarify that these warnings are related to
pg scrubbing and are a ratio of the related interval. These options
are now enabled by default.

So, we made mon_warn_pg_not_deep_scrubbed_ratio = 0  and after that cluster
not moving to warning state for not deep scrubbing.

Thanks,
Muthu

On Tue, May 14, 2019 at 4:30 PM EDH - Manuel Rios Fernandez <
mrios...@easydatahost.com> wrote:

> Hi Muthu
>
>
>
> We found the same issue near 2000 pgs not deep-scrubbed in time.
>
>
>
> We’re manually force scrubbing with :
>
>
>
> ceph health detail | grep -i not | awk '{print $2}' | while read i; do
> ceph pg deep-scrub ${i}; done
>
>
>
> It launch near 20-30 pgs to be deep-scrubbed. I think you can improve
>  with a sleep of 120 secs between scrub to prevent overload your osd.
>
>
>
> For disable deep-scrub you can use “ceph osd set nodeep-scrub” , Also you
> can setup deep-scrub with threshold .
>
> #Start Scrub 22:00
>
> osd scrub begin hour = 22
>
> #Stop Scrub 8
>
> osd scrub end hour = 8
>
> #Scrub Load 0.5
>
> osd scrub load threshold = 0.5
>
>
>
> Regards,
>
>
>
> Manuel
>
>
>
>
>
>
>
>
>
> *De:* ceph-users  *En nombre de *nokia
> ceph
> *Enviado el:* martes, 14 de mayo de 2019 11:44
> *Para:* Ceph Users 
> *Asunto:* [ceph-users] ceph nautilus deep-scrub health error
>
>
>
> Hi Team,
>
>
>
> After upgrading from Luminous to Nautilus , we see 654 pgs not
> deep-scrubbed in time error in ceph status . How can we disable this flag?
> . In our setup we disable deep-scrubbing for performance issues.
>
>
>
> Thanks,
>
> Muthu
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Jan Kasprzak
Hello, Ceph users,

how do you deal with the "clock skew detected" HEALTH_WARN message?

I think the internal RTC in most x86 servers does have 1 second resolution
only, but Ceph skew limit is much smaller than that. So every time I reboot
one of my mons (for kernel upgrade or something), I have to wait for several
minutes for the system clock to synchronize over NTP, even though ntpd
has been running before reboot and was started during the system boot again.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Marco Stuurman
Hi Yenya,

You could try to synchronize the system clock to the hardware clock before
rebooting. Also try chrony, it catches up very fast.


Kind regards,

Marco Stuurman


Op wo 15 mei 2019 om 13:48 schreef Jan Kasprzak 

> Hello, Ceph users,
>
> how do you deal with the "clock skew detected" HEALTH_WARN message?
>
> I think the internal RTC in most x86 servers does have 1 second resolution
> only, but Ceph skew limit is much smaller than that. So every time I reboot
> one of my mons (for kernel upgrade or something), I have to wait for
> several
> minutes for the system clock to synchronize over NTP, even though ntpd
> has been running before reboot and was started during the system boot
> again.
>
> Thanks,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak 
> |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
> |
> sir_clive> I hope you don't mind if I steal some of your ideas?
>  laryross> As far as stealing... we call it sharing here.   --from rcgroups
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Richard Hesketh
Another option would be adding a boot time script which uses ntpdate (or
something) to force an immediate sync with your timeservers before ntpd
starts - this is actually suggested in ntpdate's man page!

Rich

On 15/05/2019 13:00, Marco Stuurman wrote:
> Hi Yenya,
> 
> You could try to synchronize the system clock to the hardware clock
> before rebooting. Also try chrony, it catches up very fast.
> 
> 
> Kind regards,
> 
> Marco Stuurman
> 
> 
> Op wo 15 mei 2019 om 13:48 schreef Jan Kasprzak  >
> 
>         Hello, Ceph users,
> 
> how do you deal with the "clock skew detected" HEALTH_WARN message?
> 
> I think the internal RTC in most x86 servers does have 1 second resolution
> only, but Ceph skew limit is much smaller than that. So every time I 
> reboot
> one of my mons (for kernel upgrade or something), I have to wait for 
> several
> minutes for the system clock to synchronize over NTP, even though ntpd
> has been running before reboot and was started during the system
> boot again.
> 
> Thanks,
> 
> -Yenya



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pool migration for cephfs?

2019-05-15 Thread Lars Täuber
Hi,

is there a way to migrate a cephfs to a new data pool like it is for rbd on 
nautilus?
https://ceph.com/geen-categorie/ceph-pool-migration/

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-15 Thread Alfredo Deza
On Tue, May 14, 2019 at 7:24 PM Bob R  wrote:
>
> Does 'ceph-volume lvm list' show it? If so you can try to activate it with 
> 'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4'

Good suggestion. If `ceph-volume lvm list` can see it, it can probably
activate it again. You can activate it with the OSD ID + OSD FSID, or
do:

ceph-volume lvm activate --all

You didn't say if the OSD wasn't coming up after trying to start it
(the systemd unit should still be there for ID 122), or if you tried
rebooting and that OSD didn't come up.

The systemd unit is tied to both the ID and FSID of the OSD, so it
shouldn't matter if the underlying device changed since ceph-volume
ensures it is the right one every time it activates.
>
> Bob
>
> On Tue, May 14, 2019 at 7:35 AM Tarek Zegar  wrote:
>>
>> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo 1 > 
>> /sys/block/nvme0n1/device/device/remove
>> We got it back doing a echo 1 > /sys/bus/pci/rescan
>> However, it reenumerated as a different drive number (guess we didn't have 
>> udev rules)
>> They restored the LVM volume (vgcfgrestore 
>> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay 
>> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>>
>> lsblk
>> nvme0n2 259:9 0 1.8T 0 diskc
>> ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
>>  253:1 0 1.8T 0 lvm
>>
>> We are stuck here. How do we attach an OSD daemon to the drive? It was 
>> OSD.122 previously
>>
>> Thanks
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Huge rebalance after rebooting OSD host (Mimic)

2019-05-15 Thread Jan Kasprzak
Hello, Ceph users,

I wanted to install the recent kernel update on my OSD hosts
with CentOS 7, Ceph 13.2.5 Mimic. So I set a noout flag and ran
"yum -y update" on the first OSD host. This host has 8 bluestore OSDs
with data on HDDs and database on LVs of two SSDs (each SSD has 4 LVs
for OSD metadata).

Everything went OK, so I rebooted this host. After the OSD host
went back online, the cluster went from HEALTH_WARN (noout flag set)
to HEALTH_ERR, and started to rebalance itself, with reportedly almost 60 %
objects misplaced, and some of them degraded. And, of course backfill_toofull:

  cluster:
health: HEALTH_ERR
2300616/3975384 objects misplaced (57.872%)
Degraded data redundancy: 74263/3975384 objects degraded (1.868%), 
146 pgs degraded, 122 pgs undersized
Degraded data redundancy (low space): 44 pgs backfill_toofull
 
  services:
mon: 3 daemons, quorum stratus1,stratus2,stratus3
mgr: stratus3(active), standbys: stratus1, stratus2
osd: 44 osds: 44 up, 44 in; 2022 remapped pgs
rgw: 1 daemon active
 
  data:
pools:   9 pools, 3360 pgs
objects: 1.33 M objects, 5.0 TiB
usage:   25 TiB used, 465 TiB / 490 TiB avail
pgs: 74263/3975384 objects degraded (1.868%)
 2300616/3975384 objects misplaced (57.872%)
 1739 active+remapped+backfill_wait
 1329 active+clean
 102  active+recovery_wait+remapped
 76   active+undersized+degraded+remapped+backfill_wait
 31   active+remapped+backfill_wait+backfill_toofull
 30   active+recovery_wait+undersized+degraded+remapped
 21   active+recovery_wait+degraded+remapped
 8
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
 6active+recovery_wait+degraded
 4active+remapped+backfill_toofull
 3active+recovery_wait+undersized+degraded
 3active+remapped+backfilling
 2active+recovery_wait
 2active+recovering+undersized
 1active+clean+remapped
 1active+undersized+degraded+remapped+backfill_toofull
 1active+undersized+degraded+remapped+backfilling
 1active+recovering+undersized+remapped
 
  io:
client:   681 B/s rd, 1013 KiB/s wr, 0 op/s rd, 32 op/s wr
recovery: 142 MiB/s, 93 objects/s
 
(note that I cleaned the noout flag afterwards). What is wrong with it?
Why did the cluster decided to rebalance itself?

I am keeping the rest of the OSD hosts unrebooted for now.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge rebalance after rebooting OSD host (Mimic)

2019-05-15 Thread Marc Roos
Are you sure your osd's are up and reachable? (run ceph osd tree on 
another node)



-Original Message-
From: Jan Kasprzak [mailto:k...@fi.muni.cz] 
Sent: woensdag 15 mei 2019 14:46
To: ceph-us...@ceph.com
Subject: [ceph-users] Huge rebalance after rebooting OSD host (Mimic)

Hello, Ceph users,

I wanted to install the recent kernel update on my OSD hosts with CentOS 
7, Ceph 13.2.5 Mimic. So I set a noout flag and ran "yum -y update" on 
the first OSD host. This host has 8 bluestore OSDs with data on HDDs and 
database on LVs of two SSDs (each SSD has 4 LVs for OSD metadata).

Everything went OK, so I rebooted this host. After the OSD host 
went back online, the cluster went from HEALTH_WARN (noout flag set) to 
HEALTH_ERR, and started to rebalance itself, with reportedly almost 60 % 
objects misplaced, and some of them degraded. And, of course 
backfill_toofull:

  cluster:
health: HEALTH_ERR
2300616/3975384 objects misplaced (57.872%)
Degraded data redundancy: 74263/3975384 objects degraded 
(1.868%), 146 pgs degraded, 122 pgs undersized
Degraded data redundancy (low space): 44 pgs 
backfill_toofull
 
  services:
mon: 3 daemons, quorum stratus1,stratus2,stratus3
mgr: stratus3(active), standbys: stratus1, stratus2
osd: 44 osds: 44 up, 44 in; 2022 remapped pgs
rgw: 1 daemon active
 
  data:
pools:   9 pools, 3360 pgs
objects: 1.33 M objects, 5.0 TiB
usage:   25 TiB used, 465 TiB / 490 TiB avail
pgs: 74263/3975384 objects degraded (1.868%)
 2300616/3975384 objects misplaced (57.872%)
 1739 active+remapped+backfill_wait
 1329 active+clean
 102  active+recovery_wait+remapped
 76   active+undersized+degraded+remapped+backfill_wait
 31   active+remapped+backfill_wait+backfill_toofull
 30   active+recovery_wait+undersized+degraded+remapped
 21   active+recovery_wait+degraded+remapped
 8
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
 6active+recovery_wait+degraded
 4active+remapped+backfill_toofull
 3active+recovery_wait+undersized+degraded
 3active+remapped+backfilling
 2active+recovery_wait
 2active+recovering+undersized
 1active+clean+remapped
 1active+undersized+degraded+remapped+backfill_toofull
 1active+undersized+degraded+remapped+backfilling
 1active+recovering+undersized+remapped
 
  io:
client:   681 B/s rd, 1013 KiB/s wr, 0 op/s rd, 32 op/s wr
recovery: 142 MiB/s, 93 objects/s
 
(note that I cleaned the noout flag afterwards). What is wrong with it?
Why did the cluster decided to rebalance itself?

I am keeping the rest of the OSD hosts unrebooted for now.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 
4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from 
rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-15 Thread Frank Schilder
Dear Stefan,

thanks for the fast reply. We encountered the problem again, this time in a 
much simpler situation; please see below. However, let me start with your 
questions first:

What bug? -- In a single-active MDS set-up, should there ever occur an 
operation with "op_name": "fragmentdir"?

Trimming settings: In the version I'm running, mds_log_max_expiring does not 
exist and mds_log_max_segments is 128 by default. I guess this is fine.

Upgrading: The problem described here is the only issue we observe. Unless the 
problem is fixed upstream, upgrading won't help us and would be a bit of a 
waste of time. If someone can confirm that this problem is fixed in a newer 
version, we will do it. Otherwise, we might prefer to wait until it is.

News on the problem. We encountered it again when one of our users executed a 
command in parallel with pdsh on all our ~500 client nodes. This command 
accesses the same file from all these nodes pretty much simultaneously. We did 
this quite often in the past, but this time, the command got stuck and we 
started observing the MDS health problem again. Symptoms:

- The pdsh process enters an un-interruptible state.
- It is no longer possible to access the directory where the simultaneously 
accessed file resides (from any client).
- 'ceph status' reports 'MDS slow requests'
- The 'ceph daemon mds.nnn ops' list contains operations that are waiting for 
directory fragmentation (see full log below).
- The ops list contains an operation "internal op fragmentdir:mds.0:35" that is 
dispatched, but apparently never completes.
- Any attempt to access the locked directory adds operations to the ops list 
that will then also hang indefinitely.
- I/O to other directories continues to work fine.

We waited some time to confirm that ceph does not heal itself. It is a 
dead-lock situation that seems to be triggered by a large number of clients 
simultaneously accessing the same file/directory. This problem seems not to 
occur with 100 or fewer clients. The probability of occurrence seems load 
dependent.

Temporary fix: Failing the active MDS flushed the stuck operations. The cluster 
became healthy and all clients rejoined.

This time I captured the MDS ops list (log output does not really contain more 
info than this list). It contains 12 ops and I will include it here in full 
length (hope this is acceptable):

{
"ops": [
{
"description": "client_request(client.386087:12791 lookup 
#0x127/file.pdf 2019-05-15 11:30:47.173526 caller_uid=0, 
caller_gid=0{})",
"initiated_at": "2019-05-15 11:30:47.174134",
"age": 492.800243,
"duration": 492.800277,
"type_data": {
"flag_point": "failed to authpin, dir is being fragmented",
"reqid": "client.386087:12791",
"op_type": "client_request",
"client_info": {
"client": "client.386087",
"tid": 12791
},
"events": [
{
"time": "2019-05-15 11:30:47.174134",
"event": "initiated"
},
{
"time": "2019-05-15 11:30:47.174134",
"event": "header_read"
},
{
"time": "2019-05-15 11:30:47.174136",
"event": "throttled"
},
{
"time": "2019-05-15 11:30:47.174144",
"event": "all_read"
},
{
"time": "2019-05-15 11:30:47.174245",
"event": "dispatched"
},
{
"time": "2019-05-15 11:30:47.174271",
"event": "failed to authpin, dir is being fragmented"
}
]
}
},
{
"description": "client_request(client.62472:6092355 create 
#0x138/lastnotification.uXMjaLSt 2019-05-15 11:15:02.883027 
caller_uid=105731, caller_gid=105731{})",
"initiated_at": "2019-05-15 11:15:02.884547",
"age": 1437.089830,
"duration": 1437.089937,
"type_data": {
"flag_point": "failed to authpin, dir is being fragmented",
"reqid": "client.62472:6092355",
"op_type": "client_request",
"client_info": {
"client": "client.62472",
"tid": 6092355
},
"events": [
{
"time": "2019-05-15 11:15:02.884547",
"event": "initiated"
},
{
"time": "2019-05-15 11:15:02.884547",
"event": "header_read"

Re: [ceph-users] Huge rebalance after rebooting OSD host (Mimic)

2019-05-15 Thread kas
Marc,

Marc Roos wrote:
: Are you sure your osd's are up and reachable? (run ceph osd tree on 
: another node)

They are up, because all three mons see them as up.
However, ceph osd tree provided the hint (thanks!): The OSD host went back
with hostname "localhost" instead of the correct one for some reason.
So the OSDs moved themselves to a new HOST=localhost CRUSH node directly
under the CRUSH root. I rebooted the OSD host once again, and it went up
again with the correct hostname, and the "ceph osd tree" output looks sane
now. So I guess we have a reason for such a huge rebalance.

However, even though the OSD tree is back in the normal state,
the rebalance is still going on, and there are even inactive PGs,
with some Ceph clients being stuck seemingly forever:

health: HEALTH_ERR
1964645/3977451 objects misplaced (49.395%)
Reduced data availability: 11 pgs inactive
Degraded data redundancy: 315678/3977451 objects degraded (7.937%),
542 pgs degraded, 546 pgs undersized
Degraded data redundancy (low space): 76 pgs backfill_toofull

  services:
mon: 3 daemons, quorum stratus1,stratus2,stratus3
mgr: stratus3(active), standbys: stratus1, stratus2
osd: 44 osds: 44 up, 44 in; 1806 remapped pgs
rgw: 1 daemon active

  data:
pools:   9 pools, 3360 pgs
objects: 1.33 M objects, 5.0 TiB
usage:   25 TiB used, 465 TiB / 490 TiB avail
pgs: 0.327% pgs not active
 315678/3977451 objects degraded (7.937%)
 1964645/3977451 objects misplaced (49.395%)
 1554 active+clean
 1226 active+remapped+backfill_wait
 482  active+undersized+degraded+remapped+backfill_wait
 51   active+undersized+degraded+remapped+backfill_wait+backfill_too
full
 25   active+remapped+backfill_wait+backfill_toofull
 6activating+remapped
 5active+undersized+remapped+backfill_wait
 4activating+undersized+degraded+remapped
 4active+undersized+degraded+remapped+backfilling
 2active+remapped+backfilling
 1activating+degraded+remapped

  io:
client:   0 B/s rd, 126 KiB/s wr, 0 op/s rd, 5 op/s wr
recovery: 52 MiB/s, 13 objects/s

# ceph pg ls|grep activating
23.298 622  622 0   0 2591064064 3041   
 activating+undersized+degraded+remapped 2019-05-15 15:03:04.626434  
102870'1371081   103721:1369041   [8,20,70]p8  [8,20]p8 2019-05-15 
02:10:34.972050 2019-05-15 02:10:34.972050 
23.2cb 695  695   695   0 2885144354 3097   
 activating+undersized+degraded+remapped 2019-05-15 15:03:04.592438   
102890'828931   103721:1594128   [0,70,78]p0[21,78]p21 2019-05-15 
10:23:02.789435 2019-05-14 00:46:19.161050 
23.346 6231  1245   0 2602515968 3076   
activating+degraded+remapped 2019-05-15 14:56:05.317986  
103083'1061153   103721:3719154 [78,79,26]p78  [26,23,5]p26 2019-05-15 
10:21:17.388467 2019-05-15 10:21:17.388467 
23.436 6640   664   0 276736 3079   
 activating+remapped 2019-05-15 15:05:19.349660   
103083'987000   103721:1525097 [13,70,19]p13 [13,19,18]p13 2019-05-14 
09:43:52.924297 2019-05-08 04:24:41.251620 
23.454 6960  1846   0 2872765970 3031   
 activating+remapped 2019-05-15 15:05:19.152343  
102896'1092297   103721:1607448   [2,69,70]p2 [24,12,75]p24 2019-05-15 
14:06:45.123388 2019-05-11 21:53:50.183932 
23.490 6360   636   0 2635874322 3064   
 activating+remapped 2019-05-15 15:05:19.368037  
103083'4996760   103721:1789524  [13,70,1]p13  [13,1,24]p13 2019-05-14 
05:16:51.180417 2019-05-09 04:51:52.645295 
23.4f5 6330  1266   0 2641321984 3084   
 activating+remapped 2019-05-15 14:56:04.248887  
103035'4667973   103721:2116544 [70,72,27]p70 [25,27,79]p25 2019-05-15 
01:07:28.978979 2019-05-08 07:20:08.253942 
23.76b 5960  1192   0 2481048116 3025   
 activating+remapped 2019-05-15 15:05:19.135491  
102723'1445725   103721:1907186 [70,13,72]p70  [26,13,8]p26 2019-05-14 
17:04:13.644789 2019-05-14 17:04:13.644789 
23.7e1 6040   604   0 2517671954 3008   
 activating+remapped 2019-05-15 14:56:04.246016   
102730'739689   103721:1262764   [8,79,21]p8   [8,21,26]p8 2019-05-14 
13:57:52.964361 2019-05-13 09:54:51.371622 
62.4b  108  794 0   0   74451903 1028   
 activating+undersized+degraded+remapped 2019-05-15 14:56:04.330268 
102517'1028 103721:22340 [79,78,20]p79[78,20]p78 2019-

Re: [ceph-users] Lost OSD from PCIe error, recovered, to restore OSD process

2019-05-15 Thread Tarek Zegar

TLDR; I activated the drive successfully but the daemon won't start, looks
like it's complaining about mon config, idk why (there is a valid ceph.conf
on the host). Thoughts? I feel like it's close. Thank you

I executed the command:
ceph-volume lvm activate --all


It found the drive and activated it:
--> Activating OSD ID 122 FSID a151bea5-d123-45d9-9b08-963a511c042a

--> ceph-volume lvm activate successful for osd ID: 122



However, systemd would not start the OSD process 122:
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
14:16:13.862 71970700 -1 monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
14:16:13.862 7116f700 -1 monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: failed to fetch
mon config (--no-mon-config to skip)
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Main process exited, code=exited, status=1/FAILURE
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Service hold-off time over, scheduling restart.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Scheduled restart job, restart counter is at 3.
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Automatic restarting of the unit ceph-osd@122.service has been
scheduled, as the result for
-- the configured Restart= setting for the unit.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Stopped Ceph object
storage daemon osd.122.
-- Subject: Unit ceph-osd@122.service has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit ceph-osd@122.service has finished shutting down.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Start request repeated too quickly.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Failed to start Ceph
object storage daemon osd.122





From:   Alfredo Deza 
To: Bob R 
Cc: Tarek Zegar , ceph-users

Date:   05/15/2019 08:27 AM
Subject:[EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error,
recovered, to restore OSD process



On Tue, May 14, 2019 at 7:24 PM Bob R  wrote:
>
> Does 'ceph-volume lvm list' show it? If so you can try to activate it
with 'ceph-volume lvm activate 122
74b01ec2--124d--427d--9812--e437f90261d4'

Good suggestion. If `ceph-volume lvm list` can see it, it can probably
activate it again. You can activate it with the OSD ID + OSD FSID, or
do:

ceph-volume lvm activate --all

You didn't say if the OSD wasn't coming up after trying to start it
(the systemd unit should still be there for ID 122), or if you tried
rebooting and that OSD didn't come up.

The systemd unit is tied to both the ID and FSID of the OSD, so it
shouldn't matter if the underlying device changed since ceph-volume
ensures it is the right one every time it activates.
>
> Bob
>
> On Tue, May 14, 2019 at 7:35 AM Tarek Zegar  wrote:
>>
>> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo
1 > /sys/block/nvme0n1/device/device/remove
>> We got it back doing a echo 1 > /sys/bus/pci/rescan
>> However, it reenumerated as a different drive number (guess we didn't
have udev rules)
>> They restored the LVM volume (vgcfgrestore
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>>
>> lsblk
>> nvme0n2 259:9 0 1.8T 0 diskc
>>
ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
 253:1 0 1.8T 0 lvm
>>
>> We are stuck here. How do we attach an OSD daemon to the drive? It was
OSD.122 previously
>>
>> Thanks
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3V1n-r1W__Mu-wEAwzq7jDpopOSMrfRfomn1f5bgT28&m=T8FGOFoarkOiORgemihDpPCoz3wRG5GH_oQWne3ROvc&s=4zaqEyKSugJ7AN4hZW6vOZ4SZ0-SxF-yj8OGBM2zv6c&e=

>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwIBaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=3V1n-r1W__Mu-wEAwzq7jDpopOSMrfRfomn1f5bgT28&m=T8FGOFoarkOiORgemihDpPCoz3wRG5GH_oQWne3ROvc&s=4zaqEyKSugJ7AN4hZW6vOZ4SZ0-SxF-yj8OGBM2zv6c&e=




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge rebalance after rebooting OSD host (Mimic)

2019-05-15 Thread kas
kas wrote:
:   Marc,
: 
: Marc Roos wrote:
: : Are you sure your osd's are up and reachable? (run ceph osd tree on 
: : another node)
: 
:   They are up, because all three mons see them as up.
: However, ceph osd tree provided the hint (thanks!): The OSD host went back
: with hostname "localhost" instead of the correct one for some reason.
: So the OSDs moved themselves to a new HOST=localhost CRUSH node directly
: under the CRUSH root. I rebooted the OSD host once again, and it went up
: again with the correct hostname, and the "ceph osd tree" output looks sane
: now. So I guess we have a reason for such a huge rebalance.
: 
:   However, even though the OSD tree is back in the normal state,
: the rebalance is still going on, and there are even inactive PGs,
: with some Ceph clients being stuck seemingly forever:
: 
: health: HEALTH_ERR
: 1964645/3977451 objects misplaced (49.395%)
: Reduced data availability: 11 pgs inactive

Wild guessing what to do, I went to the rebooted OSD host and ran
systemctl restart ceph-osd.target
- restarting all OSD processes. The previously inactive (activating) pgs
went to the active state, and Ceph clients got unstuck. Now I see
HEALTH_ERR with backfill_toofull only, which I consider a normal state
during Ceph Mimic rebalance.

It would be interesting to know why some of the PGs went stuck,
and why did restart help. FWIW, I have a "ceph pg query" output for
one of the 11 inactive PGs.

-Yenya

---
# ceph pg 23.4f5 query
{
"state": "activating+remapped",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 104015,
"up": [
70,
72,
27
],
"acting": [
25,
27,
79
],
"backfill_targets": [
"70",
"72"
],
"acting_recovery_backfill": [
"25",
"27",
"70",
"72",
"79"
],
"info": {
"pgid": "23.4f5",
"last_update": "103035'4667973",
"last_complete": "103035'4667973",
"log_tail": "102489'4664889",
"last_user_version": 4667973,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": [],
"history": {
"epoch_created": 406,
"epoch_pool_created": 406,
"last_epoch_started": 103086,
"last_interval_started": 103085,
"last_epoch_clean": 96881,
"last_interval_clean": 96880,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 103095,
"same_interval_since": 103095,
"same_primary_since": 95398,
"last_scrub": "102517'4667556",
"last_scrub_stamp": "2019-05-15 01:07:28.978979",
"last_deep_scrub": "102491'4666011",
"last_deep_scrub_stamp": "2019-05-08 07:20:08.253942",
"last_clean_scrub_stamp": "2019-05-15 01:07:28.978979"
},
"stats": {
"version": "103035'4667973",
"reported_seq": "2116838",
"reported_epoch": "104015",
"state": "activating+remapped",
"last_fresh": "2019-05-15 16:19:44.530005",
"last_change": "2019-05-15 14:56:04.248887",
"last_active": "2019-05-15 14:56:02.579506",
"last_peered": "2019-05-15 14:56:01.401941",
"last_clean": "2019-05-15 14:53:39.291350",
"last_became_active": "2019-05-15 14:55:54.163102",
"last_became_peered": "2019-05-15 14:55:54.163102",
"last_unstale": "2019-05-15 16:19:44.530005",
"last_undegraded": "2019-05-15 16:19:44.530005",
"last_fullsized": "2019-05-15 16:19:44.530005",
"mapping_epoch": 103095,
"log_start": "102489'4664889",
"ondisk_log_start": "102489'4664889",
"created": 406,
"last_epoch_clean": 96881,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "102517'4667556",
"last_scrub_stamp": "2019-05-15 01:07:28.978979",
"last_deep_scrub": "102491'4666011",
"last_deep_scrub_stamp": "2019-05-08 07:20:08.253942",
"last_clean_scrub_stamp": "2019-05-15 01:07:28.978979",
"log_size": 3084,
"ondisk_log_size": 3084,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": true,
"manifest_stats_invalid": true,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 2641321984,
"num_objects": 633,
"num_object_clones": 49,
"num_object_copies": 1899,
"num_objects_missing_on_primary": 0,
"num_objects_mi

Re: [ceph-users] Ceph Bucket strange issues rgw.none + id and marker diferent.

2019-05-15 Thread J. Eric Ivancich
Hi Manuel,

My response is interleaved below.

On 5/8/19 3:17 PM, EDH - Manuel Rios Fernandez wrote:
> Eric,
> 
> Yes we do :
> 
> time s3cmd ls s3://[BUCKET]/ --no-ssl and we get near 2min 30 secs for list 
> the bucket.

We're adding an --allow-unordered option to `radosgw-admin bucket list`.
That would likely speed up your listing. If you want to follow the
trackers, they are:

https://tracker.ceph.com/issues/39637 [feature added to master]
https://tracker.ceph.com/issues/39730 [nautilus backport]
https://tracker.ceph.com/issues/39731 [mimic backport]
https://tracker.ceph.com/issues/39732 [luminous backport]

> If we instantly hit again the query it normally timeouts.

That's interesting. I don't have an explanation for that behavior. I
would suggest creating a tracker for the issue, ideally with the minimal
steps to reproduce the issue. My concern is that your bucket has so many
objects, and if that's related to the issue, it would not be easy to
reproduce.

> Could you explain a little more "
> 
> With respect to your earlier message in which you included the output of 
> `ceph df`, I believe the reason that default.rgw.buckets.index shows as
> 0 bytes used is that the index uses the metadata branch of the object to 
> store its data.
> "

Each object in ceph has three components. The data itself plus two types
of metadata (omap and xattr). The `ceph df` command doesn't count the
metadata.

The bucket indexes that track the objects in each bucket use only the
metadata. So you won't see that reported in `ceph df`.

> I read in IRC today that in Nautilus release now is well calculated and no 
> show more 0B. Is it correct?

I don't know. I wasn't aware of any changes in nautilus that report
metadata in `ceph df`.

> Thanks for your response.

You're welcome,

Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread EDH - Manuel Rios Fernandez
We setup 2 monitors as NTP server, and the other nodes are sync from monitors.

-Mensaje original-
De: ceph-users  En nombre de Richard Hesketh
Enviado el: miércoles, 15 de mayo de 2019 14:04
Para: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] How do you deal with "clock skew detected"?

Another option would be adding a boot time script which uses ntpdate (or
something) to force an immediate sync with your timeservers before ntpd starts 
- this is actually suggested in ntpdate's man page!

Rich

On 15/05/2019 13:00, Marco Stuurman wrote:
> Hi Yenya,
> 
> You could try to synchronize the system clock to the hardware clock 
> before rebooting. Also try chrony, it catches up very fast.
> 
> 
> Kind regards,
> 
> Marco Stuurman
> 
> 
> Op wo 15 mei 2019 om 13:48 schreef Jan Kasprzak  >
> 
> Hello, Ceph users,
> 
> how do you deal with the "clock skew detected" HEALTH_WARN message?
> 
> I think the internal RTC in most x86 servers does have 1 second resolution
> only, but Ceph skew limit is much smaller than that. So every time I 
> reboot
> one of my mons (for kernel upgrade or something), I have to wait for 
> several
> minutes for the system clock to synchronize over NTP, even though ntpd
> has been running before reboot and was started during the system
> boot again.
> 
> Thanks,
> 
> -Yenya


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bucket strange issues rgw.none + id and marker diferent.

2019-05-15 Thread EDH - Manuel Rios Fernandez
Hi Eric,

FYI , Ceph osd df in Nautilus reports metadata and Omap.  We updated to 
Nautilis 14.2.1

Im going to create a issue in tracket about timeout after a return.

[root@CEPH001 ~]# ceph osd df tree
ID  CLASS   WEIGHTREWEIGHT SIZERAW USE DATAOMAPMETA AVAIL   
%USE  VAR  PGS STATUS TYPE NAME
-41 654.84045- 655 TiB 556 TiB 555 TiB  24 MiB  1.0 TiB  99 TiB 
84.88 1.01   -root archive
-37 130.96848- 131 TiB 111 TiB 111 TiB 698 KiB  209 GiB  20 TiB 
84.93 1.01   -host CEPH-ARCH-R03-07
100 archive  10.91399  1.0  11 TiB 9.1 TiB 9.1 TiB  28 KiB   17 GiB 1.8 TiB 
83.49 0.99 199 up osd.100
101 archive  10.91399  1.0  11 TiB 9.3 TiB 9.3 TiB  20 KiB   18 GiB 1.6 TiB 
85.21 1.01 197 up osd.101
102 archive  10.91399  1.0  11 TiB 9.4 TiB 9.4 TiB  16 KiB   18 GiB 1.5 TiB 
86.56 1.03 219 up osd.102
103 archive  10.91399  1.0  11 TiB 9.6 TiB 9.6 TiB 112 KiB   18 GiB 1.3 TiB 
87.88 1.05 240 up osd.103
104 archive  10.91399  1.0  11 TiB 9.4 TiB 9.4 TiB  48 KiB   18 GiB 1.5 TiB 
85.85 1.02 212 up osd.104
105 archive  10.91399  1.0  11 TiB 9.1 TiB 9.0 TiB  16 KiB   18 GiB 1.8 TiB 
83.07 0.99 195 up osd.105
106 archive  10.91409  1.0  11 TiB 9.3 TiB 9.3 TiB  16 KiB   18 GiB 1.6 TiB 
85.51 1.02 202 up osd.106
107 archive  10.91409  1.0  11 TiB 9.1 TiB 9.1 TiB 129 KiB   17 GiB 1.8 TiB 
83.33 0.99 193 up osd.107
108 archive  10.91409  1.0  11 TiB 9.3 TiB 9.3 TiB  76 KiB   17 GiB 1.6 TiB 
85.51 1.02 211 up osd.108
109 archive  10.91409  1.0  11 TiB 9.3 TiB 9.2 TiB 140 KiB   17 GiB 1.6 TiB 
84.89 1.01 210 up osd.109
110 archive  10.91409  1.0  11 TiB 9.1 TiB 9.1 TiB   4 KiB   17 GiB 1.8 TiB 
83.84 1.00 190 up osd.110
111 archive  10.91409  1.0  11 TiB 9.2 TiB 9.2 TiB  93 KiB   17 GiB 1.7 TiB 
84.04 1.00 201 up osd.111
-23 130.96800- 131 TiB 112 TiB 111 TiB 324 KiB  209 GiB  19 TiB 
85.26 1.01   -host CEPH005
  4 archive  10.91399  1.0  11 TiB 9.4 TiB 9.4 TiB 108 KiB   17 GiB 1.5 TiB 
85.82 1.02 226 up osd.4
 41 archive  10.91399  1.0  11 TiB 9.4 TiB 9.4 TiB  20 KiB   18 GiB 1.5 TiB 
86.11 1.02 203 up osd.41
 74 archive  10.91399  1.0  11 TiB 9.1 TiB 9.1 TiB  36 KiB   17 GiB 1.8 TiB 
83.36 0.99 198 up osd.74
 75 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  12 KiB   18 GiB 1.7 TiB 
84.25 1.00 205 up osd.75
 81 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  48 KiB   17 GiB 1.7 TiB 
84.48 1.01 203 up osd.81
 82 archive  10.91399  1.0  11 TiB 9.4 TiB 9.4 TiB  36 KiB   17 GiB 1.5 TiB 
86.57 1.03 210 up osd.82
 83 archive  10.91399  1.0  11 TiB 9.3 TiB 9.3 TiB  16 KiB   18 GiB 1.6 TiB 
85.23 1.01 200 up osd.83
 84 archive  10.91399  1.0  11 TiB 9.1 TiB 9.1 TiB   4 KiB   17 GiB 1.8 TiB 
83.83 1.00 205 up osd.84
 85 archive  10.91399  1.0  11 TiB 9.3 TiB 9.3 TiB  12 KiB   18 GiB 1.6 TiB 
85.06 1.01 202 up osd.85
 86 archive  10.91399  1.0  11 TiB 9.3 TiB 9.2 TiB  12 KiB   18 GiB 1.6 TiB 
84.90 1.01 204 up osd.86
 87 archive  10.91399  1.0  11 TiB 9.5 TiB 9.5 TiB   4 KiB   18 GiB 1.4 TiB 
87.16 1.04 223 up osd.87
 88 archive  10.91399  1.0  11 TiB 9.4 TiB 9.4 TiB  16 KiB   18 GiB 1.5 TiB 
86.35 1.03 208 up osd.88
-17 130.96800- 131 TiB 111 TiB 111 TiB 6.6 MiB  203 GiB  20 TiB 
84.65 1.01   -host CEPH006
  7 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB 1.4 MiB   17 GiB 1.7 TiB 
84.49 1.01 201 up osd.7
  8 archive  10.91399  1.0  11 TiB 9.3 TiB 9.2 TiB 2.2 MiB   17 GiB 1.7 TiB 
84.79 1.01 206 up osd.8
  9 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB 2.7 MiB   17 GiB 1.7 TiB 
84.28 1.00   0   down osd.9
 10 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  24 KiB   17 GiB 1.7 TiB 
84.66 1.01 190 up osd.10
 12 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  16 KiB   17 GiB 1.7 TiB 
84.38 1.00 203 up osd.12
 13 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  24 KiB   17 GiB 1.7 TiB 
84.34 1.00 202 up osd.13
 42 archive  10.91399  1.0  11 TiB 9.1 TiB 9.1 TiB   8 KiB   17 GiB 1.8 TiB 
83.73 1.00 198 up osd.42
 43 archive  10.91399  1.0  11 TiB 9.5 TiB 9.4 TiB  36 KiB   17 GiB 1.5 TiB 
86.62 1.03 213 up osd.43
 51 archive  10.91399  1.0  11 TiB 9.3 TiB 9.3 TiB  80 KiB   17 GiB 1.6 TiB 
84.99 1.01 204 up osd.51
 53 archive  10.91399  1.0  11 TiB 9.3 TiB 9.3 TiB  64 KiB   17 GiB 1.6 TiB 
85.05 1.01 217 up osd.53
 76 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB  72 KiB   17 GiB 1.7 TiB 
84.27 1.00 196 up osd.76
 80 archive  10.91399  1.0  11 TiB 9.2 TiB 9.2 TiB   8 

Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Patrick Donnelly
On Wed, May 15, 2019 at 5:05 AM Lars Täuber  wrote:
> is there a way to migrate a cephfs to a new data pool like it is for rbd on 
> nautilus?
> https://ceph.com/geen-categorie/ceph-pool-migration/

No, this isn't possible.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Peter Woodman
I actually made a dumb python script to do this. It's ugly and has a
lot of hardcoded things in it (like the mount location where i'm
copying things to to move pools, names of pools, the savings i was
expecting, etc) but should be easy to adapt to what you're trying to
do

https://gist.github.com/pjjw/b5fbee24c848661137d6ac09a3e0c980

On Wed, May 15, 2019 at 1:45 PM Patrick Donnelly  wrote:
>
> On Wed, May 15, 2019 at 5:05 AM Lars Täuber  wrote:
> > is there a way to migrate a cephfs to a new data pool like it is for rbd on 
> > nautilus?
> > https://ceph.com/geen-categorie/ceph-pool-migration/
>
> No, this isn't possible.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Senior Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Alexandre DERUMIER
since I'm using chrony instead ntpd/openntpd, I don't have clock skew anymore.
 
(chrony is really faster to resync)

- Mail original -
De: "Jan Kasprzak" 
À: "ceph-users" 
Envoyé: Mercredi 15 Mai 2019 13:47:57
Objet: [ceph-users] How do you deal with "clock skew detected"?

Hello, Ceph users, 

how do you deal with the "clock skew detected" HEALTH_WARN message? 

I think the internal RTC in most x86 servers does have 1 second resolution 
only, but Ceph skew limit is much smaller than that. So every time I reboot 
one of my mons (for kernel upgrade or something), I have to wait for several 
minutes for the system clock to synchronize over NTP, even though ntpd 
has been running before reboot and was started during the system boot again. 

Thanks, 

-Yenya 

-- 
| Jan "Yenya" Kasprzak  | 
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | 
sir_clive> I hope you don't mind if I steal some of your ideas? 
laryross> As far as stealing... we call it sharing here. --from rcgroups 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Elise Burke
I came across that and tried it - the short answer is no, you can't do that
- using cache tier. The longer answer as to why I'm less sure about, but
iirc it has to do with copying / editing the OMAP object properties.

The good news, however, is that you can 'fake it' using File Layouts -
http://docs.ceph.com/docs/mimic/cephfs/file-layouts/

In my case I was moving around / upgrading disks and wanted to change from
unreplicated (well, r=1) to erasure coding (in my case, rs4.1). I was able
to do this keeping the following in mind:

1. The original pool, cephfs_data, must remain as a replicated pool. I'm
unsure why, IIRC some metadata can't be kept in erasure coded pools.
2. The metadata pool, cephfs_metadata, must also remain as a replicated
pool.
3. Your new pool (the destination pool) can be created however you like.
4. This procedure involves rolling unavailability on a per-file basis.

This is from memory; I should do a better writeup elsewhere, but what I did
was this:

1. Create your new pool. `ceph osd pool create  cephfs_data_ec_rs4.1 8 8
erasure rs4.1`
2. Set the xattr for the root directory to use the new pool: `setfattr -n
ceph.file.layout.pool -v cephfs_data_ec_rs4.1 /cephfs_mountpoint/`

At this stage all new files will be written to the new pool. Unfortunately
you can't change the layout of a file with data, so copying the files back
into their own place is required. You can hack up a bash script to do this,
or write a converter program. Here's the most relevant bit, per file, which
copies the file first and then renames the new file to the old file:

func doConvert(filename string) error {
poolRewriteName, previousPoolName, err :=
newNearbyTempFiles(filename)
if err != nil {
return err
}
err = SetCephFSFileLayoutPool(poolRewriteName, []byte(*toPool))
if err != nil {
os.Remove(poolRewriteName)
os.Remove(previousPoolName)
return err
}

err = CopyFilePermissions(filename, poolRewriteName)
if err != nil {
os.Remove(poolRewriteName)
os.Remove(previousPoolName)
return err
}

//log.Printf("Copying %s to %s\n", filename, poolRewriteName)
err = CopyFile(filename, poolRewriteName)
if err != nil {
os.Remove(poolRewriteName)
os.Remove(previousPoolName)
return err
}

//log.Printf("Moving %s to %s\n", filename, previousPoolName)
err = MoveFile(filename, previousPoolName)
if err != nil {
os.Remove(poolRewriteName)
os.Remove(previousPoolName)
return err
}

//log.Printf("Moving %s to %s\n", poolRewriteName, filename)
err = MoveFile(poolRewriteName, filename)
os.Remove(poolRewriteName)
os.Remove(previousPoolName)
return err
}



On Wed, May 15, 2019 at 10:31 AM Lars Täuber  wrote:

> Hi,
>
> is there a way to migrate a cephfs to a new data pool like it is for rbd
> on nautilus?
> https://ceph.com/geen-categorie/ceph-pool-migration/
>
> Thanks
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Elise Burke
Oops, forgot a step - need to tell the MDS about the new pool before step 2:

`ceph mds add_data_pool `

You may also need to mark the pool as used by cephfs:

`ceph osd pool application enable {pool-name} cephfs`

On Wed, May 15, 2019 at 3:15 PM Elise Burke  wrote:

> I came across that and tried it - the short answer is no, you can't do
> that - using cache tier. The longer answer as to why I'm less sure about,
> but iirc it has to do with copying / editing the OMAP object properties.
>
> The good news, however, is that you can 'fake it' using File Layouts -
> http://docs.ceph.com/docs/mimic/cephfs/file-layouts/
>
> In my case I was moving around / upgrading disks and wanted to change from
> unreplicated (well, r=1) to erasure coding (in my case, rs4.1). I was able
> to do this keeping the following in mind:
>
> 1. The original pool, cephfs_data, must remain as a replicated pool. I'm
> unsure why, IIRC some metadata can't be kept in erasure coded pools.
> 2. The metadata pool, cephfs_metadata, must also remain as a replicated
> pool.
> 3. Your new pool (the destination pool) can be created however you like.
> 4. This procedure involves rolling unavailability on a per-file basis.
>
> This is from memory; I should do a better writeup elsewhere, but what I
> did was this:
>
> 1. Create your new pool. `ceph osd pool create  cephfs_data_ec_rs4.1 8 8
> erasure rs4.1`
> 2. Set the xattr for the root directory to use the new pool: `setfattr -n
> ceph.file.layout.pool -v cephfs_data_ec_rs4.1 /cephfs_mountpoint/`
>
> At this stage all new files will be written to the new pool. Unfortunately
> you can't change the layout of a file with data, so copying the files back
> into their own place is required. You can hack up a bash script to do this,
> or write a converter program. Here's the most relevant bit, per file, which
> copies the file first and then renames the new file to the old file:
>
> func doConvert(filename string) error {
> poolRewriteName, previousPoolName, err :=
> newNearbyTempFiles(filename)
> if err != nil {
> return err
> }
> err = SetCephFSFileLayoutPool(poolRewriteName, []byte(*toPool))
> if err != nil {
> os.Remove(poolRewriteName)
> os.Remove(previousPoolName)
> return err
> }
>
> err = CopyFilePermissions(filename, poolRewriteName)
> if err != nil {
> os.Remove(poolRewriteName)
> os.Remove(previousPoolName)
> return err
> }
>
> //log.Printf("Copying %s to %s\n", filename, poolRewriteName)
> err = CopyFile(filename, poolRewriteName)
> if err != nil {
> os.Remove(poolRewriteName)
> os.Remove(previousPoolName)
> return err
> }
>
> //log.Printf("Moving %s to %s\n", filename, previousPoolName)
> err = MoveFile(filename, previousPoolName)
> if err != nil {
> os.Remove(poolRewriteName)
> os.Remove(previousPoolName)
> return err
> }
>
> //log.Printf("Moving %s to %s\n", poolRewriteName, filename)
> err = MoveFile(poolRewriteName, filename)
> os.Remove(poolRewriteName)
> os.Remove(previousPoolName)
> return err
> }
>
>
>
> On Wed, May 15, 2019 at 10:31 AM Lars Täuber  wrote:
>
>> Hi,
>>
>> is there a way to migrate a cephfs to a new data pool like it is for rbd
>> on nautilus?
>> https://ceph.com/geen-categorie/ceph-pool-migration/
>>
>> Thanks
>> Lars
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Brian Topping
Lars, I just got done doing this after generating about a dozen CephFS subtrees 
for different Kubernetes clients. 

tl;dr: there is no way for files to move between filesystem formats (ie CephFS 
,> RBD) without copying them.

If you are doing the same thing, there may be some relevance for you in 
https://github.com/kubernetes/enhancements/pull/643. It’s worth checking to see 
if it meets your use case if so.

In any event, what I ended up doing was letting Kubernetes create the new PV 
with the RBD provisioner, then using find piped to cpio to move the file 
subtree. In a non-Kubernetes environment, one would simply create the 
destination RBD as usual. It should be most performant to do this on a monitor 
node.

cpio ensures you don’t lose metadata. It’s been fine for me, but if you have 
special xattrs that the clients of the files need, be sure to test that those 
are copied over. It’s very difficult to move that metadata once a file is 
copied and even harder to deal with a situation where the destination volume 
went live and some files on the destination are both newer versions and missing 
metadata. 

Brian

> On May 15, 2019, at 6:05 AM, Lars Täuber  wrote:
> 
> Hi,
> 
> is there a way to migrate a cephfs to a new data pool like it is for rbd on 
> nautilus?
> https://ceph.com/geen-categorie/ceph-pool-migration/
> 
> Thanks
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph -s finds 4 pools but ceph osd lspools says no pool which is the expected answer

2019-05-15 Thread Gregory Farnum
On Tue, May 14, 2019 at 11:03 AM Rainer Krienke  wrote:
>
> Hello,
>
> for a fresh setup ceph cluster I see a strange difference in the number
> of existing pools in the output of ceph -s and what I know that should
> be there: no pools at all.
>
> I set up a fresh Nautilus cluster with 144 OSDs on 9 hosts. Just to play
> around I created a pool named rbd with
>
> $ ceph osd pool create rbd 512 512 replicated
>
> In ceph -s I saw the pool but also saw a warning:
>
>  cluster:
> id: a-b-c-d-e
> health: HEALTH_WARN
> too few PGs per OSD (21 < min 30)
>
> So I experimented around, removed the pool (ceph osd pool remove rbd)
> and it was gone in ceph osd lspools, and created a new one with some
> more PGs and repeated this a few times with larger PG nums. In the end
> in the output of ceph -s I see that 4 pools do exist:
>
>   cluster:
> id: a-b-c-d-e
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum c2,c5,c8 (age 8h)
> mgr: c2(active, since 8h)
> osd: 144 osds: 144 up (since 8h), 144 in (since 8h)
>
>   data:
> pools:   4 pools, 0 pgs
> objects: 0 objects, 0 B
> usage:   155 GiB used, 524 TiB / 524 TiB avail
> pgs:
>
> but:
>
> $ ceph osd lspools
> 
>
> Since I deleted each pool I created, 0 pools is the correct answer.
> I could add another "ghost" pool by creating another pool named rbd with
> only 512 PGs and then delete it again right away. ceph -s would then
> show me 5 pools. This is the way I came from 3 to 4 "ghost pools".
>
> This does not seem to happen if I use 2048 PGs for the new pool which I
> do delete right afterwards. In this case the pool is created and ceph -s
> shows one pool more (5) and if delete this pool again the counter in
> ceph -s goes back to 4 again.
>
> How can I fix the system so that ceph -s also understands that are
> actually no pools? There must be some inconsistency. Any ideas?
>

I don't really see how this particular error can happen and be
long-lived, but if you restart the ceph-mgr it will probably resolve
itself.
("ceph osd lspools" looks directly at the OSDMap in the monitor,
whereas the "ceph -s" data output is generated from the manager's
pgmap, but there's a tight link where the pgmap gets updated and
removes dead pools on every new OSDMap the manager sees and I can't
see how that would go wrong.)
-Greg


> Thanks
> Rainer
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
> 56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
> PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
> 1001312
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG scrub stamps reset to 0.000000 in 14.2.1

2019-05-15 Thread Brett Chancellor
After upgrading from 14.2.0 to 14.2.1, I've noticed PGs are frequently
resetting their scrub and deep scrub time stamps to 0.00.  It's extra
strange because the peers show timestamps for deep scrubs.

## First entry from a pg list at 7pm
$ grep 11.2f2 ~/pgs-active.7pm
11.2f2 6910 0   0 2897477632   0  0
2091 active+clean3h  7378'12291   8048:36261[1,6,37]p1
[1,6,37]p1 2019-05-14 21:01:29.172460 2019-05-14 21:01:29.172460

## Next Entry 3 minutes later
$ ceph pg ls active |grep 11.2f2
11.2f2 6950 0   0 2914713600   0  0
2091 active+clean6s  7378'12291   8049:36330[1,6,37]p1
[1,6,37]p1   0.00   0.00

## PG Query
{
"state": "active+clean",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 8049,
"up": [
1,
6,
37
],
"acting": [
1,
6,
37
],
"acting_recovery_backfill": [
"1",
"6",
"37"
],
"info": {
"pgid": "11.2f2",
"last_update": "7378'12291",
"last_complete": "7378'12291",
"log_tail": "1087'10200",
"last_user_version": 12291,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": [],
"history": {
"epoch_created": 1549,
"epoch_pool_created": 216,
"last_epoch_started": 6148,
"last_interval_started": 6147,
"last_epoch_clean": 6148,
"last_interval_clean": 6147,
"last_epoch_split": 6147,
"last_epoch_marked_full": 0,
"same_up_since": 6126,
"same_interval_since": 6147,
"same_primary_since": 6126,
"last_scrub": "7378'12291",
"last_scrub_stamp": "0.00",
"last_deep_scrub": "6103'12186",
"last_deep_scrub_stamp": "0.00",
"last_clean_scrub_stamp": "2019-05-15 23:08:17.014575"
},
"stats": {
"version": "7378'12291",
"reported_seq": "36700",
"reported_epoch": "8049",
"state": "active+clean",
"last_fresh": "2019-05-15 23:08:17.014609",
"last_change": "2019-05-15 23:08:17.014609",
"last_active": "2019-05-15 23:08:17.014609",
"last_peered": "2019-05-15 23:08:17.014609",
"last_clean": "2019-05-15 23:08:17.014609",
"last_became_active": "2019-05-15 19:25:01.484322",
"last_became_peered": "2019-05-15 19:25:01.484322",
"last_unstale": "2019-05-15 23:08:17.014609",
"last_undegraded": "2019-05-15 23:08:17.014609",
"last_fullsized": "2019-05-15 23:08:17.014609",
"mapping_epoch": 6126,
"log_start": "1087'10200",
"ondisk_log_start": "1087'10200",
"created": 1549,
"last_epoch_clean": 6148,
"parent": "0.0",
"parent_split_bits": 10,
"last_scrub": "7378'12291",
"last_scrub_stamp": "0.00",
"last_deep_scrub": "6103'12186",
"last_deep_scrub_stamp": "0.00",
"last_clean_scrub_stamp": "2019-05-15 23:08:17.014575",
"log_size": 2091,
"ondisk_log_size": 2091,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": true,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 2914713600,
"num_objects": 695,
"num_object_clones": 0,
"num_object_copies": 2085,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 695,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
  

Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Konstantin Shalygin

how do you deal with the "clock skew detected" HEALTH_WARN message?

I think the internal RTC in most x86 servers does have 1 second resolution
only, but Ceph skew limit is much smaller than that. So every time I reboot
one of my mons (for kernel upgrade or something), I have to wait for several
minutes for the system clock to synchronize over NTP, even though ntpd
has been running before reboot and was started during the system boot again.


Definitely you should use chrony with iburst.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-15 Thread Yan, Zheng
On Wed, May 15, 2019 at 9:34 PM Frank Schilder  wrote:
>
> Dear Stefan,
>
> thanks for the fast reply. We encountered the problem again, this time in a 
> much simpler situation; please see below. However, let me start with your 
> questions first:
>
> What bug? -- In a single-active MDS set-up, should there ever occur an 
> operation with "op_name": "fragmentdir"?
>
> Trimming settings: In the version I'm running, mds_log_max_expiring does not 
> exist and mds_log_max_segments is 128 by default. I guess this is fine.
>
> Upgrading: The problem described here is the only issue we observe. Unless 
> the problem is fixed upstream, upgrading won't help us and would be a bit of 
> a waste of time. If someone can confirm that this problem is fixed in a newer 
> version, we will do it. Otherwise, we might prefer to wait until it is.
>
> News on the problem. We encountered it again when one of our users executed a 
> command in parallel with pdsh on all our ~500 client nodes. This command 
> accesses the same file from all these nodes pretty much simultaneously. We 
> did this quite often in the past, but this time, the command got stuck and we 
> started observing the MDS health problem again. Symptoms:
>
> - The pdsh process enters an un-interruptible state.
> - It is no longer possible to access the directory where the simultaneously 
> accessed file resides (from any client).
> - 'ceph status' reports 'MDS slow requests'
> - The 'ceph daemon mds.nnn ops' list contains operations that are waiting for 
> directory fragmentation (see full log below).
> - The ops list contains an operation "internal op fragmentdir:mds.0:35" that 
> is dispatched, but apparently never completes.
> - Any attempt to access the locked directory adds operations to the ops list 
> that will then also hang indefinitely.
> - I/O to other directories continues to work fine.
>
> We waited some time to confirm that ceph does not heal itself. It is a 
> dead-lock situation that seems to be triggered by a large number of clients 
> simultaneously accessing the same file/directory. This problem seems not to 
> occur with 100 or fewer clients. The probability of occurrence seems load 
> dependent.
>
> Temporary fix: Failing the active MDS flushed the stuck operations. The 
> cluster became healthy and all clients rejoined.
>
> This time I captured the MDS ops list (log output does not really contain 
> more info than this list). It contains 12 ops and I will include it here in 
> full length (hope this is acceptable):
>

Your issues were caused by stuck internal op fragmentdir.  Can you
dump mds cache and send the output to us?

> {
> "ops": [
> {
> "description": "client_request(client.386087:12791 lookup 
> #0x127/file.pdf 2019-05-15 11:30:47.173526 caller_uid=0, 
> caller_gid=0{})",
> "initiated_at": "2019-05-15 11:30:47.174134",
> "age": 492.800243,
> "duration": 492.800277,
> "type_data": {
> "flag_point": "failed to authpin, dir is being fragmented",
> "reqid": "client.386087:12791",
> "op_type": "client_request",
> "client_info": {
> "client": "client.386087",
> "tid": 12791
> },
> "events": [
> {
> "time": "2019-05-15 11:30:47.174134",
> "event": "initiated"
> },
> {
> "time": "2019-05-15 11:30:47.174134",
> "event": "header_read"
> },
> {
> "time": "2019-05-15 11:30:47.174136",
> "event": "throttled"
> },
> {
> "time": "2019-05-15 11:30:47.174144",
> "event": "all_read"
> },
> {
> "time": "2019-05-15 11:30:47.174245",
> "event": "dispatched"
> },
> {
> "time": "2019-05-15 11:30:47.174271",
> "event": "failed to authpin, dir is being fragmented"
> }
> ]
> }
> },
> {
> "description": "client_request(client.62472:6092355 create 
> #0x138/lastnotification.uXMjaLSt 2019-05-15 11:15:02.883027 
> caller_uid=105731, caller_gid=105731{})",
> "initiated_at": "2019-05-15 11:15:02.884547",
> "age": 1437.089830,
> "duration": 1437.089937,
> "type_data": {
> "flag_point": "failed to authpin, dir is being fragmented",
> "reqid": "client.62472:6092355",
> "op_type": "client_request",
> "client_info": {
> "client": "client.62472",

[ceph-users] Grow bluestore PV/LV

2019-05-15 Thread Michael Andersen
Hi

After growing the size of an OSD's PV/LV, how can I get bluestore to see
the new space as available? It does notice the LV has changed size, but it
sees the new space as occupied.

This is the same question as:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023893.html
and
that original poster spent a lot of effort in explaining exactly what he
meant, but I could not find a reply to his email.

Thanks
Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge rebalance after rebooting OSD host (Mimic)

2019-05-15 Thread huang jun
do you have osd's crush location changed after reboot?

kas  于2019年5月15日周三 下午10:39写道:
>
> kas wrote:
> :   Marc,
> :
> : Marc Roos wrote:
> : : Are you sure your osd's are up and reachable? (run ceph osd tree on
> : : another node)
> :
> :   They are up, because all three mons see them as up.
> : However, ceph osd tree provided the hint (thanks!): The OSD host went back
> : with hostname "localhost" instead of the correct one for some reason.
> : So the OSDs moved themselves to a new HOST=localhost CRUSH node directly
> : under the CRUSH root. I rebooted the OSD host once again, and it went up
> : again with the correct hostname, and the "ceph osd tree" output looks sane
> : now. So I guess we have a reason for such a huge rebalance.
> :
> :   However, even though the OSD tree is back in the normal state,
> : the rebalance is still going on, and there are even inactive PGs,
> : with some Ceph clients being stuck seemingly forever:
> :
> : health: HEALTH_ERR
> : 1964645/3977451 objects misplaced (49.395%)
> : Reduced data availability: 11 pgs inactive
>
> Wild guessing what to do, I went to the rebooted OSD host and ran
> systemctl restart ceph-osd.target
> - restarting all OSD processes. The previously inactive (activating) pgs
> went to the active state, and Ceph clients got unstuck. Now I see
> HEALTH_ERR with backfill_toofull only, which I consider a normal state
> during Ceph Mimic rebalance.
>
> It would be interesting to know why some of the PGs went stuck,
> and why did restart help. FWIW, I have a "ceph pg query" output for
> one of the 11 inactive PGs.
>
> -Yenya
>
> ---
> # ceph pg 23.4f5 query
> {
> "state": "activating+remapped",
> "snap_trimq": "[]",
> "snap_trimq_len": 0,
> "epoch": 104015,
> "up": [
> 70,
> 72,
> 27
> ],
> "acting": [
> 25,
> 27,
> 79
> ],
> "backfill_targets": [
> "70",
> "72"
> ],
> "acting_recovery_backfill": [
> "25",
> "27",
> "70",
> "72",
> "79"
> ],
> "info": {
> "pgid": "23.4f5",
> "last_update": "103035'4667973",
> "last_complete": "103035'4667973",
> "log_tail": "102489'4664889",
> "last_user_version": 4667973,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 1,
> "purged_snaps": [],
> "history": {
> "epoch_created": 406,
> "epoch_pool_created": 406,
> "last_epoch_started": 103086,
> "last_interval_started": 103085,
> "last_epoch_clean": 96881,
> "last_interval_clean": 96880,
> "last_epoch_split": 0,
> "last_epoch_marked_full": 0,
> "same_up_since": 103095,
> "same_interval_since": 103095,
> "same_primary_since": 95398,
> "last_scrub": "102517'4667556",
> "last_scrub_stamp": "2019-05-15 01:07:28.978979",
> "last_deep_scrub": "102491'4666011",
> "last_deep_scrub_stamp": "2019-05-08 07:20:08.253942",
> "last_clean_scrub_stamp": "2019-05-15 01:07:28.978979"
> },
> "stats": {
> "version": "103035'4667973",
> "reported_seq": "2116838",
> "reported_epoch": "104015",
> "state": "activating+remapped",
> "last_fresh": "2019-05-15 16:19:44.530005",
> "last_change": "2019-05-15 14:56:04.248887",
> "last_active": "2019-05-15 14:56:02.579506",
> "last_peered": "2019-05-15 14:56:01.401941",
> "last_clean": "2019-05-15 14:53:39.291350",
> "last_became_active": "2019-05-15 14:55:54.163102",
> "last_became_peered": "2019-05-15 14:55:54.163102",
> "last_unstale": "2019-05-15 16:19:44.530005",
> "last_undegraded": "2019-05-15 16:19:44.530005",
> "last_fullsized": "2019-05-15 16:19:44.530005",
> "mapping_epoch": 103095,
> "log_start": "102489'4664889",
> "ondisk_log_start": "102489'4664889",
> "created": 406,
> "last_epoch_clean": 96881,
> "parent": "0.0",
> "parent_split_bits": 0,
> "last_scrub": "102517'4667556",
> "last_scrub_stamp": "2019-05-15 01:07:28.978979",
> "last_deep_scrub": "102491'4666011",
> "last_deep_scrub_stamp": "2019-05-08 07:20:08.253942",
> "last_clean_scrub_stamp": "2019-05-15 01:07:28.978979",
> "log_size": 3084,
> "ondisk_log_size": 3084,
> "stats_invalid": false,
> "dirty_stats_invalid": false,
> "omap_stats_invalid": false,
> "hitset_stats_invalid": false,
> "hitset_bytes_stats_invalid": false,
> "pin_stats_invalid": true,
> "m

[ceph-users] MDS Crashing 14.2.1

2019-05-15 Thread Adam Tygart
Hello all,

I've got a 30 node cluster serving up lots of CephFS data.

We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
this week.

We've been running 2 MDS daemons in an active-active setup. Tonight
one of the metadata daemons crashed with the following several times:

-1> 2019-05-16 00:20:56.775 7f9f22405700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
In function 'void CIn
ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
00:20:56.775021
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
ack_allow_loading_invalid_metadata"))

I made a quick decision to move to a single MDS because I saw
set_primary_parent, and I thought it might be related to auto
balancing between the metadata servers.

This caused one MDS to fail, the other crashed, and now rank 0 loads,
goes active and then crashes with the following:
-1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
In function 'void M
DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
258: FAILED ceph_assert(!p)

It now looks like we somehow have a duplicate inode in the MDS journal?

https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
then became rank one after the crash and attempted drop to one active
MDS
https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
and crashed

Anyone have any thoughts on this?

Thanks,
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Grow bluestore PV/LV

2019-05-15 Thread Yury Shevchuk
Hello Michael,

growing (expanding) bluestore OSD is possible since Nautilus (14.2.0)
using bluefs-bdev-expand tool as discussed in this thread:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034116.html

-- Yury

On Wed, May 15, 2019 at 10:03:29PM -0700, Michael Andersen wrote:
> Hi
> 
> After growing the size of an OSD's PV/LV, how can I get bluestore to see
> the new space as available? It does notice the LV has changed size, but it
> sees the new space as occupied.
> 
> This is the same question as:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023893.html
> and
> that original poster spent a lot of effort in explaining exactly what he
> meant, but I could not find a reply to his email.
> 
> Thanks
> Michael

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Grow bluestore PV/LV

2019-05-15 Thread Michael Andersen
Thanks! I'm on mimic for now, but I'll give it a shot on a test nautilus
cluster.

On Wed, May 15, 2019 at 10:58 PM Yury Shevchuk  wrote:

> Hello Michael,
>
> growing (expanding) bluestore OSD is possible since Nautilus (14.2.0)
> using bluefs-bdev-expand tool as discussed in this thread:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034116.html
>
> -- Yury
>
> On Wed, May 15, 2019 at 10:03:29PM -0700, Michael Andersen wrote:
> > Hi
> >
> > After growing the size of an OSD's PV/LV, how can I get bluestore to see
> > the new space as available? It does notice the LV has changed size, but
> it
> > sees the new space as occupied.
> >
> > This is the same question as:
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023893.html
> > and
> > that original poster spent a lot of effort in explaining exactly what he
> > meant, but I could not find a reply to his email.
> >
> > Thanks
> > Michael
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Pool size doubled after upgrade to Nautilus and PG Merge

2019-05-15 Thread Wido den Hollander


On 5/12/19 4:21 PM, Thore Krüss wrote:
> Good evening,
> after upgrading our cluster yesterday to Nautilus (14.2.1) and pg-merging an
> imbalanced pool we noticed that the number of objects in the pool has dubled
> (rising synchronously with the merge progress).
> 
> What happened there? Was this to be expected? Is it a bug? Will ceph
> housekeeping take care of it eventually?
> 

Has the PG merge already finished or is it still running?

Is it only the amount of objects or also the size in kB/MB/TB ?

Wido

> Best regards
> Thore
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-15 Thread Alexandre DERUMIER
Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm seeing 
improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mai 2019 09:59:39
Objet: [ceph-users] Poor performance for 512b aligned "partial" writes from 
Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably sized 
OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with NVMe 
Journals. The primary workload is Windows guests backed by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which 
while it is EOL, the issue is reproducible on current versions and also on 
BlueStore however for different reasons than FileStore. 

Generally the Ceph cluster was suffering from very poor outlier performance, 
the numbers change a little bit depending on the exact situation but roughly 
80% of I/O was happening in a "reasonable" time of 0-200ms but 5-20% of I/O 
operations were taking excessively long anywhere from 500ms through to 10-20+ 
seconds. However the normal metrics for commit and apply latency were normal, 
and in fact, this latency was hard to spot in the performance metrics available 
in jewel. 

Previously I more simply considered FileStore to have the "commit" (to journal) 
stage where it was written to the journal and it is OK to return to the client 
and then the "apply" (to disk) stage where it was flushed to disk and confirmed 
so that the data could be purged from the journal. However there is really a 
third stage in the middle where FileStore submits the I/O to the operating 
system and this is done before the lock on the object is released. Until that 
succeeds another operation cannot write to the same object (generally being a 
4MB area of the disk). 

I found that the fstore_op threads would get stuck for hundreds of MS or more 
inside of pwritev() which was blocking inside of the kernel. Normally we expect 
pwritev() to be buffered I/O into the page cache and return quite fast however 
in this case the kernel was in a few percent of cases blocking with the stack 
trace included at the end of the e-mail [1]. My finding from that stack is that 
inside __block_write_begin_int we see a call to out_of_line_wait_on_bit call 
which is really an inlined call for wait_on_buffer which occurs in 
linux/fs/buffer.c in the section around line 2000-2024 with the comment "If we 
issued read requests - let them complete." ( [ 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 | 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 ] ) 

My interpretation of that code is that for Linux to store a write in the page 
cache, it has to have the entire 4K page as that is the granularity of which it 
tracks the dirty state and it needs the entire 4K page to later submit back to 
the disk. Since we wrote a part of the page, and the page wasn't already in the 
cache, it has to fetch the remainder of the page from the disk. When this 
happens, it blocks waiting for this read to complete before returning from the 
pwritev() call - hence our normally buffered write blocks. This holds up the 
tp_fstore_op thread, of which there are (by default) only 2-4 such threads 
trying to process several hundred operations per second. Additionally the size 
of the osd_op_queue is bounded, and operations do not clear out of this queue 
until the tp_fstore_op thread is done. Which ultimately means that not only are 
these partial writes delayed but it knocks on to delay other writes behind them 
because of the constrained thread pools. 

What was further confusing to this, is that I could easily reproduce this in a 
test deployment using an rbd benchmark that was only writing to a total disk 
size of 256MB which I would easily have expected to fit in the page cache: 
rbd create -p rbd --size=256M bench2 
rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total 256M 
--io-pattern rand 

This is explained by the fact that on secondary OSDs (at least, there was some 
refactoring of fadvise which I have not fully understood as of yet), FileStore 
is using fadvise FADVISE_DONTNEED on the objects after write which causes the 
kernel to immediately discard them from the page cache without any regard to 
their statistics of being recently/frequently used. The motivation for this 
addition appears to be that on a secondary OSD we don't service reads (only 
writes) and so therefor we can optimize memory usage by throwing away this 
object and in theory leaving more room in the page cache for objects which we 
are primary for and expect to actually service reads from a client for. 
Unfortunately this behavior does not take into account partial writes, where we 
now pathologically throw away the cached copy instantly such that a write 

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-15 Thread Frank Schilder
Dear Yan,

OK, I will try to trigger the problem again and dump the information requested. 
Since it is not easy to get into this situation and I usually need to resolve 
it fast (its not a test system), is there anything else worth capturing?

I will get back as soon as it happened again.

In the meantime, I would be grateful if you could shed some light on the 
following questions:

- Is there a way to cancel an individual operation in the queue? It is a bit 
harsh to have to fail an MDS for that.
- What is the fragmentdir operation doing in a single MDS setup? I thought this 
was only relevant if multiple MDS daemons are active on a file system.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 16 May 2019 05:50
To: Frank Schilder
Cc: Stefan Kooman; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

> [...]
> This time I captured the MDS ops list (log output does not really contain 
> more info than this list). It contains 12 ops and I will include it here in 
> full length (hope this is acceptable):
>

Your issues were caused by stuck internal op fragmentdir.  Can you
dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com