Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-16 Thread Uwe Sauter
You could also edit your ceph-mon@.service (assuming systemd) to depend on chrony and add a line 
"ExecStartPre=/usr/bin/sleep 30" to stall the startup to give chrony a chance to sync before the Mon is started.




Am 16.05.19 um 17:38 schrieb Stefan Kooman:

Quoting Jan Kasprzak (k...@fi.muni.cz):


OK, many responses (thanks for them!) suggest chrony, so I tried it:
With all three mons running chrony and being in sync with my NTP server
with offsets under 0.0001 second, I rebooted one of the mons:

There still was the HEALTH_WARN clock_skew message as soon as
the rebooted mon starts responding to ping. The cluster returns to
HEALTH_OK about 95 seconds later.

According to "ntpdate -q my.ntp.server", the initial offset
after reboot is about 0.6 s (which is the reason of HEALTH_WARN, I think),
but it gets under 0.0001 s in about 25 seconds. The remaining ~50 seconds
of HEALTH_WARN is inside Ceph, with mons being already synchronized.

So the result is that chrony indeed synchronizes faster,
but nevertheless I still have about 95 seconds of HEALTH_WARN "clock skew
detected".

I guess now the workaround now is to ignore the warning, and wait
for two minutes before rebooting another mon.


You can tune the "mon_timecheck_skew_interval" which by default is set
to 30 seconds. See [1] and look for "timecheck" to find the different
options.

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-03-29 Thread Uwe Sauter
Hi,

Am 28.03.19 um 20:03 schrieb c...@elchaka.de:
> Hi Uwe,
> 
> Am 28. Februar 2019 11:02:09 MEZ schrieb Uwe Sauter :
>> Am 28.02.19 um 10:42 schrieb Matthew H:
>>> Have you made any changes to your ceph.conf? If so, would you mind
>> copying them into this thread?
>>
>> No, I just deleted an OSD, replaced HDD with SDD and created a new OSD
>> (with bluestore). Once the cluster was healty again, I
>> repeated with the next OSD.
>>
>>
>> [global]
>>  auth client required = cephx
>>  auth cluster required = cephx
>>  auth service required = cephx
>>  cluster network = 169.254.42.0/24
>>  fsid = 753c9bbd-74bd-4fea-8c1e-88da775c5ad4
>>  keyring = /etc/pve/priv/$cluster.$name.keyring
>>  public network = 169.254.42.0/24
>>
>> [mon]
>>  mon allow pool delete = true
>>  mon data avail crit = 5
>>  mon data avail warn = 15
>>
>> [osd]
>>  keyring = /var/lib/ceph/osd/ceph-$id/keyring
>>  osd journal size = 5120
>>  osd pool default min size = 2
>>  osd pool default size = 3
>>  osd max backfills = 6
>>  osd recovery max active = 12
> 
> I guess should decrease  this last two  parameters to 1. This should help to 
> avoid to much pressure on your drives...
> 

Unlikely to help as no recovery / backfilling is running when the situation 
appears.

> Hth
> - Mehmet 
> 
>>
>> [mon.px-golf-cluster]
>>  host = px-golf-cluster
>>  mon addr = 169.254.42.54:6789
>>
>> [mon.px-hotel-cluster]
>>  host = px-hotel-cluster
>>  mon addr = 169.254.42.55:6789
>>
>> [mon.px-india-cluster]
>>  host = px-india-cluster
>>  mon addr = 169.254.42.56:6789
>>
>>
>>
>>
>>>
>>>
>> --
>>> *From:* ceph-users  on behalf of
>> Vitaliy Filippov 
>>> *Sent:* Wednesday, February 27, 2019 4:21 PM
>>> *To:* Ceph Users
>>> *Subject:* Re: [ceph-users] Blocked ops after change from filestore
>> on HDD to bluestore on SDD
>>>  
>>> I think this should not lead to blocked ops in any case, even if the 
>>> performance is low...
>>>
>>> -- 
>>> With best regards,
>>>    Vitaliy Filippov
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
olcDbShmKey only applies to BDB and HDB backends but I'm using the new MDB 
backend.


Am 28.02.19 um 14:47 schrieb Marc Roos:
> If you have every second disk io with your current settings, which I 
> also had with 'default' settings. There are some optimizations you can 
> do, bringing it down to every 50 seconds or so. Adding the olcDbShmKey 
> will allow for slapd to access the db cache. 
> I am getting an error of sharedmemory settings when rebooting (centos7), 
> but maintainers of slapd said that I can ignore that. Dont have any 
> problems since using this also.
> 
> 
> 
> -----Original Message-
> From: Uwe Sauter [mailto:uwe.sauter...@gmail.com] 
> Sent: 28 February 2019 14:34
> To: Marc Roos; ceph-users; vitalif
> Subject: Re: [ceph-users] Fwd: Re: Blocked ops after change from 
> filestore on HDD to bluestore on SDD
> 
> Do you have anything particular in mind? I'm using mdb backend with 
> maxsize = 1GB but currently the files are only about 23MB.
> 
> 
>>
>> I am having quite a few openldap servers (slaves) running also, make 
>> sure to use proper caching that saves a lot of disk io.
>>
>>
>>
>>
>> -Original Message-
>> Sent: 28 February 2019 13:56
>> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
>> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
>> change from filestore on HDD to bluestore on SDD
>>
>> "Advanced power loss protection" is in fact a performance feature, not 
> 
>> a safety one.
>>
>>
>> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
>>  пишет:
>>
>>  Hi all,
>>  
>>  thanks for your insights.
>>  
>>  Eneko,
>>  
>>
>>  We tried to use a Samsung 840 Pro SSD as OSD some time ago 
> and it 
>> was a no-go; it wasn't that performance was bad, it
>>  just didn't work for the kind of use of OSD. Any HDD was 
> better than 
>> it (the disk was healthy and have been used in a
>>  software raid-1 for a pair of years).
>>  
>>  I suggest you check first that your Samsung 860 Pro disks 
> work well 
>> for Ceph. Also, how is your host's RAM?
>>
>>
>>  As already mentioned the hosts each have 64GB RAM. Each host has 
> 3 
>> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>>  memory / 400MB residual memory.
>>  
>>  
>>  
>>  Joachim,
>>  
>>
>>  I can only recommend the use of enterprise SSDs. We've 
> tested many 
>> consumer SSDs in the past, including your SSDs. Many
>>  of them are not suitable for long-term use and some weard 
> out within 
>> 6 months.
>>
>>
>>  Unfortunately I couldn't afford enterprise grade SSDs. But I 
> suspect 
>> that my workload (about 20 VMs for our infrastructure, the
>>  most IO demanding is probably LDAP) is light enough that wearout 
>> won't be a problem.
>>  
>>  The issue I'm seeing then is probably related to direct IO if 
> using 
>> bluestore. But with filestore, the file system cache probably
>>  hides the latency issues.
>>  
>>  
>>  Igor,
>>  
>>
>>  AFAIR Samsung 860 Pro isn't for enterprise market, you 
> shouldn't use 
>> consumer SSDs for Ceph.
>>  
>>  I had some experience with Samsung 960 Pro a while ago and 
> it turned 
>> out that it handled fsync-ed writes very slowly
>>  (comparing to the original/advertised performance). Which 
> one can 
>> probably explain by the lack of power loss protection
>>  for these drives. I suppose it's the same in your case.
>>  
>>  Here are a couple links on the topic:
>>  
>>  
>> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devi
>> ces/
>>  
>>  
>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-
>> ssd-is-suitable-as-a-journal-device/
>>
>>
>>  Power loss protection wasn't a criteria for me as the cluster 
> hosts 
>> are distributed in two buildings with separate battery backed
>>  UPSs. As mentioned above I suspect the main difference for my 
> case 
>> between filestore and bluestore is file system cache vs. direct
>>  IO. Which means I will keep using filestore.
>>  
>>  Regards,
>>  
>>  Uwe
>> 
>>
>>  ceph-users mailing list
>>  ceph-users@lists.ceph.com
>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> With best regards,
>> Vitaliy Filippov
>>
>>
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
I already sent my configuration to the list about 3,5h ago but here it is again:


[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  cluster network = 169.254.42.0/24
  fsid = 753c9bbd-74bd-4fea-8c1e-88da775c5ad4
  keyring = /etc/pve/priv/$cluster.$name.keyring
  public network = 169.254.42.0/24

[mon]
  mon allow pool delete = true
  mon data avail crit = 5
  mon data avail warn = 15

[osd]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring
  osd journal size = 5120
  osd pool default min size = 2
  osd pool default size = 3
  osd max backfills = 6
  osd recovery max active = 12

[mon.px-golf-cluster]
  host = px-golf-cluster
  mon addr = 169.254.42.54:6789

[mon.px-hotel-cluster]
  host = px-hotel-cluster
  mon addr = 169.254.42.55:6789

[mon.px-india-cluster]
  host = px-india-cluster
  mon addr = 169.254.42.56:6789



Am 28.02.19 um 14:44 schrieb Matthew H:
> Could you send your ceph.conf file over please? Are you setting any tunables 
> for OSD or Bluestore currently?
> 
> --
> *From:* ceph-users  on behalf of Uwe 
> Sauter 
> *Sent:* Thursday, February 28, 2019 8:33 AM
> *To:* Marc Roos; ceph-users; vitalif
> *Subject:* Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore 
> on HDD to bluestore on SDD
>  
> Do you have anything particular in mind? I'm using mdb backend with maxsize = 
> 1GB but currently the files are only about 23MB.
> 
> 
>> 
>> I am having quite a few openldap servers (slaves) running also, make 
>> sure to use proper caching that saves a lot of disk io.  
>> 
>> 
>> 
>> 
>> -----Original Message-
>> Sent: 28 February 2019 13:56
>> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
>> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
>> change from filestore on HDD to bluestore on SDD
>> 
>> "Advanced power loss protection" is in fact a performance feature, not a 
>> safety one.
>> 
>> 
>> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
>>  пишет:
>> 
>>    Hi all,
>>    
>>    thanks for your insights.
>>    
>>    Eneko,
>>    
>> 
>>    We tried to use a Samsung 840 Pro SSD as OSD some time ago 
>>and 
>> it was a no-go; it wasn't that performance was bad, it 
>>    just didn't work for the kind of use of OSD. Any HDD was 
>> better than it (the disk was healthy and have been used in a 
>>    software raid-1 for a pair of years).
>>    
>>    I suggest you check first that your Samsung 860 Pro disks 
>>work 
>> well for Ceph. Also, how is your host's RAM?
>> 
>> 
>>    As already mentioned the hosts each have 64GB RAM. Each host has 3 
>> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>>    memory / 400MB residual memory.
>>    
>>    
>>    
>>    Joachim,
>>    
>> 
>>    I can only recommend the use of enterprise SSDs. We've tested 
>> many consumer SSDs in the past, including your SSDs. Many 
>>    of them are not suitable for long-term use and some weard out 
>> within 6 months.
>> 
>> 
>>    Unfortunately I couldn't afford enterprise grade SSDs. But I 
>> suspect that my workload (about 20 VMs for our infrastructure, the
>>    most IO demanding is probably LDAP) is light enough that wearout 
>> won't be a problem.
>>    
>>    The issue I'm seeing then is probably related to direct IO if using 
>> bluestore. But with filestore, the file system cache probably
>>    hides the latency issues.
>>    
>>    
>>    Igor,
>>    
>> 
>>    AFAIR Samsung 860 Pro isn't for enterprise market, you 
>> shouldn't use consumer SSDs for Ceph.
>>    
>>    I had some experience with Samsung 960 Pro a while ago and it 
>> turned out that it handled fsync-ed writes very slowly 
>>    (comparing to the original/advertised performance). Which one 
>> can probably explain by the lack of power loss protection 
>>    for these drives. I suppose it's the same in your case.
>>    
>>    Here are a couple links on the topic:
>>    
>>    
>> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
>>    
&g

Re: [ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
Do you have anything particular in mind? I'm using mdb backend with maxsize = 
1GB but currently the files are only about 23MB.


> 
> I am having quite a few openldap servers (slaves) running also, make 
> sure to use proper caching that saves a lot of disk io.  
> 
> 
> 
> 
> -Original Message-
> Sent: 28 February 2019 13:56
> To: uwe.sauter...@gmail.com; Uwe Sauter; Ceph Users
> Subject: *SPAM* Re: [ceph-users] Fwd: Re: Blocked ops after 
> change from filestore on HDD to bluestore on SDD
> 
> "Advanced power loss protection" is in fact a performance feature, not a 
> safety one.
> 
> 
> 28 февраля 2019 г. 13:03:51 GMT+03:00, Uwe Sauter 
>  пишет:
> 
>   Hi all,
>   
>   thanks for your insights.
>   
>   Eneko,
>   
> 
>   We tried to use a Samsung 840 Pro SSD as OSD some time ago and 
> it was a no-go; it wasn't that performance was bad, it 
>   just didn't work for the kind of use of OSD. Any HDD was 
> better than it (the disk was healthy and have been used in a 
>   software raid-1 for a pair of years).
>   
>   I suggest you check first that your Samsung 860 Pro disks work 
> well for Ceph. Also, how is your host's RAM?
> 
> 
>   As already mentioned the hosts each have 64GB RAM. Each host has 3 
> SSDs for OSD usage. Each OSD is using about 1.3GB virtual
>   memory / 400MB residual memory.
>   
>   
>   
>   Joachim,
>   
> 
>   I can only recommend the use of enterprise SSDs. We've tested 
> many consumer SSDs in the past, including your SSDs. Many 
>   of them are not suitable for long-term use and some weard out 
> within 6 months.
> 
> 
>   Unfortunately I couldn't afford enterprise grade SSDs. But I 
> suspect that my workload (about 20 VMs for our infrastructure, the
>   most IO demanding is probably LDAP) is light enough that wearout 
> won't be a problem.
>   
>   The issue I'm seeing then is probably related to direct IO if using 
> bluestore. But with filestore, the file system cache probably
>   hides the latency issues.
>   
>   
>   Igor,
>   
> 
>   AFAIR Samsung 860 Pro isn't for enterprise market, you 
> shouldn't use consumer SSDs for Ceph.
>   
>   I had some experience with Samsung 960 Pro a while ago and it 
> turned out that it handled fsync-ed writes very slowly 
>   (comparing to the original/advertised performance). Which one 
> can probably explain by the lack of power loss protection 
>   for these drives. I suppose it's the same in your case.
>   
>   Here are a couple links on the topic:
>   
>   
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
>   
>   
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> 
>   Power loss protection wasn't a criteria for me as the cluster hosts 
> are distributed in two buildings with separate battery backed
>   UPSs. As mentioned above I suspect the main difference for my case 
> between filestore and bluestore is file system cache vs. direct
>   IO. Which means I will keep using filestore.
>   
>   Regards,
>   
>   Uwe
> 
> 
>   ceph-users mailing list
>   ceph-users@lists.ceph.com
>   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> --
> With best regards,
> Vitaliy Filippov
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Re: Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
Hi all,

thanks for your insights.

Eneko,

> We tried to use a Samsung 840 Pro SSD as OSD some time ago and it was a 
> no-go; it wasn't that performance was bad, it 
> just didn't work for the kind of use of OSD. Any HDD was better than it (the 
> disk was healthy and have been used in a 
> software raid-1 for a pair of years).
> 
> I suggest you check first that your Samsung 860 Pro disks work well for Ceph. 
> Also, how is your host's RAM?

As already mentioned the hosts each have 64GB RAM. Each host has 3 SSDs for OSD 
usage. Each OSD is using about 1.3GB virtual
memory / 400MB residual memory.



Joachim,

> I can only recommend the use of enterprise SSDs. We've tested many consumer 
> SSDs in the past, including your SSDs. Many 
> of them are not suitable for long-term use and some weard out within 6 months.

Unfortunately I couldn't afford enterprise grade SSDs. But I suspect that my 
workload (about 20 VMs for our infrastructure, the
most IO demanding is probably LDAP) is light enough that wearout won't be a 
problem.

The issue I'm seeing then is probably related to direct IO if using bluestore. 
But with filestore, the file system cache probably
hides the latency issues.


Igor,

> AFAIR Samsung 860 Pro isn't for enterprise market, you shouldn't use consumer 
> SSDs for Ceph.
> 
> I had some experience with Samsung 960 Pro a while ago and it turned out that 
> it handled fsync-ed writes very slowly 
> (comparing to the original/advertised performance). Which one can probably 
> explain by the lack of power loss protection 
> for these drives. I suppose it's the same in your case.
> 
> Here are a couple links on the topic:
> 
> https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Power loss protection wasn't a criteria for me as the cluster hosts are 
distributed in two buildings with separate battery backed
UPSs. As mentioned above I suspect the main difference for my case between 
filestore and bluestore is file system cache vs. direct
IO. Which means I will keep using filestore.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-28 Thread Uwe Sauter
Am 28.02.19 um 10:42 schrieb Matthew H:
> Have you made any changes to your ceph.conf? If so, would you mind copying 
> them into this thread?

No, I just deleted an OSD, replaced HDD with SDD and created a new OSD (with 
bluestore). Once the cluster was healty again, I
repeated with the next OSD.


[global]
  auth client required = cephx
  auth cluster required = cephx
  auth service required = cephx
  cluster network = 169.254.42.0/24
  fsid = 753c9bbd-74bd-4fea-8c1e-88da775c5ad4
  keyring = /etc/pve/priv/$cluster.$name.keyring
  public network = 169.254.42.0/24

[mon]
  mon allow pool delete = true
  mon data avail crit = 5
  mon data avail warn = 15

[osd]
  keyring = /var/lib/ceph/osd/ceph-$id/keyring
  osd journal size = 5120
  osd pool default min size = 2
  osd pool default size = 3
  osd max backfills = 6
  osd recovery max active = 12

[mon.px-golf-cluster]
  host = px-golf-cluster
  mon addr = 169.254.42.54:6789

[mon.px-hotel-cluster]
  host = px-hotel-cluster
  mon addr = 169.254.42.55:6789

[mon.px-india-cluster]
  host = px-india-cluster
  mon addr = 169.254.42.56:6789




> 
> --
> *From:* ceph-users  on behalf of Vitaliy 
> Filippov 
> *Sent:* Wednesday, February 27, 2019 4:21 PM
> *To:* Ceph Users
> *Subject:* Re: [ceph-users] Blocked ops after change from filestore on HDD to 
> bluestore on SDD
>  
> I think this should not lead to blocked ops in any case, even if the 
> performance is low...
> 
> -- 
> With best regards,
>    Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD

2019-02-26 Thread Uwe Sauter

Hi,

TL;DR: In my Ceph clusters I replaced all OSDs from HDDs of several brands and models with Samsung 860 Pro SSDs and used 
the opportunity to switch from filestore to bluestore. Now I'm seeing blocked ops in Ceph and file system freezes inside 
VMs. Any suggestions?



I have two Proxmox clusters for virtualization which use Ceph on HDDs as backend storage for VMs. About half a year ago 
I had to increase the pool size and used the occasion to switch from filestore to bluestore. That was when trouble 
started. Both clusters showed blocked ops that caused freezes inside VMs which needed a reboot to function properly 
again. I wasn't able to identify the cause of the blocking ops but I blamed the low performance of the HDDs. It was also 
the time when patches for Spectre/Meltdown were released. Kernel 4.13.x didn't show the behavior while kernel 4.15.x 
did. After several weeks of debugging the workaround was to go back to filestore.


Today I replace all HDDs with brand new Samsung 860 Pro SSDs and switched to bluestore again (on one cluster). And… the 
blocked ops reappeared. I am out of ideas about the cause.


Any idea why bluestore is so much more demanding on the storage devices 
compared to filestore?

Before switching back to filestore do you have any suggestions for debugging? 
Anything special to check for in the network?

The clusters are both connected via 10GbE (MTU 9000) and are only lightly loaded (15 VMs on the first, 6 VMs on the 
second). Each host has 3 SSDs and 64GB memory.


"rados bench" gives decent results for 4M block size but 4K block size triggers blocked ops (and only finishes after I 
restart the OSD with the blocked ops). Results below.



Thanks,

Uwe




Results from "rados bench" runs with 4K block size when the cluster didn't 
block:

root@px-hotel-cluster:~# rados bench -p scbench 60 write -b 4K -t 16 
--no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up 
to 60 seconds or 0 objects
Object prefix: benchmark_data_px-hotel-cluster_3814550
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1  16  2338  2322   9.06888   9.07031   0.0068972   0.0068597
2  16  4631  4615   9.01238   8.95703   0.0076618  0.00692027
3  16  6936  6920   9.00928   9.00391   0.0066511  0.00692966
4  16  9173  9157   8.94133   8.73828  0.00416256  0.00698071
5  16 11535 11519   8.99821   9.22656  0.00799875  0.00693842
6  16 13892 13876   9.03287   9.20703  0.00688782  0.00691459
7  15 16173 16158   9.01578   8.91406  0.00791589  0.00692736
8  16 18406 18390   8.97854   8.71875  0.00745151  0.00695723
9  16 20681 20665   8.96822   8.88672   0.0072881  0.00696475
   10  16 23037 23021   8.99163   9.20312  0.00728763   0.0069473
   11  16 24261 24245   8.60882   4.78125  0.00502342  0.00725673
   12  16 25420 25404   8.26863   4.52734  0.00443917  0.00750865
   13  16 27347 27331   8.21154   7.52734  0.00670819  0.00760455
   14  16 28750 28734   8.01642   5.48047  0.00617038  0.00779322
   15  16 30222 302067.8653  5.75  0.00700398  0.00794209
   16  16 32180 321647.8517   7.64844  0.00704785   0.0079573
   17  16 34527 34511   7.92907   9.16797  0.00582831  0.00788017
   18  15 36969 36954   8.01868   9.54297  0.00635168  0.00779228
   19  16 39059 39043   8.02609   8.16016  0.00622597  0.00778436
2019-02-26 21:55:41.623245 min lat: 0.00337595 max lat: 0.431158 avg lat: 
0.00779143
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20  16 41079 41063   8.01928   7.89062  0.00649895  0.00779143
   21  16 43076 43060   8.00878   7.80078  0.00726145  0.00780128
   22  16 45433 45417   8.06321   9.20703  0.00455727  0.00774944
   23  16 47763 47747   8.10832   9.10156  0.00582818  0.00770599
   24  16 50079 50063   8.14738   9.04688   0.0051125  0.00766894
   25  16 52477 52461   8.19614   9.36719  0.00537575  0.00762343
   26  16 54895 54879   8.24415   9.44531  0.00573134  0.00757909
   27  16 57276 57260   8.28325   9.30078  0.00576683  0.00754383
   28  16 59487 59471   8.29585   8.63672  0.00651535  0.00753232
   29  16 61948 61932   8.34125   9.61328  0.00499461  0.00749048
   30  16 64289 64273   8.36799   9.14453  0.00735917  0.00746708
   31  16 66645 666298.3949   9.20312  0.00644432  0.00744233
   32  16 68926 68910   8.41098   8.91016  0.00545702   0.0074289
   33  16 71257 71241 8.432   9.10547  0.00505016  0.00741037
   34  16 73668 73652   

Re: [ceph-users] Move rdb based image from one pool to another

2018-11-07 Thread Uwe Sauter




Am 07.11.18 um 21:17 schrieb Alex Gorbachev:

On Wed, Nov 7, 2018 at 2:38 PM Uwe Sauter  wrote:


I've been reading a bit and trying around but it seems I'm not quite where I 
want to be.

I want to migrate from pool "vms" to pool "vdisks".

# ceph osd pool ls
vms
vdisks

# rbd ls vms
vm-101-disk-1
vm-101-disk-2
vm-102-disk-1
vm-102-disk-2

# rbd snap ls vms/vm-102-disk-2
SNAPID NAME SIZE TIMESTAMP
  81 SL6_81 100GiB Thu Aug 23 11:57:05 2018
  92 SL6_82 100GiB Fri Oct 12 13:27:53 2018

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(no output)

# rbd export-diff --whole-object vms/vm-102-disk-2 - | rbd import-diff - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image diff: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(still no output)

It looks like the current content is copied but not the snapshots.

What am I doing wrong? Any help is appreciated.


Hi Uwe,

If these are Proxmox images, would you be able to move them simply
using Proxmox Move Disk in hardware for VM?  I have had good results
with that.


You are correct that this is on Proxmox but the UI prohobits moving Ceph-backed 
disks when the VM has snapshots.

I know how to alter the config files so I'm going the manual route here.

But thanks for the suggestion.






--
Alex Gorbachev
Storcium



Thanks,

 Uwe



Am 07.11.18 um 14:39 schrieb Uwe Sauter:

I'm still on luminous (12.2.8). I'll have a look on the commands. Thanks.

Am 07.11.18 um 14:31 schrieb Jason Dillaman:

With the Mimic release, you can use "rbd deep-copy" to transfer the
images (and associated snapshots) to a new pool. Prior to that, you
could use "rbd export-diff" / "rbd import-diff" to manually transfer
an image and its associated snapshots.
On Wed, Nov 7, 2018 at 7:11 AM Uwe Sauter  wrote:


Hi,

I have several VM images sitting in a Ceph pool which are snapshotted. Is there 
a way to move such images from one pool to another
and perserve the snapshots?

Regards,

  Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move rdb based image from one pool to another

2018-11-07 Thread Uwe Sauter

I do have an empty disk in that server. Just go the extra step, save the export 
to a file and import that one?



Am 07.11.18 um 20:55 schrieb Jason Dillaman:

There was a bug in "rbd import" where it disallowed the use of stdin
for export-format 2. This has been fixed in v12.2.9 and is in the
pending 13.2.3 release.
On Wed, Nov 7, 2018 at 2:46 PM Uwe Sauter  wrote:


I tried that but it fails:

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import --export-format 
2 - vdisks/vm-102-disk-2
rbd: import header failed.
Importing image: 0% complete...failed.
rbd: import failed: (22) Invalid argument
Exporting image: 0% complete...failed.
rbd: export error: (32) Broken pipe


But the version seems to support that option:

# rbd help import
usage: rbd import [--path ] [--dest-pool ] [--dest ]
[--image-format ] [--new-format]
[--order ] [--object-size ]
[--image-feature ] [--image-shared]
[--stripe-unit ]
[--stripe-count ] [--data-pool ]
[--journal-splay-width ]
[--journal-object-size ]
[--journal-pool ]
[--sparse-size ] [--no-progress]
[--export-format ] [--pool ]
[--image ]
 





Am 07.11.18 um 20:41 schrieb Jason Dillaman:

If your CLI supports "--export-format 2", you can just do "rbd export
--export-format 2 vms/vm-102-disk2 - | rbd import --export-format 2 -
vdisks/vm-102-disk-2" (you need to specify the data format on import
otherwise it will assume it's copying a raw image).
On Wed, Nov 7, 2018 at 2:38 PM Uwe Sauter  wrote:


I've been reading a bit and trying around but it seems I'm not quite where I 
want to be.

I want to migrate from pool "vms" to pool "vdisks".

# ceph osd pool ls
vms
vdisks

# rbd ls vms
vm-101-disk-1
vm-101-disk-2
vm-102-disk-1
vm-102-disk-2

# rbd snap ls vms/vm-102-disk-2
SNAPID NAME SIZE TIMESTAMP
   81 SL6_81 100GiB Thu Aug 23 11:57:05 2018
   92 SL6_82 100GiB Fri Oct 12 13:27:53 2018

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(no output)

# rbd export-diff --whole-object vms/vm-102-disk-2 - | rbd import-diff - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image diff: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(still no output)

It looks like the current content is copied but not the snapshots.

What am I doing wrong? Any help is appreciated.

Thanks,

  Uwe



Am 07.11.18 um 14:39 schrieb Uwe Sauter:

I'm still on luminous (12.2.8). I'll have a look on the commands. Thanks.

Am 07.11.18 um 14:31 schrieb Jason Dillaman:

With the Mimic release, you can use "rbd deep-copy" to transfer the
images (and associated snapshots) to a new pool. Prior to that, you
could use "rbd export-diff" / "rbd import-diff" to manually transfer
an image and its associated snapshots.
On Wed, Nov 7, 2018 at 7:11 AM Uwe Sauter  wrote:


Hi,

I have several VM images sitting in a Ceph pool which are snapshotted. Is there 
a way to move such images from one pool to another
and perserve the snapshots?

Regards,

   Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com















___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move rdb based image from one pool to another

2018-11-07 Thread Uwe Sauter

Looks like I'm hitting this:

http://tracker.ceph.com/issues/34536

Am 07.11.18 um 20:46 schrieb Uwe Sauter:

I tried that but it fails:

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import --export-format 
2 - vdisks/vm-102-disk-2
rbd: import header failed.
Importing image: 0% complete...failed.
rbd: import failed: (22) Invalid argument
Exporting image: 0% complete...failed.
rbd: export error: (32) Broken pipe


But the version seems to support that option:

# rbd help import
usage: rbd import [--path ] [--dest-pool ] [--dest ]
   [--image-format ] [--new-format]
   [--order ] [--object-size ]
   [--image-feature ] [--image-shared]
   [--stripe-unit ]
   [--stripe-count ] [--data-pool ]
   [--journal-splay-width ]
   [--journal-object-size ]
   [--journal-pool ]
   [--sparse-size ] [--no-progress]
   [--export-format ] [--pool ]
   [--image ]
    





Am 07.11.18 um 20:41 schrieb Jason Dillaman:

If your CLI supports "--export-format 2", you can just do "rbd export
--export-format 2 vms/vm-102-disk2 - | rbd import --export-format 2 -
vdisks/vm-102-disk-2" (you need to specify the data format on import
otherwise it will assume it's copying a raw image).
On Wed, Nov 7, 2018 at 2:38 PM Uwe Sauter  wrote:


I've been reading a bit and trying around but it seems I'm not quite where I 
want to be.

I want to migrate from pool "vms" to pool "vdisks".

# ceph osd pool ls
vms
vdisks

# rbd ls vms
vm-101-disk-1
vm-101-disk-2
vm-102-disk-1
vm-102-disk-2

# rbd snap ls vms/vm-102-disk-2
SNAPID NAME SIZE TIMESTAMP
  81 SL6_81 100GiB Thu Aug 23 11:57:05 2018
  92 SL6_82 100GiB Fri Oct 12 13:27:53 2018

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(no output)

# rbd export-diff --whole-object vms/vm-102-disk-2 - | rbd import-diff - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image diff: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(still no output)

It looks like the current content is copied but not the snapshots.

What am I doing wrong? Any help is appreciated.

Thanks,

     Uwe



Am 07.11.18 um 14:39 schrieb Uwe Sauter:

I'm still on luminous (12.2.8). I'll have a look on the commands. Thanks.

Am 07.11.18 um 14:31 schrieb Jason Dillaman:

With the Mimic release, you can use "rbd deep-copy" to transfer the
images (and associated snapshots) to a new pool. Prior to that, you
could use "rbd export-diff" / "rbd import-diff" to manually transfer
an image and its associated snapshots.
On Wed, Nov 7, 2018 at 7:11 AM Uwe Sauter  wrote:


Hi,

I have several VM images sitting in a Ceph pool which are snapshotted. Is there a way to move such images from one 
pool to another

and perserve the snapshots?

Regards,

  Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com











___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move rdb based image from one pool to another

2018-11-07 Thread Uwe Sauter

I tried that but it fails:

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import --export-format 
2 - vdisks/vm-102-disk-2
rbd: import header failed.
Importing image: 0% complete...failed.
rbd: import failed: (22) Invalid argument
Exporting image: 0% complete...failed.
rbd: export error: (32) Broken pipe


But the version seems to support that option:

# rbd help import
usage: rbd import [--path ] [--dest-pool ] [--dest ]
  [--image-format ] [--new-format]
  [--order ] [--object-size ]
  [--image-feature ] [--image-shared]
  [--stripe-unit ]
  [--stripe-count ] [--data-pool ]
  [--journal-splay-width ]
  [--journal-object-size ]
  [--journal-pool ]
  [--sparse-size ] [--no-progress]
  [--export-format ] [--pool ]
  [--image ]
   





Am 07.11.18 um 20:41 schrieb Jason Dillaman:

If your CLI supports "--export-format 2", you can just do "rbd export
--export-format 2 vms/vm-102-disk2 - | rbd import --export-format 2 -
vdisks/vm-102-disk-2" (you need to specify the data format on import
otherwise it will assume it's copying a raw image).
On Wed, Nov 7, 2018 at 2:38 PM Uwe Sauter  wrote:


I've been reading a bit and trying around but it seems I'm not quite where I 
want to be.

I want to migrate from pool "vms" to pool "vdisks".

# ceph osd pool ls
vms
vdisks

# rbd ls vms
vm-101-disk-1
vm-101-disk-2
vm-102-disk-1
vm-102-disk-2

# rbd snap ls vms/vm-102-disk-2
SNAPID NAME SIZE TIMESTAMP
  81 SL6_81 100GiB Thu Aug 23 11:57:05 2018
  92 SL6_82 100GiB Fri Oct 12 13:27:53 2018

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(no output)

# rbd export-diff --whole-object vms/vm-102-disk-2 - | rbd import-diff - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image diff: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(still no output)

It looks like the current content is copied but not the snapshots.

What am I doing wrong? Any help is appreciated.

Thanks,

     Uwe



Am 07.11.18 um 14:39 schrieb Uwe Sauter:

I'm still on luminous (12.2.8). I'll have a look on the commands. Thanks.

Am 07.11.18 um 14:31 schrieb Jason Dillaman:

With the Mimic release, you can use "rbd deep-copy" to transfer the
images (and associated snapshots) to a new pool. Prior to that, you
could use "rbd export-diff" / "rbd import-diff" to manually transfer
an image and its associated snapshots.
On Wed, Nov 7, 2018 at 7:11 AM Uwe Sauter  wrote:


Hi,

I have several VM images sitting in a Ceph pool which are snapshotted. Is there 
a way to move such images from one pool to another
and perserve the snapshots?

Regards,

  Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com











___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move rdb based image from one pool to another

2018-11-07 Thread Uwe Sauter

I've been reading a bit and trying around but it seems I'm not quite where I 
want to be.

I want to migrate from pool "vms" to pool "vdisks".

# ceph osd pool ls
vms
vdisks

# rbd ls vms
vm-101-disk-1
vm-101-disk-2
vm-102-disk-1
vm-102-disk-2

# rbd snap ls vms/vm-102-disk-2
SNAPID NAME SIZE TIMESTAMP
81 SL6_81 100GiB Thu Aug 23 11:57:05 2018
92 SL6_82 100GiB Fri Oct 12 13:27:53 2018

# rbd export --export-format 2 vms/vm-102-disk-2 - | rbd import - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(no output)

# rbd export-diff --whole-object vms/vm-102-disk-2 - | rbd import-diff - 
vdisks/vm-102-disk-2
Exporting image: 100% complete...done.
Importing image diff: 100% complete...done.

# rbd snap ls vdisks/vm-102-disk-2
(still no output)

It looks like the current content is copied but not the snapshots.

What am I doing wrong? Any help is appreciated.

Thanks,

Uwe



Am 07.11.18 um 14:39 schrieb Uwe Sauter:

I'm still on luminous (12.2.8). I'll have a look on the commands. Thanks.

Am 07.11.18 um 14:31 schrieb Jason Dillaman:

With the Mimic release, you can use "rbd deep-copy" to transfer the
images (and associated snapshots) to a new pool. Prior to that, you
could use "rbd export-diff" / "rbd import-diff" to manually transfer
an image and its associated snapshots.
On Wed, Nov 7, 2018 at 7:11 AM Uwe Sauter  wrote:


Hi,

I have several VM images sitting in a Ceph pool which are snapshotted. Is there 
a way to move such images from one pool to another
and perserve the snapshots?

Regards,

 Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move rdb based image from one pool to another

2018-11-07 Thread Uwe Sauter
I'm still on luminous (12.2.8). I'll have a look on the commands. Thanks.

Am 07.11.18 um 14:31 schrieb Jason Dillaman:
> With the Mimic release, you can use "rbd deep-copy" to transfer the
> images (and associated snapshots) to a new pool. Prior to that, you
> could use "rbd export-diff" / "rbd import-diff" to manually transfer
> an image and its associated snapshots.
> On Wed, Nov 7, 2018 at 7:11 AM Uwe Sauter  wrote:
>>
>> Hi,
>>
>> I have several VM images sitting in a Ceph pool which are snapshotted. Is 
>> there a way to move such images from one pool to another
>> and perserve the snapshots?
>>
>> Regards,
>>
>> Uwe
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Move rdb based image from one pool to another

2018-11-07 Thread Uwe Sauter
Hi,

I have several VM images sitting in a Ceph pool which are snapshotted. Is there 
a way to move such images from one pool to another
and perserve the snapshots?

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests from bluestore osds

2018-09-05 Thread Uwe Sauter

I'm also experiencing slow requests though I cannot point it to scrubbing.

Which kernel do you run? Would you be able to test against the same kernel with Spectre/Meltdown mitigations disabled 
("noibrs noibpb nopti nospectre_v2" as boot option)?


Uwe

Am 05.09.18 um 19:30 schrieb Brett Chancellor:

Marc,
   As with you, this problem manifests itself only when the bluestore OSD is involved in some form of deep scrub.  
Anybody have any insight on what might be causing this?


-Brett

On Mon, Sep 3, 2018 at 4:13 AM, Marc Schöchlin mailto:m...@256bit.org>> wrote:

Hi,

we are also experiencing this type of behavior for some weeks on our not
so performance critical hdd pools.
We haven't spent so much time on this problem, because there are
currently more important tasks - but here are a few details:

Running the following loop results in the following output:

while true; do ceph health|grep -q HEALTH_OK || (date;  ceph health
detail); sleep 2; done

Sun Sep  2 20:59:47 CEST 2018
HEALTH_WARN 4 slow requests are blocked > 32 sec
REQUEST_SLOW 4 slow requests are blocked > 32 sec
     4 ops are blocked > 32.768 sec
     osd.43 has blocked requests > 32.768 sec
Sun Sep  2 20:59:50 CEST 2018
HEALTH_WARN 4 slow requests are blocked > 32 sec
REQUEST_SLOW 4 slow requests are blocked > 32 sec
     4 ops are blocked > 32.768 sec
     osd.43 has blocked requests > 32.768 sec
Sun Sep  2 20:59:52 CEST 2018
HEALTH_OK
Sun Sep  2 21:00:28 CEST 2018
HEALTH_WARN 1 slow requests are blocked > 32 sec
REQUEST_SLOW 1 slow requests are blocked > 32 sec
     1 ops are blocked > 32.768 sec
     osd.41 has blocked requests > 32.768 sec
Sun Sep  2 21:00:31 CEST 2018
HEALTH_WARN 7 slow requests are blocked > 32 sec
REQUEST_SLOW 7 slow requests are blocked > 32 sec
     7 ops are blocked > 32.768 sec
     osds 35,41 have blocked requests > 32.768 sec
Sun Sep  2 21:00:33 CEST 2018
HEALTH_WARN 7 slow requests are blocked > 32 sec
REQUEST_SLOW 7 slow requests are blocked > 32 sec
     7 ops are blocked > 32.768 sec
     osds 35,51 have blocked requests > 32.768 sec
Sun Sep  2 21:00:35 CEST 2018
HEALTH_WARN 7 slow requests are blocked > 32 sec
REQUEST_SLOW 7 slow requests are blocked > 32 sec
     7 ops are blocked > 32.768 sec
     osds 35,51 have blocked requests > 32.768 sec

Our details:

   * system details:
     * Ubuntu 16.04
      * Kernel 4.13.0-39
      * 30 * 8 TB Disk (SEAGATE/ST8000NM0075)
      * 3* Dell Power Edge R730xd (Firmware 2.50.50.50)
        * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
        * 2*10GBITS SFP+ Network Adapters
        * 192GB RAM
      * Pools are using replication factor 3, 2MB object size,
        85% write load, 1700 write IOPS/sec
        (ops mainly between 4k and 16k size), 300 read IOPS/sec
   * we have the impression that this appears on deepscrub/scrub activity.
   * Ceph 12.2.5, we alread played with the osd settings OSD Settings
     (our assumtion was that the problem is related to rocksdb compaction)
     bluestore cache kv max = 2147483648
     bluestore cache kv ratio = 0.9
     bluestore cache meta ratio = 0.1
     bluestore cache size hdd = 10737418240
   * this type problem only appears on hdd/bluestore osds, ssd/bluestore
     osds did never experienced that problem
   * the system is healthy, no swapping, no high load, no errors in dmesg

I attached a log excerpt of osd.35 - probably this is useful for
investigating the problem is someone owns deeper bluestore knowledge.
(slow requests appeared on Sun Sep  2 21:00:35)

Regards
Marc


Am 02.09.2018 um 15:50 schrieb Brett Chancellor:
> The warnings look like this. 
>

> 6 ops are blocked > 32.768 sec on osd.219
> 1 osds have slow requests
>
> On Sun, Sep 2, 2018, 8:45 AM Alfredo Deza mailto:ad...@redhat.com>
> >> wrote:
>
>     On Sat, Sep 1, 2018 at 12:45 PM, Brett Chancellor
 >     mailto:bchancel...@salesforce.com> 
>>
 >     wrote:
 >     > Hi Cephers,
 >     >   I am in the process of upgrading a cluster from Filestore to
 >     bluestore,
 >     > but I'm concerned about frequent warnings popping up against the 
new
 >     > bluestore devices. I'm frequently seeing messages like this,
 >     although the
 >     > specific osd changes, it's always one of the few hosts I've
 >     converted to
 >     > bluestore.
 >     >
 >     > 6 ops are blocked > 32.768 sec on osd.219
 >     > 1 osds have slow requests
 >     >
 >     > I'm running 12.2.4, have any of you seen similar issues? It
 >     seems as though
 >     > these 

[ceph-users] Documentation regarding log file structure

2018-08-21 Thread Uwe Sauter
Hi list,

does documentation exist that explains the structure of Ceph log files? Other 
than the source code?

Thanks,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help needed for debugging slow_requests

2018-08-13 Thread Uwe Sauter


Dear community,

TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 
4.15. How to debug?


I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different 
kinds (details at the end).
The main difference between those hosts is CPU generation (Westmere / 
Sandybridge),  and number of OSD disks.

The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu 
kernels and the Proxmox
virtualization framework. The Proxmox WebUI also integrates some kind of Ceph 
management.

On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the 
other 3 only run OSDs.
OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are 
BlueStore.


Now here's the thing:

Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm 
getting slow requests that
cause blocked IO inside the VMs that are running on the cluster (but not 
necessarily on the host
with the OSD causing the slow request).

If I boot back into 4.13 then Ceph runs smoothly again.


I'm seeking for help to debug this issue as I'm running out of ideas what I 
could else do.
So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always 
indicates that the
primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting 
for subops from 9,23")
but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The 
other one blocks (there is
no commit message from OSD 9).

On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me 
a lot. If this OSD
does not commit there should be an operation that does not succeed, should it 
not?

Restarting the OSD with the blocked operation clears the error, restarting the 
secondary OSD that
does not commit has no effect on the issue.


Any ideas on how to debug this further? What should I do to identify this as a 
Ceph issue and not
a networking or kernel issue?


I can provide more specific info if needed.


Thanks,

  Uwe



 Hardware details 
Host type 1:
  CPU: 2x Intel Xeon E5-2670
  RAM: 64GiB
  Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB)
  connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom 
(Ceph & KVM, MTU 9000)

Host type 2:
  CPU: 2x Intel Xeon E5606
  RAM: 96GiB
  Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB)
  connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom 
(Ceph & KVM, MTU 9000)
 End Hardware 

 Ceph OSD Tree 
ID  CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
 -1   12.72653 root default
 -21.36418 host px-alpha-cluster
  0   hdd  0.22729 osd.0up  1.0 1.0
  1   hdd  0.22729 osd.1up  1.0 1.0
  2   hdd  0.90959 osd.2up  1.0 1.0
 -31.36418 host px-bravo-cluster
  3   hdd  0.22729 osd.3up  1.0 1.0
  4   hdd  0.22729 osd.4up  1.0 1.0
  5   hdd  0.90959 osd.5up  1.0 1.0
 -42.04648 host px-charlie-cluster
  6   hdd  0.90959 osd.6up  1.0 1.0
  7   hdd  0.22729 osd.7up  1.0 1.0
  8   hdd  0.90959 osd.8up  1.0 1.0
 -52.04648 host px-delta-cluster
  9   hdd  0.22729 osd.9up  1.0 1.0
 10   hdd  0.90959 osd.10   up  1.0 1.0
 11   hdd  0.90959 osd.11   up  1.0 1.0
-112.72516 host px-echo-cluster
 12   hdd  0.45419 osd.12   up  1.0 1.0
 13   hdd  0.45419 osd.13   up  1.0 1.0
 14   hdd  0.45419 osd.14   up  1.0 1.0
 15   hdd  0.45419 osd.15   up  1.0 1.0
 16   hdd  0.45419 osd.16   up  1.0 1.0
 17   hdd  0.45419 osd.17   up  1.0 1.0
-133.18005 host px-foxtrott-cluster
 18   hdd  0.45419 osd.18   up  1.0 1.0
 19   hdd  0.45419 osd.19   up  1.0 1.0
 20   hdd  0.45419 osd.20   up  1.0 1.0
 21   hdd  0.90909 osd.21   up  1.0 1.0
 22   hdd  0.45419 osd.22   up  1.0 1.0
 23   hdd  0.45419 osd.23   up  1.0 1.0
 End OSD Tree 

 CRUSH map 
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd

Re: [ceph-users] pg count question

2018-08-09 Thread Uwe Sauter

Given your formula, you would have 512 PGs in total. Instead of dividing that 
evenly you could also do
128 PGs for pool-1 and 384 PGs for pool-2, which gives you 1/4 and 3/4 of total PGs. This might not be 100% optimal for 
the pools but keeps the calculated total PGs and the 100PG/OSD target.



On 8/9/18 10:11 PM, Satish Patel wrote:

Thanks Subhachandra,

That is good point but how do i calculate that PG based on size?

On Thu, Aug 9, 2018 at 1:42 PM, Subhachandra Chandra
 wrote:

If pool1 is going to be much smaller than pool2, you may want more PGs in
pool2 for better distribution of data.




On Wed, Aug 8, 2018 at 12:40 AM, Sébastien VIGNERON
 wrote:


The formula seems correct for a 100 pg/OSD target.



Le 8 août 2018 à 04:21, Satish Patel  a écrit :

Thanks!

Do you have any comments on Question: 1 ?

On Tue, Aug 7, 2018 at 10:59 AM, Sébastien VIGNERON
 wrote:

Question 2:

ceph osd pool set-quota  max_objects|max_bytes 
set object or byte limit on pool



Le 7 août 2018 à 16:50, Satish Patel  a écrit :

Folks,

I am little confused so just need clarification, I have 14 osd in my
cluster and i want to create two pool  (pool-1 & pool-2) how do i
device pg between two pool with replication 3

Question: 1

Is this correct formula?

14 * 100 / 3 / 2 =  233  ( power of 2 would be 256)

So should i give 256 PG per pool right?

pool-1 = 256 pg & pgp
poo-2 = 256 pg & pgp


Question: 2

How do i set limit on pool for example if i want pool-1 can only use
500GB and pool-2 can use rest of the space?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmalloc performance still relevant?

2018-07-17 Thread Uwe Sauter
I asked a similar question about 2 weeks ago, subject "jemalloc / Bluestore". 
Have a look at the archives


Regards,

Uwe

Am 17.07.2018 um 15:27 schrieb Robert Stanford:
> Looking here: 
> https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/
> 
>  I see that it was a good idea to change to JEMalloc.  Is this still the 
> case, with up to date Linux and current Ceph?
> 
>  
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.6 CRC errors

2018-07-14 Thread Uwe Sauter

Hi Glen,

about 16h ago there has been a notice on this list with subject "IMPORTANT: broken luminous 12.2.6 release in repo, do 
not upgrade" from Sage Weil (main developer of Ceph).


Quote from this notice:

"tl;dr:  Please avoid the 12.2.6 packages that are currently present on
download.ceph.com.  We will have a 12.2.7 published ASAP (probably
Monday).

If you do not use bluestore or erasure-coded pools, none of the issues
affect you.


Details:

We built 12.2.6 and pushed it to the repos Wednesday, but as that was
happening realized there was a potentially dangerous regression in
12.2.5[1] that an upgrade might exacerbate.  While we sorted that issue
out, several people noticed the updated version in the repo and
upgraded.  That turned up two other regressions[2][3].  We have fixes for
those, but are working on an additional fix to make the damage from [3]
be transparently repaired."



Regards,

Uwe



Am 14.07.2018 um 17:02 schrieb Glen Baars:

Hello Ceph users!

Note to users, don't install new servers on Friday the 13th!

We added a new ceph node on Friday and it has received the latest 12.2.6 
update. I started to see CRC errors and investigated hardware issues. I have 
since found that it is caused by the 12.2.6 release. About 80TB copied onto 
this server.

I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( ceph pg 
repair ) . This has cleared the errors.

* no idea if this is a good way to fix the issue. From the bug report this 
issue is in the deepscrub and therefore I suppose stopping it will limit the 
issues. ***

Can anyone tell me what to do? Downgrade seems that it won't fix the issue. 
Maybe remove this node and rebuild with 12.2.5 and resync data? Wait a few days 
for 12.2.7?

Kind regards,
Glen Baars
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jemalloc / Bluestore

2018-07-05 Thread Uwe Sauter

Ah, thanks…

I'm currently trying to diagnose a performace regression that occurs with the Ubuntu 4.15 kernel (on a Proxmox system) 
and thought that jemalloc, given the old reports, could help with that. But than I ran into that bug report.


I'll take from your info that I'm gonna stick to tcmalloc. You know, so much to 
test and benchmark, so little time…


Regards,

Uwe

Am 05.07.2018 um 19:08 schrieb Mark Nelson:

Hi Uwe,


As luck would have it we were just looking at memory allocators again and ran some quick RBD and RGW tests that stress 
memory allocation:



https://drive.google.com/uc?export=download=1VlWvEDSzaG7fE4tnYfxYtzeJ8mwx4DFg


The gist of it is that tcmalloc looks like it's doing pretty well relative to the version of jemalloc and libc malloc 
tested (The jemalloc version here is pretty old though).  You are also correct that there have been reports of crashes 
with jemalloc, potentially related to rocksdb.  Right now it looks like our decision to stick with tcmalloc is still 
valid.  I wouldn't suggest switching unless you can find evidence that tcmalloc is behaving worse than the others (and 
please let me know if you do!).


Thanks,

Mark


On 07/05/2018 08:08 AM, Uwe Sauter wrote:

Hi all,

is using jemalloc still recommended for Ceph?

There are multiple sites (e.g. https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/) from 2015 
where jemalloc

is praised for higher performance but I found a bug report that Bluestore 
crashes when used with jemalloc.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] jemalloc / Bluestore

2018-07-05 Thread Uwe Sauter
Hi all,

is using jemalloc still recommended for Ceph?

There are multiple sites (e.g. 
https://ceph.com/geen-categorie/the-ceph-and-tcmalloc-performance-story/) from 
2015 where jemalloc
is praised for higher performance but I found a bug report that Bluestore 
crashes when used with jemalloc.

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-19 Thread Uwe Sauter



Am 19.05.2018 um 01:45 schrieb Brad Hubbard:

On Thu, May 17, 2018 at 6:06 PM, Uwe Sauter <uwe.sauter...@gmail.com> wrote:

Brad,

thanks for the bug report. This is exactly the problem I am having (log-wise).


You don't give any indication what version you are running but see
https://tracker.ceph.com/issues/23205



the cluster is an Proxmox installation which is based on an Ubuntu kernel.

# ceph -v
ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous
(stable)

The mistery is that these blocked requests occur numerously when at least
one of the 6 servers is booted with kernel 4.15.17, if all are running
4.13.16 the number of blocked requests is infrequent and low.


Sounds like you need to profile your two kernel versions and work out
why one is under-performing.



Well, the problem is that I see this behavior only in our production system (6 
hosts and 22 OSDs total). The test system I have is
a bit smaller (only 3 hosts with 12 OSDs on older hardware) and shows no sign 
of this possible regression…


Are you saying you can't gather performance data from your production system?


As far as I can tell the issue only occurs on the production cluster. Without a 
way to reproduce
on the test cluster I can't bisect the kernels as on the production cluster 
runs our central
infrastructure and each time the active LDAP is stuck, most of the other 
services are stuck as well…
My colleagues won't appreciate that.

What other kind of performance data would you have collected?

Uwe






Regards,

 Uwe





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-17 Thread Uwe Sauter
Brad,

thanks for the bug report. This is exactly the problem I am having (log-wise).
>>>
>>> You don't give any indication what version you are running but see
>>> https://tracker.ceph.com/issues/23205
>>
>>
>> the cluster is an Proxmox installation which is based on an Ubuntu kernel.
>>
>> # ceph -v
>> ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous
>> (stable)
>>
>> The mistery is that these blocked requests occur numerously when at least
>> one of the 6 servers is booted with kernel 4.15.17, if all are running
>> 4.13.16 the number of blocked requests is infrequent and low.
> 
> Sounds like you need to profile your two kernel versions and work out
> why one is under-performing.
> 

Well, the problem is that I see this behavior only in our production system (6 
hosts and 22 OSDs total). The test system I have is
a bit smaller (only 3 hosts with 12 OSDs on older hardware) and shows no sign 
of this possible regression…


Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-17 Thread Uwe Sauter

Hi,


I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
get this info in realtime).

Collecting this information could help identify aging disks if you were able to 
accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as 
long as the requests are actually blocked. Is this
correct?


You don't give any indication what version you are running but see
https://tracker.ceph.com/issues/23205


the cluster is an Proxmox installation which is based on an Ubuntu kernel.

# ceph -v
ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable)

The mistery is that these blocked requests occur numerously when at least one of the 6 servers is booted with kernel 
4.15.17, if all are running 4.13.16 the number of blocked requests is infrequent and low.



Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Uwe Sauter
Hi Mohamad,


>> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
>> like to identify the OSD that is causing those events
>> once the cluster is back to HEALTH_OK (as I have no monitoring yet that 
>> would get this info in realtime).
>>
>> Collecting this information could help identify aging disks if you were able 
>> to accumulate and analyze which OSD had blocking
>> requests in the past and how often those events occur.
>>
>> My research so far let's me think that this information is only available as 
>> long as the requests are actually blocked. Is this
>> correct?
> 
> I think this is what you're looking for:
> 
> $> ceph daemon osd.X dump_historic_slow_ops
> 
> which gives you recent slow operations, as opposed to
> 
> $> ceph daemon osd.X dump_blocked_ops
> 
> which returns current blocked operations. You can also add a filter to
> those commands.

Thanks for these commands. I'll have a look into those. If I understand these 
correctly it means that I need to run these at each
server for each OSD instead of at a central location, is that correct?

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Uwe Sauter
Hi folks,

I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
get this info in realtime).

Collecting this information could help identify aging disks if you were able to 
accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as 
long as the requests are actually blocked. Is this
correct?

MON logs only show that those events occure and how many requests are in 
blocking state but no indication of which OSD is
affected. Is there a way to identify blocking requests from the OSD log files?


On a side note: I was trying to write a small Python script that would extract 
this kind of information in realtime but while I
was able to register a MonitorLog callback that would receive the same messages 
as you would get with "ceph -w" I haven's seen in
the librados Python bindings documentation the possibility to do the equivalent 
of "ceph health detail". Any suggestions on how to
get the blocking OSDs via librados?


Thanks,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com