Re: [ceph-users] monitor ghosted

2020-01-09 Thread Peter Eisch
As oddly as it drifted away it came back.  Next time, should there be a next 
time, I will snag logs as suggested by Sascha.

The window for all this was, local time: 9:02 am - disassociated; 11:20 pm - 
associated.  No changes were made, I did reboot the mon02 host at 1 pm.  No 
other network or host issues were observed in the rest of the cluster or at the 
site.

Thank you for your replies and I'll gather better loggin next time.

peter




Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
From: Brad Hubbard 
Date: Wednesday, January 8, 2020 at 6:21 PM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] monitor ghosted


On Thu, Jan 9, 2020 at 5:48 AM Peter Eisch <mailto:peter.ei...@virginpulse.com> 
wrote:
Hi,

This morning one of my three monitor hosts got booted from the Nautilus 14.2.4 
cluster and it won’t regain. There haven’t been any changes, or events at this 
site at all. The conf file is the [unchanged] and the same as the other two 
monitors. The host is also running the MDS and MGR apps without any issue. The 
ceph-mon log shows this repeating:

2020-01-08 13:33:29.403 7fec1a736700 1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.433 7fec1a736700 1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.541 7fec1a736700 1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
...

Try gathering a log with debug_mon 20. That should provide more detail about 
why  AuthMonitor::_assign_global_id() didn't return an ID.


There is nothing in the logs of the two remaining/healthy monitors. What is my 
best practice to get this host back in the cluster?

peter

___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com


--
Cheers,
Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] monitor ghosted

2020-01-08 Thread Peter Eisch
Hi,

This morning one of my three monitor hosts got booted from the Nautilus 14.2.4 
cluster and it won’t regain.  There haven’t been any changes, or events at this 
site at all.  The conf file is the [unchanged] and the same as the other two 
monitors.  The host is also running the MDS and MGR apps without any issue.  
The ceph-mon log shows this repeating:

2020-01-08 13:33:29.403 7fec1a736700  1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.433 7fec1a736700  1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
2020-01-08 13:33:29.541 7fec1a736700  1 mon.cephmon02@1(probing) e7 
handle_auth_request failed to assign global_id
...

There is nothing in the logs of the two remaining/healthy monitors.  What is my 
best practice to get this host back in the cluster?

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW/swift segments

2019-10-31 Thread Peter Eisch
Removal hints are created for the head object (of 0 bytes) correctly.  The 
expiry process is unaware of the relationship between the head object and the 
segments.  Likewise no removal hit is created for the swift segments.

I’m awaiting a blessing to file this as a bug, would any capable reader of this 
be willing to file it as a bug, please?

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
From: ceph-users  on behalf of Peter Eisch 

Date: Thursday, October 31, 2019 at 10:27 AM
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] RGW/swift segments

I’ve verified this is still broken on Nautilus (14.2.4).  The code for deleting 
a swift object split into segments is correct.  I’m now trying to follow the 
object expirer code.  It seems it reads the metadata for the lead object 
(0-bytes in length) and expires that correctly but never understands to look 
for its associated segments in the metadata.  It could be because these are 
stored in a different bucket – I’m not sure.  Once the lead object is deleted, 
the metadata is removed and all the segments are then orphaned.

Is anyone here familiar with the RGWObjectExpirer code who I could confer?

peter

From: ceph-users  on behalf of Peter Eisch 

Date: Monday, October 28, 2019 at 3:06 PM
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] RGW/swift segments

I should have noted this is with Luminous 12.2.12 and consistent with 
swiftclient versions from 3.0.0 to 3.8.1, which may not be relevant.  With a 
proper nod I can open a ticket for this – just want to make sure it’s not a 
config issue.

[client.rgw.cephrgw-s01]
  host = cephrgw-s01
  keyring = /etc/ceph/ceph.client.rgw.cephrgw-s01
  rgw_zone = 
  rgw zonegroup = us
  rgw realm = 
  rgw dns name = rgw-s00.
  rgw dynamic resharding = false
  rgw swift account in url = true
  rgw swift url = https://rgw-s00./swift/v1
  rgw keystone make new tenants = true
  rgw keystone implicit tenants = true
  rgw enable usage log = true
  rgw keystone accepted roles = _member_,admin
  rgw keystone admin domain = Default
  rgw keystone admin password = 
  rgw keystone admin project = admin
  rgw keystone admin user = admin
  rgw keystone api version = 3
  rgw keystone url = https://keystone-s00.
  rgw relaxed s3 bucket names = true
  rgw s3 auth use keystone = true
  rgw thread pool size = 4096
  rgw keystone revocation interval = 300
  rgw keystone token cache size = 1
  rgw swift versioning enabled = true
  rgw log nonexistent bucket = true

All tips accepted…

peter

From: ceph-users  on behalf of Peter Eisch 

Date: Monday, October 28, 2019 at 9:28 AM
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] RGW/swift segments


Hi,

When uploading to RGW via swift I can set an expiration time.  The files being 
uploaded are large.  We segment them using the swift upload ‘-S’ arg.  This 
results in a 0-byte file in the bucket and all the data frags landing in a 
*_segments bucket.

When the expiration passes the 0-byte file is delete but all the segments 
remain.  Am I misconfigured or is this a bug where it won’t expire the actual 
data?  Shouldn’t RGW set the expiration on the uploaded segments too if they’re 
managed separately?

Thanks,

peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW/swift segments

2019-10-31 Thread Peter Eisch
I’ve verified this is still broken on Nautilus (14.2.4).  The code for deleting 
a swift object split into segments is correct.  I’m now trying to follow the 
object expirer code.  It seems it reads the metadata for the lead object 
(0-bytes in length) and expires that correctly but never understands to look 
for its associated segments in the metadata.  It could be because these are 
stored in a different bucket – I’m not sure.  Once the lead object is deleted, 
the metadata is removed and all the segments are then orphaned.

Is anyone here familiar with the RGWObjectExpirer code who I could confer?

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
From: ceph-users  on behalf of Peter Eisch 

Date: Monday, October 28, 2019 at 3:06 PM
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] RGW/swift segments

I should have noted this is with Luminous 12.2.12 and consistent with 
swiftclient versions from 3.0.0 to 3.8.1, which may not be relevant.  With a 
proper nod I can open a ticket for this – just want to make sure it’s not a 
config issue.

[client.rgw.cephrgw-s01]
  host = cephrgw-s01
  keyring = /etc/ceph/ceph.client.rgw.cephrgw-s01
  rgw_zone = 
  rgw zonegroup = us
  rgw realm = 
  rgw dns name = rgw-s00.
  rgw dynamic resharding = false
  rgw swift account in url = true
  rgw swift url = https://rgw-s00./swift/v1
  rgw keystone make new tenants = true
  rgw keystone implicit tenants = true
  rgw enable usage log = true
  rgw keystone accepted roles = _member_,admin
  rgw keystone admin domain = Default
  rgw keystone admin password = 
  rgw keystone admin project = admin
  rgw keystone admin user = admin
  rgw keystone api version = 3
  rgw keystone url = https://keystone-s00.
  rgw relaxed s3 bucket names = true
  rgw s3 auth use keystone = true
  rgw thread pool size = 4096
  rgw keystone revocation interval = 300
  rgw keystone token cache size = 1
  rgw swift versioning enabled = true
  rgw log nonexistent bucket = true

All tips accepted…

peter

From: ceph-users  on behalf of Peter Eisch 

Date: Monday, October 28, 2019 at 9:28 AM
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] RGW/swift segments


Hi,

When uploading to RGW via swift I can set an expiration time.  The files being 
uploaded are large.  We segment them using the swift upload ‘-S’ arg.  This 
results in a 0-byte file in the bucket and all the data frags landing in a 
*_segments bucket.

When the expiration passes the 0-byte file is delete but all the segments 
remain.  Am I misconfigured or is this a bug where it won’t expire the actual 
data?  Shouldn’t RGW set the expiration on the uploaded segments too if they’re 
managed separately?

Thanks,

peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW/swift segments

2019-10-28 Thread Peter Eisch
I should have noted this is with Luminous 12.2.12 and consistent with 
swiftclient versions from 3.0.0 to 3.8.1, which may not be relevant.  With a 
proper nod I can open a ticket for this – just want to make sure it’s not a 
config issue.

[client.rgw.cephrgw-s01]
  host = cephrgw-s01
  keyring = /etc/ceph/ceph.client.rgw.cephrgw-s01
  rgw_zone = 
  rgw zonegroup = us
  rgw realm = 
  rgw dns name = rgw-s00.
  rgw dynamic resharding = false
  rgw swift account in url = true
  rgw swift url = https://rgw-s00./swift/v1
  rgw keystone make new tenants = true
  rgw keystone implicit tenants = true
  rgw enable usage log = true
  rgw keystone accepted roles = _member_,admin
  rgw keystone admin domain = Default
  rgw keystone admin password = 
  rgw keystone admin project = admin
  rgw keystone admin user = admin
  rgw keystone api version = 3
  rgw keystone url = https://keystone-s00.
  rgw relaxed s3 bucket names = true
  rgw s3 auth use keystone = true
  rgw thread pool size = 4096
  rgw keystone revocation interval = 300
  rgw keystone token cache size = 1
  rgw swift versioning enabled = true
  rgw log nonexistent bucket = true

All tips accepted…

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
From: ceph-users  on behalf of Peter Eisch 

Date: Monday, October 28, 2019 at 9:28 AM
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] RGW/swift segments


Hi,

When uploading to RGW via swift I can set an expiration time.  The files being 
uploaded are large.  We segment them using the swift upload ‘-S’ arg.  This 
results in a 0-byte file in the bucket and all the data frags landing in a 
*_segments bucket.

When the expiration passes the 0-byte file is delete but all the segments 
remain.  Am I misconfigured or is this a bug where it won’t expire the actual 
data?  Shouldn’t RGW set the expiration on the uploaded segments too if they’re 
managed separately?

Thanks,

peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW/swift segments

2019-10-28 Thread Peter Eisch
Hi,

When uploading to RGW via swift I can set an expiration time.  The files being 
uploaded are large.  We segment them using the swift upload ‘-S’ arg.  This 
results in a 0-byte file in the bucket and all the data frags landing in a 
*_segments bucket.

When the expiration passes the 0-byte file is delete but all the segments 
remain.  Am I misconfigured or is this a bug where it won’t expire the actual 
data?  Shouldn’t RGW set the expiration on the uploaded segments too if they’re 
managed separately?

Thanks,

peter

Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph for "home lab" / hobbyist use?

2019-09-06 Thread Peter Woodman
2GB ram is gonna be really tight, probably. However, I do something similar
at home with a bunch of rock64 4gb boards, and it works well. There are
sometimes issues with the released ARM packages (frequently crc32 doesn;'t
work, which isn't great), so you may have to build your own on the board
you're targeting or on something like scaleway, YMMV.

On Fri, Sep 6, 2019 at 6:16 PM Cranage, Steve 
wrote:

> I use those HC2 nodes for my home Ceph cluster, but my setup only has to
> support the librados API, my software does HSM between regular XFS file
> systems and the RADOS api so I don’t need the other MDS and the rest so I
> can’t tell you if you’ll be happy in your configuration.
>
>
>
> Steve Cranage
>
> Principal Architect, Co-Founder
>
> DeepSpace Storage
>
> 719-930-6960
>
>
> --
> *From:* ceph-users  on behalf of
> William Ferrell 
> *Sent:* Friday, September 6, 2019 3:16:30 PM
> *To:* ceph-users@lists.ceph.com 
> *Subject:* [ceph-users] Ceph for "home lab" / hobbyist use?
>
> Hello everyone!
>
> After years of running several ZFS pools on a home server and several
> disk failures along the way, I've decided that my current home storage
> setup stinks. So far there hasn't been any data loss, but
> recovering/"resilvering" a ZFS pool after a disk failure is a
> nail-biting experience. I also think the way things are set up now
> isn't making the best use of all the disks attached to the server;
> they were acquired over time instead of all at once, so I've got 4
> 4-disk raidz1 pools, each in their own enclosures. If any enclosure
> dies, all that pool's data is lost. Despite having a total of 16 disks
> in use for storage, the entire system can only "safely" lose one disk
> before there's a risk of a second failure taking a bunch of data with
> it.
>
> I'd like to ask the list's opinions on running a Ceph cluster in a
> home environment as a filer using cheap, low-power systems. I don't
> have any expectations for high performance (this will be built on a
> gigabit network, and just used for backups and streaming videos,
> music, etc. for two people); the main concern is resiliency if one or
> two disks fail, and the secondary concern is having a decent usable
> storage capacity. Being able to slowly add capacity to the cluster one
> disk at a time is a very appealing bonus.
>
> I'm interested in using these things as OSDs (and hopefully monitors
> and metadata servers):
> https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/
>
> They're about $50 each, can boot from MicroSD or eMMC flash (basically
> an SSD with a custom connector), and have one SATA port. They have
> 8-core 32-bit CPUs, 2GB of RAM and a gigabit ethernet port. Four of
> them (including disks) can run off a single 12V/8A power adapter
> (basically 100 watts per set of 4). The obvious appeal is price, plus
> they're stackable so they'd be easy to hide away in a closet.
>
> Is it feasible for these to work as OSDs at all? The Ceph hardware
> recommendations page suggests OSDs need 1GB per TB of space, so does
> this mean these wouldn't be suitable with, say, a 4TB or 8TB disk? Or
> would they work, but just more slowly?
>
> Pushing my luck further (assuming the HC2 can handle OSD duties at
> all), is that enough muscle to run the monitor and/or metadata
> servers? Should monitors and MDS's be run separately, or can/should
> they piggyback on hosts running OSDs?
>
> I'd be perfectly happy with a setup like this even if it could only
> achieve speeds in the 20-30MB/sec range.
>
> Is this a dumb idea, or could it actually work? Are there any other
> recommendations among Ceph users for low-end hardware to cobble
> together a working cluster?
>
> Any feedback is sincerely appreciated.
>
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] health: HEALTH_ERR Module 'devicehealth' has failed: Failed to import _strptime because the import lockis held by another thread.

2019-08-28 Thread Peter Eisch

> Restart of single module is: `ceph mgr module disable devicehealth ; ceph mgr 
> module enable devicehealth`.

Thank you for your reply.  The I receive an error as the module can't be 
disabled.

I may have worked through this by restarting the nodes in a rapid succession.

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.60
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] health: HEALTH_ERR Module 'devicehealth' has failed: Failed to import _strptime because the import lockis held by another thread.

2019-08-27 Thread Peter Eisch
Hi,

What is the correct/best way to address a this?  It seems like a python issue, 
maybe it's time I learn how to "restart" modules?  The cluster seems to be 
working beyond this.

    health: HEALTH_ERR
            Module 'devicehealth' has failed: Failed to import _strptime 
because the import lockis held by another thread.



CEPH: Nautilus 14.2.2
3 - mons
3 - mgrs.
3 - mds
Full status:

  cluster:
id: 2fdb5976-1a38-4b29-1234-1ca74a9466ec
health: HEALTH_ERR
Module 'devicehealth' has failed: Failed to import _strptime 
because the import lockis held by another thread.

  services:
mon: 3 daemons, quorum cephmon01,cephmon02,cephmon03 (age 33m)
mgr: cephmon01(active, since 2h), standbys: cephmon02, cephmon03
mds: cephfs1:1 {0=cephmds-a03=up:active} 2 up:standby
osd: 103 osds: 103 up, 103 in
rgw: 3 daemons active (cephrgw-a01, cephrgw-a02, cephrgw-a03)

  data:
pools:   18 pools, 4901 pgs
objects: 4.28M objects, 16 TiB
usage:   49 TiB used, 97 TiB / 146 TiB avail
pgs: 4901 active+clean

  io:
client:   7.4 KiB/s rd, 24 MiB/s wr, 7 op/s rd, 628 op/s wr



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.60
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-26 Thread Peter Sabaini
On 26.07.19 15:03, Stefan Kooman wrote:
> Quoting Peter Sabaini (pe...@sabaini.at):
>> What kind of commit/apply latency increases have you seen when adding a
>> large numbers of OSDs? I'm nervous how sensitive workloads might react
>> here, esp. with spinners.
> 
> You mean when there is backfilling going on? Instead of doing "a big

Yes exactly. I usually tune down max rebalance and max recovery active
knobs to lessen impact but still I found the additional write load can
substantially increase i/o latencies. Not all workloads like this.

> bang" you can also use Dan van der Ster's trick with upmap balancer:
> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
> 
> See
> https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

Thanks, thats interesting -- though I wish it weren't necessary.


cheers,
peter.


> So you would still have norebalance / nobackfill / norecover and ceph
> balancer off. Then you run the script as many times as necessary to get
> "HEALTH_OK" again (on clusters other than nautilus) and there a no more
> PGs remapped. Unset the flags and enable the ceph balancer ... now the
> balancer will slowly move PGs to the new OSDs.
> 
> We've used this trick to increase the number of PGs on a pool, and will
> use this to expand the cluster in the near future.
> 
> This only works if you can use the balancer in "upmap" mode. Note that
> using upmap requires that all clients be Luminous or newer. If you are
> using cephfs kernel client it might report as not compatible (jewel) but
> recent linux distributions work well (Ubuntu 18.04 / CentOS 7).
> 
> Gr. Stefan
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-26 Thread Peter Sabaini
What kind of commit/apply latency increases have you seen when adding a
large numbers of OSDs? I'm nervous how sensitive workloads might react
here, esp. with spinners.

cheers,
peter.

On 24.07.19 20:58, Reed Dier wrote:
> Just chiming in to say that this too has been my preferred method for
> adding [large numbers of] OSDs.
> 
> Set the norebalance nobackfill flags.
> Create all the OSDs, and verify everything looks good.
> Make sure my max_backfills, recovery_max_active are as expected.
> Make sure everything has peered.
> Unset flags and let it run.
> 
> One crush map change, one data movement.
> 
> Reed
> 
>>
>> That works, but with newer releases I've been doing this:
>>
>> - Make sure cluster is HEALTH_OK
>> - Set the 'norebalance' flag (and usually nobackfill)
>> - Add all the OSDs
>> - Wait for the PGs to peer. I usually wait a few minutes
>> - Remove the norebalance and nobackfill flag
>> - Wait for HEALTH_OK
>>
>> Wido
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Hi,

I appreciate the insistency that the directions be followed.  I wholly agree.  
The only liberty I took was to do a ‘yum update’ instead of just ‘yum update 
ceph-osd’ and then reboot.  (Also my MDS runs on the MON hosts, so it got 
update a step early.)

As for the logs:

[2019-07-24 15:07:22,713][ceph_volume.main][INFO  ] Running command: 
ceph-volume  simple scan
[2019-07-24 15:07:22,714][ceph_volume.process][INFO  ] Running command: 
/bin/systemctl show --no-pager --property=Id --state=running ceph-osd@*
[2019-07-24 15:07:27,574][ceph_volume.main][INFO  ] Running command: 
ceph-volume  simple activate --all
[2019-07-24 15:07:27,575][ceph_volume.devices.simple.activate][INFO  ] 
activating OSD specified in 
/etc/ceph/osd/0-93fb5f2f-0273-4c87-a718-886d7e6db983.json
[2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ] Required 
devices (block and data) not present for bluestore
[2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ] 
bluestore devices found: [u'data']
[2019-07-24 15:07:27,576][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, 
in newfunc
return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in main
terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/main.py", 
line 33, in main
terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File 
"/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py", line 
272, in main
self.activate(args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
in is_root
return func(*a, **kw)
  File 
"/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py", line 
131, in activate
self.validate_devices(osd_metadata)
  File 
"/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py", line 
62, in validate_devices
raise RuntimeError('Unable to activate bluestore OSD due to missing 
devices')
RuntimeError: Unable to activate bluestore OSD due to missing devices

(this is repeated for each of the 16 drives)

Any other thoughts?  (I’ll delete/create the OSDs with ceph-deply otherwise.)

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Alfredo Deza 
Date: Wednesday, July 24, 2019 at 3:02 PM
To: Peter Eisch 
Cc: Paul Emmerich , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs


On Wed, Jul 24, 2019 at 3:49 PM Peter Eisch 
mailto:peter.ei...@virginpulse.com>> wrote:

I’m at step 6.  I updated/rebooted the host to complete “installing the new 
packages and restarting the ceph-osd daemon” on the first OSD host.  All the 
systemctl definitions to start the OSDs were deleted, all the properties in 
/var/lib/ceph/osd/ceph-* directories were deleted.  All the files in 
/var/lib/ceph/osd-lockbox, for comparison, were untouched and still present.

Peeking into step 7 I can run ceph-volume:

# ceph-volume simple scan /dev/sda1
Running command: /usr/sbin/cryptsetup status /dev/sda1
Running command: /usr/sbin/cryptsetup status 
93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/mount -v /dev/sda5 /tmp/tmpF5F8t2
stdout: mount: /dev/sda5 mounted on /tmp/tmpF5F8t2.
Running command: /usr/sbin/cryptsetup status /dev/sda5
Running command: /bin/ceph --cluster ceph --name 
client.osd-lockbox.93fb5f2f-0273-4c87-a718-886d7e6db983 --keyring 
/tmp/tmpF5F8t2/keyring config-key get 
dm-crypt/osd/93fb5f2f-0273-4c87-a718-886d7e6db983/luks
Running command: /bin/umount -v /tmp/tmpF5F8t2
stderr: umount: /tmp/tmpF5F8t2 (/dev/sda5) unmounte

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch

I’m at step 6.  I updated/rebooted the host to complete “installing the new 
packages and restarting the ceph-osd daemon” on the first OSD host.  All the 
systemctl definitions to start the OSDs were deleted, all the properties in 
/var/lib/ceph/osd/ceph-* directories were deleted.  All the files in 
/var/lib/ceph/osd-lockbox, for comparison, were untouched and still present.

Peeking into step 7 I can run ceph-volume:

# ceph-volume simple scan /dev/sda1
Running command: /usr/sbin/cryptsetup status /dev/sda1
Running command: /usr/sbin/cryptsetup status 
93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/mount -v /dev/sda5 /tmp/tmpF5F8t2
stdout: mount: /dev/sda5 mounted on /tmp/tmpF5F8t2.
Running command: /usr/sbin/cryptsetup status /dev/sda5
Running command: /bin/ceph --cluster ceph --name 
client.osd-lockbox.93fb5f2f-0273-4c87-a718-886d7e6db983 --keyring 
/tmp/tmpF5F8t2/keyring config-key get 
dm-crypt/osd/93fb5f2f-0273-4c87-a718-886d7e6db983/luks
Running command: /bin/umount -v /tmp/tmpF5F8t2
stderr: umount: /tmp/tmpF5F8t2 (/dev/sda5) unmounted
Running command: /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen 
/dev/sda1 93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/mount -v /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983 
/tmp/tmpYK0WEV
stdout: mount: /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983 mounted on 
/tmp/tmpYK0WEV.
--> broken symlink found /tmp/tmpYK0WEV/block -> 
/dev/mapper/a05b447c-c901-4690-a249-cc1a2d62a110
Running command: /usr/sbin/cryptsetup status /tmp/tmpYK0WEV/block_dmcrypt
Running command: /usr/sbin/cryptsetup status 
/dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/umount -v /tmp/tmpYK0WEV
stderr: umount: /tmp/tmpYK0WEV 
(/dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983) unmounted
Running command: /usr/sbin/cryptsetup remove 
/dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983
--> OSD 0 got scanned and metadata persisted to file: 
/etc/ceph/osd/0-93fb5f2f-0273-4c87-a718-886d7e6db983.json
--> To take over management of this scanned OSD, and disable ceph-disk and 
udev, run:
--> ceph-volume simple activate 0 93fb5f2f-0273-4c87-a718-886d7e6db983
#
#
# ceph-volume simple activate 0 93fb5f2f-0273-4c87-a718-886d7e6db983
--> Required devices (block and data) not present for bluestore
--> bluestore devices found: [u'data']
-->  RuntimeError: Unable to activate bluestore OSD due to missing devices
#

Okay, this created /etc/ceph/osd/*.json.  This is cool.  Is there a command or 
option which will read these files and mount the devices?

peter




Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Alfredo Deza 
Date: Wednesday, July 24, 2019 at 2:20 PM
To: Peter Eisch 
Cc: Paul Emmerich , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs

On Wed, Jul 24, 2019 at 2:56 PM Peter Eisch 
mailto:peter.ei...@virginpulse.com>> wrote:
Hi Paul,

To do better to answer you question, I'm following: 
http://docs.ceph.com/docs/nautilus/releases/nautilus/<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdocs.ceph.com%2Fdocs%2Fnautilus%2Freleases%2Fnautilus%2F&data=02%7C01%7Cpeter.eisch%40virginpulse.com%7Ccb996f99f71d41410beb08d7106bece7%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C636995928035423307&sdata=15PrHdzXLqtKg0o2ZM0Pfv%2Fp56KSCGOsXXzhymKkbCA%3D&reserved=0>

At step 6, upgrade OSDs, I jumped on an OSD host and did a full 'yum update' 
for patching the host and rebooted to pick up the current centos kernel.

If you are at Step 6 then it is *crucial* to understand that the tooling used 
to create the OSDs is no longer available and Step 7 *is absolutely required*.

ceph-volume has to scan the system and give you the output of all OSDs found so 
that it can persist them in /etc/ceph/osd/*.json files and then can later be
"activated".


I didn't do anything to specific commands for just updating the ceph RPMs in 
this process.

It is not clear if you are at Ste

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Hi Paul,

To do better to answer you question, I'm following: 
http://docs.ceph.com/docs/nautilus/releases/nautilus/

At step 6, upgrade OSDs, I jumped on an OSD host and did a full 'yum update' 
for patching the host and rebooted to pick up the current centos kernel.

I didn't do anything to specific commands for just updating the ceph RPMs in 
this process.

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Paul Emmerich 
Date: Wednesday, July 24, 2019 at 1:39 PM
To: Peter Eisch 
Cc: Xavier Trilla , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs

On Wed, Jul 24, 2019 at 8:36 PM Peter Eisch 
<mailto:peter.ei...@virginpulse.com> wrote:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.7T 0 disk
├─sda1 8:1 0 100M 0 part
├─sda2 8:2 0 1.7T 0 part
└─sda5 8:5 0 10M 0 part
sdb 8:16 0 1.7T 0 disk
├─sdb1 8:17 0 100M 0 part
├─sdb2 8:18 0 1.7T 0 part
└─sdb5 8:21 0 10M 0 part
sdc 8:32 0 1.7T 0 disk
├─sdc1 8:33 0 100M 0 part

That's ceph-disk which was removed, run "ceph-volume simple scan"


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at 
https://nam02.safelinks.protection.outlook.com/?url=https://croit.io&data=02|01|peter.ei...@virginpulse.com|93235ab7971a4beceab708d710664a14|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995903843215231&sdata=YEQI+UvikVPVeOFNSB2ikqVRiul8ElD3JEZDVOQI+NY=&reserved=0

croit GmbH
Freseniusstr. 31h
81247 München
https://nam02.safelinks.protection.outlook.com/?url=http://www.croit.io&data=02|01|peter.ei...@virginpulse.com|93235ab7971a4beceab708d710664a14|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995903843225224&sdata=83sD9wJHxE5W0renuDE7RGR/cPznR6jl9rEfl1AO+oA=&reserved=0
Tel: +49 89 1896585 90

 
...
I'm thinking the OSD would start (I can recreate the .service definitions in 
systemctl) if the above were mounted in a way like they are on another of my 
hosts:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.7T 0 disk
├─sda1 8:1 0 100M 0 part
│ └─97712be4-1234-4acc-8102-2265769053a5 253:17 0 98M 0 crypt 
/var/lib/ceph/osd/ceph-16
├─sda2 8:2 0 1.7T 0 part
│ └─049b7160-1234-4edd-a5dc-fe00faca8d89 253:16 0 1.7T 0 crypt
└─sda5 8:5 0 10M 0 part 
/var/lib/ceph/osd-lockbox/97712be4-9674-4acc-1234-2265769053a5
sdb 8:16 0 1.7T 0 disk
├─sdb1 8:17 0 100M 0 part
│ └─f03f0298-1234-42e9-8b28-f3016e44d1e2 253:26 0 98M 0 crypt 
/var/lib/ceph/osd/ceph-17
├─sdb2 8:18 0 1.7T 0 part
│ └─51177019-1234-4963-82d1-5006233f5ab2 253:30 0 1.7T 0 crypt
└─sdb5 8:21 0 10M 0 part 
/var/lib/ceph/osd-lockbox/f03f0298-1234-42e9-8b28-f3016e44d1e2
sdc 8:32 0 1.7T 0 disk
├─sdc1 8:33 0 100M 0 part
│ └─0184df0c-1234-404d-92de-cb71b1047abf 253:8 0 98M 0 crypt 
/var/lib/ceph/osd/ceph-18
├─sdc2 8:34 0 1.7T 0 part
│ └─fdad7618-1234-4021-a63e-40d973712e7b 253:13 0 1.7T 0 crypt
...

Thank you for your time on this,

peter

From: Xavier Trilla <mailto:xavier.tri...@clouding.io>
Date: Wednesday, July 24, 2019 at 1:25 PM
To: Peter Eisch <mailto:peter.ei...@virginpulse.com>
Cc: "mailto:ceph-users@lists.ceph.com"; <mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Upgrading and lost OSDs

Hi Peter,

Im not sure but maybe after some changes the OSDs are not being recongnized by 
ceph scripts.

Ceph used to use udev to detect the OSDs and then moved to lvm, which kind of 
OSDs are you running? Blustore or filestore? Which version did you use to 
create them?

Cheers!

El 24 jul 2019, a les 20:04, Peter Eisch 
<mailto:mailto:peter.ei...@virginpulse.com> va escriure:
Hi,

I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on 
centos 7.6. The managers are updated alright:

# ceph -s
  cluster:
    id:     2fdb5976-1234-4b29-ad9c-1ca74a9466ec
    health: HEALTH_WARN
            Degraded data redundancy: 24177/9555955 objects degraded (0.253%), 
7 pgs degraded, 1285 pgs undersized
            3 monitors have not enabled msgr2
 ...

I updated ceph on a OSD host with 'yum update&#

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
[2019-07-24 13:40:49,602][ceph_volume.process][INFO  ] Running command: 
/bin/systemctl show --no-pager --property=Id --state=running ceph-osd@*

This is the only log event.  At the prompt:

# ceph-volume simple scan
#

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Paul Emmerich 
Date: Wednesday, July 24, 2019 at 1:32 PM
To: Xavier Trilla 
Cc: Peter Eisch , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs

Did you use ceph-disk before?

Support for ceph-disk was removed, see Nautilus upgrade instructions. You'll 
need to run "ceph-volume simple scan" to convert them to ceph-volume

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at 
https://nam02.safelinks.protection.outlook.com/?url=https://croit.io&data=02|01|peter.ei...@virginpulse.com|d36b1ddd859a4312cc5908d710654b4f|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995899574324430&sdata=qGIIxPCaeiKjrJ2F7enE5NrjY3vfv7fGNaO/gr1RYto=&reserved=0

croit GmbH
Freseniusstr. 31h
81247 München
https://nam02.safelinks.protection.outlook.com/?url=http://www.croit.io&data=02|01|peter.ei...@virginpulse.com|d36b1ddd859a4312cc5908d710654b4f|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995899574324430&sdata=bw4LllzkQlTarUimhd/JattNfA1ULqdSgtUYmVnhdhU=&reserved=0
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:25 PM Xavier Trilla 
<mailto:xavier.tri...@clouding.io> wrote:
Hi Peter,

Im not sure but maybe after some changes the OSDs are not being recongnized by 
ceph scripts.

Ceph used to use udev to detect the OSDs and then moved to lvm, which kind of 
OSDs are you running? Blustore or filestore? Which version did you use to 
create them?

Cheers!

El 24 jul 2019, a les 20:04, Peter Eisch <mailto:peter.ei...@virginpulse.com> 
va escriure:
Hi,

I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on 
centos 7.6. The managers are updated alright:

# ceph -s
  cluster:
    id:     2fdb5976-1234-4b29-ad9c-1ca74a9466ec
    health: HEALTH_WARN
            Degraded data redundancy: 24177/9555955 objects degraded (0.253%), 
7 pgs degraded, 1285 pgs undersized
            3 monitors have not enabled msgr2
 ...

I updated ceph on a OSD host with 'yum update' and then rebooted to grab the 
current kernel. Along the way, the contents of all the directories in 
/var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs down from this. I 
can manage the undersized but I'd like to get these drives working again 
without deleting each OSD and recreating them.

So far I've pulled the respective cephx key into the 'keyring' file and 
populated 'bluestore' into the 'type' files but I'm unsure how to get the 
lockboxes mounted to where I can get the OSDs running. The osd-lockbox 
directory is otherwise untouched from when the OSDs were deployed.

Is there a way to run ceph-deploy or some other tool to rebuild the mounts for 
the drives?

peter
___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com
https://nam02.safelinks.protection.outlook.com/?url=http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com&data=02|01|peter.ei...@virginpulse.com|d36b1ddd859a4312cc5908d710654b4f|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995899574354413&sdata=V1IPNZgsCojA+RPbPRQop6R0zGGWTtovUtrg7toHMrs=&reserved=0

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Bluestore created with 12.2.10/luminous.

The OSD startup generates logs like:

2019-07-24 12:39:46.483 7f4b27649d80  0 set uid:gid to 167:167 (ceph:ceph)
2019-07-24 12:39:46.483 7f4b27649d80  0 ceph version 14.2.2 
(4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, 
pid 48553
2019-07-24 12:39:46.483 7f4b27649d80  0 pidfile_write: ignore empty --pid-file
2019-07-24 12:39:46.483 7f4b27649d80  0 set uid:gid to 167:167 (ceph:ceph)
2019-07-24 12:39:46.483 7f4b27649d80  0 ceph version 14.2.2 
(4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, 
pid 48553
2019-07-24 12:39:46.483 7f4b27649d80  0 pidfile_write: ignore empty --pid-file
2019-07-24 12:39:46.505 7f4b27649d80 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open 
/var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open 
/var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open 
/var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
-
# lsblk
NAMEMAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda   8:00   1.7T  0 disk
├─sda18:10   100M  0 part
├─sda28:20   1.7T  0 part
└─sda58:5010M  0 part
sdb   8:16   0   1.7T  0 disk
├─sdb18:17   0   100M  0 part
├─sdb28:18   0   1.7T  0 part
└─sdb58:21   010M  0 part
sdc   8:32   0   1.7T  0 disk
├─sdc18:33   0   100M  0 part
...
I'm thinking the OSD would start (I can recreate the .service definitions in 
systemctl) if the above were mounted in a way like they are on another of my 
hosts:
# lsblk
NAME MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda8:00   1.7T  0 disk
├─sda1 8:10   100M  0 part
│ └─97712be4-1234-4acc-8102-2265769053a5 253:17   098M  0 crypt 
/var/lib/ceph/osd/ceph-16
├─sda2 8:20   1.7T  0 part
│ └─049b7160-1234-4edd-a5dc-fe00faca8d89 253:16   0   1.7T  0 crypt
└─sda5 8:5010M  0 part  
/var/lib/ceph/osd-lockbox/97712be4-9674-4acc-1234-2265769053a5
sdb8:16   0   1.7T  0 disk
├─sdb1 8:17   0   100M  0 part
│ └─f03f0298-1234-42e9-8b28-f3016e44d1e2 253:26   098M  0 crypt 
/var/lib/ceph/osd/ceph-17
├─sdb2 8:18   0   1.7T  0 part
│ └─51177019-1234-4963-82d1-5006233f5ab2 253:30   0   1.7T  0 crypt
└─sdb5 8:21   010M  0 part  
/var/lib/ceph/osd-lockbox/f03f0298-1234-42e9-8b28-f3016e44d1e2
sdc8:32   0   1.7T  0 disk
├─sdc1 8:33   0   100M  0 part
│ └─0184df0c-1234-404d-92de-cb71b1047abf 253:8098M  0 crypt 
/var/lib/ceph/osd/ceph-18
├─sdc2 8:34   0   1.7T  0 part
│ └─fdad7618-1234-4021-a63e-40d973712e7b 253:13   0   1.7T  0 crypt
...

Thank you for your time on this,

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Xavier Trilla 
Date: Wednesday, July 24, 2019 at 1:25 PM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Upgrading and lost OSDs

Hi Peter,

Im not sure 

[ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Hi,

I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on 
centos 7.6.  The managers are updated alright:

# ceph -s
  cluster:
    id:     2fdb5976-1234-4b29-ad9c-1ca74a9466ec
    health: HEALTH_WARN
            Degraded data redundancy: 24177/9555955 objects degraded (0.253%), 
7 pgs degraded, 1285 pgs undersized
            3 monitors have not enabled msgr2
 ...

I updated ceph on a OSD host with 'yum update' and then rebooted to grab the 
current kernel.  Along the way, the contents of all the directories in 
/var/lib/ceph/osd/ceph-*/ were deleted.  Thus I have 16 OSDs down from this.  I 
can manage the undersized but I'd like to get these drives working again 
without deleting each OSD and recreating them.

So far I've pulled the respective cephx key into the 'keyring' file and 
populated 'bluestore' into the 'type' files but I'm unsure how to get the 
lockboxes mounted to where I can get the OSDs running.  The osd-lockbox 
directory is otherwise untouched from when the OSDs were deployed.

Is there a way to run ceph-deploy or some other tool to rebuild the mounts for 
the drives?

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multisite RGW - endpoints configuration

2019-07-17 Thread Peter Eisch
Hi,

I also have been looking solutions for improving sync.  I have two clusters, 25 
ms RTT, with the RGW multi-site configured and all nodes running 12.2.12.  I 
have three rgw nodes at each with the nodes behind haproxy at each site.  There 
is a 1G circuit between the sites and bandwidth usage averages 370Mb/s.  I can 
put [with swift] to the remote site at wire speed.

Logs on the receiving site show ample:
heartbeat_map is_healthy 'RGWAsyncRadosProcessor::m_tp thread 0x7f16e022d700' 
had timed out after 600

..but it all works albeit slow.  What should be my next move in researching a 
resolution for this?

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
On 7/17/19, 8:44 AM, "ceph-users on behalf of Casey Bodley" 
 wrote:

On 7/17/19 8:04 AM, P. O. wrote:
> Hi,
> Is there any mechanism inside the rgw that can detect faulty endpoints
> for a configuration with multiple endpoints?

No, replication requests that fail just get retried using round robin
until they succeed. If an endpoint isn't available, we assume it will
come back eventually and keep trying.


> Is there any advantage related with the number of replication
> endpoints? Can I expect improved replication performance (the more
> synchronization rgws = the faster replication)?

These endpoints act as the server side of replication, and handle GET
requests from other zones to read replication logs and fetch objects. As
long as the number of gateways on the client side of replication (ie.
gateways on other zones that have rgw_run_sync_thread enabled, which is
on by default) scale along with these replication endpoints, you can
expect a modest improvement in replication, though it's limited by the
available bandwidth between sites. Spreading replication endpoints over
several gateways also helps to limit the impact of replication on the
local client workloads.


>
>
> W dniu środa, 17 lipca 2019 P. O.  <mailto:pos...@gmail.com>> napisał(a):
>
> Hi,
>
> Is there any mechanism inside the rgw that can detect faulty
> endpoints for a configuration with multiple endpoints? Is there
> any advantage related with the number of replication endpoints?
> Can I expect improved replication performance (the more 
synchronization rgws = the faster replication)?
>
>
> W dniu wtorek, 16 lipca 2019 Casey Bodley  <mailto:cbod...@redhat.com>> napisał(a):
>
> We used to have issues when a load balancer was in front of
> the sync endpoints, because our http client didn't time out
> stalled connections. Those are resolved in luminous, but we
> still recommend using the radosgw addresses directly to avoid
> shoveling data through an extra proxy. Internally, sync is
> already doing a round robin over that list of endpoints. On
> the other hand, load balancers give you some extra
> flexibility, like adding/removing gateways without having to
> update the global multisite configuration.
>
> On 7/16/19 2:52 PM, P. O. wrote:
>
> Hi all,
>
> I have multisite RGW setup with one zonegroup and two
> zones. Each zone has one endpoint configured like below:
>
> "zonegroups": [
> {
>  ...
>  "is_master": "true",
>  "endpoints": 
["https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2F192.168.100.1%3A80&data=02%7C01%7Cpeter.eisch%40virginpulse.com%7C645bd531c1124940e43e08d70abceff5%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C636989678911683822&sdata=sDr3ODUgXP0yZrJBFu5bCNBvmLBN

[ceph-users] RGW Multisite Q's

2019-06-12 Thread Peter Eisch
Hi,

Could someone be able to point me to a blog or documentation page which helps 
me resolve the issues noted below?

All nodes are Luminous, 12.2.12; one realm, one zonegroup (clustered haproxies 
fronting), two zones (three rgw in each); All endpoint references to each zone 
go are an haproxy.

In hoping to replace a swift config with RGW it has been interesting.  Crafting 
a functional configuration from blog posts and documentation takes time.  It 
was crucial to find and use 
http://docs.ceph.com/docs/luminous/radosgw/multisite/ instead of 
http://docs.ceph.com/docs/master/radosgw/config-ref/ except parts suggest 
incorrect configurations.  I've submitted corrections to the former in #28517, 
for what it's worth.

Through this I'm now finding fewer resources to help explain the abundance of 
404's in the gateway logs:

  "GET 
/admin/log/?type=data&id=8&marker&extra-info=true&rgwx-zonegroup= 
HTTP/1.1" 404 0 - -
  "GET 
/admin/log/?type=data&id=8&marker&extra-info=true&rgwx-zonegroup= 
HTTP/1.1" 404 0 - -
  "GET 
/admin/log/?type=data&id=8&marker&extra-info=true&rgwx-zonegroup= 
HTTP/1.1" 404 0 - -
  "GET 
/admin/log/?type=data&id=8&marker&extra-info=true&rgwx-zonegroup= 
HTTP/1.1" 404 0 - -

To the counts of hundreds of thousands.  The site seems to work with just 
minimal testing so far.  The 404's also seem to be limited to the data queries 
while the metadata queries are mostly more successful with 200's.

  "GET 
/admin/log?type=metadata&id=55&period=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9&max-entries=100&&rgwx-zonegroup=
 HTTP/1.1" 200 0 - -
   "GET 
/admin/log?type=metadata&id=45&period=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9&max-entries=100&&rgwx-zonegroup==
 HTTP/1.1" 200 0 - -
  "GET 
/admin/log?type=metadata&id=4&period=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9&max-entries=100&&rgwx-zonegroup==
 HTTP/1.1" 200 0 - -
   "GET 
/admin/log?type=metadata&id=35&period=58b43d07-03e2-48e4-b2dc-74d64ef7f0c9&max-entries=100&&rgwx-zonegroup==
 HTTP/1.1" 200 0 - -

Q: How do I address the 404 events to help them succeed?


Other log events which I cannot resolve are the tens of thousands (even while 
no reads or writes are requested) of:

  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... data sync: ERROR: failed to read remote data log info: ret=-2
  ... meta sync: ERROR: RGWBackoffControlCR called coroutine returned -2
  ... etc.
These seem to fire off every 30 seconds but doesn't seem to be managed by "rgw 
usage log tick interval" nor "rgw init timeout" values.  Meanwhile the usage 
between the two zones matches for each bucket.

Q:  What are these log events indicating?

Thanks,

peter



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global Data Deduplication

2019-05-30 Thread Peter Wienemann
Hi Felix,

there is a seven year old open issue asking for this feature [0].

An alternative option would be using Benji [1].

Peter

[0] https://tracker.ceph.com/issues/1576
[1] https://benji-backup.me

On 29.05.19 10:25, Felix Hüttner wrote:
> Hi everyone,
> 
> We are currently using Ceph as the backend for our OpenStack
> blockstorage. For backup of these disks we thought about also using ceph
> (just with hdd’s instead of ssd’s). As we will have some volumes that
> will be backuped daily and that will probably not change too often I
> searched for any possible deduplication methods for ceph.
>  
> There I noticed this paper regarding “Global Data Deduplication”
> (https://ceph.com/wp-content/uploads/2018/07/ICDCS_2018_mwoh.pdf). It
> says “We implemented the proposed design upon open source distributed
> storage system, Ceph”.
> 
> Unfortunately I was not able to find any documentation for this
> anywhere. The only thing that seems related is the cephdeduptool.
> 
> Is there some something that I just missed? Or is it implicitly done in
> the background and I don’t need to care about it?
> 
> Thanks for your help
> 
> Felix
> 
> Hinweise zum Datenschutz finden Sie hier <https://www.datenschutz.schwarz>.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs free space vs ceph df free space disparity

2019-05-28 Thread Peter Wienemann
On 27.05.19 09:08, Stefan Kooman wrote:
> Quoting Robert Ruge (robert.r...@deakin.edu.au):
>> Ceph newbie question.
>>
>> I have a disparity between the free space that my cephfs file system
>> is showing and what ceph df is showing.  As you can see below my
>> cephfs file system says there is 9.5TB free however ceph df says there
>> is 186TB which with replication size 3 should equate to 62TB free
>> space.  I guess the basic question is how can I get cephfs to see and
>> use all of the available space?  I recently changed my number of pg's
>> on the cephfs_data pool from 2048 to 4096 and this gave me another 8TB
>> so do I keep increasing the number of pg's or is there something else
>> that I am missing? I have only been running ceph for ~6 months so I'm
>> relatively new to it all and not being able to use all of the space is
>> just plain bugging me.
> 
> My guess here is you have a lot of small files in your cephfs, is that
> right? Do you have HDD or SDD/NVMe?
> 
> Mohamad Gebai gave a talk about this at Cephalocon 2019:
> https://static.sched.com/hosted_files/cephalocon2019/d2/cephalocon-2019-mohamad-gebai.pdf
> for the slides and the recording:
> https://www.youtube.com/watch?v=26FbUEbiUrw&list=PLrBUGiINAakNCnQUosh63LpHbf84vegNu&index=29&t=0s
> 
> TL;DR: there is a bluestore_min_alloc_size_ssd which is 16K default for
> SSD and 64K default for HDD. With lots of small objects this might add
> up to *a lot* of overhead. You can change that to 4k:
> 
> bluestore min alloc size ssd = 4096
> bluestore min alloc size hdd = 4096
> 
> You will have to rebuild _all_ of your OSDs though.
> 
> Here is another thread about this:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/thread.html#24801
> 
> Gr. Stefan

Hi Robert,

some more questions: Are all your OSDs of equal size? If yes, have you
enabled balancing for your cluster (see [0])?

You might also be interested in this thread [1].

Peter

[0] http://docs.ceph.com/docs/master/rados/operations/balancer
[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030765.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [events] Ceph Day CERN September 17 - CFP now open!

2019-05-27 Thread Peter Wienemann
Hi Mike,

there is a date incompatibility between your announcement and Dan's
initial announcement [0]. Which date is correct: September 16 or
September 17?

Peter

[0]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034259.html

On 27.05.19 11:22, Mike Perez wrote:
> Hey everyone,
> 
> Ceph CERN Day will be a full-day event dedicated to fostering Ceph's
> research and non-profit user communities. The event is hosted by the
> Ceph team from the CERN IT department.
> 
> We invite this community to meet and discuss the status of the Ceph
> project, recent improvements, and roadmap, and to share practical
> experiences operating Ceph for their novel use-cases.original
> 
> We also invite potential speakers to submit an abstract on any of the
> following topics:
> 
> * Ceph use-cases for scientific and research applications
> * Ceph deployments in academic or non-profit organizations
> * Applications of CephFS or Object Storage for HPC
> * Operational highlights, tools, procedures, or other tips you want to
> share with the community
> 
> The day will end with a cocktail reception.
> 
> Visitors may be interested in combining their visit to CERN with the
> CERN Open Days being held September 14-15.
> 
> All event information for CFP, registration, accommodations can be
> found on the CERN website:
> 
> https://indico.cern.ch/event/765214/
> 
> And thank you to Dan van der Ster for reaching out to organizer this event!
> 
> --
> Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pool migration for cephfs?

2019-05-15 Thread Peter Woodman
I actually made a dumb python script to do this. It's ugly and has a
lot of hardcoded things in it (like the mount location where i'm
copying things to to move pools, names of pools, the savings i was
expecting, etc) but should be easy to adapt to what you're trying to
do

https://gist.github.com/pjjw/b5fbee24c848661137d6ac09a3e0c980

On Wed, May 15, 2019 at 1:45 PM Patrick Donnelly  wrote:
>
> On Wed, May 15, 2019 at 5:05 AM Lars Täuber  wrote:
> > is there a way to migrate a cephfs to a new data pool like it is for rbd on 
> > nautilus?
> > https://ceph.com/geen-categorie/ceph-pool-migration/
>
> No, this isn't possible.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Senior Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to just delete PGs stuck incomplete on EC pool

2019-03-05 Thread Peter Woodman
Last time I had to do this, I used the command outlined here:
https://tracker.ceph.com/issues/10098

On Mon, Mar 4, 2019 at 11:05 AM Daniel K  wrote:
>
> Thanks for the suggestions.
>
> I've tried both -- setting osd_find_best_info_ignore_history_les = true and 
> restarting all OSDs,  as well as 'ceph osd-force-create-pg' -- but both still 
> show incomplete
>
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
> pg 18.c is incomplete, acting [32,48,58,40,13,44,61,59,30,27,43,37] 
> (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 
> 'incomplete')
> pg 18.1e is incomplete, acting [50,49,41,58,60,46,52,37,34,63,57,16] 
> (reducing pool ec84-hdd-zm min_size from 8 may help; search ceph.com/docs for 
> 'incomplete')
>
>
> The OSDs in down_osds_we_would_probe have already been marked lost
>
> When I ran  the force-create-pg command, they went to peering for a few 
> seconds, but then went back incomplete.
>
> Updated ceph pg 18.1e query https://pastebin.com/XgZHvJXu
> Updated ceph pg 18.c query https://pastebin.com/N7xdQnhX
>
> Any other suggestions?
>
>
>
> Thanks again,
>
> Daniel
>
>
>
> On Sat, Mar 2, 2019 at 3:44 PM Paul Emmerich  wrote:
>>
>> On Sat, Mar 2, 2019 at 5:49 PM Alexandre Marangone
>>  wrote:
>> >
>> > If you have no way to recover the drives, you can try to reboot the OSDs 
>> > with `osd_find_best_info_ignore_history_les = true` (revert it 
>> > afterwards), you'll lose data. If after this, the PGs are down, you can 
>> > mark the OSDs blocking the PGs from become active lost.
>>
>> this should work for PG 18.1e, but not for 18.c. Try running "ceph osd
>> force-create-pg " to reset the PGs instead.
>> Data will obviously be lost afterwards.
>>
>> Paul
>>
>> >
>> > On Sat, Mar 2, 2019 at 6:08 AM Daniel K  wrote:
>> >>
>> >> They all just started having read errors. Bus resets. Slow reads. Which 
>> >> is one of the reasons the cluster didn't recover fast enough to 
>> >> compensate.
>> >>
>> >> I tried to be mindful of the drive type and specifically avoided the 
>> >> larger capacity Seagates that are SMR. Used 1 SM863 for every 6 drives 
>> >> for the WAL.
>> >>
>> >> Not sure why they failed. The data isn't critical at this point, just 
>> >> need to get the cluster back to normal.
>> >>
>> >> On Sat, Mar 2, 2019, 9:00 AM  wrote:
>> >>>
>> >>> Did they break, or did something went wronng trying to replace them?
>> >>>
>> >>> Jespe
>> >>>
>> >>>
>> >>>
>> >>> Sent from myMail for iOS
>> >>>
>> >>>
>> >>> Saturday, 2 March 2019, 14.34 +0100 from Daniel K :
>> >>>
>> >>> I bought the wrong drives trying to be cheap. They were 2TB WD Blue 
>> >>> 5400rpm 2.5 inch laptop drives.
>> >>>
>> >>> They've been replace now with HGST 10K 1.8TB SAS drives.
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Mar 2, 2019, 12:04 AM  wrote:
>> >>>
>> >>>
>> >>>
>> >>> Saturday, 2 March 2019, 04.20 +0100 from satha...@gmail.com 
>> >>> :
>> >>>
>> >>> 56 OSD, 6-node 12.2.5 cluster on Proxmox
>> >>>
>> >>> We had multiple drives fail(about 30%) within a few days of each other, 
>> >>> likely faster than the cluster could recover.
>> >>>
>> >>>
>> >>> Hov did so many drives break?
>> >>>
>> >>> Jesper
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs crashing in EC pool (whack-a-mole)

2019-01-18 Thread Peter Woodman
At the risk of hijacking this thread, like I said I've ran into this
problem again, and have captured a log with debug_osd=20, viewable at
https://www.dropbox.com/s/8zoos5hhvakcpc4/ceph-osd.3.log?dl=0 - any
pointers?

On Tue, Jan 8, 2019 at 11:31 AM Peter Woodman  wrote:
>
> For the record, in the linked issue, it was thought that this might be
> due to write caching. This seems not to be the case, as it happened
> again to me with write caching disabled.
>
> On Tue, Jan 8, 2019 at 11:15 AM Sage Weil  wrote:
> >
> > I've seen this on luminous, but not on mimic.  Can you generate a log with
> > debug osd = 20 leading up to the crash?
> >
> > Thanks!
> > sage
> >
> >
> > On Tue, 8 Jan 2019, Paul Emmerich wrote:
> >
> > > I've seen this before a few times but unfortunately there doesn't seem
> > > to be a good solution at the moment :(
> > >
> > > See also: http://tracker.ceph.com/issues/23145
> > >
> > > Paul
> > >
> > > --
> > > Paul Emmerich
> > >
> > > Looking for help with your Ceph cluster? Contact us at https://croit.io
> > >
> > > croit GmbH
> > > Freseniusstr. 31h
> > > 81247 München
> > > www.croit.io
> > > Tel: +49 89 1896585 90
> > >
> > > On Tue, Jan 8, 2019 at 9:37 AM David Young  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > One of my OSD hosts recently ran into RAM contention (was swapping 
> > > > heavily), and after rebooting, I'm seeing this error on random OSDs in 
> > > > the cluster:
> > > >
> > > > ---
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  ceph version 13.2.4 
> > > > (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  1: /usr/bin/ceph-osd() 
> > > > [0xcac700]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  2: (()+0x11390) 
> > > > [0x7f8fa5d0e390]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  3: (gsignal()+0x38) 
> > > > [0x7f8fa5241428]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  4: (abort()+0x16a) 
> > > > [0x7f8fa524302a]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  5: 
> > > > (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > > > const*)+0x250) [0x7f8fa767c510]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  6: (()+0x2e5587) 
> > > > [0x7f8fa767c587]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  7: 
> > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> > > > ObjectStore::Transaction*)+0x923) [0xbab5e3]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  8: 
> > > > (BlueStore::queue_transactions(boost::intrusive_ptr&,
> > > >  std::vector > > > std::allocator >&, 
> > > > boost::intrusive_ptr, ThreadPool::TPHandle*)+0x5c3) 
> > > > [0xbade03]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  9: 
> > > > (ObjectStore::queue_transaction(boost::intrusive_ptr&,
> > > >  ObjectStore::Transaction&&, boost::intrusive_ptr, 
> > > > ThreadPool::TPHandle*)+0x82) [0x79c812]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  10: 
> > > > (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, 
> > > > ThreadPool::TPHandle*)+0x58) [0x730ff8]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  11: 
> > > > (OSD::dequeue_peering_evt(OSDShard*, PG*, 
> > > > std::shared_ptr, ThreadPool::TPHandle&)+0xfe) [0x759aae]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  12: (PGPeeringItem::run(OSD*, 
> > > > OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x50) 
> > > > [0x9c5720]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  13: 
> > > > (OSD::ShardedOpWQ::_process(unsigned int, 
> > > > ceph::heartbeat_handle_d*)+0x590) [0x769760]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  14: 
> > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) 
> > > > [0x7f8fa76824f6]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  15: 
> > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f8fa76836b0]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  16: (()+0x76ba) 
> > > > [0x7f8fa5d046ba]
> > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  17: (clone()+0x6d) 
> > >

Re: [ceph-users] OSDs crashing in EC pool (whack-a-mole)

2019-01-08 Thread Peter Woodman
For the record, in the linked issue, it was thought that this might be
due to write caching. This seems not to be the case, as it happened
again to me with write caching disabled.

On Tue, Jan 8, 2019 at 11:15 AM Sage Weil  wrote:
>
> I've seen this on luminous, but not on mimic.  Can you generate a log with
> debug osd = 20 leading up to the crash?
>
> Thanks!
> sage
>
>
> On Tue, 8 Jan 2019, Paul Emmerich wrote:
>
> > I've seen this before a few times but unfortunately there doesn't seem
> > to be a good solution at the moment :(
> >
> > See also: http://tracker.ceph.com/issues/23145
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Tue, Jan 8, 2019 at 9:37 AM David Young  
> > wrote:
> > >
> > > Hi all,
> > >
> > > One of my OSD hosts recently ran into RAM contention (was swapping 
> > > heavily), and after rebooting, I'm seeing this error on random OSDs in 
> > > the cluster:
> > >
> > > ---
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  ceph version 13.2.4 
> > > (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  1: /usr/bin/ceph-osd() 
> > > [0xcac700]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  2: (()+0x11390) [0x7f8fa5d0e390]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  3: (gsignal()+0x38) 
> > > [0x7f8fa5241428]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  4: (abort()+0x16a) 
> > > [0x7f8fa524302a]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  5: 
> > > (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > > const*)+0x250) [0x7f8fa767c510]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  6: (()+0x2e5587) 
> > > [0x7f8fa767c587]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  7: 
> > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> > > ObjectStore::Transaction*)+0x923) [0xbab5e3]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  8: 
> > > (BlueStore::queue_transactions(boost::intrusive_ptr&,
> > >  std::vector > > std::allocator >&, 
> > > boost::intrusive_ptr, ThreadPool::TPHandle*)+0x5c3) [0xbade03]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  9: 
> > > (ObjectStore::queue_transaction(boost::intrusive_ptr&,
> > >  ObjectStore::Transaction&&, boost::intrusive_ptr, 
> > > ThreadPool::TPHandle*)+0x82) [0x79c812]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  10: 
> > > (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, 
> > > ThreadPool::TPHandle*)+0x58) [0x730ff8]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  11: 
> > > (OSD::dequeue_peering_evt(OSDShard*, PG*, 
> > > std::shared_ptr, ThreadPool::TPHandle&)+0xfe) [0x759aae]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  12: (PGPeeringItem::run(OSD*, 
> > > OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x50) 
> > > [0x9c5720]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  13: 
> > > (OSD::ShardedOpWQ::_process(unsigned int, 
> > > ceph::heartbeat_handle_d*)+0x590) [0x769760]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  14: 
> > > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) 
> > > [0x7f8fa76824f6]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  15: 
> > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f8fa76836b0]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  16: (()+0x76ba) [0x7f8fa5d046ba]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  17: (clone()+0x6d) 
> > > [0x7f8fa531341d]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  NOTE: a copy of the executable, 
> > > or `objdump -rdS ` is needed to interpret this.
> > > Jan 08 03:34:36 prod1 systemd[1]: ceph-osd@43.service: Main process 
> > > exited, code=killed, status=6/ABRT
> > > ---
> > >
> > > I've restarted all the OSDs and the mons, but still encountering the 
> > > above.
> > >
> > > Any ideas / suggestions?
> > >
> > > Thanks!
> > > D
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.3?

2019-01-04 Thread Peter Woodman
not to mention that the current released version of mimic (.2) has a
bug that is potentially catastrophic to cephfs, known about for
months, yet it's not in the release notes. would have upgraded and
destroyed data had i not caught a thread on this list.

hopefully crowing like this isn't coming off as too obnoxious, but
yeah, the release process seems quite brittle at the moment.

On Fri, Jan 4, 2019 at 1:25 PM Brady Deetz  wrote:
>
> I agree with the comments above. I don't feel comfortable upgrading because I 
> never know what's been deemed stable. We used to get an announcement at the 
> same times that the packages hit the repo. What's going on? Frankly, the 
> entire release cycle of Mimic has seemed very haphazard.
>
> On Fri, Jan 4, 2019 at 10:22 AM Daniel Baumann  wrote:
>>
>> On 01/04/2019 05:07 PM, Matthew Vernon wrote:
>> > how is it still the case that packages are being pushed onto the official 
>> > ceph.com repos that people
>> > shouldn't install?
>>
>> We're still on 12.2.5 because of this. Basically every 12.2.x after that
>> had notes on the mailinglist like "don't use, wait for ..."
>>
>> I don't dare updating to 13.2.
>>
>> For the 10.2.x and 11.2.x cycles, we upgraded our production cluster
>> within a matter of days after the release of an update. Since the second
>> half of the 12.2.x releases, this seems to be not possible anymore.
>>
>> Ceph is great and all, but this decrease of release quality seriously
>> harms the image and perception of Ceph as a stable software platform in
>> the enterprise environment and makes people do the wrong things (rotting
>> systems update-wise, for the sake of stability).
>>
>> Regards,
>> Daniel
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs speed

2018-08-31 Thread Peter Eisch
[replying to myself]

I set aside cephfs and created an rbd volume.  I get the same splotchy 
throughput with rbd as I was getting with cephfs.   (image attached)

So, withdrawing this as a question here as a cephfs issue.

#backingout

peter



Peter Eisch
virginpulse.com
|globalchallenge.virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.10

On 8/30/18, 12:25 PM, "Peter Eisch"  wrote:

Thanks for the thought.  It’s mounted with this entry in fstab (one line, 
if email wraps it):

cephmon-s01,cephmon-s02,cephmon-s03:/ /loamceph
noauto,name=clientname,secretfile=/etc/ceph/secret,noatime,_netdev0   2

Pretty plain, but I'm open to tweaking!

peter

From: Gregory Farnum 
Date: Thursday, August 30, 2018 at 11:47 AM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] cephfs speed

How are you mounting CephFS? It may be that the cache settings are just set 
very badly for a 10G pipe. Plus rados bench is a very parallel large-IO 
benchmark and many benchmarks you might dump into a filesystem are definitely 
not.
-Greg

On Thu, Aug 30, 2018 at 7:54 AM Peter Eisch 
<mailto:peter.ei...@virginpulse.com> wrote:
Hi,

I have a cluster serving cephfs and it works. It’s just slow. Client is 
using the kernel driver. I can ‘rados bench’ writes to the cephfs_data pool at 
wire speeds (9580Mb/s on a 10G link) but when I copy data into cephfs it is 
rare to get above 100Mb/s. Large file writes may start fast (2Gb/s) but within 
a minute slows. In the dashboard at the OSDs I get lots of triangles (it 
doesn't stream) which seems to be lots of starts and stops. By contrast the 
graphs show constant flow when using 'rados bench.'

I feel like I'm missing something obvious. What can I do to help diagnose 
this better or resolve the issue?

Errata:
Version: 12.2.7 (on everything)
mon: 3 daemons, quorum cephmon-s01,cephmon-s03,cephmon-s02
mgr: cephmon-s02(active), standbys: cephmon-s01, cephmon-s03
mds: cephfs1-1/1/1 up {0=cephmon-s02=up:active}, 2 up:standby
osd: 70 osds: 70 up, 70 in
rgw: 3 daemons active

rados bench summary:
Total time run: 600.043733
Total writes made: 167725
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1118.09
Stddev Bandwidth: 7.23868
Max bandwidth (MB/sec): 1140
Min bandwidth (MB/sec): 1084
Average IOPS: 279
Stddev IOPS: 1
Max IOPS: 285
Min IOPS: 271
Average Latency(s): 0.057239
Stddev Latency(s): 0.0354817
Max latency(s): 0.367037
    Min latency(s): 0.0120791

peter


Peter Eisch​











https://www.virginpulse.com/
|

https://globalchallenge.virginpulse.com/


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
Switzerland | United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including 
any attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.10

___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs speed

2018-08-30 Thread Peter Eisch
Thanks for the thought.  It’s mounted with this entry in fstab (one line, if 
email wraps it):

cephmon-s01,cephmon-s02,cephmon-s03:/     /loam    ceph    
noauto,name=clientname,secretfile=/etc/ceph/secret,noatime,_netdev    0       2

Pretty plain, but I'm open to tweaking!

peter


Peter Eisch
virginpulse.com
|globalchallenge.virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.10
From: Gregory Farnum 
Date: Thursday, August 30, 2018 at 11:47 AM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] cephfs speed

How are you mounting CephFS? It may be that the cache settings are just set 
very badly for a 10G pipe. Plus rados bench is a very parallel large-IO 
benchmark and many benchmarks you might dump into a filesystem are definitely 
not.
-Greg

On Thu, Aug 30, 2018 at 7:54 AM Peter Eisch 
<mailto:peter.ei...@virginpulse.com> wrote:
Hi,

I have a cluster serving cephfs and it works. It’s just slow. Client is using 
the kernel driver. I can ‘rados bench’ writes to the cephfs_data pool at wire 
speeds (9580Mb/s on a 10G link) but when I copy data into cephfs it is rare to 
get above 100Mb/s. Large file writes may start fast (2Gb/s) but within a minute 
slows. In the dashboard at the OSDs I get lots of triangles (it doesn't stream) 
which seems to be lots of starts and stops. By contrast the graphs show 
constant flow when using 'rados bench.'

I feel like I'm missing something obvious. What can I do to help diagnose this 
better or resolve the issue?

Errata:
Version: 12.2.7 (on everything)
mon: 3 daemons, quorum cephmon-s01,cephmon-s03,cephmon-s02
mgr: cephmon-s02(active), standbys: cephmon-s01, cephmon-s03
mds: cephfs1-1/1/1 up {0=cephmon-s02=up:active}, 2 up:standby
osd: 70 osds: 70 up, 70 in
rgw: 3 daemons active

rados bench summary:
Total time run: 600.043733
Total writes made: 167725
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1118.09
Stddev Bandwidth: 7.23868
Max bandwidth (MB/sec): 1140
Min bandwidth (MB/sec): 1084
Average IOPS: 279
Stddev IOPS: 1
Max IOPS: 285
Min IOPS: 271
Average Latency(s): 0.057239
Stddev Latency(s): 0.0354817
Max latency(s): 0.367037
Min latency(s): 0.0120791

peter


Peter Eisch​











https://www.virginpulse.com/
|

https://globalchallenge.virginpulse.com/


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.10

___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs speed

2018-08-30 Thread Peter Eisch
Hi,

I have a cluster serving cephfs and it works.  It’s just slow.  Client is using 
the kernel driver.  I can ‘rados bench’ writes to the cephfs_data pool at wire 
speeds (9580Mb/s on a 10G link) but when I copy data into cephfs it is rare to 
get above 100Mb/s.  Large file writes may start fast (2Gb/s) but within a 
minute slows.  In the dashboard at the OSDs I get lots of triangles (it doesn't 
stream) which seems to be lots of starts and stops.  By contrast the graphs 
show constant flow when using 'rados bench.'

I feel like I'm missing something obvious.  What can I do to help diagnose this 
better or resolve the issue?

Errata:
Version: 12.2.7 (on everything)
mon: 3 daemons, quorum cephmon-s01,cephmon-s03,cephmon-s02
mgr: cephmon-s02(active), standbys: cephmon-s01, cephmon-s03
mds: cephfs1-1/1/1 up  {0=cephmon-s02=up:active}, 2 up:standby
osd: 70 osds: 70 up, 70 in
rgw: 3 daemons active

rados bench summary:
Total time run: 600.043733
Total writes made:  167725
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 1118.09
Stddev Bandwidth:   7.23868
Max bandwidth (MB/sec): 1140
Min bandwidth (MB/sec): 1084
Average IOPS:   279
Stddev IOPS:1
Max IOPS:   285
Min IOPS:   271
Average Latency(s): 0.057239
Stddev Latency(s):  0.0354817
Max latency(s): 0.367037
Min latency(s): 0.0120791

peter




Peter Eisch
virginpulse.com
|globalchallenge.virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD-primary crush rule doesn't work as intended

2018-05-24 Thread Peter Linder
It will also only work reliably if you use a single level tree structure 
with failure domain "host". If you want say, separate data center 
failure domains, you need extra steps to make sure a SSD host and a HDD 
host do not get selected from the same DC.


I have done such a layout so it is possible (see my older posts) but you 
need to be careful when you construct the additional trees that are 
needed in order to force the correct elections.


In reality however, even if you force all reads to the SSD using primary 
affinity, you will soon run out of write IOPS on the HDDs. To keep up 
with the SSD's you will need so many HDDs for an average workload that 
in order to keep up performance you will not save any money.


Regards,

Peter



Den 2018-05-23 kl. 14:37, skrev Paul Emmerich:

You can't mix HDDs and SSDs in a server if you want to use such a rule.
The new selection step after "emit" can't know what server was 
selected previously.


Paul

2018-05-23 11:02 GMT+02:00 Horace <mailto:hor...@hkisl.net>>:


Add to the info, I have a slightly modified rule to take advantage
of the new storage class.

rule ssd-hybrid {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class ssd
        step chooseleaf firstn 1 type host
        step emit
        step take default class hdd
        step chooseleaf firstn -1 type host
        step emit
}

Regards,
Horace Ng

- Original Message -
From: "horace" mailto:hor...@hkisl.net>>
To: "ceph-users" mailto:ceph-users@lists.ceph.com>>
Sent: Wednesday, May 23, 2018 3:56:20 PM
Subject: [ceph-users] SSD-primary crush rule doesn't work as intended

I've set up the rule according to the doc, but some of the PGs are
still being assigned to the same host.

http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/
<http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/>

  rule ssd-primary {
              ruleset 5
              type replicated
              min_size 5
              max_size 10
              step take ssd
              step chooseleaf firstn 1 type host
              step emit
              step take platter
              step chooseleaf firstn -1 type host
              step emit
      }

Crush tree:

[root@ceph0 ~]#    ceph osd crush tree
ID CLASS WEIGHT   TYPE NAME
-1       58.63989 root default
-2       19.55095     host ceph0
 0   hdd  2.73000         osd.0
 1   hdd  2.73000         osd.1
 2   hdd  2.73000         osd.2
 3   hdd  2.73000         osd.3
12   hdd  4.54999         osd.12
15   hdd  3.71999         osd.15
18   ssd  0.2         osd.18
19   ssd  0.16100         osd.19
-3       19.55095     host ceph1
 4   hdd  2.73000         osd.4
 5   hdd  2.73000         osd.5
 6   hdd  2.73000         osd.6
 7   hdd  2.73000         osd.7
13   hdd  4.54999         osd.13
16   hdd  3.71999         osd.16
20   ssd  0.16100         osd.20
21   ssd  0.2         osd.21
-4       19.53799     host ceph2
 8   hdd  2.73000         osd.8
 9   hdd  2.73000         osd.9
10   hdd  2.73000         osd.10
11   hdd  2.73000         osd.11
14   hdd  3.71999         osd.14
17   hdd  4.54999         osd.17
22   ssd  0.18700         osd.22
23   ssd  0.16100         osd.23

#ceph pg ls-by-pool ssd-hybrid

27.8       1051                  0        0         0    0
4399733760 1581     1581               active+clean 2018-05-23
06:20:56.088216 27957'185553 27959:368828 [23,1,11]         23 
[23,1,11]             23 27953'182582 2018-05-23 06:20:56.088172 
  27843'162478 2018-05-20 18:28:20.118632

With osd.23 and osd.11 being assigned on the same host.

Regards,
Horace Ng
___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>




--
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io>
Tel: +49 89 1896585 90


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SOLVED] Replicated pool with an even size - has min_size to be bigger than half the size?

2018-03-29 Thread Peter Linder



Den 2018-03-29 kl. 14:26, skrev David Rabel:

On 29.03.2018 13:50, Peter Linder wrote:

Den 2018-03-29 kl. 12:29, skrev David Rabel:

On 29.03.2018 12:25, Janne Johansson wrote:

2018-03-29 11:50 GMT+02:00 David Rabel :

You are right. But with my above example: If I have min_size 2 and size
4, and because of a network issue the 4 OSDs are split into 2 and 2, is
it possible that I have write operations on both sides and therefore
have inconsistent data?


You always write to the primary, which in turn sends copies to the 3
others,
so in the 2+2 split case, only one side can talk to the primary OSD for
that pg,
so writes will just happen on one side at most.

I'm not sure that this is true, will not the side that doesn't have the
primary simply elect a new one when min_size=2 and there are 2 of
[failure domain] available? This is assuming that there are enough mon's
also.

Even if this is the case, only half of the PGs would be available and
operations will stop.

Why is this? If min_size is 2 and 2 PGs are available, operations should
not stop. Or am I wrong here?
Yes, but there are 2 OSDs available per PG per side of the partition, so 
2 separate active clusters. If there is a different write to both of 
them it will be accepted and it will not be possible to heal the cluster 
later when the network issue is resolved because of inconsistency.


Even if it was a 50/50 chance on which side a PG would be active (going 
by the original primary) it would mean trouble as many writes could not 
complete, but I don't think this is the case.


You will have to take into account  mon quorum as well of course, it is 
outside of my post. Best think I believe is to have an uneven number of 
everything. I don't know if you can have 4 OSD hosts and a 5:th node for 
quorum, I suppose it woule be worth it if the extra quorum node could 
not fail at the same time as 2 of the hosts.




David




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SOLVED] Replicated pool with an even size - has min_size to be bigger than half the size?

2018-03-29 Thread Peter Linder



Den 2018-03-29 kl. 12:29, skrev David Rabel:

On 29.03.2018 12:25, Janne Johansson wrote:

2018-03-29 11:50 GMT+02:00 David Rabel :

You are right. But with my above example: If I have min_size 2 and size
4, and because of a network issue the 4 OSDs are split into 2 and 2, is
it possible that I have write operations on both sides and therefore
have inconsistent data?


You always write to the primary, which in turn sends copies to the 3 others,
so in the 2+2 split case, only one side can talk to the primary OSD for
that pg,
so writes will just happen on one side at most.
I'm not sure that this is true, will not the side that doesn't have the 
primary simply elect a new one when min_size=2 and there are 2 of 
[failure domain] available? This is assuming that there are enough mon's 
also.


Even if this is the case, only half of the PGs would be available and 
operations will stop.




Thanks for clarifying!

David




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck activating after adding new OSDs

2018-03-27 Thread Peter Linder
0 9313G  7597G 1716G 81.57 1.23 171
 65   hdd 9.09560  1.0 9313G  6706G 2607G 72.00 1.08 151
 66   hdd 9.09560  0.95000 9313G  7820G 1493G 83.97 1.26 176
 67   hdd 9.09560  0.95000 9313G  8043G 1270G 86.36 1.30 181
 68   hdd 9.09560  1.0 9313G  7643G 1670G 82.07 1.23 172
 69   hdd 9.09560  1.0 9313G  6620G 2693G 71.08 1.07 149
 70   hdd 9.09560  1.0 9313G  7775G 1538G 83.48 1.26 175
 71   hdd 9.09560  1.0 9313G  7731G 1581G 83.02 1.25 174
 72   hdd 9.09560  1.0 9313G  7598G 1715G 81.58 1.23 171
 73   hdd 9.09560  1.0 9313G  6575G 2738G 70.60 1.06 148
 74   hdd 9.09560  1.0 9313G  7155G 2158G 76.83 1.16 161
 75   hdd 9.09560  1.0 9313G  6220G 3093G 66.79 1.00 140
 76   hdd 9.09560  1.0 9313G  6796G 2517G 72.97 1.10 153
 77   hdd 9.09560  1.0 9313G  7725G 1587G 82.95 1.25 174
 78   hdd 9.09560  1.0 9313G  7241G 2072G 77.75 1.17 163
 79   hdd 9.09560  1.0 9313G  7597G 1716G 81.57 1.23 171
 80   hdd 9.09560  1.0 9313G  7467G 1846G 80.18 1.21 168
 81   hdd 9.09560  1.0 9313G  7909G 1404G 84.92 1.28 178
 82   hdd 9.09560  1.0 9313G  7240G 2073G 77.74 1.17 163
 83   hdd 9.09560  1.0 9313G  7241G 2072G 77.75 1.17 163
 84   hdd 9.09560  1.0 9313G  7687G 1626G 82.54 1.24 173
 85   hdd 9.09560  1.0 9313G  7244G 2069G 77.78 1.17 163
 86   hdd 9.09560  1.0 9313G  7466G 1847G 80.16 1.21 168
 87   hdd 9.09560  1.0 9313G  7953G 1360G 85.39 1.28 179
 88   hdd 9.09569  1.0 9313G   144G 9169G  1.56 0.02  3
 89   hdd 9.09569  1.0 9313G   241G 9072G  2.59 0.04  5
 90   hdd       0  1.0 9313G  6975M 9307G  0.07 0.00  0
 91   hdd       0  1.0 9313G  1854M 9312G  0.02    0  0
 92   hdd       0  1.0 9313G  1837M 9312G  0.02    0  0
 93   hdd       0  1.0 9313G  2001M 9312G  0.02    0  0
 94   hdd       0  1.0 9313G  1829M 9312G  0.02    0  0
 95   hdd       0  1.0 9313G  1807M 9312G  0.02    0  0
 96   hdd       0  1.0 9313G  1850M 9312G  0.02    0  0
 97   hdd       0  1.0 9313G  1311M 9312G  0.01    0  0
 98   hdd       0  1.0 9313G  1287M 9312G  0.01    0  0
 99   hdd       0  1.0 9313G  1279M 9312G  0.01    0  0
100   hdd       0  1.0 9313G  1285M 9312G  0.01    0  0
101   hdd       0  1.0 9313G  1271M 9312G  0.01    0  0


On Tue, Mar 27, 2018 at 2:29 PM, Peter Linder 
mailto:peter.lin...@fiberdirekt.se>> wrote:


I've had similar issues, but I think your problem might be
something else. Could you send the output of "ceph osd df"?

Other people will probably be interested in what version you are
using as well.


Den 2018-03-27 kl. 20:07, skrev Jon Light:

Hi all,

I'm adding a new OSD node with 36 OSDs to my cluster and have run
into some problems. Here are some of the details of the cluster:

1 OSD node with 80 OSDs
1 EC pool with k=10, m=3
pg_num 1024
osd failure domain

I added a second OSD node and started creating OSDs with
ceph-deploy, one by one. The first 2 added fine, but each
subsequent new OSD resulted in more and more PGs stuck
activating. I've added a total of 14 new OSDs, but had to set 12
of those with a weight of 0 to get the cluster healthy and usable
until I get it fixed.

I have read some things about similar behavior due to PG overdose
protection, but I don't think that's the case here because the
failure domain is set to osd. Instead, I think my CRUSH rule need
some attention:

rule main-storage {
        id 1
        type erasure
        min_size 3
        max_size 13
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 0 type osd
        step emit
}

I don't believe I have modified anything from the automatically
generated rule except for the addition of the hdd class.

I have been reading the documentation on CRUSH rules, but am
having trouble figuring out if the rule is setup properly. After
a few more nodes are added I do want to change the failure domain
to host, but osd is sufficient for now.

Can anyone help out to see if the rule is causing the problems or
if I should be looking at something else?


___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>



___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck activating after adding new OSDs

2018-03-27 Thread Peter Linder
I've had similar issues, but I think your problem might be something 
else. Could you send the output of "ceph osd df"?


Other people will probably be interested in what version you are using 
as well.



Den 2018-03-27 kl. 20:07, skrev Jon Light:

Hi all,

I'm adding a new OSD node with 36 OSDs to my cluster and have run into 
some problems. Here are some of the details of the cluster:


1 OSD node with 80 OSDs
1 EC pool with k=10, m=3
pg_num 1024
osd failure domain

I added a second OSD node and started creating OSDs with ceph-deploy, 
one by one. The first 2 added fine, but each subsequent new OSD 
resulted in more and more PGs stuck activating. I've added a total of 
14 new OSDs, but had to set 12 of those with a weight of 0 to get the 
cluster healthy and usable until I get it fixed.


I have read some things about similar behavior due to PG overdose 
protection, but I don't think that's the case here because the failure 
domain is set to osd. Instead, I think my CRUSH rule need some attention:


rule main-storage {
        id 1
        type erasure
        min_size 3
        max_size 13
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 0 type osd
        step emit
}

I don't believe I have modified anything from the automatically 
generated rule except for the addition of the hdd class.


I have been reading the documentation on CRUSH rules, but am having 
trouble figuring out if the rule is setup properly. After a few more 
nodes are added I do want to change the failure domain to host, but 
osd is sufficient for now.


Can anyone help out to see if the rule is causing the problems or if I 
should be looking at something else?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS Metadata corruption while activating OSD

2018-03-12 Thread Peter Woodman
from what i've heard, xfs has problems on arm. use btrfs, or (i
believe?) ext4+bluestore will work.

On Sun, Mar 11, 2018 at 9:49 PM, Christian Wuerdig
 wrote:
> Hm, so you're running OSD nodes with 2GB of RAM and 2x10TB = 20TB of
> storage? Literally everything posted on this list in relation to HW
> requirements and related problems will tell you that this simply isn't going
> to work. The slightest hint of a problem will simply kill the OSD nodes with
> OOM. Have you tried with smaller disks - like 1TB models (or even smaller
> like 256GB SSDs) and see if the same problem persists?
>
>
> On Tue, 6 Mar 2018 at 10:51, 赵赵贺东  wrote:
>>
>> Hello ceph-users,
>>
>> It is a really really Really tough problem for our team.
>> We investigated in the problem for a long time, try a lot of efforts, but
>> can’t solve the problem, even the concentrate cause of the problem is still
>> unclear for us!
>> So, Anyone give any solution/suggestion/opinion whatever  will be highly
>> highly appreciated!!!
>>
>> Problem Summary:
>> When we activate osd, there will be  metadata corrupttion in the
>> activating disk, probability is 100% !
>>
>> Admin Nodes&MON node:
>> Platform: X86
>> OS: Ubuntu 16.04
>> Kernel: 4.12.0
>> Ceph: Luminous 12.2.2
>>
>> OSD nodes:
>> Platform: armv7
>> OS:   Ubuntu 14.04
>> Kernel:   4.4.39
>> Ceph: Lominous 12.2.2
>> Disk: 10T+10T
>> Memory: 2GB
>>
>> Deploy log:
>>
>>
>> dmesg log:(Sorry arms001-01 dmesg log has log has been lost, but error
>> message about metadata corruption on arms003-10 are the same with
>> arms001-01)
>> Mar  5 11:08:49 arms003-10 kernel: [  252.534232] XFS (sda1): Unmount and
>> run xfs_repair
>> Mar  5 11:08:49 arms003-10 kernel: [  252.539100] XFS (sda1): First 64
>> bytes of corrupted metadata buffer:
>> Mar  5 11:08:49 arms003-10 kernel: [  252.545504] eb82f000: 58 46 53 42 00
>> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
>> Mar  5 11:08:49 arms003-10 kernel: [  252.553569] eb82f010: 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00  
>> Mar  5 11:08:49 arms003-10 kernel: [  252.561624] eb82f020: fc 4e e3 89 50
>> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
>> Mar  5 11:08:49 arms003-10 kernel: [  252.569706] eb82f030: 00 00 00 00 80
>> 00 00 07 ff ff ff ff ff ff ff ff  
>> Mar  5 11:08:49 arms003-10 kernel: [  252.58] XFS (sda1): metadata I/O
>> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
>> Mar  5 11:08:49 arms003-10 kernel: [  252.602944] XFS (sda1): Metadata
>> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
>> block 0x48b9ff80
>> Mar  5 11:08:49 arms003-10 kernel: [  252.614170] XFS (sda1): Unmount and
>> run xfs_repair
>> Mar  5 11:08:49 arms003-10 kernel: [  252.619030] XFS (sda1): First 64
>> bytes of corrupted metadata buffer:
>> Mar  5 11:08:49 arms003-10 kernel: [  252.625403] eb901000: 58 46 53 42 00
>> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
>> Mar  5 11:08:49 arms003-10 kernel: [  252.633441] eb901010: 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00  
>> Mar  5 11:08:49 arms003-10 kernel: [  252.641474] eb901020: fc 4e e3 89 50
>> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
>> Mar  5 11:08:49 arms003-10 kernel: [  252.649519] eb901030: 00 00 00 00 80
>> 00 00 07 ff ff ff ff ff ff ff ff  
>> Mar  5 11:08:49 arms003-10 kernel: [  252.657554] XFS (sda1): metadata I/O
>> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
>> Mar  5 11:08:49 arms003-10 kernel: [  252.675056] XFS (sda1): Metadata
>> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
>> block 0x48b9ff80
>> Mar  5 11:08:49 arms003-10 kernel: [  252.686228] XFS (sda1): Unmount and
>> run xfs_repair
>> Mar  5 11:08:49 arms003-10 kernel: [  252.691054] XFS (sda1): First 64
>> bytes of corrupted metadata buffer:
>> Mar  5 11:08:49 arms003-10 kernel: [  252.697425] eb901000: 58 46 53 42 00
>> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
>> Mar  5 11:08:49 arms003-10 kernel: [  252.705459] eb901010: 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00  
>> Mar  5 11:08:49 arms003-10 kernel: [  252.713489] eb901020: fc 4e e3 89 50
>> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
>> Mar  5 11:08:49 arms003-10 kernel: [  252.721520] eb901030: 00 00 00 00 80
>> 00 00 07 ff ff ff ff ff ff ff ff  
>> Mar  5 11:08:49 arms003-10 kernel: [  252.729558] XFS (sda1): metadata I/O
>> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
>> Mar  5 11:08:49 arms003-10 kernel: [  252.741953] XFS (sda1): Metadata
>> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
>> block 0x48b9ff80
>> Mar  5 11:08:49 arms003-10 kernel: [  252.753139] XFS (sda1): Unmount and
>> run xfs_repair
>> Mar  5 11:08:49 arms003-10 kernel: [  252.757955] XFS (sda1): First 64
>> bytes of corrupted metadata buffer:
>> Mar  5 11:08:49 arms003-10 kernel: [  252.7643

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-31 Thread Peter Linder
Yes, this did turn out to be our main issue. We also had a smaller 
issue, but this was the one that caused parts of our pools to go offline 
for a short time. Or, 'cause' was us adding some new NVMe drives that 
were much larger than the ones we already had so too many PGs got mapped 
to them but we didn't realize at first that it was the problem. Taking 
those OSDs down again allowed us to quickly recover though.


It was a little hard to figure out, mostly because we had two separate 
problems at the same time. Some kind of separate warning message would 
have been nice (couldn't find anything in the logs), and perhaps allow 
the PGs to activate anyway and put the cluster in health_warn?


My colleague built a lab copy of our environment virtualized and we used 
that to recreate and then fix our issues.


We are also working on installing more OSDs, as was our original plan, 
so PGs per OSD will decrease over time. At the time we thought to aim 
for 300 PGs per OSD, which I realize now was probably not a great idea, 
something like 150 would have been better.


/Peter

Den 2018-01-31 kl. 13:42, skrev Thomas Bennett:

Hi Peter,

Relooking at your problem, you might want to keep track of this issue: 
http://tracker.ceph.com/issues/22440 
<http://tracker.ceph.com/issues/22440>


Regards,
Tom

On Wed, Jan 31, 2018 at 11:37 AM, Thomas Bennett <mailto:tho...@ska.ac.za>> wrote:


Hi Peter,

From your reply, I see that:

 1. pg 3.12c is part of pool 3.
 2. The osd's in the "up" for pg 3.12c  are: 6, 0, 12.


I suggest to check on this 'activating' issue do the following:

 1. What is the rule that pool 3 should follow, 'hybrid', 'nvme'
or 'hdd'? (Use the *ceph osd pool ls detail* command and look
at pool 3's crush rule)
 2. Then check are osds 6, 0, 12 backed by nvme's or hdd's? (Use
*ceph osd tree | grep nvme *command to find your nvme backed
osds.)


If your problem is similar to mine, you will have osds that are
nvme backed in a pool that should only be backed by hdds, which
was causing a pg to go into 'activating' state and staying there.

Cheers,
Tom




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH straw2 can not handle big weight differences

2018-01-29 Thread Peter Linder
I realize we're probably kind of pushing it. It was the only option i 
could think of however that would satisfy the idea that:


Have separate servers for HDD and NVMe storage spread out in 3 data centers.
Always select 1 NVMe and 2 HDD, in separate data centers (make sure NVMe 
is primary)

If one data center goes down, only loose 1/3 of the NVMes.

I tried making a ceph rule to first select an NVMe based on class, and 
then select 2 HDDs based on class. I couldn't make it guarantee however 
that they would be in separate data centers probably because of two 
separate chooseleaf statements. Sometimes one of the HDDs would end up 
being in the same one as the NVMe. I did play around with this for some 
time.


Just selecting 3 separate ones instead sometimes resulted in 2 or 3 
NVMes, or no NVMes at all. In fact we do have a separate pool with 
3xNVMe for the high performance req stuff, but that uses a traditional 
"default" tree.


Rearranging the osd map and reducing the rule to a single chooseleaf 
seems to work though and we will manually alter the weights outside of 
the hosts to make life easier for CRUSH :).


If we want to add more servers we will just add another layer in between 
and make sure the weights there do not differ too much when we plan it out.


/Peter


Den 2018-01-29 kl. 17:52, skrev Gregory Farnum:
CRUSH is a pseudorandom, probabilistic algorithm. That can lead to 
problems with extreme input.


In this case, you've given it a bucket in which one child contains 
~3.3% of the total weight, and there are only three weights. So on 
only 3% of "draws", as it tries to choose a child bucket to descend 
into, will it choose that small one first.
And then you've forced it to select...each of the hosts in that data 
center, for all inputs? How can that even work in terms of actual data 
storage, if some of them are an order of magnitude larger than the others?


Anyway, leaving that bit aside since it looks like you're mapping each 
host to multiple DCs, you're giving CRUSH a very difficult problem to 
solve. You can probably "fix" it by turning up the choose_retries 
value (or whatever it is) to a high enough level that trying to map a 
PG eventually actually grabs the small host. But I wouldn't be very 
confident in a solution like this; it seems very fragile and subject 
to input error.

-Greg

On Mon, Jan 29, 2018 at 6:45 AM Peter Linder 
mailto:peter.lin...@fiberdirekt.se>> wrote:


We kind of turned the crushmap inside out a little bit.

Instead of the traditional "for 1 PG, select OSDs from 3 separate data
centers" we did "force selection from only one datacenter (out of
3) and
leave enough options only to make sure precisely 1 SSD and 2 HDD are
selected".

We then organized these "virtual datacenters" in the hierachy so that
one of them in fact contain 3 options that lead to 3 physically
separate
servers in different locations.

Every physical datacenter has both SSD's and HDD's. The idea is
that if
one datacenter is lost, 2/3 of the SSD's still remain (and can be
mapped
to by marking the missing ones "out") so performance is maintained.





Den 2018-01-29 kl. 13:35, skrev Niklas:
> Yes.
> It is a hybrid solution where a placement group is always located on
> one NVMe drive and two HDD drives. Advantage is great read
performance
> and cost savings. Disadvantages is low write performance. Still the
> write performance is good thanks to rockdb on Intel Optane disks in
> HDD servers.
>
> Real world looks more like I described in a previous question
> (2018-01-23) here on ceph-users list, "Ruleset for optimized Ceph
> hybrid storage". Nobody answered so am guessing it is not
possible to
> create my wanted rule. Now am trying to solve it with virtual
> datacenters in the crush map. Which works but maybe the the most
> optimal solution.
>
>
> On 2018-01-29 13:21, Wido den Hollander wrote:
>>
>>
>> On 01/29/2018 01:14 PM, Niklas wrote:
>>> ...
>>>
>>
>> Is it your intention to put all copies of a object in only one DC?
>>
>> What is your exact idea behind this rule? What's the purpose?
>>
>> Wido
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com

Re: [ceph-users] [Best practise] Adding new data center

2018-01-29 Thread Peter Linder
But the OSDs themselves introduce latency also, even if they are NVMe. 
We find that it is in the same ballpark. Latency does reduce I/O, but 
for sub-ms ones it is still thousands of IOPS even for a single thread.


For a use case with many concurrent writers/readers (VMs), aggregated 
throughput can be quite high.



Den 2018-01-29 kl. 19:32, skrev Wido den Hollander:
Although the difference between 0.4ms and 0.2ms is just, yes, 0.2ms 
it's a 100% increase and does half the amount I/O you can do in a second. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Best practise] Adding new data center

2018-01-29 Thread Peter Linder
Your data centers seem to be pretty close, some 13-14km? If it is a more 
or less straight fiber run then latency should be 0.1-0.2ms or 
something, clearly not a problem for synchronous replication. It should 
work rather well.


With "only" 2 data centers however, you need to manually decide if there 
is an outage how to proceed unless you have a third mediator node. 
Writing on both sides if the network is down will lead to inconsistent 
data so it is unsafe to automatically restart services and a stonith 
approach may just kill the working node instead of the faulty one. With 
that in mind a separate data center as a disaster recovery plan is a 
good idea (we have 3 data centers for ceph with similar distances, 
redundant network in between and really quite good write performance). 
Just plan for the most common failure scenarios and good luck, you seem 
to be doing well :)


/Peter





Den 2018-01-29 kl. 19:26, skrev Nico Schottelius:

Hey Wido,


[...]
Like I said, latency, latency, latency. That's what matters. Bandwidth
usually isn't a real problem.

I imagined that.


What latency do you have with a 8k ping between hosts?

As the link will be setup this week, I cannot tell yet.

However, currently we have on a 65km link with ~2ms latency.
In our data center, we currently have ~0.4 ms latency.
(both 8k pings).

Do you see similar latencies in your setup?

Best,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH straw2 can not handle big weight differences

2018-01-29 Thread Peter Linder

We kind of turned the crushmap inside out a little bit.

Instead of the traditional "for 1 PG, select OSDs from 3 separate data 
centers" we did "force selection from only one datacenter (out of 3) and 
leave enough options only to make sure precisely 1 SSD and 2 HDD are 
selected".


We then organized these "virtual datacenters" in the hierachy so that 
one of them in fact contain 3 options that lead to 3 physically separate 
servers in different locations.


Every physical datacenter has both SSD's and HDD's. The idea is that if 
one datacenter is lost, 2/3 of the SSD's still remain (and can be mapped 
to by marking the missing ones "out") so performance is maintained.






Den 2018-01-29 kl. 13:35, skrev Niklas:

Yes.
It is a hybrid solution where a placement group is always located on 
one NVMe drive and two HDD drives. Advantage is great read performance 
and cost savings. Disadvantages is low write performance. Still the 
write performance is good thanks to rockdb on Intel Optane disks in 
HDD servers.


Real world looks more like I described in a previous question 
(2018-01-23) here on ceph-users list, "Ruleset for optimized Ceph 
hybrid storage". Nobody answered so am guessing it is not possible to 
create my wanted rule. Now am trying to solve it with virtual 
datacenters in the crush map. Which works but maybe the the most 
optimal solution.



On 2018-01-29 13:21, Wido den Hollander wrote:



On 01/29/2018 01:14 PM, Niklas wrote:

...



Is it your intention to put all copies of a object in only one DC?

What is your exact idea behind this rule? What's the purpose?

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder
Ok, by randomly toggling settings *MOST* of the PGs in the test cluster 
is online, but a few are not. No matter how much I change, a few of them 
seem to not activate. They are running bluestore with version 12.2.2, i 
think created with ceph-volume.


Here is the output from ceph pg X query of one that won't activate (but 
was active before but got remapped due to one of my changes. What should 
I look for and where should I look next to understand?



# ceph pg 3.12c query
{
    "state": "activating",
    "snap_trimq": "[]",
    "epoch": 918,
    "up": [
    6,
    0,
    12
    ],
    "acting": [
    6,
    0,
    12
    ],
    "actingbackfill": [
    "0",
    "6",
    "12"
    ],
    "info": {
    "pgid": "3.12c",
    "last_update": "0'0",
    "last_complete": "0'0",
    "log_tail": "0'0",
    "last_user_version": 0,
    "last_backfill": "MAX",
    "last_backfill_bitwise": 0,
    "purged_snaps": [],
    "history": {
    "epoch_created": 314,
    "epoch_pool_created": 314,
    "last_epoch_started": 862,
    "last_interval_started": 860,
    "last_epoch_clean": 862,
    "last_interval_clean": 860,
    "last_epoch_split": 0,
    "last_epoch_marked_full": 0,
    "same_up_since": 872,
    "same_interval_since": 915,
    "same_primary_since": 789,
    "last_scrub": "0'0",
    "last_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_deep_scrub": "0'0",
    "last_deep_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_clean_scrub_stamp": "2018-01-26 13:29:35.010846"
    },
    "stats": {
    "version": "0'0",
    "reported_seq": "427",
    "reported_epoch": "918",
    "state": "activating",
    "last_fresh": "2018-01-26 17:26:39.603121",
    "last_change": "2018-01-26 17:26:36.161131",
    "last_active": "2018-01-26 17:25:09.770406",
    "last_peered": "2018-01-26 17:24:17.510532",
    "last_clean": "2018-01-26 17:24:17.510532",
    "last_became_active": "2018-01-26 17:24:09.211916",
    "last_became_peered": "2018-01-26 17:24:09.211916",
    "last_unstale": "2018-01-26 17:26:39.603121",
    "last_undegraded": "2018-01-26 17:26:39.603121",
    "last_fullsized": "2018-01-26 17:26:39.603121",
    "mapping_epoch": 915,
    "log_start": "0'0",
    "ondisk_log_start": "0'0",
    "created": 314,
    "last_epoch_clean": 862,
    "parent": "0.0",
    "parent_split_bits": 0,
    "last_scrub": "0'0",
    "last_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_deep_scrub": "0'0",
    "last_deep_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_clean_scrub_stamp": "2018-01-26 13:29:35.010846",
    "log_size": 0,
    "ondisk_log_size": 0,
    "stats_invalid": false,
    "dirty_stats_invalid": false,
    "omap_stats_invalid": false,
    "hitset_stats_invalid": false,
    "hitset_bytes_stats_invalid": false,
    "pin_stats_invalid": false,
    "stat_sum": {
    "num_bytes": 0,
    "num_objects": 0,
    "num_object_clones": 0,
    "num_object_copies": 0,
    "num_objects_missing_on_primary": 0,
    "num_objects_missing": 0,
    "num_objects_degraded": 0,
    "num_objects_misplaced": 0,
    "num_objects_unfound": 0,
    "num_objects_dirty": 0,
    "num_whiteouts": 0,
    "num_read": 0,
    "num_read_kb": 0,
    "num_write": 0,
    "num_write_kb": 0,
    "num_scrub_errors": 0,
    "num_shallow_scrub_errors": 0,
    "num_deep_scrub_errors": 0,
    "num_objects_recovered": 0,
    "num_bytes_recovered": 0,
    "num_keys_recovered": 0,
    "num_objects_omap": 0,
    "num_objects_hit_set_archive": 0,
    "num_bytes_hit_set_archive": 0,
    "num_flush": 0,
    "num_flush_kb": 0,
    "num_evict": 0,
    "num_evict_kb": 0,
    "num_promote": 0,
    "num_flush_mode_high": 0,
    "num_flush_mode_low": 0,
    "num_evict_mode_some": 0,
    "num_evict_mode_full": 0,
    "num_objects_pinned": 0,
    "num_legacy_snapsets": 0
    },
    "up": [
    6,
    0,
    12
    ],
    "acting": [
    6,
    0,
    12
    ],
    "blocked_by": [],
    "up_primary": 6,
    "acting_primary": 6
    },
    "empt

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder
Ok, so creating our setup in the lab and adding the pools, our hybrid 
pool cannot even be properly created with around 1/3 of the PGs stuck in 
various states:


  cluster:
    id: e07f568d-056c-4e01-9292-732c64ab4f8e
    health: HEALTH_WARN
    Reduced data availability: 1070 pgs inactive, 204 pgs peering
    Degraded data redundancy: 1087 pgs unclean, 69 pgs 
degraded, 69 pgs undersized

    too many PGs per OSD (215 > max 200)

  services:
    mon: 3 daemons, quorum s11,s12,s13
    mgr: s12(active), standbys: s11, s13
    osd: 51 osds: 51 up, 51 in

  data:
    pools:   3 pools, 4608 pgs
    objects: 0 objects, 0 bytes
    usage:   56598 MB used, 706 GB / 761 GB avail
    pgs: 17.643% pgs unknown
 5.577% pgs not active
 3521 active+clean
 813  unknown
 204  creating+peering
 46   undersized+degraded+peered
 17   active+undersized+degraded
 6    creating+activating+undersized+degraded
 1    creating+activating


It is stuck like this, and I cant query the problematic PGs:

# ceph pg 2.7cf query

Error ENOENT: i don't have pgid 2.7cf

So, so far, great success :). Now I only have to learn how to fix it, 
any ideas anyone?




Den 2018-01-26 kl. 12:59, skrev Peter Linder:


Well, we do, but our problem is with our hybrid setup (1 nvme and 2 
hdds). The other two (that we rarely use) are nvme only and hdd only, 
as far as I can tell they work and "take" command uses class to select 
only the relevant OSDs.


I'll just paste our entire crushmap dump here. This one starts working 
when changing the 1.7 weight to 1.0... crushtool --test doesn't show 
any errors in any case, all PGs seem to be properly assigned to osds.


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class nvme
device 25 osd.25 class nvme
device 26 osd.26 class nvme
device 27 osd.27 class nvme
device 36 osd.36 class hdd
device 37 osd.37 class hdd
device 38 osd.38 class hdd
device 39 osd.39 class hdd
device 40 osd.40 class hdd
device 41 osd.41 class hdd
device 42 osd.42 class hdd
device 43 osd.43 class hdd
device 44 osd.44 class hdd
device 45 osd.45 class hdd
device 46 osd.46 class hdd
device 47 osd.47 class hdd
device 48 osd.48 class hdd
device 49 osd.49 class hdd
device 50 osd.50 class hdd
device 51 osd.51 class hdd
device 52 osd.52 class hdd
device 53 osd.53 class hdd
device 54 osd.54 class hdd
device 55 osd.55 class hdd
device 56 osd.56 class hdd
device 57 osd.57 class hdd
device 58 osd.58 class hdd
device 59 osd.59 class hdd

# types
type 0 osd
type 1 host
type 2 hostgroup
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host storage11 {
    id -5   # do not change unnecessarily
    id -6 class nvme    # do not change unnecessarily
    id -10 class hdd    # do not change unnecessarily
    # weight 4.612
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 0.728
    item osd.3 weight 0.728
    item osd.6 weight 0.728
    item osd.7 weight 0.728
    item osd.10 weight 1.700
}
host storage21 {
    id -13  # do not change unnecessarily
    id -14 class nvme   # do not change unnecessarily
    id -15 class hdd    # do not change unnecessarily
    # weight 65.496
    alg straw2
    hash 0  # rjenkins1
    item osd.12 weight 5.458
    item osd.13 weight 5.458
    item osd.14 weight 5.458
    item osd.15 weight 5.458
    item osd.16 weight 5.458
    item osd.17 weight 5.458
    item osd.18 weight 5.458
    item osd.19 weight 5.458
    item osd.20 weight 5.458
    item osd.21 weight 5.458
    item osd.22 weight 5.458
    item osd.23 weight 5.458
}
datacenter HORN79 {
    id -19  # do not change unnecessarily
    id -26 class nvme   # do not change unnecessarily
    id -27 class hdd    # do not change unnecessarily
    # weight 70.108
    alg straw2

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder
age22 weight 100.000
}
datacenter ldc1 {
    id -39  # do not change unnecessarily
#   id -44 class nvme   # do not change unnecessarily
#   id -57 class hdd    # do not change unnecessarily
    # weight 30.000
    alg straw2
    hash 0  # rjenkins1
    item hg1-1 weight 100.000
    item hg1-2 weight 100.000
    item hg1-3 weight 100.000
}
datacenter ldc2 {
    id -40  # do not change unnecessarily
#   id -48 class nvme   # do not change unnecessarily
#   id -61 class hdd    # do not change unnecessarily
    # weight 30.000
    alg straw2
    hash 0  # rjenkins1
    item hg2-1 weight 100.000
    item hg2-2 weight 100.000
    item hg2-3 weight 100.000
}
datacenter ldc3 {
    id -41  # do not change unnecessarily
#   id -52 class nvme   # do not change unnecessarily
#   id -65 class hdd    # do not change unnecessarily
    # weight 30.000
    alg straw2
    hash 0  # rjenkins1
    item hg3-1 weight 100.000
    item hg3-2 weight 100.000
    item hg3-3 weight 100.000
}
root ldc {
    id -42  # do not change unnecessarily
#   id -53 class nvme   # do not change unnecessarily
#   id -66 class hdd    # do not change unnecessarily
    # weight 90.000
    alg straw2
    hash 0  # rjenkins1
    item ldc1 weight 300.000
    item ldc2 weight 300.000
    item ldc3 weight 300.000
}

# rules
rule hybrid {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take ldc
    step choose indep 1 type datacenter
    step chooseleaf indep 0 type hostgroup
    step emit
}
rule hdd {
    id 2
    type replicated
    min_size 1
    max_size 3
    step take default class hdd
    step chooseleaf firstn 0 type datacenter
    step emit
}
rule nvme {
    id 3
    type replicated
    min_size 1
    max_size 3
    step take default class nvme
    step chooseleaf firstn 0 type datacenter
    step emit
}

# end crush map



Den 2018-01-26 kl. 11:22, skrev Thomas Bennett:

Hi Peter,

Just to check if your problem is similar to mine:

  * Do you have any pools that follow a crush rule to only use osds
that are backed by hdds (i.e not nvmes)?
  * Do these pools obey that rule? i.e do they maybe have pgs that are
on nvmes?

Regards,
Tom

On Fri, Jan 26, 2018 at 11:48 AM, Peter Linder 
mailto:peter.lin...@fiberdirekt.se>> wrote:


Hi Thomas,

No, we haven't gotten any closer to resolving this, in fact we had
another issue again when we added a new nvme drive to our nvme
servers (storage11, storage12 and storage13) that had weight 1.7
instead of the usual 0.728 size. This (see below) is what a nvme
and hdd server pair at a site looks like, and it broke when adding
osd.10 (adding the nvme drive to storage12 and storage13 worked,
it failed when adding the last one to storage11). Changing
osd.10's weight to 1.0 instead and recompiling crushmap allowed
all PGs to activate.

Unfortunately this is a production cluster that we were hoping to
expand as needed, so if there is a problem we quickly have to
revert to the last working crushmap, so no time to debug :(

We are currently building a copy of the environment though
virtualized and I hope that we will be able to re-create the issue
there as we will be able to break it at will :)


host storage11 {
    id -5   # do not change unnecessarily
    id -6 class nvme    # do not change unnecessarily
    id -10 class hdd    # do not change unnecessarily
    # weight 4.612
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 0.728
    item osd.3 weight 0.728
    item osd.6 weight 0.728
    item osd.7 weight 0.728
    item osd.10 weight 1.700
}
host storage21 {
    id -13  # do not change unnecessarily
    id -14 class nvme   # do not change unnecessarily
    id -15 class hdd    # do not change unnecessarily
    # weight 65.496
    alg straw2
    hash 0  # rjenkins1
    item osd.12 weight 5.458
    item osd.13 weight 5.458
    item osd.14 weight 5.458
    item osd.15 weight 5.458
    item osd.16 weight 5.458
    item osd.17 weight 5.458
    item osd.18 weight 5.458
    item osd.19 weight 5.458
    item osd.20 weight 5.458
    item osd.21 weight 5.458
    item osd.22 weight 5.458
    item osd.23 weight 5.458
}


Den 2018-01-26 kl. 08:45, skrev Thomas Bennett:

Hi Peter,

Not sure if you have got to the bottom of your problem,  but I
seem t

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder

Hi Thomas,

No, we haven't gotten any closer to resolving this, in fact we had 
another issue again when we added a new nvme drive to our nvme servers 
(storage11, storage12 and storage13) that had weight 1.7 instead of the 
usual 0.728 size. This (see below) is what a nvme and hdd server pair at 
a site looks like, and it broke when adding osd.10 (adding the nvme 
drive to storage12 and storage13 worked, it failed when adding the last 
one to storage11). Changing osd.10's weight to 1.0 instead and 
recompiling crushmap allowed all PGs to activate.


Unfortunately this is a production cluster that we were hoping to expand 
as needed, so if there is a problem we quickly have to revert to the 
last working crushmap, so no time to debug :(


We are currently building a copy of the environment though virtualized 
and I hope that we will be able to re-create the issue there as we will 
be able to break it at will :)



host storage11 {
    id -5   # do not change unnecessarily
    id -6 class nvme    # do not change unnecessarily
    id -10 class hdd    # do not change unnecessarily
    # weight 4.612
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 0.728
    item osd.3 weight 0.728
    item osd.6 weight 0.728
    item osd.7 weight 0.728
    item osd.10 weight 1.700
}
host storage21 {
    id -13  # do not change unnecessarily
    id -14 class nvme   # do not change unnecessarily
    id -15 class hdd    # do not change unnecessarily
    # weight 65.496
    alg straw2
    hash 0  # rjenkins1
    item osd.12 weight 5.458
    item osd.13 weight 5.458
    item osd.14 weight 5.458
    item osd.15 weight 5.458
    item osd.16 weight 5.458
    item osd.17 weight 5.458
    item osd.18 weight 5.458
    item osd.19 weight 5.458
    item osd.20 weight 5.458
    item osd.21 weight 5.458
    item osd.22 weight 5.458
    item osd.23 weight 5.458
}


Den 2018-01-26 kl. 08:45, skrev Thomas Bennett:

Hi Peter,

Not sure if you have got to the bottom of your problem, but I seem to 
have found what might be a similar problem. I recommend reading 
below,  as there could be a potential hidden problem.


Yesterday our cluster went into *HEALTH_WARN* state**and I noticed 
that one of my pg's was listed as '/activating/' and marked as 
'/inactive/' and '/unclean/'.


We also have a mixed OSD system - 768 HDDs and 16 NVMEs with three 
crush rules for object placement: the default /replicated_rule/ (I 
never deleted it) and then two new ones for /replicate_rule_hdd/ and 
/replicate_rule_nvme./


Running a query on the pg (in my case pg 15.792) did not yield 
anything out of place, except for it telling me that that it's state 
was '/activating/' (that's not even a pg state: pg states 
<http://docs.ceph.com/docs/master/rados/operations/pg-states/>) and 
made me slightly alarmed.


The bits of information that alerted me to the issue where:

1. Running 'ceph pg dump' and finding the 'activating' pg showed the 
following information:


15.792 activating [4,724,242] #for pool 15 pg there are osds 4,724,242


2. Running 'ceph osd tree | grep 'osd.4 ' and getting the following 
information:


4 nvme osd.4

3. Now checking what pool 15 is by running 'ceph osd pool ls detail':

pool 15 'default.rgw.data' replicated size 3 min_size 2 crush_rule 1


These three bits of information made me realise what was going on:

  * OSD 4,724,242 are all nvmes
  * Pool 15 should obey crush_rule 1 (/replicate_rule_hdd)/
  * Pool 15 has pgs that use nvmes!

I found the following really useful tool online which showed me the 
depth of the problem: Get the Number of Placement Groups Per Osd 
<http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd>


So it turns out in my case pool 15 has osds in all the nvmes!

To test a fix to mimic the problem again - I executed the following 
command: 'ceph osd pg-upmap-items 15.792 4 22 724 67 76 242'


It remap the osds used by the 'activating' pg and my cluster status 
when back to *HEALTH_OK *and the pg went back to normal making the 
cluster appear healthy.


Luckily for me we've not put the cluster into production so I'll just 
blow away the pool and recreate it.


What I've not yet figured out is how this happened.

The steps (I think) I took where:

 1. Run ceph-ansible and  'default.rgw.data' pool was created
automatically.
 2. I think I then increased the pg count.
 3. Create a new rule: ceph osd crush rule create-replicated
replicated_rule_hdd default host hdd
 4. Move pool to new rule:ceph osd pool set
default.rgw.data crush_rule replicated_rule_hdd

I don't know what the expected behav

Re: [ceph-users] Stuck pgs (activating+remapped) and slow requests after adding OSD node via ceph-ansible

2018-01-22 Thread Peter Linder
Did you find out anything about this? We are also getting pgs stuck 
"activating+remapped". I have to manually alter bucket weights so that 
they are basically the same everywhere, even if disks aren't the same 
size to fix the problem, but it is a real hassle every time we add a new 
node or disk.


See my email subject "Weird issues related to (large/small) weights in 
mixed nvme/hdd pool" from 2018-01-20 and see if there are some similarities?



Regards,
Peter

Den 2018-01-07 kl. 12:17, skrev Tzachi Strul:

Hi all,
We have 5 node ceph cluster (Luminous 12.2.1) installed via ceph-ansible.
All servers have 16X1.5TB SSD disks.
3 of these servers are also acting as MON+MGRs.
We don't have separated network for cluster and public, each node has 
4 NICs bonded together (40G) and serves cluster+public communication 
(We know it's not ideal and planning to change it).


Last week we added another node to cluster (another 16*1.5TB ssd).
We used ceph-ansible latest stable release.
After OSD activation cluster started rebalancing and problems began:
1. Cluster entered HEALTH_ERROR state
2. 67 pgs stuck at activating+remapped
3. A lot of blocked slow requests.

This cluster serves OpenStack volumes and almost all OpenStack 
instances got 100% disk utilization and hanged, eventually, 
cinder-volume has crushed.


Eventually, after restarting several OSDs, problem solved and cluster 
got back to HEALTH_OK


Our configuration already has:
osd max backfills = 1
osd max scrubs = 1
osd recovery max active = 1
osd recovery op priority = 1

In addition, we see a lot of bad mappings:
for example: bad mapping rule 0 x 52 num_rep 8 result 
[32,5,78,25,96,59,80]


What can be the cause and what can I do in order to avoid this 
situation? we need to add another 9 osd servers and can't afford downtime.


Any help would be appreciated. Thank you very much


Our ceph configuration:

[mgr]
mgr_modules = dashboard zabbix

[global]
cluster network = *removed for security resons*
fsid =  *removed for security resons*
mon host =  *removed for security resons*
mon initial members =  *removed for security resons*
mon osd down out interval = 900
osd pool default size = 3
public network =  *removed for security resons*

[client.libvirt]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # 
must be writable by QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by 
QEMU and allowed by SELinux or AppArmor


[osd]
osd backfill scan max = 16
osd backfill scan min = 4
osd bluestore cache size = 104857600  **Due to 12.2.1 bluestore memory 
leak bug**

osd max backfills = 1
osd max scrubs = 1
osd recovery max active = 1
osd recovery max single start = 1
osd recovery op priority = 1
osd recovery threads = 1


--

*Tzachi Strul*

*Storage DevOps *// *Kenshoo*


This e-mail, as well as any attached document, may contain material 
which is confidential and privileged and may include trademark, 
copyright and other intellectual property rights that are proprietary 
to Kenshoo Ltd,  its subsidiaries or affiliates ("Kenshoo"). This 
e-mail and its attachments may be read, copied and used only by the 
addressee for the purpose(s) for which it was disclosed herein. If you 
have received it in error, please destroy the message and any 
attachment, and contact us immediately. If you are not the intended 
recipient, be aware that any review, reliance, disclosure, copying, 
distribution or use of the contents of this message without Kenshoo's 
express permission is strictly prohibited.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-20 Thread peter . linder
 -48 class nvme   # do not change unnecessarily
    id -61 class hdd    # do not change unnecessarily
    id -82 class ssd    # do not change unnecessarily
    # weight 196.781
    alg straw2
    hash 0  # rjenkins1
    item hg2-1 weight 65.496
    item hg2-2 weight 65.496
    item hg2-3 weight 65.789
}
datacenter ldc3 {
    id -41  # do not change unnecessarily
    id -52 class nvme   # do not change unnecessarily
    id -65 class hdd    # do not change unnecessarily
    id -87 class ssd    # do not change unnecessarily
    # weight 197.197
    alg straw2
    hash 0  # rjenkins1
    item hg3-1 weight 65.912
    item hg3-2 weight 65.496
    item hg3-3 weight 65.789
}
root ldc {
    id -42  # do not change unnecessarily
    id -53 class nvme   # do not change unnecessarily
    id -66 class hdd    # do not change unnecessarily
    id -88 class ssd    # do not change unnecessarily

    # weight 528.881
    alg straw2
    hash 0  # rjenkins1
    item ldc1 weight 97.489
    item ldc2 weight 97.196
    item ldc3 weight 97.196
}

# rules
rule hybrid {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take ldc
    step choose firstn 1 type datacenter
    step chooseleaf firstn 0 type hostgroup
    step emit
}


Ok, so there are 9 hostgroups (i changed "type 2"). Each hostgroup 
currently holds 1 server, but may in the future hold more. These are 
grouped in 3, and called a "datacenter" even though the set is spread 
out onto 3 physical data centers. These are then put in a separate root 
called "ldc".


The "hybrid" rule then proceeds to select 1 datacenter, and then 3 osds 
from that datacenter. The end result is that 3 OSDs from different 
physical datacenters are selected, with 1 nvme and 2 hdd (hdds have 
reduced primary affinity to 0.00099, and yes this might be a problem?). 
If one datacenter is lost, only 1/3'rd of the nvmes are in fact offline 
so capacity loss is manageable compared to having all nvme's in one 
datacenter.


Because nvmes are much smaller, after adding one the "datacenter" looks 
like this:


    item hg1-1 weight 2.911
    item hg1-2 weight 65.789
    item hg1-3 weight 65.789

This causes PGs to go into "active+clean+remapped" state forever. If I 
manually change the weights so that they are all almost the same, the 
problem goes away! I would have though that the weights does not matter, 
since we have to choose 3 of these anyways. So I'm really confused over 
this.


Today I also had to change

    item ldc1 weight 197.489
    item ldc2 weight 197.196
    item ldc3 weight 197.196
to
    item ldc1 weight 97.489
    item ldc2 weight 97.196
    item ldc3 weight 97.196

or some PGs wouldn't activate at all! I'm really not aware how the 
hashing/selection process works though, it does somehow seem that if the 
values are too far apart, things seem to break. crushtool --test seems 
to correctly calculate my PGs.


Basically when this happens I just randomly change some weights and most 
of the time it starts working. Why?


Regards,
Peter





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous on armhf

2017-12-18 Thread Peter Woodman
er, yeah, i didn't read before i replied. that's fair, though it is
only some of the integration test binaries that tax that limit in a
single compile step.

On Mon, Dec 18, 2017 at 4:52 PM, Peter Woodman  wrote:
> not the larger "intensive" instance types! they go up to 128gb ram.
>
> On Mon, Dec 18, 2017 at 4:46 PM, Ean Price  wrote:
>> The problem with the native build on armhf is the compilation exceeds the 2 
>> GB of memory that ARMv7 (armhf) supports. Scaleway is pretty awesome but 
>> their 32 bit ARM systems have the same 2 GB limit. I haven’t tried the 
>> cross-compile on the 64 bit ARMv8 they offer and that might be easier than 
>> trying to do it on x86_64.
>>
>>> On Dec 18, 2017, at 4:41 PM, Peter Woodman  wrote:
>>>
>>> https://www.scaleway.com/
>>>
>>> they rent access to arm servers with gobs of ram.
>>>
>>> i've been building my own, but with some patches (removal of some
>>> asserts that were unnecessarily causing crashes while i try and track
>>> down the bug) that make it unsuitable for public consumption
>>>
>>> On Mon, Dec 18, 2017 at 4:38 PM, Andrew Knapp  wrote:
>>>> I have no idea what this response means.
>>>>
>>>> I have tried building the armhf and arm64 package on my raspberry pi 3 to
>>>> no avail.  Would love to see someone post Debian packages for stretch on
>>>> arm64 or armhf.
>>>>
>>>> On Dec 18, 2017 4:12 PM, "Peter Woodman"  wrote:
>>>>>
>>>>> YMMV, but I've been using Scaleway instances to build packages for
>>>>> arm64- AFAIK you should be able to run any armhf distro on those
>>>>> machines as well.
>>>>>
>>>>> On Mon, Dec 18, 2017 at 4:02 PM, Andrew Knapp  wrote:
>>>>>> I would also love to see these packages!!!
>>>>>>
>>>>>> On Dec 18, 2017 3:46 PM, "Ean Price"  wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I have a test cluster of armhf arch SoC systems running Xenial and Jewel
>>>>>> (10.2). I’m looking to do a clean rebuild with Luminous (12.2) but there
>>>>>> are
>>>>>> no 32 bit armhf binaries available. This is just a toy cluster and not
>>>>>> in
>>>>>> production.
>>>>>>
>>>>>> I have tried, unsuccessfully, to compile from source but they only have
>>>>>> 2GB
>>>>>> of memory and the system runs out of memory even with tuning and dialing
>>>>>> back compile options. I have tinkered around with cross compiling but
>>>>>> that
>>>>>> seems to land me in dependency hell on Xenial and I am a cross compile
>>>>>> newbie at any rate.
>>>>>>
>>>>>> Does anyone know of a source for a packaged version of Luminous for the
>>>>>> armhf architecture? Like I said, it’s just a test cluster so I’m not
>>>>>> overly
>>>>>> concerned about stability.
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Ean
>>>>>> --
>>>>>> __
>>>>>>
>>>>>> This message contains information which may be confidential.  Unless you
>>>>>> are the addressee (or authorized to receive for the addressee), you may
>>>>>> not
>>>>>> use, copy, or disclose to anyone the message or any information
>>>>>> contained
>>>>>> in the message.  If you have received the message in error, please
>>>>>> advise
>>>>>> the sender by reply e-mail or contact the sender at Price Paper & Twine
>>>>>> Company by phone at (516) 378-7842 and delete the message.  Thank you
>>>>>> very
>>>>>> much.
>>>>>>
>>>>>> __
>>>>>> ___
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> ___
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>
>>
>> --
>> __
>>
>> This message contains information which may be confidential.  Unless you
>> are the addressee (or authorized to receive for the addressee), you may not
>> use, copy, or disclose to anyone the message or any information contained
>> in the message.  If you have received the message in error, please advise
>> the sender by reply e-mail or contact the sender at Price Paper & Twine
>> Company by phone at (516) 378-7842 and delete the message.  Thank you very
>> much.
>>
>> __
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous on armhf

2017-12-18 Thread Peter Woodman
not the larger "intensive" instance types! they go up to 128gb ram.

On Mon, Dec 18, 2017 at 4:46 PM, Ean Price  wrote:
> The problem with the native build on armhf is the compilation exceeds the 2 
> GB of memory that ARMv7 (armhf) supports. Scaleway is pretty awesome but 
> their 32 bit ARM systems have the same 2 GB limit. I haven’t tried the 
> cross-compile on the 64 bit ARMv8 they offer and that might be easier than 
> trying to do it on x86_64.
>
>> On Dec 18, 2017, at 4:41 PM, Peter Woodman  wrote:
>>
>> https://www.scaleway.com/
>>
>> they rent access to arm servers with gobs of ram.
>>
>> i've been building my own, but with some patches (removal of some
>> asserts that were unnecessarily causing crashes while i try and track
>> down the bug) that make it unsuitable for public consumption
>>
>> On Mon, Dec 18, 2017 at 4:38 PM, Andrew Knapp  wrote:
>>> I have no idea what this response means.
>>>
>>> I have tried building the armhf and arm64 package on my raspberry pi 3 to
>>> no avail.  Would love to see someone post Debian packages for stretch on
>>> arm64 or armhf.
>>>
>>> On Dec 18, 2017 4:12 PM, "Peter Woodman"  wrote:
>>>>
>>>> YMMV, but I've been using Scaleway instances to build packages for
>>>> arm64- AFAIK you should be able to run any armhf distro on those
>>>> machines as well.
>>>>
>>>> On Mon, Dec 18, 2017 at 4:02 PM, Andrew Knapp  wrote:
>>>>> I would also love to see these packages!!!
>>>>>
>>>>> On Dec 18, 2017 3:46 PM, "Ean Price"  wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I have a test cluster of armhf arch SoC systems running Xenial and Jewel
>>>>> (10.2). I’m looking to do a clean rebuild with Luminous (12.2) but there
>>>>> are
>>>>> no 32 bit armhf binaries available. This is just a toy cluster and not
>>>>> in
>>>>> production.
>>>>>
>>>>> I have tried, unsuccessfully, to compile from source but they only have
>>>>> 2GB
>>>>> of memory and the system runs out of memory even with tuning and dialing
>>>>> back compile options. I have tinkered around with cross compiling but
>>>>> that
>>>>> seems to land me in dependency hell on Xenial and I am a cross compile
>>>>> newbie at any rate.
>>>>>
>>>>> Does anyone know of a source for a packaged version of Luminous for the
>>>>> armhf architecture? Like I said, it’s just a test cluster so I’m not
>>>>> overly
>>>>> concerned about stability.
>>>>>
>>>>> Thanks in advance,
>>>>> Ean
>>>>> --
>>>>> __
>>>>>
>>>>> This message contains information which may be confidential.  Unless you
>>>>> are the addressee (or authorized to receive for the addressee), you may
>>>>> not
>>>>> use, copy, or disclose to anyone the message or any information
>>>>> contained
>>>>> in the message.  If you have received the message in error, please
>>>>> advise
>>>>> the sender by reply e-mail or contact the sender at Price Paper & Twine
>>>>> Company by phone at (516) 378-7842 and delete the message.  Thank you
>>>>> very
>>>>> much.
>>>>>
>>>>> __
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>>
>>>>>
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>
>
> --
> __
>
> This message contains information which may be confidential.  Unless you
> are the addressee (or authorized to receive for the addressee), you may not
> use, copy, or disclose to anyone the message or any information contained
> in the message.  If you have received the message in error, please advise
> the sender by reply e-mail or contact the sender at Price Paper & Twine
> Company by phone at (516) 378-7842 and delete the message.  Thank you very
> much.
>
> __
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous on armhf

2017-12-18 Thread Peter Woodman
https://www.scaleway.com/

they rent access to arm servers with gobs of ram.

i've been building my own, but with some patches (removal of some
asserts that were unnecessarily causing crashes while i try and track
down the bug) that make it unsuitable for public consumption

On Mon, Dec 18, 2017 at 4:38 PM, Andrew Knapp  wrote:
> I have no idea what this response means.
>
>  I have tried building the armhf and arm64 package on my raspberry pi 3 to
> no avail.  Would love to see someone post Debian packages for stretch on
> arm64 or armhf.
>
> On Dec 18, 2017 4:12 PM, "Peter Woodman"  wrote:
>>
>> YMMV, but I've been using Scaleway instances to build packages for
>> arm64- AFAIK you should be able to run any armhf distro on those
>> machines as well.
>>
>> On Mon, Dec 18, 2017 at 4:02 PM, Andrew Knapp  wrote:
>> > I would also love to see these packages!!!
>> >
>> > On Dec 18, 2017 3:46 PM, "Ean Price"  wrote:
>> >
>> > Hi everyone,
>> >
>> > I have a test cluster of armhf arch SoC systems running Xenial and Jewel
>> > (10.2). I’m looking to do a clean rebuild with Luminous (12.2) but there
>> > are
>> > no 32 bit armhf binaries available. This is just a toy cluster and not
>> > in
>> > production.
>> >
>> > I have tried, unsuccessfully, to compile from source but they only have
>> > 2GB
>> > of memory and the system runs out of memory even with tuning and dialing
>> > back compile options. I have tinkered around with cross compiling but
>> > that
>> > seems to land me in dependency hell on Xenial and I am a cross compile
>> > newbie at any rate.
>> >
>> > Does anyone know of a source for a packaged version of Luminous for the
>> > armhf architecture? Like I said, it’s just a test cluster so I’m not
>> > overly
>> > concerned about stability.
>> >
>> > Thanks in advance,
>> > Ean
>> > --
>> > __
>> >
>> > This message contains information which may be confidential.  Unless you
>> > are the addressee (or authorized to receive for the addressee), you may
>> > not
>> > use, copy, or disclose to anyone the message or any information
>> > contained
>> > in the message.  If you have received the message in error, please
>> > advise
>> > the sender by reply e-mail or contact the sender at Price Paper & Twine
>> > Company by phone at (516) 378-7842 and delete the message.  Thank you
>> > very
>> > much.
>> >
>> > __
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous on armhf

2017-12-18 Thread Peter Woodman
YMMV, but I've been using Scaleway instances to build packages for
arm64- AFAIK you should be able to run any armhf distro on those
machines as well.

On Mon, Dec 18, 2017 at 4:02 PM, Andrew Knapp  wrote:
> I would also love to see these packages!!!
>
> On Dec 18, 2017 3:46 PM, "Ean Price"  wrote:
>
> Hi everyone,
>
> I have a test cluster of armhf arch SoC systems running Xenial and Jewel
> (10.2). I’m looking to do a clean rebuild with Luminous (12.2) but there are
> no 32 bit armhf binaries available. This is just a toy cluster and not in
> production.
>
> I have tried, unsuccessfully, to compile from source but they only have 2GB
> of memory and the system runs out of memory even with tuning and dialing
> back compile options. I have tinkered around with cross compiling but that
> seems to land me in dependency hell on Xenial and I am a cross compile
> newbie at any rate.
>
> Does anyone know of a source for a packaged version of Luminous for the
> armhf architecture? Like I said, it’s just a test cluster so I’m not overly
> concerned about stability.
>
> Thanks in advance,
> Ean
> --
> __
>
> This message contains information which may be confidential.  Unless you
> are the addressee (or authorized to receive for the addressee), you may not
> use, copy, or disclose to anyone the message or any information contained
> in the message.  If you have received the message in error, please advise
> the sender by reply e-mail or contact the sender at Price Paper & Twine
> Company by phone at (516) 378-7842 and delete the message.  Thank you very
> much.
>
> __
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random checksum errors (bluestore on Luminous)

2017-12-10 Thread Peter Woodman
IIRC there was a bug related to bluestore compression fixed between
12.2.1 and 12.2.2

On Sun, Dec 10, 2017 at 5:04 PM, Martin Preuss  wrote:
> Hi,
>
>
> Am 10.12.2017 um 22:06 schrieb Peter Woodman:
>> Are you using bluestore compression?
> [...]
>
> As a matter of fact, I do. At least for one of the 5 pools, exclusively
> used with CephFS (I'm using CephFS as a way to achieve high availability
> while replacing an NFS server).
>
> However, I see these checksum errors also with uncompressed pools...
>
>
>
> Regards
> Martin
>
>
>
> --
> "Things are only impossible until they're not"
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random checksum errors (bluestore on Luminous)

2017-12-10 Thread Peter Woodman
Are you using bluestore compression?

On Sun, Dec 10, 2017 at 1:45 PM, Martin Preuss  wrote:
> Hi (again),
>
> meanwhile I tried
>
> "ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0"
>
> but that resulted in a segfault (please see attached console log).
>
>
> Regards
> Martin
>
>
> Am 10.12.2017 um 14:34 schrieb Martin Preuss:
>> Hi,
>>
>> I'm new to Ceph. I started a ceph cluster from scratch on DEbian 9,
>> consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently
>> totalling 10 hdds).
>>
>> Right from the start I always received random scrub errors telling me
>> that some checksums didn't match the expected value, fixable with "ceph
>> pg repair".
>>
>> I looked at the ceph-osd logfiles on each of the hosts and compared with
>> the corresponding syslogs. I never found any hardware error, so there
>> was no problem reading or writing a sector hardware-wise. Also there was
>> never any other suspicious syslog entry around the time of checksum
>> error reporting.
>>
>> When I looked at the checksum error entries I found that the reported
>> bad checksum always was "0x6706be76".
>>
>> Could someone please tell me where to look further for the source of the
>> problem?
>>
>> I appended an excerpt of the osd logs.
>>
>>
>> Kind regards
>> Martin
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> "Things are only impossible until they're not"
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The way to minimize osd memory usage?

2017-12-10 Thread Peter Woodman
I've had some success in this configuration by cutting the bluestore
cache size down to 512mb and only one OSD on an 8tb drive. Still get
occasional OOMs, but not terrible. Don't expect wonderful performance,
though.

Two OSDs would really be pushing it.

On Sun, Dec 10, 2017 at 10:05 AM, David Turner  wrote:
> The docs recommend 1GB/TB of OSDs. I saw people asking if this was still
> accurate for bluestore and the answer was that it is more true for bluestore
> than filestore. There might be a way to get this working at the cost of
> performance. I would look at Linux kernel memory settings as much as ceph
> and bluestore settings. Cache pressure is one that comes to mind that an
> aggressive setting might help.
>
>
> On Sat, Dec 9, 2017, 11:33 PM shadow_lin  wrote:
>>
>> The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf)
>> we are running is with the memory issues fix.And we are working on to
>> upgrade to 12.2.2 release to see if there is any furthermore improvement.
>>
>> 2017-12-10
>> 
>> lin.yunfan
>> 
>>
>> 发件人:Konstantin Shalygin 
>> 发送时间:2017-12-10 12:29
>> 主题:Re: [ceph-users] The way to minimize osd memory usage?
>> 收件人:"ceph-users"
>> 抄送:"shadow_lin"
>>
>>
>> > I am testing running ceph luminous(12.2.1-249-g42172a4
>> > (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.
>> Try new 12.2.2 - this release should fix memory issues with Bluestore.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk removal roadmap (was ceph-disk is now deprecated)

2017-11-30 Thread Peter Woodman
How quickly are you planning to cut 12.2.3?

On Thu, Nov 30, 2017 at 4:25 PM, Alfredo Deza  wrote:
> Thanks all for your feedback on deprecating ceph-disk, we are very
> excited to be able to move forwards on a much more robust tool and
> process for deploying and handling activation of OSDs, removing the
> dependency on UDEV which has been a tremendous source of constant
> issues.
>
> Initially (see "killing ceph-disk" thread [0]) we planned for removal
> of Mimic, but we didn't want to introduce the deprecation warnings up
> until we had an out for those who had OSDs deployed in previous
> releases with ceph-disk (we are now able to handle those as well).
> That is the reason ceph-volume, although present since the first
> Luminous release, hasn't been pushed forward much.
>
> Now that we feel like we can cover almost all cases, we would really
> like to see a wider usage so that we can improve on issues/experience.
>
> Given that 12.2.2 is already in the process of getting released, we
> can't undo the deprecation warnings for that version, but we will
> remove them for 12.2.3, add them back again in Mimic, which will mean
> ceph-disk will be kept around a bit longer, and finally fully removed
> by N.
>
> To recap:
>
> * ceph-disk deprecation warnings will stay for 12.2.2
> * deprecation warnings will be removed in 12.2.3 (and from all later
> Luminous releases)
> * deprecation warnings will be added again in ceph-disk for all Mimic releases
> * ceph-disk will no longer be available for the 'N' release, along
> with the UDEV rules
>
> I believe these four points address most of the concerns voiced in
> this thread, and should give enough time to port clusters over to
> ceph-volume.
>
> [0] 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021358.html
>
> On Thu, Nov 30, 2017 at 8:22 AM, Daniel Baumann  wrote:
>> On 11/30/17 14:04, Fabian Grünbichler wrote:
>>> point is - you should not purposefully attempt to annoy users and/or
>>> downstreams by changing behaviour in the middle of an LTS release cycle,
>>
>> exactly. upgrading the patch level (x.y.z to x.y.z+1) should imho never
>> introduce a behaviour-change, regardless if it's "just" adding new
>> warnings or not.
>>
>> this is a stable update we're talking about, even more so since it's an
>> LTS release. you never know how people use stuff (e.g. by parsing stupid
>> things), so such behaviour-change will break stuff for *some* people
>> (granted, most likely a really low number).
>>
>> my expection to an stable release is, that it stays, literally, stable.
>> that's the whole point of having it in the first place. otherwise we
>> would all be running git snapshots and update randomly to newer ones.
>>
>> adding deprecation messages in mimic makes sense, and getting rid of
>> it/not provide support for it in mimic+1 is reasonable.
>>
>> Regards,
>> Daniel
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk removal roadmap (was ceph-disk is now deprecated)

2017-11-30 Thread Peter Woodman
how quickly are you planning to cut 12.2.3?

On Thu, Nov 30, 2017 at 4:25 PM, Alfredo Deza  wrote:

> Thanks all for your feedback on deprecating ceph-disk, we are very
> excited to be able to move forwards on a much more robust tool and
> process for deploying and handling activation of OSDs, removing the
> dependency on UDEV which has been a tremendous source of constant
> issues.
>
> Initially (see "killing ceph-disk" thread [0]) we planned for removal
> of Mimic, but we didn't want to introduce the deprecation warnings up
> until we had an out for those who had OSDs deployed in previous
> releases with ceph-disk (we are now able to handle those as well).
> That is the reason ceph-volume, although present since the first
> Luminous release, hasn't been pushed forward much.
>
> Now that we feel like we can cover almost all cases, we would really
> like to see a wider usage so that we can improve on issues/experience.
>
> Given that 12.2.2 is already in the process of getting released, we
> can't undo the deprecation warnings for that version, but we will
> remove them for 12.2.3, add them back again in Mimic, which will mean
> ceph-disk will be kept around a bit longer, and finally fully removed
> by N.
>
> To recap:
>
> * ceph-disk deprecation warnings will stay for 12.2.2
> * deprecation warnings will be removed in 12.2.3 (and from all later
> Luminous releases)
> * deprecation warnings will be added again in ceph-disk for all Mimic
> releases
> * ceph-disk will no longer be available for the 'N' release, along
> with the UDEV rules
>
> I believe these four points address most of the concerns voiced in
> this thread, and should give enough time to port clusters over to
> ceph-volume.
>
> [0] http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> 2017-October/021358.html
>
> On Thu, Nov 30, 2017 at 8:22 AM, Daniel Baumann 
> wrote:
> > On 11/30/17 14:04, Fabian Grünbichler wrote:
> >> point is - you should not purposefully attempt to annoy users and/or
> >> downstreams by changing behaviour in the middle of an LTS release cycle,
> >
> > exactly. upgrading the patch level (x.y.z to x.y.z+1) should imho never
> > introduce a behaviour-change, regardless if it's "just" adding new
> > warnings or not.
> >
> > this is a stable update we're talking about, even more so since it's an
> > LTS release. you never know how people use stuff (e.g. by parsing stupid
> > things), so such behaviour-change will break stuff for *some* people
> > (granted, most likely a really low number).
> >
> > my expection to an stable release is, that it stays, literally, stable.
> > that's the whole point of having it in the first place. otherwise we
> > would all be running git snapshots and update randomly to newer ones.
> >
> > adding deprecation messages in mimic makes sense, and getting rid of
> > it/not provide support for it in mimic+1 is reasonable.
> >
> > Regards,
> > Daniel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HW Raid vs. Multiple OSD

2017-11-13 Thread Peter Maloney
Once you've replaced an OSD, you'll see it is quite simple... doing it
for a few is not much more work (you've scripted it, right?). I don't
see RAID as giving any benefit here at all. It's not tricky...it's
perfectly normal operation. Just get used to ceph, and it'll be as
normal as replacing a RAID disk. And for performance degradation, maybe
it could be better on either... or better on ceph if you don't mind
setting the rate to the lowest... but when the QoS functionality is
ready, probably ceph will be much better. Also RAID will cost you more
for hardware.

And raid5 is really bad for IOPS. And ceph already replicates, so you
will have 2 layers of redundancy... and ceph does it cluster wide, not
just one machine. Using ceph with replication is like all your free
space as hot spares... you could lose 2 disks on all your machines, and
it can still run (assuming it had time to recover in between, and enough
space). And you don't want min_size=1, and if you have 2 layers of
redundancy, you'll be tempted to do that probably.

But for some workloads, like RBD, ceph doesn't balance out the workload
very evenly for a specific client, only many clients at once... raid
might help solve that, but I don't see it as worth it.

I would just software RAID1 the OS and mons, and mds, not the OSDs.

On 11/13/17 12:26, Oscar Segarra wrote:
> Hi, 
>
> I'm designing my infraestructure. I want to provide 8TB (8 disks x 1TB
> each) of data per host just for Microsoft Windows 10 VDI. In each host
> I will have storage (ceph osd) and compute (on kvm).
>
> I'd like to hear your opinion about theese two configurations:
>
> 1.- RAID5 with 8 disks (I will have 7TB but for me it is enough) + 1
> OSD daemon
> 2.- 8 OSD daemons
>
> I'm a little bit worried that 8 osd daemons can affect performance
> because all jobs running and scrubbing.
>
> Another question is the procedure of a replacement of a failed disk.
> In case of a big RAID, replacement is direct. In case of many OSDs,
> the procedure is a little bit tricky.
>
> http://ceph.com/geen-categorie/admin-guide-replacing-a-failed-disk-in-a-ceph-cluster/
>
> What is your advice?
>
> Thanks a lot everybody in advance...
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster hang (deep scrub bug? "waiting for scrub")

2017-11-10 Thread Peter Maloney
I have often seen a problem where a single osd in an eternal deep scrup
will hang any client trying to connect. Stopping or restarting that
single OSD fixes the problem.

Do you use snapshots?

Here's what the scrub bug looks like (where that many seconds is 14 hours):

> ceph daemon "osd.$osd_number" dump_blocked_ops

>  {
>  "description": "osd_op(client.6480719.0:2000419292 4.a27969ae
> rbd_data.46820b238e1f29.aa70 [set-alloc-hint object_size
> 524288 write_size 524288,write 0~4096] snapc 16ec0=[16ec0]
> ack+ondisk+write+known_if_redirected e148441)",
>  "initiated_at": "2017-09-12 20:04:27.987814",
>  "age": 49315.666393,
>  "duration": 49315.668515,
>  "type_data": [
>  "delayed",
>  {
>  "client": "client.6480719",
>  "tid": 2000419292
>  },
>  [
>  {
>  "time": "2017-09-12 20:04:27.987814",
>  "event": "initiated"
>  },
>  {
>  "time": "2017-09-12 20:04:27.987862",
>  "event": "queued_for_pg"
>  },
>  {
>  "time": "2017-09-12 20:04:28.004142",
>  "event": "reached_pg"
>  },
>  {
>  "time": "2017-09-12 20:04:28.004219",
>  "event": "waiting for scrub"
>  }
>  ]
>  ]
>  }






On 11/09/17 17:20, Matteo Dacrema wrote:
> Update:  I noticed that there was a pg that remained scrubbing from the first 
> day I found the issue to when I reboot the node and problem disappeared.
> Can this cause the behaviour I described before?
>
>
>> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema  ha 
>> scritto:
>>
>> Hi all,
>>
>> I’ve experienced a strange issue with my cluster.
>> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 
>> 4 SSDs nodes with 5 SSDs each.
>> All the nodes are behind 3 monitors and 2 different crush maps.
>> All the cluster is on 10.2.7 
>>
>> About 20 days ago I started to notice that long backups hangs with "task 
>> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
>> About few days ago another VM start to have high iowait without doing iops 
>> also on the HDD crush map.
>>
>> Today about a hundreds VMs wasn’t able to read/write from many volumes all 
>> of them on HDD crush map. Ceph health was ok and no significant log entries 
>> were found.
>> Not all the VMs experienced this problem and in the meanwhile the iops on 
>> the journal and HDDs was very low even if I was able to do significant iops 
>> on the working VMs.
>>
>> After two hours of debug I decided to reboot one of the OSD nodes and the 
>> cluster start to respond again. Now the OSD node is back in the cluster and 
>> the problem is disappeared.
>>
>> Can someone help me to understand what happened?
>> I see strange entries in the log files like:
>>
>> accept replacing existing (lossy) channel (new one lossy=1)
>> fault with nothing to send, going to standby
>> leveldb manual compact 
>>
>> I can share all the logs that can help to identify the issue.
>>
>> Thank you.
>> Regards,
>>
>> Matteo
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> --
>> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
>> infetto.
>> Seguire il link qui sotto per segnalarlo come spam: 
>> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Needed help to setup a 3-way replication between 2 datacenters

2017-11-10 Thread Peter Linder
On 11/10/2017 7:17 AM, Sébastien VIGNERON wrote:
> Hi everyone,
>
> Beginner with Ceph, i’m looking for a way to do a 3-way replication
> between 2 datacenters as mention in ceph docs (but not describe).
>
> My goal is to keep access to the data (at least read-only access) even
> when the link between the 2 datacenters is cut and make sure at least
> one copy of the data exists in each datacenter.
If that is your goal, then why 3 way replication?

>
> I’m not sure how to implement such 3-way replication. With a rule?
> Based on the CEPH docs, I think of a rule:
> rule 3-way-replication_with_2_DC {
> ruleset 1
> type replicated
> min_size 2
> max_size 3

min_size and max_size here doesn't do what you expect it to. You need to
set min_size = 1 for a 2 way replicated cluster (beware of
inconcistencies if the link between DC's go down) and = 2 for a 3 way
replicated cluster, but the setting is on the pool, not in the crush rule.

> step take DC-1
> step choose firstn 1 type host
> step chooseleaf firstn 1 type osd
> step emit
> step take DC-2
> step choose firstn 1 type host
> step chooseleaf firstn 1 type osd
> step emit
> step take default
> step choose firstn 1 type host
> step chooseleaf firstn 1 type osd
> step emit
> }
> but what should happen if the link between the 2 datacenters is cut?
> If someone has a better solution, I interested by any resources about
> it (examples, …).

This seems to, for each pg, take an osd on a host in DC-1 and then and
osd on a host in DC-2, and then just a random osd on a random host
anywhere. 50% of the extra osds selected will be in DC1 and the rest in
DC2. When the link is cut, 50% of the pgs will not be able to fulfil the
min_size = 2 requirement (depending on if the observer is in DC1 or DC2
of course) and operations will stop on these. This should in practice
stop all operations, and Im not even considering monitor quorum yet.

>
> The default rule (see below) keep the pool working when we mark each
> node of DC-2 as down (typically maintenance) but if we shut the link
> down between the 2 datacenters, the pool/rbd hangs (frozen writing dd
> tool for example).
> Does anyone have some insight on how to setup a 3-way replication
> between 2 datacenters?
I don't really know why there is a difference here. We opted for a 3 way
cluster in 3 separate datacenters though. Perhaps somehow you can
simulate 2 separate datacenters in one of yours, at least make sure they
are on different power circuits etc. Also, consider redundancy for your
network so that is does not go down. Spanning tree is a little slow, but
TRILL or SPB should work in your case.


>
> Thanks in advance for any advice on the topic.
>
> Current situation:
>
> Mons : host-1, host-2, host-4
>
> Quick network topology:
>
> USERS NETWORK
>      |
>    2x10G
>      |
>   DC-1-SWITCH <——— 40G ——> DC-2-SWITCH
>         | | |                   | | |
> host-1 _| | |           host-4 _| | |
> host-2 ___| |           host-5 ___| |
> host-3 _|           host-6 _|
>
>
>
> crushmap :
> # ceph osd tree
> ID  CLASS WEIGHT    TYPE NAME                   STATUS REWEIGHT PRI-AFF
>  -1       147.33325 root default
> -20        73.3     datacenter DC-1
> -15        73.3         rack DC-1-RACK-1
>  -9        24.4             host host-1
>  27   hdd   2.72839                 osd.27          up  1.0 1.0
>  28   hdd   2.72839                 osd.28          up  1.0 1.0
>  29   hdd   2.72839                 osd.29          up  1.0 1.0
>  30   hdd   2.72839                 osd.30          up  1.0 1.0
>  31   hdd   2.72839                 osd.31          up  1.0 1.0
>  32   hdd   2.72839                 osd.32          up  1.0 1.0
>  33   hdd   2.72839                 osd.33          up  1.0 1.0
>  34   hdd   2.72839                 osd.34          up  1.0 1.0
>  36   hdd   2.72839                 osd.36          up  1.0 1.0
> -11        24.4             host host-2
>  35   hdd   2.72839                 osd.35          up  1.0 1.0
>  37   hdd   2.72839                 osd.37          up  1.0 1.0
>  38   hdd   2.72839                 osd.38          up  1.0 1.0
>  39   hdd   2.72839                 osd.39          up  1.0 1.0
>  40   hdd   2.72839                 osd.40          up  1.0 1.0
>  41   hdd   2.72839                 osd.41          up  1.0 1.0
>  42   hdd   2.72839                 osd.42          up  1.0 1.0
>  43   hdd   2.72839                 osd.43          up  1.0 1.0
>  46   hdd   2.72839                 osd.46          up  1.0 1.0
> -13        24.4             host host-3
>  44   hdd   2.72839                 osd.44          up  1.0 1.0
>  45   hdd   2.72839                 osd.45          up  1.0 1.0
>  47   hdd   2.72839                 osd.47          up  1.0 1.0
>  48   hdd   2.72839                 osd.48          up  1.

Re: [ceph-users] ceph-backed VM drive became corrupted after unexpected VM termination

2017-11-07 Thread Peter Maloney
I see nobarrier in there... Try without that. (unless that's just the
bluestore xfs...then it probably won't change anything). And are the
osds using bluestore?

And what cache options did you set in the VM config? It's dangerous to
set writeback without also this in the client side ceph.conf:

rbd cache writethrough until flush = true
rbd_cache = true



On 11/07/17 14:36, Дробышевский, Владимир wrote:
> Hello!
>
>   I've got a weird situation with rdb drive image reliability. I found
> that after hard-reset VM with ceph rbd drive from my new cluster
> become corrupted. I accidentally found it during HA tests of my new
> cloud cluster: after host reset VM was not able to boot again because
> of the virtual drive errors. The same result will be if you just kill
> qemu process (like would happened at host crash time).
>
>   First of all I thought it is a guest OS problem. But then I tried
> RouterOS (linux based), Linux, FreeBSD - all options show the same
> behavior. 
>   Then I blamed OpenNebula installation. For the test sake I've
> installed the latest Proxmox (5.1-36) to another server. The first
> subtest: I've created a VM in OpenNebula from predefined image, shut
> it down, then create Proxmox VM and pointed it to the image was
> created from OpenNebula.
> The second subtest: I've made a clean install from ISO with from
> Proxmox console, having previously created from Proxmox VM and drive
> image (of course, on the same ceph pool).
>   Both results: unbootable VMs.
>
>   Finally I've made a clean install to the fresh VM with local
> LVM-backed drive image. And - guess what? - it survived qemu process kill.
>   
>   This is the first situation of this kind in my practice so I would
> like to ask for guidance. I believe that it is a cache problem of some
> kind, but I haven't faced it with earlier releases.
>
>   Some cluster details:
>
>   It's a small test cluster with 4 nodes, each has:
>
>   2x CPU E5-2665,
>   128GB RAM
>   1 OSD with Samsung sm863 1.92TB drive
>   IB connection with IPoIB on QDR IB network
>
>   OS: Ubuntu 16.04 with 4.10 kernel
>   ceph: luminous 12.2.1
>
>   Client (kvm host) OSes: 
>   1. Ubuntu 16.04 (the same hosts as ceph cluster)
>   2. Debian 9.1 in case of Proxmox
>
>
> *ceph.conf:*
>
> [global]
> fsid = 6a8ffc55-fa2e-48dc-a71c-647e1fff749b
>
> public_network = 10.103.0.0/16 <http://10.103.0.0/16>
> cluster_network = 10.104.0.0/16 <http://10.104.0.0/16>
>
> mon_initial_members = e001n01, e001n02, e001n03
> mon_host = 10.103.0.1,10.103.0.2,10.103.0.3
>
> rbd default format = 2
>
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
> osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier
> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
> osd_mkfs_type = xfs
>
> bluestore fsck on mount = true
>   
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
>
> [osd]
> osd op threads = 4
> osd disk threads = 2
> osd max backfills = 1
> osd recovery threads = 1
> osd recovery max active = 1
>
> -- 
>
> Best regards,
> Vladimir
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph not recovering after osd/host failure

2017-10-16 Thread Peter Maloney
 device 65 osd.65 class hdd
> device 66 osd.66 class hdd
> device 67 osd.67 class hdd
> device 68 osd.68 class hdd
> device 69 osd.69 class hdd
> device 70 osd.70 class hdd
> device 71 osd.71 class hdd
> device 72 osd.72 class hdd
> device 73 osd.73 class hdd
> device 74 osd.74 class hdd
> device 75 osd.75 class hdd
> device 76 osd.76 class hdd
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host storage004 {
>   id -3   # do not change unnecessarily
>   id -4 class hdd # do not change unnecessarily
>   # weight 20.003
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.2 weight 1.818
>   item osd.4 weight 1.818
>   item osd.10 weight 1.818
>   item osd.16 weight 1.818
>   item osd.17 weight 1.818
>   item osd.19 weight 1.818
>   item osd.26 weight 1.818
>   item osd.29 weight 1.818
>   item osd.39 weight 1.818
>   item osd.43 weight 1.818
>   item osd.50 weight 1.818
> }
> host storage005 {
>   id -5   # do not change unnecessarily
>   id -6 class hdd # do not change unnecessarily
>   # weight 20.003
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.3 weight 1.818
>   item osd.7 weight 1.818
>   item osd.12 weight 1.818
>   item osd.13 weight 1.818
>   item osd.15 weight 1.818
>   item osd.20 weight 1.818
>   item osd.25 weight 1.818
>   item osd.30 weight 1.818
>   item osd.34 weight 1.818
>   item osd.41 weight 1.818
>   item osd.47 weight 1.818
> }
> host storage002 {
>   id -7   # do not change unnecessarily
>   id -8 class hdd # do not change unnecessarily
>   # weight 20.003
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.1 weight 1.818
>   item osd.5 weight 1.818
>   item osd.9 weight 1.818
>   item osd.21 weight 1.818
>   item osd.22 weight 1.818
>   item osd.23 weight 1.818
>   item osd.35 weight 1.818
>   item osd.36 weight 1.818
>   item osd.38 weight 1.818
>   item osd.42 weight 1.818
>   item osd.49 weight 1.818
> }
> host storage001 {
>   id -9   # do not change unnecessarily
>   id -10 class hdd# do not change unnecessarily
>   # weight 20.003
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.0 weight 1.818
>   item osd.6 weight 1.818
>   item osd.8 weight 1.818
>   item osd.11 weight 1.818
>   item osd.14 weight 1.818
>   item osd.18 weight 1.818
>   item osd.24 weight 1.818
>   item osd.28 weight 1.818
>   item osd.33 weight 1.818
>   item osd.40 weight 1.818
>   item osd.45 weight 1.818
> }
> host storage003 {
>   id -11  # do not change unnecessarily
>   id -12 class hdd# do not change unnecessarily
>   # weight 20.003
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.27 weight 1.818
>   item osd.31 weight 1.818
>   item osd.32 weight 1.818
>   item osd.37 weight 1.818
>   item osd.44 weight 1.818
>   item osd.46 weight 1.818
>   item osd.48 weight 1.818
>   item osd.54 weight 1.818
>   item osd.53 weight 1.818
>   item osd.59 weight 1.818
>   item osd.56 weight 1.818
> }
> host storage006 {
>   id -13  # do not change unnecessarily
>   id -14 class hdd# do not change unnecessarily
>   # weight 20.003
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.51 weight 1.818
>   item osd.55 weight 1.818
>   item osd.58 weight 1.818
>   item osd.61 weight 1.818
>   item osd.63 weight 1.818
>   item osd.65 weight 1.818
>   item osd.66 weight 1.818
>   item osd.69 weight 1.818
>   item osd.71 weight 1.818
>   item osd.73 weight 1.818
>   item osd.75 weight 1.818
> }
> host storage007 {
>   id -15  # do not change unnecessarily
>   id -16 class hdd# do not change unnecessarily
>   # weight 20.003
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.52 weight 1.818
>   item osd.57 weight 1.818
>   item osd.60 weight 1.818
>   item osd.62 weight 1.818
>   item osd.64 weight 1.818
>   item osd.67 weight 1.818
>   item osd.70 weight 1.818
>   item osd.68 weight 1.818
>   item osd.72 weight 1.818
>   item osd.74 weight 1.818
>   item osd.7

Re: [ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?

2017-10-10 Thread Peter Linder
Probably chooseleaf also instead of choose. 

Konrad Riedel  skrev: (10 oktober 2017 17:05:52 CEST)
>Hello Ceph-users,
>
>after switching to luminous I was excited about the great
>crush-device-class feature - now we have 5 servers with 1x2TB NVMe
>based OSDs, 3 of them additionally with 4 HDDS per server. (we have
>only three 400G NVMe disks for block.wal and block.db and therefore
>can't distribute all HDDs evenly on all servers.)
>
>Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on
>the same
>Host:
>
>ceph pg map 5.b
>osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8]
>
>(on rebooting this host I had 4 stale PGs)
>
>I've written a small perl script to add hostname after OSD number and
>got many PGs where
>ceph placed 2 replicas on the same host... :
>
>5.1e7: 8 - daniel 9 - daniel 11 - udo
>5.1eb: 10 - udo 7 - daniel 9 - daniel
>5.1ec: 10 - udo 11 - udo 7 - daniel
>5.1ed: 13 - felix 16 - felix 5 - udo
>
>
>Is there any way I can correct this?
>
>
>Please see crushmap below. Thanks for any help!
>
># begin crush map
>tunable choose_local_tries 0
>tunable choose_local_fallback_tries 0
>tunable choose_total_tries 50
>tunable chooseleaf_descend_once 1
>tunable chooseleaf_vary_r 1
>tunable chooseleaf_stable 1
>tunable straw_calc_version 1
>tunable allowed_bucket_algs 54
>
># devices
>device 0 osd.0 class hdd
>device 1 device1
>device 2 osd.2 class ssd
>device 3 device3
>device 4 device4
>device 5 osd.5 class hdd
>device 6 device6
>device 7 osd.7 class hdd
>device 8 osd.8 class hdd
>device 9 osd.9 class hdd
>device 10 osd.10 class hdd
>device 11 osd.11 class hdd
>device 12 osd.12 class hdd
>device 13 osd.13 class hdd
>device 14 osd.14 class hdd
>device 15 device15
>device 16 osd.16 class hdd
>device 17 device17
>device 18 device18
>device 19 device19
>device 20 device20
>device 21 device21
>device 22 device22
>device 23 device23
>device 24 osd.24 class hdd
>device 25 device25
>device 26 osd.26 class hdd
>device 27 osd.27 class hdd
>device 28 osd.28 class hdd
>device 29 osd.29 class hdd
>device 30 osd.30 class ssd
>device 31 osd.31 class ssd
>device 32 osd.32 class ssd
>device 33 osd.33 class ssd
>
># types
>type 0 osd
>type 1 host
>type 2 rack
>type 3 row
>type 4 room
>type 5 datacenter
>type 6 root
>
># buckets
>host daniel {
>   id -4   # do not change unnecessarily
>   id -2 class hdd # do not change unnecessarily
>   id -9 class ssd # do not change unnecessarily
>   # weight 3.459
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.31 weight 1.819
>   item osd.7 weight 0.547
>   item osd.8 weight 0.547
>   item osd.9 weight 0.547
>}
>host felix {
>   id -5   # do not change unnecessarily
>   id -3 class hdd # do not change unnecessarily
>   id -10 class ssd# do not change unnecessarily
>   # weight 3.653
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.33 weight 1.819
>   item osd.13 weight 0.547
>   item osd.14 weight 0.467
>   item osd.16 weight 0.547
>   item osd.0 weight 0.274
>}
>host udo {
>   id -6   # do not change unnecessarily
>   id -7 class hdd # do not change unnecessarily
>   id -11 class ssd# do not change unnecessarily
>   # weight 4.006
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.32 weight 1.819
>   item osd.5 weight 0.547
>   item osd.10 weight 0.547
>   item osd.11 weight 0.547
>   item osd.12 weight 0.547
>}
>host moritz {
>   id -13  # do not change unnecessarily
>   id -14 class hdd# do not change unnecessarily
>   id -15 class ssd# do not change unnecessarily
>   # weight 1.819
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.30 weight 1.819
>}
>host bruno {
>   id -16  # do not change unnecessarily
>   id -17 class hdd# do not change unnecessarily
>   id -18 class ssd# do not change unnecessarily
>   # weight 3.183
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.24 weight 0.273
>   item osd.26 weight 0.273
>   item osd.27 weight 0.273
>   item osd.28 weight 0.273
>   item osd.29 weight 0.273
>   item osd.2 weight 1.819
>}
>root default {
>   id -1   # do not change unnecessarily
>   id -8 class hdd # do not change unnecessarily
>   id -12 class ssd# do not change unnecessarily
>   # weight 16.121
>   alg straw2
>   hash 0  # rjenkins1
>   item daniel weight 3.459
>   item felix weight 3.653
>   item udo weight 4.006
>   item moritz weight 1.819
>   item bruno weight 3.183
>}
>
># rules
>rule ssd {
>   id 0
>   type replicated
>   min_size 1
>   max_size 10
>   step take default class ssd
>   step choose firstn 0 type osd
>   step emit
>}
>rule hdd {
>   id 1
>   typ

Re: [ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?

2017-10-10 Thread Peter Linder
I think your failure domain within your rules is wrong.

step choose firstn 0 type osd

Should be:

step choose firstn 0 type host


On 10/10/2017 5:05 PM, Konrad Riedel wrote:
> Hello Ceph-users,
>
> after switching to luminous I was excited about the great
> crush-device-class feature - now we have 5 servers with 1x2TB NVMe
> based OSDs, 3 of them additionally with 4 HDDS per server. (we have
> only three 400G NVMe disks for block.wal and block.db and therefore
> can't distribute all HDDs evenly on all servers.)
>
> Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on
> the same
> Host:
>
> ceph pg map 5.b
> osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8]
>
> (on rebooting this host I had 4 stale PGs)
>
> I've written a small perl script to add hostname after OSD number and
> got many PGs where
> ceph placed 2 replicas on the same host... :
>
> 5.1e7: 8 - daniel 9 - daniel 11 - udo
> 5.1eb: 10 - udo 7 - daniel 9 - daniel
> 5.1ec: 10 - udo 11 - udo 7 - daniel
> 5.1ed: 13 - felix 16 - felix 5 - udo
>
>
> Is there any way I can correct this?
>
>
> Please see crushmap below. Thanks for any help!
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable chooseleaf_stable 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
>
> # devices
> device 0 osd.0 class hdd
> device 1 device1
> device 2 osd.2 class ssd
> device 3 device3
> device 4 device4
> device 5 osd.5 class hdd
> device 6 device6
> device 7 osd.7 class hdd
> device 8 osd.8 class hdd
> device 9 osd.9 class hdd
> device 10 osd.10 class hdd
> device 11 osd.11 class hdd
> device 12 osd.12 class hdd
> device 13 osd.13 class hdd
> device 14 osd.14 class hdd
> device 15 device15
> device 16 osd.16 class hdd
> device 17 device17
> device 18 device18
> device 19 device19
> device 20 device20
> device 21 device21
> device 22 device22
> device 23 device23
> device 24 osd.24 class hdd
> device 25 device25
> device 26 osd.26 class hdd
> device 27 osd.27 class hdd
> device 28 osd.28 class hdd
> device 29 osd.29 class hdd
> device 30 osd.30 class ssd
> device 31 osd.31 class ssd
> device 32 osd.32 class ssd
> device 33 osd.33 class ssd
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
>
> # buckets
> host daniel {
> id -4    # do not change unnecessarily
> id -2 class hdd    # do not change unnecessarily
> id -9 class ssd    # do not change unnecessarily
> # weight 3.459
> alg straw2
> hash 0    # rjenkins1
> item osd.31 weight 1.819
> item osd.7 weight 0.547
> item osd.8 weight 0.547
> item osd.9 weight 0.547
> }
> host felix {
> id -5    # do not change unnecessarily
> id -3 class hdd    # do not change unnecessarily
> id -10 class ssd    # do not change unnecessarily
> # weight 3.653
> alg straw2
> hash 0    # rjenkins1
> item osd.33 weight 1.819
> item osd.13 weight 0.547
> item osd.14 weight 0.467
> item osd.16 weight 0.547
> item osd.0 weight 0.274
> }
> host udo {
> id -6    # do not change unnecessarily
> id -7 class hdd    # do not change unnecessarily
> id -11 class ssd    # do not change unnecessarily
> # weight 4.006
> alg straw2
> hash 0    # rjenkins1
> item osd.32 weight 1.819
> item osd.5 weight 0.547
> item osd.10 weight 0.547
> item osd.11 weight 0.547
> item osd.12 weight 0.547
> }
> host moritz {
> id -13    # do not change unnecessarily
> id -14 class hdd    # do not change unnecessarily
> id -15 class ssd    # do not change unnecessarily
> # weight 1.819
> alg straw2
> hash 0    # rjenkins1
> item osd.30 weight 1.819
> }
> host bruno {
> id -16    # do not change unnecessarily
> id -17 class hdd    # do not change unnecessarily
> id -18 class ssd    # do not change unnecessarily
> # weight 3.183
> alg straw2
> hash 0    # rjenkins1
> item osd.24 weight 0.273
> item osd.26 weight 0.273
> item osd.27 weight 0.273
> item osd.28 weight 0.273
> item osd.29 weight 0.273
> item osd.2 weight 1.819
> }
> root default {
> id -1    # do not change unnecessarily
> id -8 class hdd    # do not change unnecessarily
> id -12 class ssd    # do not change unnecessarily
> # weight 16.121
> alg straw2
> hash 0    # rjenkins1
> item daniel weight 3.459
> item felix weight 3.653
> item udo weight 4.006
> item moritz weight 1.819
> item bruno weight 3.183
> }
>
> # rules
> rule ssd {
> id 0
> type replicated
> min_size 1
> max_size 10
> step take default class ssd
> step choose firstn 0 type osd
> step emit
> }
> rule hdd {
> id 1
> type replicated
> min_size 1
> max_

Re: [ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

2017-10-09 Thread Peter Linder
I was able to get this working with the crushmap in my last post! I now
have the intended behavior together with the change of primary affinity
on the slow hdds. Very happy, performance is excellent.

One thing was a little weird though, I had to manually change the weight
of each hostgroup so that they are in the same ballpark. If they were
too far apart ceph couldn't properly allocate 3 buckets for each pg,
some ended up being in state "remapped" or "degraded".

When I changed the weights (The crush rule selects 3 out of 3 hostgroups
anyway so weight isn't even a consideration there) to similar values
that problem went away.

Perhaps that is a bug?

/Peter

On 10/8/2017 3:22 PM, David Turner wrote:
>
> That's correct. It doesn't matter how many copies of the data you have
> in each datacenter. The mons control the maps and you should be good
> as long as you have 1 mon per DC. You should test this to see how the
> recovery goes, but there shouldn't be a problem.
>
>
> On Sat, Oct 7, 2017, 6:10 PM Дробышевский, Владимир  <mailto:v...@itgorod.ru>> wrote:
>
> 2017-10-08 2:02 GMT+05:00 Peter Linder
> mailto:peter.lin...@fiberdirekt.se>>:
>
>>
>> Then, I believe, the next best configuration would be to set
>> size for this pool to 4.  It would choose an NVMe as the
>> primary OSD, and then choose an HDD from each DC for the
>> secondary copies.  This will guarantee that a copy of the
>> data goes into each DC and you will have 2 copies in other
>> DCs away from the primary NVMe copy.  It wastes a copy of all
>> of the data in the pool, but that's on the much cheaper HDD
>> storage and can probably be considered acceptable losses for
>> the sake of having the primary OSD on NVMe drives.
> I have considered this, and it should of course work when it
> works so to say, but what if 1 datacenter is isolated while
> running? We would be left with 2 running copies on each side
> for all PGs, with no way of knowing what gets written where.
> In the end, data would be destoyed due to the split brain.
> Even being able to enforce quorum where the SSD is would mean
> a single point of failure.
>
>     In case you have one mon per DC all operations in the isolated DC
> will be frozen, so I believe you would not lose data.
>  
>
>
>
>>
>> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder
>> > <mailto:peter.lin...@fiberdirekt.se>> wrote:
>>
>> On 10/7/2017 8:08 PM, David Turner wrote:
>>>
>>> Just to make sure you understand that the reads will
>>> happen on the primary osd for the PG and not the nearest
>>> osd, meaning that reads will go between the datacenters.
>>> Also that each write will not ack until all 3 writes
>>> happen adding the latency to the writes and reads both.
>>>
>>>
>>
>> Yes, I understand this. It is actually fine, the
>> datacenters have been selected so that they are about
>> 10-20km apart. This yields around a 0.1 - 0.2ms round
>> trip time due to speed of light being too low.
>> Nevertheless, latency due to network shouldn't be a
>> problem and it's all 40G (dedicated) TRILL network for
>> the moment.
>>
>> I just want to be able to select 1 SSD and 2 HDDs, all
>> spread out. I can do that, but one of the HDDs end up in
>> the same datacenter, probably because I'm using the
>> "take" command 2 times (resets selecting buckets?).
>>
>>
>>
>>> On Sat, Oct 7, 2017, 1:48 PM Peter Linder
>>> >> <mailto:peter.lin...@fiberdirekt.se>> wrote:
>>>
>>> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>> Hello!
>>>>
>>>> 2017-10-07 19:12 GMT+05:00 Peter Linder
>>>> >>> <mailto:peter.lin...@fiberdirekt.se>>:
>>>>
>>>> The idea is to select an nvme osd, and
>>>> then select the rest from hdd osds in different
>>>> datacenters (see crush
>>>> map below for hierar

Re: [ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

2017-10-08 Thread Peter Linder
cessarily
    id -65 class hdd    # do not change unnecessarily
    # weight 68.408
    alg straw2
    hash 0  # rjenkins1
    item hg3-1 weight 2.912
    item hg3-2 weight 65.496
    item hg3-3 weight 0.000
}
root ldc {
    id -42  # do not change unnecessarily
    id -53 class nvme   # do not change unnecessarily
    id -66 class hdd    # do not change unnecessarily
    # weight 270.721
    alg straw2
    hash 0  # rjenkins1
    item ldc1 weight 68.409
    item ldc2 weight 133.904
    item ldc3 weight 68.408
}

# rules
rule hybrid {
    id 1
    type replicated
    min_size 1
    max_size 10
    step take ldc
    step choose firstn 1 type datacenter
    step chooseleaf firstn 0 type hostgroup
    step emit
}
rule hdd {
    id 2
    type replicated
    min_size 1
    max_size 3
    step take default class hdd
    step chooseleaf firstn 0 type datacenter
    step emit
}
rule nvme {
    id 3
    type replicated
    min_size 1
    max_size 3
    step take default class nvme
    step chooseleaf firstn 0 type datacenter
    step emit
}

# end crush map





On 10/8/2017 3:22 PM, David Turner wrote:
>
> That's correct. It doesn't matter how many copies of the data you have
> in each datacenter. The mons control the maps and you should be good
> as long as you have 1 mon per DC. You should test this to see how the
> recovery goes, but there shouldn't be a problem.
>
>
> On Sat, Oct 7, 2017, 6:10 PM Дробышевский, Владимир  <mailto:v...@itgorod.ru>> wrote:
>
> 2017-10-08 2:02 GMT+05:00 Peter Linder
> mailto:peter.lin...@fiberdirekt.se>>:
>
>>
>> Then, I believe, the next best configuration would be to set
>> size for this pool to 4.  It would choose an NVMe as the
>> primary OSD, and then choose an HDD from each DC for the
>> secondary copies.  This will guarantee that a copy of the
>> data goes into each DC and you will have 2 copies in other
>> DCs away from the primary NVMe copy.  It wastes a copy of all
>> of the data in the pool, but that's on the much cheaper HDD
>> storage and can probably be considered acceptable losses for
>> the sake of having the primary OSD on NVMe drives.
> I have considered this, and it should of course work when it
> works so to say, but what if 1 datacenter is isolated while
> running? We would be left with 2 running copies on each side
> for all PGs, with no way of knowing what gets written where.
> In the end, data would be destoyed due to the split brain.
> Even being able to enforce quorum where the SSD is would mean
> a single point of failure.
>
>     In case you have one mon per DC all operations in the isolated DC
> will be frozen, so I believe you would not lose data.
>  
>
>
>
>>
>> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder
>> > <mailto:peter.lin...@fiberdirekt.se>> wrote:
>>
>> On 10/7/2017 8:08 PM, David Turner wrote:
>>>
>>> Just to make sure you understand that the reads will
>>> happen on the primary osd for the PG and not the nearest
>>> osd, meaning that reads will go between the datacenters.
>>> Also that each write will not ack until all 3 writes
>>> happen adding the latency to the writes and reads both.
>>>
>>>
>>
>> Yes, I understand this. It is actually fine, the
>> datacenters have been selected so that they are about
>> 10-20km apart. This yields around a 0.1 - 0.2ms round
>> trip time due to speed of light being too low.
>> Nevertheless, latency due to network shouldn't be a
>> problem and it's all 40G (dedicated) TRILL network for
>> the moment.
>>
>> I just want to be able to select 1 SSD and 2 HDDs, all
>> spread out. I can do that, but one of the HDDs end up in
>> the same datacenter, probably because I'm using the
>> "take" command 2 times (resets selecting buckets?).
>>
>>
>>
>>> On Sat, Oct 7, 2017, 1:48 PM Peter Linder
>>> >> <mailto:peter.lin...@fiberdirekt.se>> wrote:
>>>
>>> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>>> Hello!
>>>>
>&

Re: [ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

2017-10-07 Thread Peter Linder
On 10/7/2017 10:41 PM, David Turner wrote:
> Disclaimer, I have never attempted this configuration especially with
> Luminous. I doubt many have, but it's a curious configuration that I'd
> love to help see if it is possible.
Very generous of you :). (With that said, I suppose we are prepared to
pay for help to have this figured out. It makes me a little headachy and
there is budget space :)).
>
> There is 1 logical problem with your configuration (which you have
> most likely considered).  If you want all of your PGs to be primary on
> NVMe's across the 3 DC's, then you need to have 1/3 of your available
> storage (that you plan to use for this pool) be from NVMe's. 
> Otherwise they will fill up long before the HDDs and your cluster will
> be "full" while your HDDs are near empty.  I clarify "that you plan to
> use for this pool" because if you plan to put other stuff on just the
> HDDs, that is planning to utilize that extra space, then it's a part
> of the plan that your NVMe's don't total 1/3 of your storage.
We were going to use the left over HDD space for nearline archives,
intermediary backups etc.

>
> Second, I'm noticing that if a PG has a primary OSD in any datacenter
> other than TEG4, then it only has 1 other datacenter available to have
> its 2 HDD copies on.  If the rules were working properly, then I would
> expect the PG to be stuck undersized as opposed to choosing an OSD
> from a datacenter that it shouldn't be able to.  Potentially, you
> could test setting the size to 2 for this pool (while you're missing
> the third HDD node) to see if any PGs still end up on an HDD and NVMe
> in the same DC.  I think that likely you will find that PGs will still
> be able to use 2 copies in the same DC based on your current
> configuration.
I know. This server does not exist yet. It should be finished this
coming week (hardware is busy, task needs migrating). And yes, this does
not make testing this out any easier. It was an oversight to not have it
finished.

>
> Then, I believe, the next best configuration would be to set size for
> this pool to 4.  It would choose an NVMe as the primary OSD, and then
> choose an HDD from each DC for the secondary copies.  This will
> guarantee that a copy of the data goes into each DC and you will have
> 2 copies in other DCs away from the primary NVMe copy.  It wastes a
> copy of all of the data in the pool, but that's on the much cheaper
> HDD storage and can probably be considered acceptable losses for the
> sake of having the primary OSD on NVMe drives.
I have considered this, and it should of course work when it works so to
say, but what if 1 datacenter is isolated while running? We would be
left with 2 running copies on each side for all PGs, with no way of
knowing what gets written where. In the end, data would be destoyed due
to the split brain. Even being able to enforce quorum where the SSD is
would mean a single point of failure.

I was thinking instead I can define a crushmap where I make logical
datacenters that include the SSDs as they are spread out and the HDDS I
explicitly want to mirror each ssd set to, and make a crush rule to
enforce 3 copies of the data to 3 hosts within that "datacenter"
selected for each PG. I dont really know how to make such a "depth
first" rule though, but I will try tomorrow.

I was considering making 3 rules to map SSDs to HDDs and then 3 pools,
but that would leave me manually balancing load. And if one node went
down, some RBDs would completely loose their SSD read capability instead
of just 1/3 of it...  perhaps acceptable, but not optimal :)

/Peter


>
> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder
> mailto:peter.lin...@fiberdirekt.se>> wrote:
>
> On 10/7/2017 8:08 PM, David Turner wrote:
>>
>> Just to make sure you understand that the reads will happen on
>> the primary osd for the PG and not the nearest osd, meaning that
>> reads will go between the datacenters. Also that each write will
>> not ack until all 3 writes happen adding the latency to the
>> writes and reads both.
>>
>>
>
> Yes, I understand this. It is actually fine, the datacenters have
> been selected so that they are about 10-20km apart. This yields
> around a 0.1 - 0.2ms round trip time due to speed of light being
> too low. Nevertheless, latency due to network shouldn't be a
> problem and it's all 40G (dedicated) TRILL network for the moment.
>
> I just want to be able to select 1 SSD and 2 HDDs, all spread out.
> I can do that, but one of the HDDs end up in the same datacenter,
> probably because I'm using the "take" command 2 times (resets
> selecti

Re: [ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

2017-10-07 Thread Peter Linder
On 10/7/2017 8:08 PM, David Turner wrote:
>
> Just to make sure you understand that the reads will happen on the
> primary osd for the PG and not the nearest osd, meaning that reads
> will go between the datacenters. Also that each write will not ack
> until all 3 writes happen adding the latency to the writes and reads both.
>
>

Yes, I understand this. It is actually fine, the datacenters have been
selected so that they are about 10-20km apart. This yields around a 0.1
- 0.2ms round trip time due to speed of light being too low.
Nevertheless, latency due to network shouldn't be a problem and it's all
40G (dedicated) TRILL network for the moment.

I just want to be able to select 1 SSD and 2 HDDs, all spread out. I can
do that, but one of the HDDs end up in the same datacenter, probably
because I'm using the "take" command 2 times (resets selecting buckets?).


> On Sat, Oct 7, 2017, 1:48 PM Peter Linder  <mailto:peter.lin...@fiberdirekt.se>> wrote:
>
> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>> Hello!
>>
>> 2017-10-07 19:12 GMT+05:00 Peter Linder
>> mailto:peter.lin...@fiberdirekt.se>>:
>>
>> The idea is to select an nvme osd, and
>> then select the rest from hdd osds in different datacenters
>> (see crush
>> map below for hierarchy). 
>>
>> It's a little bit aside of the question, but why do you want to
>> mix SSDs and HDDs in the same pool? Do you have read-intensive
>> workload and going to use primary-affinity to get all reads from
>> nvme?
>>  
>>
> Yes, this is pretty much the idea, getting the performance from
> NVMe reads, while still maintaining triple redundancy and a
> reasonable cost.
>
>
>> -- 
>> Regards,
>> Vladimir
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

2017-10-07 Thread Peter Linder
Yes, I realized that, I updated it to 3.

On 10/7/2017 8:41 PM, Sinan Polat wrote:
> You are talking about the min_size, which should be 2 according to
> your text.
>
> Please be aware, the min_size in your CRUSH is _not_ the replica size.
> The replica size is set with your pools.
>
> Op 7 okt. 2017 om 19:39 heeft Peter Linder
> mailto:peter.lin...@fiberdirekt.se>> het
> volgende geschreven:
>
>> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
>>> Hello!
>>>
>>> 2017-10-07 19:12 GMT+05:00 Peter Linder >> <mailto:peter.lin...@fiberdirekt.se>>:
>>>
>>> The idea is to select an nvme osd, and
>>> then select the rest from hdd osds in different datacenters (see
>>> crush
>>> map below for hierarchy). 
>>>
>>> It's a little bit aside of the question, but why do you want to mix
>>> SSDs and HDDs in the same pool? Do you have read-intensive workload
>>> and going to use primary-affinity to get all reads from nvme?
>>>  
>>>
>> Yes, this is pretty much the idea, getting the performance from NVMe
>> reads, while still maintaining triple redundancy and a reasonable cost.
>>
>>
>>> -- 
>>> Regards,
>>> Vladimir
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

2017-10-07 Thread Peter Linder
On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote:
> Hello!
>
> 2017-10-07 19:12 GMT+05:00 Peter Linder  <mailto:peter.lin...@fiberdirekt.se>>:
>
> The idea is to select an nvme osd, and
> then select the rest from hdd osds in different datacenters (see crush
> map below for hierarchy). 
>
> It's a little bit aside of the question, but why do you want to mix
> SSDs and HDDs in the same pool? Do you have read-intensive workload
> and going to use primary-affinity to get all reads from nvme?
>  
>
Yes, this is pretty much the idea, getting the performance from NVMe
reads, while still maintaining triple redundancy and a reasonable cost.


> -- 
> Regards,
> Vladimir


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs get placed in the same datacenter (Trying to make a hybrid NVMe/HDD pool with 6 servers, 2 in each datacenter)

2017-10-07 Thread Peter Linder
Hello Ceph-users!

Ok, so I've got 3 separate datacenters (low latency network in between)
and I want to make a hybrid NMVe/HDD pool for performance and cost reasons.

There are 3 servers with NVMe based OSDs, and 2 servers with normal HDDS
(Yes, one is missing, will be 3 of course. It needs some more work and
will be added later), with 1 NVMe server and 1 HDD server in each
datacenter.

I've been trying to use a rule like this:

rule hybrid {
    id 1
    type replicated
    min_size 1
    max_size 3
    step take default class nvme
    step chooseleaf firstn 1 type datacenter
    step emit
    step take default class hdd
    step chooseleaf firstn -1 type datacenter
    step emit
}

(min_size should be 2, i know). The idea is to select an nvme osd, and
then select the rest from hdd osds in different datacenters (see crush
map below for hierarchy). This would work I think if each datacenter
only had nmve or hdd osds, but currently there are 2 servers of the
different kinds in each datacenter.

Output from "ceph pg dump" shows that some PGs end up in the same
datacenter:

2.6c8    47  0    0 0   0  197132288  613   
   613 active+clean 2017-10-07 14:27:33.943589  '613   :3446  [8,24]
  8  [8,24]

Here OSD 8 and OSD 24 are indeed of diferent types, but are in the same
datacenter so redundancy for this PG would be depending on a single
datacenter...

Is there any way I can rethink this?

Please see my full crushmap etc below. Thanks for any help!

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd
device 34 osd.34 class hdd
device 35 osd.35 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host storage11 {
    id -5   # do not change unnecessarily
    id -6 class nvme    # do not change unnecessarily
    id -10 class hdd    # do not change unnecessarily
    # weight 2.912
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 0.728
    item osd.3 weight 0.728
    item osd.6 weight 0.728
    item osd.9 weight 0.728
}
host storage21 {
    id -13  # do not change unnecessarily
    id -14 class nvme   # do not change unnecessarily
    id -15 class hdd    # do not change unnecessarily
    # weight 65.496
    alg straw2
    hash 0  # rjenkins1
    item osd.12 weight 5.458
    item osd.13 weight 5.458
    item osd.14 weight 5.458
    item osd.15 weight 5.458
    item osd.16 weight 5.458
    item osd.17 weight 5.458
    item osd.18 weight 5.458
    item osd.19 weight 5.458
    item osd.20 weight 5.458
    item osd.21 weight 5.458
    item osd.22 weight 5.458
    item osd.23 weight 5.458
}
datacenter HORN79 {
    id -19  # do not change unnecessarily
    id -26 class nvme   # do not change unnecessarily
    id -27 class hdd    # do not change unnecessarily
    # weight 68.406
    alg straw2
    hash 0  # rjenkins1
    item storage11 weight 2.911
    item storage21 weight 65.495
}
host storage13 {
    id -7   # do not change unnecessarily
    id -8 class nvme    # do not change unnecessarily
    id -11 class hdd    # do not change unnecessarily
    # weight 2.912
    alg straw2
    hash 0  # rjenkins1
    item osd.2 weight 0.728
    item osd.5 weight 0.728
    item osd.8 weight 0.728
    item osd.11 weight 0.728
}
host storage23 {
    id -16  # do not change unnecessarily
    id -17 class nvme   # do not change unnecessarily
    id -18 class hdd 

[ceph-users] bluestore compression statistics

2017-09-18 Thread Peter Gervai
Hello,

Is there any way to get compression stats of compressed bluestore storage?

Thanks,
Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous CephFS on EC - how?

2017-08-30 Thread Peter Maloney
What kind of terrible mail client is this that sends a multipart message
where one part is blank and that's the one Thunderbird chooses to show?
(see blankness below)

Yes you're on the right track. As long as the main fs is on a replicated
pool (the one with omap), the ones below it (using file layouts) can be
EC without needing a cache pool.


a quote from your first url:
http://docs.ceph.com/docs/master/rados/operations/erasure-code/#erasure-coding-with-overwrites
>
> For Cephfs, using an erasure coded pool means setting that pool in a
> file layout .
>


On 08/30/17 08:21, Martin Millnert wrote:
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-10 Thread Peter Maloney
I think a `ceph osd df` would be useful.

And how did you set up such a cluster? I don't see a root, and you have
each osd in there more than once...is that even possible?


On 08/10/17 08:46, Mandar Naik wrote:
> *
>
> Hi,
>
> I am evaluating ceph cluster for a solution where ceph could be used
> for provisioning
>
> pools which could be either stored local to a node or replicated
> across a cluster.  This
>
> way ceph could be used as single point of solution for writing both
> local as well as replicated
>
> data. Local storage helps avoid possible storage cost that comes with
> replication factor of more
>
> than one and also provide availability as long as the data host is
> alive.  
>
>
> So I tried an experiment with Ceph cluster where there is one crush
> rule which replicates data across
>
> nodes and other one only points to a crush bucket that has local ceph
> osd. Cluster configuration
>
> is pasted below.
>
>
> Here I observed that if one of the disk is full (95%) entire cluster
> goes into error state and stops
>
> accepting new writes from/to other nodes. So ceph cluster became
> unusable even though it’s only
>
> 32% full. The writes are blocked even for pools which are not touching
> the full osd.
>
>
> I have tried playing around crush hierarchy but it did not help. So is
> it possible to store data in the above
>
> manner with Ceph ? If yes could we get cluster state in usable state
> after one of the node is full ?
>
>
>
> # ceph df
>
>
> GLOBAL:
>
>SIZE AVAIL  RAW USED %RAW USED
>
>134G 94247M   43922M 31.79
>
>
> # ceph –s
>
>
>cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f
>
> health HEALTH_ERR
>
>1 full osd(s)
>
>full,sortbitwise,require_jewel_osds flag(s) set
>
> monmap e3: 3 mons at
> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0
> <http://10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0>}
>
>election epoch 14, quorum 0,1,2
> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210
>
> osdmap e93: 3 osds: 3 up, 3 in
>
>flags full,sortbitwise,require_jewel_osds
>
>  pgmap v630: 384 pgs, 6 pools, 43772 MB data, 18640 objects
>
>43922 MB used, 94247 MB / 134 GB avail
>
> 384 active+clean
>
>
> # ceph osd tree
>
>
> ID WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
>
> -9 0.04399 rack ip-10-0-9-146-rack
>
> -8 0.04399 host ip-10-0-9-146
>
> 2 0.04399 osd.2up  1.0  1.0
>
> -7 0.04399 rack ip-10-0-9-210-rack
>
> -6 0.04399 host ip-10-0-9-210
>
> 1 0.04399 osd.1up  1.0  1.0
>
> -5 0.04399 rack ip-10-0-9-122-rack
>
> -3 0.04399 host ip-10-0-9-122
>
> 0 0.04399 osd.0up  1.0  1.0
>
> -4 0.13197 rack rep-rack
>
> -3 0.04399 host ip-10-0-9-122
>
> 0 0.04399 osd.0up  1.0  1.0
>
> -6 0.04399 host ip-10-0-9-210
>
> 1 0.04399 osd.1up  1.0  1.0
>
> -8 0.04399 host ip-10-0-9-146
>
> 2 0.04399 osd.2up  1.0  1.0
>
>
> # ceph osd crush rule list
>
> [
>
>"rep_ruleset",
>
>"ip-10-0-9-122_ruleset",
>
>"ip-10-0-9-210_ruleset",
>
>"ip-10-0-9-146_ruleset"
>
> ]
>
>
> # ceph osd crush rule dump rep_ruleset
>
> {
>
>"rule_id": 0,
>
>"rule_name": "rep_ruleset",
>
>"ruleset": 0,
>
>"type": 1,
>
>"min_size": 1,
>
>"max_size": 10,
>
>"steps": [
>
>{
>
>"op": "take",
>
>"item": -4,
>
>"item_name": "rep-rack"
>
>},
>
>{
>
>"op": "chooseleaf_firstn",
>
>"num": 0,
>
>"type": "host"
>
>},
>
>{
>
>"op": "emit"
>
>}
>
>]
>
> }
>
>
> # ceph osd crush rule dump ip-10-0-9-122_ruleset
>
> {
>
>"rule_id": 1,
>
>"rule_name": "ip-10-0-9-122_ruleset",
>
>"r

Re: [ceph-users] IO Error reaching client when primary osd get funky but secondaries are ok

2017-08-09 Thread Peter Gervai
Hello David,

On Wed, Aug 9, 2017 at 3:08 PM, David Turner  wrote:

> When exactly is the timeline of when the io error happened?

The timeline was included in the email, hour:min:sec resolution. I
spared millisecs since it doesn't really change things.

> If the primary
> osd was dead, but not marked down in the cluster yet,

The email showed when the osd went up, so before that it was suposed
to be down, as far as I can tell from the logs, unless there was an
up-down somewhere I have missed. I believe a boot-failed osd won't
come up.

> then the cluster would
> sit there and expect that osd too respond.

Suppose the osd would have been up and in (which I believe it wasn't),
and it fails to respond, what is supposed to happen? I thought
librados would see failure or timeout and would try to contact
secondaries, and definitely not send IO error upwards unless all
possibilities failed.

> If this definitely happened after
> the primary osd was marked down, then it's a different story.

Seems so, based on the logs I was able to correlate, but I cannot be
absolutely sure.

> I'm confused about you saying 1 osd was down/out and 2 other osds we're down
> but not out.

Okay, that may have been a mistake on my part: there's 2 osds failed
and one was about to be replaced first, and since it have failed we
kind of hesitated to replace the other one. :-/ The email was heavily
trimmed to remove fluff, this info may have been missed. Sorry.

> We're this in the same host whole you were replacing the disk?

The logs were gathered from many hosts and osds and mons, since events
happened simultaneously. The replacement happened on the same host, I
believe this is expected.

> Is your failure domain host or osd?

Host (and datacenter).

> What version of ceph are you running?

See the first line of my mail: version 0.94.10 (hammer)

Thanks,
Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] IO Error reaching client when primary osd get funky but secondaries are ok

2017-08-09 Thread Peter Gervai
Hello,

ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)

We had a few problems related to the simple operation of replacing a
failed OSD, and some clarification would be appreciated. It is not
very simple to observe what specifically happened (the timeline was
gathered from half a dozen logs), so apologies for any vagueness,
which could be fixed by looking at further specific logs on request.

The main problem was that we had to replace a failed OSD. There were 2
down+out but otherwise known (not deleted) OSDs. We have removed
(deleted) one. It changes the CRUSH, and rebalancing starts (no matter
that noout set since it's been out anyway; it could only be stopped by
the scary norecover, but it's not been flagged then; I will check
nobackfill/norebalance next time which looks more safe). Rebalancing
finished fine (25% objects were told to be misplaced, which is a PITA,
but there's been not much objects on that cluster). This is the
prologue, so far it's all fine.

We plugged in (and created) the new OSD, but due to the environment
and some admin errors [wasn't me! :)] the OSD at start were not able
to umount it's temporary filesystem which seems to be used for initial
creation, so what I have observed is [from the logs]

- 14:12:00, osd6 created, enters the osdmap, down+out
- 14:12:02, replaced osd6 started, boots, tries to create initial osd layout
- 14:12:03, osd6 crash due to failed umount / file not found
- 14:12:07, some other osds are logging warnings like (may not be important):
   misdirected client (some that osd not in the set, others just logged the pg)
- 14:12:07, one of the clients get IO error (this one was actually
pretty fatal):
   rbd: rbd1: write 1000 at 40779000 (379000)
   rbd: rbd1:   result -6 xferred 1000
   blk_update_request: I/O error, dev rbd1, sector 2112456
   EXT4-fs warning (device rbd1): ext4_end_bio:329: I/O error -6
writing to inode 399502 (offset 0 size 0 starting block 264058)
  Buffer I/O error on device rbd1, logical block 264057
- 14:12:17, other client gets IO error (this one's been lucky):
   rbd: rbd1: write 1000 at c84795000 (395000)
   rbd: rbd1:   result -6 xferred 1000
   blk_update_request: I/O error, dev rbd1, sector 105004200
- 14:12:27, libceph: osd6 weight 0x1 (in); in+down: the osd6 is
crashed at that point and hasn't been restarted yet

- 14:13:19, osd6 started again
- 14:13:22, libceph: osd6 up
- from this on everything's fine, apart from the crashed VM :-/

The main problem is of course that the IO error which have reached the
client, and knocked out the FS, while there were 2 replica osds
active. I haven't found the specifics about how it's handled when the
primary fails, or acts funky, since by my guess this may have been
happened.

I would like to understand why the IO error was there, and how to
prevent it, if it's possible, and whether this is something which
already have been taken care of in later ceph versions.

Your shared wisdom would be appreciated.

Thanks,
Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd safe to remove

2017-08-03 Thread Peter Maloney
On 08/03/17 11:05, Dan van der Ster wrote:
> On Fri, Jul 28, 2017 at 9:42 PM, Peter Maloney
>  wrote:
>> Hello Dan,
>>
>> Based on what I know and what people told me on IRC, this means basicaly the
>> condition that the osd is not acting nor up for any pg. And for one person
>> (fusl on irc) that said there was a unfound objects bug when he had size =
>> 1, also he said if reweight (and I assume crush weight) is 0, it will surely
>> be safe, but possibly it won't be otherwise.
>>
>> And so here I took my bc-ceph-reweight-by-utilization.py script that already
>> parses `ceph pg dump --format=json` (for up,acting,bytes,count of pgs) and
>> `ceph osd df --format=json` (for weight and reweight), and gutted out the
>> unneeded parts, and changed the report to show the condition I described as
>> True or False per OSD. So the ceph auth needs to allow ceph pg dump and ceph
>> osd df. The script is attached.
>>
>> The script doesn't assume you're ok with acting lower than size, or care
>> about min_size, and just assumes you want the OSD completely empty.
> Thanks for this script. In fact, I am trying to use the
> min_size/size-based removal heuristics. If we would be able to wait
> until an OSD is completely empty, then I suppose could just set the
> crush weight to 0 then wait for HEALTH_OK. For our procedures I'm
> trying to shortcut this with an earlier device removal.
>
> Cheers, Dan
Well what this is intended for is you can set some weight 0, then later
set others weight 0, etc. and before *all* are done, you can remove
*some* that the script identifies (no pgs are on that disk, even if
other pgs are still being moved on *other* disks). So it's a shortcut,
but only by gaining knowledge, not by sacrificing redundancy.

And I wasn't sure what you preferred... I definitely prefer to have my
full size achieved, not just min_size if I'm going to remove something.
Just like how you don't run raid5 on large disks, and instead use raid6,
and would only replace one disk at a time so you still have redundancy.

What do you use so keeping redundancy isn't important?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd safe to remove

2017-07-28 Thread Peter Maloney
Hello Dan,

Based on what I know and what people told me on IRC, this means basicaly
the condition that the osd is not acting nor up for any pg. And for one
person (fusl on irc) that said there was a unfound objects bug when he
had size = 1, also he said if reweight (and I assume crush weight) is 0,
it will surely be safe, but possibly it won't be otherwise.

And so here I took my bc-ceph-reweight-by-utilization.py script that
already parses `ceph pg dump --format=json` (for up,acting,bytes,count
of pgs) and `ceph osd df --format=json` (for weight and reweight), and
gutted out the unneeded parts, and changed the report to show the
condition I described as True or False per OSD. So the ceph auth needs
to allow ceph pg dump and ceph osd df. The script is attached.

The script doesn't assume you're ok with acting lower than size, or care
about min_size, and just assumes you want the OSD completely empty.

Sample output:

Real cluster:
> root@cephtest:~ # ./bc-ceph-empty-osds.py -a
> osd_id weight  reweight pgs_old bytes_old  pgs_new bytes_new 
> empty
>  0 4.00099  0.61998  38  1221853911536  38  1221853911536
> False
>  1 4.00099  0.59834  43  1168531341347  43  1168531341347
> False
>  2 4.00099  0.79213  44  1155260814435  44  1155260814435
> False
> 27 4.00099  0.69459  39  1210145117377  39  1210145117377
> False
> 30 6.00099  0.73933  56  1691992924542  56  1691992924542
> False
> 31 6.00099  0.81180  64  1810503842054  64  1810503842054
> False
> ...

Test cluster with some -nan and 0's in crush map:
> root@tceph1:~ # ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS
>  4 1.00  0  0  0 -nan -nan   0
>  1 0.06439  1.0 61409M 98860k 61313M 0.16 0.93  47
>  0 0.06438  1.0 61409M   134M 61275M 0.22 1.29  59
>  2 0.06439  1.0 61409M 82300k 61329M 0.13 0.77  46
>  3   00  0  0  0 -nan -nan   0
>   TOTAL   179G   311M   179G 0.17  
> MIN/MAX VAR: 0.77/1.29  STDDEV: 0.04

> root@tceph1:~ # ./bc-ceph-empty-osds.py 
> osd_id weight  reweight pgs_old bytes_old  pgs_new bytes_new 
> empty
>  3 0.0  0.0   0  0   0  0 True
>  4 1.0  0.0   0  0   0  0 True
> root@tceph1:~ # ./bc-ceph-empty-osds.py -a
> osd_id weight  reweight pgs_old bytes_old  pgs_new bytes_new 
> empty
>  0 0.06438  1.0  59   46006167  59   46006167
> False
>  1 0.06439  1.0  47   28792306  47   28792306
> False
>  2 0.06439  1.0  46   17623485  46   17623485
> False
>  3 0.0  0.0   0  0   0  0 True
>  4 1.0  0.0   0  0   0  0 True

The "old" vs "new" suffixes refer to the position of data now and after
recovery is complete, respectively. (the magic that made my reweight
script efficient compared to the official reweight script)

And I have not used such a method in the past... my cluster is small, so
I have always just let recovery completely finish instead. I hope you
find it useful and it develops from there.

Peter

On 07/28/17 15:36, Dan van der Ster wrote:
> Hi all,
>
> We are trying to outsource the disk replacement process for our ceph
> clusters to some non-expert sysadmins.
> We could really use a tool that reports if a Ceph OSD *would* or
> *would not* be safe to stop, e.g.
>
> # ceph-osd-safe-to-stop osd.X
> Yes it would be OK to stop osd.X
>
> (which of course means that no PGs would go inactive if osd.X were to
> be stopped).
>
> Does anyone have such a script that they'd like to share?
>
> Thanks!
>
> Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


#!/usr/bin/env python3
#
# tells you if an osd is empty (no pgs up or acting, and no weight)
# (most of the code here was copied from bc-ceph-reweight-by-utilization.py)
#
# Author: Peter Maloney
# Licensed GNU GPLv2; if you did not recieve a copy of the license, get one at http://www.gnu.org/licenses/gpl-2.0.html

import sys
import subprocess
import re
import argparse
import time
import logging
import json

#
# global variables
#

osds = {}
health = ""
json_nan_regex = None

#
# logging
#

logging.VERBOSE = 15
def log_verbose(self, message, *args, **kws):
if self.isEnabledFor(logging.VERBOSE):
self.log(logging.VERBOSE, message, *args, **kws)

logging.addLevelName(logging.VERBOSE, &quo

Re: [ceph-users] High iowait on OSD node

2017-07-27 Thread Peter Maloney
00 0.001.000.00 4.00 0.00
> 8.00 0.004.004.000.00   4.00   0.40
> sdp   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> sdq   0.50 0.00  756.000.00 93288.00 0.00  
> 246.79 1.471.951.950.00   1.17  88.60
> sdr   0.00 0.001.000.00 4.00 0.00
> 8.00 0.004.004.000.00   4.00   0.40
> sds   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> sdt   0.00 0.000.00   36.50 0.00   643.50  
>  35.26 3.49   95.730.00   95.73   2.63   9.60
> sdu   0.00 0.000.00   21.00 0.00   323.25  
>  30.79 0.78   37.240.00   37.24   2.95   6.20
> sdv   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> sdw   0.00 0.000.00   31.00 0.00   689.50  
>  44.48 2.48   80.060.00   80.06   3.29  10.20
> sdx   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> dm-0  0.00 0.000.000.50 0.00 6.00  
>  24.00 0.008.00    0.008.00   8.00   0.40
> dm-1  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple osd's to an active cluster

2017-07-19 Thread Peter Gervai
On Fri, Feb 17, 2017 at 10:42 AM, nigel davies  wrote:

> How is the best way to added multiple osd's to an active cluster?
> As the last time i done this i all most killed the VM's we had running on
> the cluster

You possibly mean that messing with OSDs caused the cluster to
reorganise the date and the recovery/backfill slowed you down.

If that's the case see --osd-max-backfills and
--osd-recovery-max-active options, and use them like
$ ceph tell osd.* injectargs '--osd-max-backfills 1'
'--osd-recovery-max-active 1'
but be aware that slowing down recovery makes it take longer and
longer recovery mean longer dangerous state of the cluster (in case of
any unfortunate event while recovering).

g
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Peter Maloney
On 07/18/17 14:10, Gencer W. Genç wrote:
>>> Are you sure? Your config didn't show this.
> Yes. I have dedicated 10GbE network between ceph nodes. Each ceph node has 
> seperate network that have 10GbE network card and speed. Do I have to set 
> anything in the config for 10GbE?
Not for 10GbE, but for public vs cluster network, for example:

> public network = 10.10.10.0/24
> cluster network = 10.10.11.0/24

Mainly this is for replication performance.

And using jumbo frames (high MTU, like 9000, on hosts and higher on
switches) also increases performance a bit (especially on slow CPUs in
theory). That's also not in the ceph.conf.

>>> What kind of devices are they? did you do the journal test?
> They are not connected via NVMe neither SSD's. Each node has 10x3TB SATA Hard 
> Disk Drives (HDD).
Then I'm not sure what to expect... probably poor performance with sync
writes on filestore, and not sure what would happen with bluestore...
probably much better than filestore though if you use a large block size.
>
>
> -Gencer.
>
>
> -Original Message-
> From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de] 
> Sent: Tuesday, July 18, 2017 2:47 PM
> To: gen...@gencgiyen.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Yet another performance tuning for CephFS
>
> On 07/17/17 22:49, gen...@gencgiyen.com wrote:
>> I have a seperate 10GbE network for ceph and another for public.
>>
> Are you sure? Your config didn't show this.
>
>> No they are not NVMe, unfortunately.
>>
> What kind of devices are they? did you do the journal test?
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Unlike most tests, with ceph journals, you can't look at the load on the 
> device and decide it's not the bottleneck; you have to test it another way. I 
> had some micron SSDs I tested which performed poorly, and that test showed 
> them performing poorly too. But from other benchmarks, and disk load during 
> journal tests, they looked ok, which was misleading.
>> Do you know any test command that i can try to see if this is the max.
>> Read speed from rsync?
> I don't know how you can improve your rsync test.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Peter Maloney
On 07/17/17 22:49, gen...@gencgiyen.com wrote:
> I have a seperate 10GbE network for ceph and another for public.
>
Are you sure? Your config didn't show this.

> No they are not NVMe, unfortunately.
>
What kind of devices are they? did you do the journal test?
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Unlike most tests, with ceph journals, you can't look at the load on the
device and decide it's not the bottleneck; you have to test it another
way. I had some micron SSDs I tested which performed poorly, and that
test showed them performing poorly too. But from other benchmarks, and
disk load during journal tests, they looked ok, which was misleading.
> Do you know any test command that i can try to see if this is the max.
> Read speed from rsync?
I don't know how you can improve your rsync test.
>
> Because I tried one thing a few minutes ago. I opened 4 ssh channel
> and run rsync command and copy bigfile to different targets in cephfs
> at the same time. Then i looked into network graphs and i see numbers
> up to 1.09 gb/s. But why single copy/rsync cannot exceed 200mb/s? What
> prevents it im really wonder this.
>
> Gencer.
>
> On 2017-07-17 23:24, Peter Maloney wrote:
>> You should have a separate public and cluster network. And journal or
>> wal/db performance is important... are the devices fast NVMe?
>>
>> On 07/17/17 21:31, gen...@gencgiyen.com wrote:
>>
>>> Hi,
>>>
>>> I located and applied almost every different tuning setting/config
>>> over the internet. I couldn’t manage to speed up my speed one byte
>>> further. It is always same speed whatever I do.
>>>
>>> I was on jewel, now I tried BlueStore on Luminous. Still exact same
>>> speed I gain from cephfs.
>>>
>>> It doesn’t matter if I disable debug log, or remove [osd] section
>>> as below and re-add as below (see .conf). Results are exactly the
>>> same. Not a single byte is gained from those tunings. I also did
>>> tuning for kernel (sysctl.conf).
>>>
>>> Basics:
>>>
>>> I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each
>>> node has 24 cores and 64GB of RAM. Ceph nodes are connected via
>>> 10GbE NIC. No FUSE used. But tried that too. Same results.
>>>
>>> $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct
>>>
>>> 10+0 records in
>>>
>>> 10+0 records out
>>>
>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.77219 s, 182 MB/s
>>>
>>> 182MB/s. This is the best speed i get so far. Usually 170~MB/s. Hm..
>>> I get much much much higher speeds on different filesystems. Even
>>> with glusterfs. Is there anything I can do or try?
>>>
>>> Read speed is also around 180-220MB/s but not higher.
>>>
>>> This is What I am using on ceph.conf:
>>>
>>> [global]
>>>
>>> fsid = d7163667-f8c5-466b-88df-8747b26c91df
>>>
>>> mon_initial_members = server1
>>>
>>> mon_host = 192.168.0.1
>>>
>>> auth_cluster_required = cephx
>>>
>>> auth_service_required = cephx
>>>
>>> auth_client_required = cephx
>>>
>>> osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>>
>>> osd_mkfs_type = xfs
>>>
>>> osd pool default size = 2
>>>
>>> enable experimental unrecoverable data corrupting features =
>>> bluestore rocksdb
>>>
>>> bluestore fsck on mount = true
>>>
>>> rbd readahead disable after bytes = 0
>>>
>>> rbd readahead max bytes = 4194304
>>>
>>> log to syslog = false
>>>
>>> debug_lockdep = 0/0
>>>
>>> debug_context = 0/0
>>>
>>> debug_crush = 0/0
>>>
>>> debug_buffer = 0/0
>>>
>>> debug_timer = 0/0
>>>
>>> debug_filer = 0/0
>>>
>>> debug_objecter = 0/0
>>>
>>> debug_rados = 0/0
>>>
>>> debug_rbd = 0/0
>>>
>>> debug_journaler = 0/0
>>>
>>> debug_objectcatcher = 0/0
>>>
>>> debug_client = 0/0
>>>
>>> debug_osd = 0/0
>>>
>>> debug_optracker = 0/0
>>>
>>> debug_objclass = 0/0
>>>
>>> debug_filestore = 0/0
>>>
>>> debug_journal = 0/0
>>>
>>> debug_ms = 0/0

Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-17 Thread Peter Maloney
You should have a separate public and cluster network. And journal or
wal/db performance is important... are the devices fast NVMe?


On 07/17/17 21:31, gen...@gencgiyen.com wrote:
>
> Hi,
>
>  
>
> I located and applied almost every different tuning setting/config
> over the internet. I couldn’t manage to speed up my speed one byte
> further. It is always same speed whatever I do.
>
>  
>
> I was on jewel, now I tried BlueStore on Luminous. Still exact same
> speed I gain from cephfs.
>
>  
>
> It doesn’t matter if I disable debug log, or remove [osd] section as
> below and re-add as below (see .conf). Results are exactly the same.
> Not a single byte is gained from those tunings. I also did tuning for
> kernel (sysctl.conf).
>
>  
>
> Basics:
>
>  
>
> I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each
> node has 24 cores and 64GB of RAM. Ceph nodes are connected via 10GbE
> NIC. No FUSE used. But tried that too. Same results.
>
>  
>
> $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct
>
> 10+0 records in
>
> 10+0 records out
>
> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.77219 s, 182 MB/s
>
>  
>
> 182MB/s. This is the best speed i get so far. Usually 170~MB/s. Hm.. I
> get much much much higher speeds on different filesystems. Even with
> glusterfs. Is there anything I can do or try?
>
>  
>
> Read speed is also around 180-220MB/s but not higher.
>
>  
>
> This is What I am using on ceph.conf:
>
>  
>
> [global]
>
> fsid = d7163667-f8c5-466b-88df-8747b26c91df
>
> mon_initial_members = server1
>
> mon_host = 192.168.0.1
>
> auth_cluster_required = cephx
>
> auth_service_required = cephx
>
> auth_client_required = cephx
>
>  
>
> osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier
>
> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>
> osd_mkfs_type = xfs
>
>  
>
> osd pool default size = 2
>
> enable experimental unrecoverable data corrupting features = bluestore
> rocksdb
>
> bluestore fsck on mount = true
>
> rbd readahead disable after bytes = 0
>
> rbd readahead max bytes = 4194304
>
>  
>
> log to syslog = false
>
> debug_lockdep = 0/0
>
> debug_context = 0/0
>
> debug_crush = 0/0
>
> debug_buffer = 0/0
>
> debug_timer = 0/0
>
> debug_filer = 0/0
>
> debug_objecter = 0/0
>
> debug_rados = 0/0
>
> debug_rbd = 0/0
>
> debug_journaler = 0/0
>
> debug_objectcatcher = 0/0
>
> debug_client = 0/0
>
> debug_osd = 0/0
>
> debug_optracker = 0/0
>
> debug_objclass = 0/0
>
> debug_filestore = 0/0
>
> debug_journal = 0/0
>
> debug_ms = 0/0
>
> debug_monc = 0/0
>
> debug_tp = 0/0
>
> debug_auth = 0/0
>
> debug_finisher = 0/0
>
> debug_heartbeatmap = 0/0
>
> debug_perfcounter = 0/0
>
> debug_asok = 0/0
>
> debug_throttle = 0/0
>
> debug_mon = 0/0
>
> debug_paxos = 0/0
>
> debug_rgw = 0/0
>
>  
>
>  
>
> [osd]
>
> osd max write size = 512
>
> osd client message size cap = 2147483648
>
> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>
> filestore xattr use omap = true
>
> osd_op_threads = 8
>
> osd disk threads = 4
>
> osd map cache size = 1024
>
> filestore_queue_max_ops = 25000
>
> filestore_queue_max_bytes = 10485760
>
> filestore_queue_committing_max_ops = 5000
>
> filestore_queue_committing_max_bytes = 1048576
>
> journal_max_write_entries = 1000
>
> journal_queue_max_ops = 3000
>
> journal_max_write_bytes = 1048576000
>
> journal_queue_max_bytes = 1048576000
>
> filestore_max_sync_interval = 15
>
> filestore_merge_threshold = 20
>
> filestore_split_multiple = 2
>
> osd_enable_op_tracker = false
>
> filestore_wbthrottle_enable = false
>
> osd_client_message_size_cap = 0
>
> osd_client_message_cap = 0
>
> filestore_fd_cache_size = 64
>
> filestore_fd_cache_shards = 32
>
> filestore_op_threads = 12
>
>  
>
>  
>
> As I stated above, it doesn’t matter if I have this [osd] section or
> not. Results are same.
>
>  
>
> I am open to all suggestions.
>
>  
>
> Thanks,
>
> Gencer.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] missing feature 400000000000000 ?

2017-07-14 Thread Peter Maloney

according to some slide in https://www.youtube.com/watch?v=gp6if858HUI
the support is:
> TUNABLE  RELEASE   CEPH_VERSION  KERNEL
> CRUSH_TUNABLES   argonaut  v0.48.1   v3.6
> CRUSH_TUNABLES2  bobtail   v0.55 v3.9
> CRUSH_TUNABLES3  firefly   v0.78 v3.15
> CRUSH_V4 hammerv0.94 v4.1
> CRUSH_TUNABLES5  Jewel v10.0.2   v4.5

So go to hammer tunables:
> ceph osd crush tunables hammer


On 07/14/17 11:29, Riccardo Murri wrote:
> Hello,
>
> I am trying to install a test CephFS "Luminous" system on Ubuntu 16.04.
>
> Everything looks fine, but the `mount.ceph` command fails (error 110, 
> timeout);
> kernel logs show a number of messages like these before the `mount`
> prog gives up:
>
> libceph: ... feature set mismatch, my 107b84a842aca < server's
> 40107b84a842aca, missing 400
>
> I read in [1] that this is feature
> CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING which is only supported in
> kernels 4.5 and up -- whereas Ubuntu 16.04 runs Linux 4.4.
>
> Is there some tunable or configuration file entry that I can set,
> which will make Luminous FS mounting work on the std Ubuntu 16.04
> Linux kernel?  I.e., is there a way I can avoid upgrading the kernel?
>
> [1]: 
> http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client
>
> Thanks,
> Riccardo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Specifying a cache tier for erasure-coding?

2017-07-07 Thread Peter Maloney
On 07/07/17 14:03, David Turner wrote:
>
> So many of your questions depends on what your cluster is used for. We
> don't even know rbd or cephfs from what you said and that still isn't
> enough to fully answer your questions. I have a much smaller 3 node
> cluster using Erasure coding for rbds as well as cephfs and it is fine
> speed-wise for my needs with the cache tier on the hdds. Luminous will
> remove the need for a cache tier to use Erasure coding if you can wait.
>
> Is your current cluster fast enough for your needs? Is Erasure coding
> just for additional space? If so, moving to Erasure coding requires
> you to copy your data from the replicated pool to the EC pool land you
> will have 2 copies of your data until you feel confident enough to
> delete the replicated copy.  Elaborate on what you mean when you ask
> how robust EC is, you then referred to replicated as simple.  Are you
> concerned it will add complexity or that it will be lacking features
> of a replicated pool?
>

You can even use your usual osds (HDDs and SSD journals?) as a cache
tier. I plan to do that for some low performance cold storage (where
bandwidth is small importance, and iops is not at all). It only has to
be replicated, and the rest depends on the performance you need.

And with only 3 hosts, EC won't save you as much space or be as fault
tolerant. With failure domain host, you might end up with k=2,m=1 and
then have poor redundnacy and not much space saved.  And with failure
domain osd, you could save more space by increasing k, but even if you
increase m, you might end up with both copies on the same node, making
it not very fault tolerant.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding storage to exiting clusters with minimal impact

2017-07-06 Thread Peter Maloney
Here's my possibly unique method... I had 3 nodes with 12 disks each,
and when adding 2 more nodes, I had issues with the common method you
describe, totally blocking clients for minutes, but this worked great
for me:

> my own method
> - osd max backfills = 1 and osd recovery max active = 1
> - create them with crush weight 0 so no peering happens
> - (starting here the script below does it, eg. `ceph_activate_osds 6`
> will set weight 6)
> - after they're up, set them reweight 0
> - then set crush weight to the TB of the disk
> - peering starts, but reweight is 0 so it doesn't block clients
> - when that's done, reweight 1 and it should be faster than the
> previous peering and not bug clients as much
>
>
> # list osds with hosts next to them for easy filtering with awk
> (doesn't support chassis, rack, etc. buckets)
> ceph_list_osd() {
> ceph osd tree | awk '
> BEGIN {found=0; host=""};
> $3 == "host" {found=1; host=$4; getline};
> $3 == "host" {found=0}
> found || $3 ~ /osd\./ {print $0 " " host}'
> }
>
> peering_sleep() {
> echo "sleeping"
> sleep 2
> while ceph health | grep -q peer; do
> echo -n .
> sleep 1
> done
> echo
> sleep 5
> }
>
> # after an osd is already created, this reweights them to 'activate' them
> ceph_activate_osds() {
> weight="$1"
> host=$(hostname -s)
> 
> if [ -z "$weight" ]; then
> # TODO: somehow make this automatic...
> # This assumes all disks are the same weight.
> weight=6.00099
> fi
> 
> # for crush weight 0 osds, set reweight 0 so the crush weight
> non-zero won't cause as many blocked requests
> for id in $(ceph_list_osd | awk '$2 == 0 {print $1}'); do
> ceph osd reweight $id 0 &
> done
> wait
> peering_sleep
> 
> # the harsh reweight which we do slowly
> for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
> echo ceph osd crush reweight "osd.$id" "$weight"
> ceph osd crush reweight "osd.$id" "$weight"
> peering_sleep
> done
> 
> # the light reweight
> for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
> ceph osd reweight $id 1 &
> done
> wait
> }


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Watch for fstrim running on your Ubuntu systems

2017-07-06 Thread Peter Maloney
Hey,

I have some SAS Micron S630DC-400 which came with firmware M013 which
did the same or worse (takes very long... 100% blocked for about 5min
for 16GB trimmed), and works just fine with firmware M017 (4s for 32GB
trimmed). So maybe you just need an update.

Peter



On 07/06/17 18:39, Reed Dier wrote:
> Hi Wido,
>
> I came across this ancient ML entry with no responses and wanted to
> follow up with you to see if you recalled any solution to this.
> Copying the ceph-users list to preserve any replies that may result
> for archival.
>
> I have a couple of boxes with 10x Micron 5100 SATA SSD’s, journaled on
> Micron 9100 NVMe SSD’s; ceph 10.2.7; Ubuntu 16.04 4.8 kernel.
>
> I have noticed now twice that I’ve had SSD’s flapping due to the
> fstrim eating up the io 100%.
> It eventually righted itself after a little less than 8 hours.
> Noout flag was set, so it didn’t create any unnecessary rebalance or
> whatnot.
>
> Timeline showing that only 1 OSD ever went down at a time, but they
> seemed to go down in a rolling fashion during the fstrim session.
> You can actually see in the OSD graph all 10 OSD’s on this node go
> down 1 by 1 over time.
>
> And the OSD’s were going down because of:
>
>> 2017-07-02 13:47:32.618752 7ff612721700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7ff5ecd0c700' had timed out after 15
>> 2017-07-02 13:47:32.618757 7ff612721700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7ff608d9e700' had timed out after 60
>> 2017-07-02 13:47:32.618760 7ff612721700  1 heartbeat_map is_healthy
>> 'FileStore::op_tp thread 0x7ff608d9e700' had suicide timed out after 180
>> 2017-07-02 13:47:32.624567 7ff612721700 -1 common/HeartbeatMap.cc
>> <http://heartbeatmap.cc>: In function 'bool
>> ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const
>> char*, time_t)' thread 7ff612721700 time 2017-07-02 13:47:32.618784
>> common/HeartbeatMap.cc <http://heartbeatmap.cc>: 86: FAILED assert(0
>> == "hit suicide timeout")
>
> I am curious if you were able to nice it or something similar to
> mitigate this issue?
> Oddly, I have similar machines with Samsung SM863a’s with Intel P3700
> journals that do not appear to be affected by the fstrim load issue
> despite identical weekly cron jobs enabled. Only the Micron drives
> (newer) have had these issues.
>
> Appreciate any pointers,
>
> Reed
>
>> *Wido den Hollander* wido at 42on.com 
>> <mailto:ceph-users%40lists.ceph.com?Subject=Re%3A%20%5Bceph-users%5D%20Watch%20for%20fstrim%20running%20on%20your%20Ubuntu%20systems&In-Reply-To=%3C5486BF08.3010505%4042on.com%3E>
>> /Tue Dec 9 01:21:16 PST 2014/
>> Hi,
>>
>> Last sunday I got a call early in the morning that a Ceph cluster was
>> having some issues. Slow requests and OSDs marking each other down.
>>
>> Since this is a 100% SSD cluster I was a bit confused and started
>> investigating.
>>
>> It took me about 15 minutes to see that fstrim was running and was
>> utilizing the SSDs 100%.
>>
>> On Ubuntu 14.04 there is a weekly CRON which executes fstrim-all. It
>> detects all mountpoints which can be trimmed and starts to trim those.
>>
>> On the Intel SSDs used here it caused them to become 100% busy for a
>> couple of minutes. That was enough for them to no longer respond on
>> heartbeats, thus timing out and being marked down.
>>
>> Luckily we had the "out interval" set to 1800 seconds on that cluster,
>> so no OSD was marked as "out".
>>
>> fstrim-all does not execute fstrim with a ionice priority. From what I
>> understand, but haven't tested yet, is that running fstrim with ionice
>> -c Idle should solve this.
>>
>> It's weird that this issue didn't come up earlier on that cluster, but
>> after killing fstrim all problems we resolved and the cluster ran
>> happily again.
>>
>> So watch out for fstrim on early Sunday mornings on Ubuntu!
>>
>> -- 
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-06-30 Thread Peter Maloney
On 06/30/17 05:21, Sage Weil wrote:
> We're having a series of problems with the valgrind included in xenial[1] 
> that have led us to restrict all valgrind tests to centos nodes.  At teh 
> same time, we're also seeing spurious ENOSPC errors from btrfs on both 
> centos on xenial kernels[2], making trusty the only distro where btrfs 
> works reliably.
Do you guys know about balance filters and how to use them to prevent
ENOSPC?

see: https://btrfs.wiki.kernel.org/index.php/Balance_Filters

Basically it sometimes (when using snaps heavily) just has many
partially used chunks and so you rebalance the data inside them so it
can remove the fully empty ones and reuse the space. The above page says
to run commands like:

> btrfs balance start -dusage=50 /

where you start at 50 or so and raise it up and rerun if you want, until
you reclaimed enough space.

So, to make the automated tests eat less of your time, you could script
something that runs that after some number of unit tests, or in the
btrfs filestore itself can do it after removing some amount of
snapshots, or after ENOSPC and retry. I don't know what's easy to
implement, just making sure you're aware of the option.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very HIGH Disk I/O latency on instances

2017-06-29 Thread Peter Maloney
On 06/28/17 21:57, Gregory Farnum wrote:
>
>
> On Wed, Jun 28, 2017 at 9:17 AM Peter Maloney
>  <mailto:peter.malo...@brockmann-consult.de>> wrote:
>
> On 06/28/17 16:52, keynes_...@wistron.com
> <mailto:keynes_...@wistron.com> wrote:
>> [...]backup VMs is create a snapshot by Ceph commands (rbd
>> snapshot) then download (rbd export) it.
>>
>>  
>>
>> We found a very high Disk Read / Write latency during creating /
>> deleting snapshots, it will higher than 1 ms.
>>
>>  
>>
>> Even not during backup jobs, we often see a more than 4000 ms
>> latency occurred.
>>
>>  
>>
>> Users start to complain.
>>
>> Could you please help us to how to start the troubleshooting?
>>
>>  
>>
> For creating snaps and keeping them, this was marked wontfix
> http://tracker.ceph.com/issues/10823
>
> For deleting, see the recent "Snapshot removed, cluster thrashed"
> thread for some config to try.
>
>
> Given he says he's seeing 4 second IOs even without snapshot
> involvement, I think Keynes must be seeing something else in his cluster.

If you have few enough OSDs and slow enough journals that seem ok
without snaps, with snaps can be much worse than 4s IOs if you have any
sync heavy clients, like ganglia.

Before I figured out that it was exclusive-lock causing VMs to hang, I
tested many things and spent months on it and found that out. Also some
people in freenode irc ##proxmox channel with cheap home setups with
ceph complain about such things often.

>
>
> 
> https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/
>> Consider a *copy-on-write* system, which /copies/ any blocks
>> before they are overwritten with new information (i.e. it copies
>> on writes). In other words, if a block in a protected entity is
>> to be modified, the system will copy that block to a separate
>> snapshot area before it is overwritten with the new information.
>> This approach requires three I/O operations for each write: one
>> read and two writes. [...] This decision process for each block
>> also comes with some computational overhead.
>
>> A *redirect-on-write* system uses pointers to represent all
>> protected entities. If a block needs modification, the storage
>> system merely /redirects/ the pointer for that block to another
>> block and writes the data there. [...] There is zero
>> computational overhead of reading a snapshot in a
>> redirect-on-write system.
>
>> The redirect-on-write system uses 1/3 the number of I/O
>> operations when modifying a protected block, and it uses no extra
>> computational overhead reading a snapshot. Copy-on-write systems
>> can therefore have a big impact on the performance of the
>> protected entity. The more snapshots are created and the longer
>> they are stored, the greater the impact to performance on the
>> protected entity.
>
>
> I wouldn't consider that a very realistic depiction of the tradeoffs
> involved in different snapshotting strategies[1], but BlueStore uses
> "redirect-on-write" under the formulation presented in those quotes.
> RBD clones of protected images will remain copy-on-write forever, I
> imagine.
> -Greg
It was simply the first link I found which I could quote, but I didn't
find it too bad... just it describes it like all implementations are the
same.
>
> [1]: There's no reason to expect a copy-on-write system will first
> copy the original data and then overwrite it with the new data when it
> can simply inject the new data along the way. *Some* systems will copy
> the "old" block into a new location and then overwrite in the existing
> location (it helps prevent fragmentation), but many don't. And a
> "redirect-on-write" system needs to persist all those block metadata
> pointers, which may be much cheaper or much, much more expensive than
> just duplicating the blocks.

After a snap is unprotected, will the clones be redirect-on-write? Or
after the image is flattened (like dd if=/dev/zero to the whole disk)?

Are there other cases where you get a copy-on-write behavior?

Glad to hear bluestore has something better. Is that avaliable and
default behavior on kraken (which I tested but where it didn't seem to
be fixed, although all storage backends were less block prone on kraken)?

If it was a true redirect-on-write system, I would expect that when you
make a snap, there is just the overhead of organizing some metadata, 

Re: [ceph-users] Very HIGH Disk I/O latency on instances

2017-06-28 Thread Peter Maloney
On 06/28/17 16:52, keynes_...@wistron.com wrote:
>
> We were using HP Helion 2.1.5 ( OpenStack + Ceph )
>
> The OpenStack version is *Kilo* and Ceph version is *firefly*
>
>  
>
> The way we backup VMs is create a snapshot by Ceph commands (rbd
> snapshot) then download (rbd export) it.
>
>  
>
> We found a very high Disk Read / Write latency during creating /
> deleting snapshots, it will higher than 1 ms.
>
>  
>
> Even not during backup jobs, we often see a more than 4000 ms latency
> occurred.
>
>  
>
> Users start to complain.
>
> Could you please help us to how to start the troubleshooting?
>
>  
>
For creating snaps and keeping them, this was marked wontfix
http://tracker.ceph.com/issues/10823

For deleting, see the recent "Snapshot removed, cluster thrashed" thread
for some config to try.

And I find this to be a very severe problem. And you haven't even seen
the worst... also make more and it gets slower and slower to do many
things (resize, clone, snap revert, etc.) (but a fully flattened image
seen by a client seems as fast as normal usually).

Let's pool some money together as a reward for making snapshots work
properly/modern, like on ZFS and btrfs where they don't have to copy so
muchthey "redirect on write" rather than literally "copy on write".
(what would be a good way to pool money like that?). If others are
interested, I surely am, but would have to ask the boss about money.
Even if it's only for bluestore, so only for future releases, that's ok
with me. And if it keeps the copy on the same osd/fs as the original,
that is acceptable too.


https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/
> Consider a *copy-on-write* system, which /copies/ any blocks before
> they are overwritten with new information (i.e. it copies on writes).
> In other words, if a block in a protected entity is to be modified,
> the system will copy that block to a separate snapshot area before it
> is overwritten with the new information. This approach requires three
> I/O operations for each write: one read and two writes. [...] This
> decision process for each block also comes with some computational
> overhead.

> A *redirect-on-write* system uses pointers to represent all protected
> entities. If a block needs modification, the storage system merely
> /redirects/ the pointer for that block to another block and writes the
> data there. [...] There is zero computational overhead of reading a
> snapshot in a redirect-on-write system.

> The redirect-on-write system uses 1/3 the number of I/O operations
> when modifying a protected block, and it uses no extra computational
> overhead reading a snapshot. Copy-on-write systems can therefore have
> a big impact on the performance of the protected entity. The more
> snapshots are created and the longer they are stored, the greater the
> impact to performance on the protected entity.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot removed, cluster thrashed...

2017-06-26 Thread Peter Maloney
On 06/26/17 11:36, Marco Gaiarin wrote:
> ...
> Three question:
>
> a) while a 'snapshot remove' action put system on load?
>
> b) as for options like:
>
>   osd scrub during recovery = false
> osd recovery op priority = 1
> osd recovery max active = 5
> osd max backfills = 1
>
>  (for recovery), there are option to reduce the impact of a stapshot
>  remove?
>
> c) snapshot are handled differently from other IO ops, or doing some
>  similar things (eg, a restore from a backup) i've to expect some
>  similar result?
>
>
> Thanks.
>
You also have to set:

> osd_pg_max_concurrent_snap_trims=1
> osd_snap_trim_sleep=0
2nd is default 0, but just make sure. Or maybe it doesn't exist in
hammer. It's bugged in jewel, and holds a lock during sleep, so you have
to set it 0.

And I think maybe this helps a little:

> filestore_split_multiple=8

And this one lower can make performance lower, but can reduce blocked
requests (due to blocking locks):

> osd_op_threads = 2
(currently I have 8 set on the last one... long ago I found that lowest
was the minimum bareable when doing snapshots and snap removal)


And keep in mind all the "priority" stuff possibly doesn't have any
effect without the cfq disk scheduler (at least in hammer... I think
I've heard different for jewel and later). Check with:

> grep . /sys/block/*/queue/scheduler


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-21 Thread Peter Maloney
On 06/14/17 11:59, Dan van der Ster wrote:
> Dear ceph users,
>
> Today we had O(100) slow requests which were caused by deep-scrubbing
> of the metadata log:
>
> 2017-06-14 11:07:55.373184 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
> deep-scrub starts
> ...
> 2017-06-14 11:22:04.143903 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
> request 480.140904 seconds old, received at 2017-06-14
> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
> scrub
> ...
> 2017-06-14 11:22:06.729306 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
> deep-scrub ok

This looks just like my problem in my thread on ceph-devel "another
scrub bug? blocked for > 10240.948831 secs​" except that your scrub
eventually finished (mine ran hours before I stopped it manually), and
I'm not using rgw.

Sage commented that it is likely related to snaps being removed at some
point and interacting with scrub.

Restarting the osd that is mentioned there (osd.155 in  your case) will
fix it for now. And tuning scrub changes the way it behaves (defaults
make it happen more rarely than what I had before).


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritise recovery on specific PGs/OSDs?

2017-06-20 Thread Peter Maloney
these settings are on a specific OSD:
> osd recovery max active = 1
> osd max backfills = 1

I don't know if it will behave as you expect if you set 0... (I tested
setting 0 which didn't complain, but is 0 actually 0 or unlimited or an
error?)

Maybe you could parse the ceph pg dump, then look at the pgs that list
your special osds, then set all of the listed osds (not just special
ones) config to 1 and the rest 0. But this will not prioritize specific
pgs... or even specific osds, and maybe it'll end up being all osds.

To further add to your criteria, you could select ones where the
direction of movement is how you want it... like if up (where CRUSH
wants the data after recovery is done) says [1,2,3] and acting (where it
is now, even partial pgs I think) says [1,2,7] and you want to empty 7,
then you have to set the numbers non-zero for osd 3 and 7, but maybe not
1 or 2 (although these could be read as part of recovery).

I'm sure it's doomed to fail, but you can try it out on a test cluster.

My guess is it will either not accept 0 like you expect, or it will only
be a small fraction of your osds that you can set to 0.


On 06/20/17 14:44, Richard Hesketh wrote:
> Is there a way, either by individual PG or by OSD, I can prioritise 
> backfill/recovery on a set of PGs which are currently particularly important 
> to me?
>
> For context, I am replacing disks in a 5-node Jewel cluster, on a 
> node-by-node basis - mark out the OSDs on a node, wait for them to clear, 
> replace OSDs, bring up and in, mark out the OSDs on the next set, etc. I've 
> done my first node, but the significant CRUSH map changes means most of my 
> data is moving. I only currently care about the PGs on my next set of OSDs to 
> replace - the other remapped PGs I don't care about settling because they're 
> only going to end up moving around again after I do the next set of disks. I 
> do want the PGs specifically on the OSDs I am about to replace to backfill 
> because I don't want to compromise data integrity by downing them while they 
> host active PGs. If I could specifically prioritise the backfill on those 
> PGs/OSDs, I could get on with replacing disks without worrying about causing 
> degraded PGs.
>
> I'm in a situation right now where there is merely a couple of dozen PGs on 
> the disks I want to replace, which are all remapped and waiting to backfill - 
> but there are 2200 other PGs also waiting to backfill because they've moved 
> around too, and it's extremely frustating to be sat waiting to see when the 
> ones I care about will finally be handled so I can get on with replacing 
> those disks.
>
> Rich
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FAILED assert(i.first <= i.last)

2017-06-19 Thread Peter Rosell
That sound like an easy rule to follow. thanks again for you reply.

/Peter

mån 19 juni 2017 kl 10:19 skrev Wido den Hollander :

>
> > Op 19 juni 2017 om 9:55 schreef Peter Rosell :
> >
> >
> > I have my servers on UPS and shutdown them manually the way I use to turn
> > them off. There where enough power in the UPS after the servers were
> > shutdown because is continued to beep. Anyway, I will wipe it and re-add
> > it. Thanks for your reply.
> >
>
> Ok, you didn't mention that in the first post. I assumed a sudden power
> failure.
>
> My general recommendation is to wipe a single OSD if it has issues. The
> reason is that I've seen many cases where people ran XFS repair, played
> with the files on the disk and then had data corruption.
>
> That's why I'd say that you should try to avoid fixing single OSDs when
> you don't need to.
>
> Wido
>
> > /Peter
> >
> > mån 19 juni 2017 kl 09:11 skrev Wido den Hollander :
> >
> > >
> > > > Op 18 juni 2017 om 16:21 schreef Peter Rosell <
> peter.ros...@gmail.com>:
> > > >
> > > >
> > > > Hi,
> > > > I have a small cluster with only three nodes, 4 OSDs + 3 OSDs. I have
> > > been
> > > > running version 0.87.2 (Giant) for over 2.5 year, but a couple of day
> > > ago I
> > > > upgraded to 0.94.10 (Hammer) and then up to 10.2.7 (Jewel). Both the
> > > > upgrades went great. Started with monitors, osd and finally mds. The
> log
> > > > shows all 448 pgs active+clean. I'm running all daemons inside
> docker and
> > > > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
> > > >
> > > > Today I had a power outage and I had to take down the servers. When
> I now
> > > > start the servers again one OSD daemon doesn't start properly. It
> keeps
> > > > crashing.
> > > >
> > > > I noticed that the two first restarts of the osd daemon crashed with
> this
> > > > error:
> > > > FAILED assert(rollback_info_trimmed_to_riter == log.rbegin())
> > > >
> > > > After that it always fails with "FAILED assert(i.first <= i.last)"
> > > >
> > > > I have 15 logs like this one:
> > > > Jun 18 08:56:18 island sh[27991]: 2017-06-18 08:56:18.300641
> 7f5c5e0ff8c0
> > > > -1 log_channel(cluster) log [ERR] : 2.38 log bound mismatch, info
> > > > (19544'666742,19691'671046] actual [19499'665843,19691'671046]
> > > >
> > > > I removed the directories _head, but that just removed these
> error
> > > > logs. It crashes anyway.
> > > >
> > > > Anyone has any suggestions what to do to make it start up correct. Of
> > > > course I can remove the OSD from the cluster and re-add it, but it
> feels
> > > > like a bug.
> > >
> > > Are you sure? Since you had a power failure it could be that certain
> parts
> > > weren't committed to disk/FS properly when the power failed. That
> really
> > > depends on the hardware and configuration.
> > >
> > > Please, do not try to repair this OSD. Wipe it and re-add it to the
> > > cluster.
> > >
> > > Wido
> > >
> > > > A small snippet from the logs is added below. I didn't include the
> event
> > > > list. If it will help I can send it too.
> > > >
> > > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: In function
> 'static
> > > bool
> > > > pg_interval_t::check_new_interval(int, int, const std::vector&,
> > > const
> > > > std::vector&, int, int, const std::vector&, const
> > > > std::vector&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t,
> > > > IsPGRecoverablePredicate*, std::map*,
> > > > std::ostream*)' thread 7f4fc2500700 time 2017-06-18 13:52:23.593991
> > > > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: 3132: FAILED
> > > > assert(i.first <= i.last)
> > > > Jun 18 13:52:23 island sh[7068]:  ceph version 10.2.7
> > > > (50e863e0f4bc8f4b9e31156de690d765af245185)
> > > > Jun 18 13:52:23 island sh[7068]:  1: (ceph::__ceph_assert_fail(char
> > > const*,
> > > > char const*, int, char const*)+0x80) [0x559fe4c14360]
> > > > Jun 18 13:52:23 island sh[7068]:  2:
> > > > (pg_interval_t::check_new_interval(int, int, std::vector > > > std::allocator > c

Re: [ceph-users] FAILED assert(i.first <= i.last)

2017-06-19 Thread Peter Rosell
I have my servers on UPS and shutdown them manually the way I use to turn
them off. There where enough power in the UPS after the servers were
shutdown because is continued to beep. Anyway, I will wipe it and re-add
it. Thanks for your reply.

/Peter

mån 19 juni 2017 kl 09:11 skrev Wido den Hollander :

>
> > Op 18 juni 2017 om 16:21 schreef Peter Rosell :
> >
> >
> > Hi,
> > I have a small cluster with only three nodes, 4 OSDs + 3 OSDs. I have
> been
> > running version 0.87.2 (Giant) for over 2.5 year, but a couple of day
> ago I
> > upgraded to 0.94.10 (Hammer) and then up to 10.2.7 (Jewel). Both the
> > upgrades went great. Started with monitors, osd and finally mds. The log
> > shows all 448 pgs active+clean. I'm running all daemons inside docker and
> > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
> >
> > Today I had a power outage and I had to take down the servers. When I now
> > start the servers again one OSD daemon doesn't start properly. It keeps
> > crashing.
> >
> > I noticed that the two first restarts of the osd daemon crashed with this
> > error:
> > FAILED assert(rollback_info_trimmed_to_riter == log.rbegin())
> >
> > After that it always fails with "FAILED assert(i.first <= i.last)"
> >
> > I have 15 logs like this one:
> > Jun 18 08:56:18 island sh[27991]: 2017-06-18 08:56:18.300641 7f5c5e0ff8c0
> > -1 log_channel(cluster) log [ERR] : 2.38 log bound mismatch, info
> > (19544'666742,19691'671046] actual [19499'665843,19691'671046]
> >
> > I removed the directories _head, but that just removed these error
> > logs. It crashes anyway.
> >
> > Anyone has any suggestions what to do to make it start up correct. Of
> > course I can remove the OSD from the cluster and re-add it, but it feels
> > like a bug.
>
> Are you sure? Since you had a power failure it could be that certain parts
> weren't committed to disk/FS properly when the power failed. That really
> depends on the hardware and configuration.
>
> Please, do not try to repair this OSD. Wipe it and re-add it to the
> cluster.
>
> Wido
>
> > A small snippet from the logs is added below. I didn't include the event
> > list. If it will help I can send it too.
> >
> > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: In function 'static
> bool
> > pg_interval_t::check_new_interval(int, int, const std::vector&,
> const
> > std::vector&, int, int, const std::vector&, const
> > std::vector&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t,
> > IsPGRecoverablePredicate*, std::map*,
> > std::ostream*)' thread 7f4fc2500700 time 2017-06-18 13:52:23.593991
> > Jun 18 13:52:23 island sh[7068]: osd/osd_types.cc: 3132: FAILED
> > assert(i.first <= i.last)
> > Jun 18 13:52:23 island sh[7068]:  ceph version 10.2.7
> > (50e863e0f4bc8f4b9e31156de690d765af245185)
> > Jun 18 13:52:23 island sh[7068]:  1: (ceph::__ceph_assert_fail(char
> const*,
> > char const*, int, char const*)+0x80) [0x559fe4c14360]
> > Jun 18 13:52:23 island sh[7068]:  2:
> > (pg_interval_t::check_new_interval(int, int, std::vector > std::allocator > const&, std::vector >
> > const&, int, int, std::vector > const&,
> > std::vector > const&, unsigned int, unsigned
> int,
> > std::shared_ptr, std::shared_ptr, pg_t,
> > IsPGRecoverablePredicate*, std::map > std::less, std::allocator > pg_interval_t> > >*, std::ostream*)+0x72c) [0x559fe47f723c]
> > Jun 18 13:52:23 island sh[7068]:  3:
> > (PG::start_peering_interval(std::shared_ptr,
> std::vector > std::allocator > const&, int, std::vector >
> > const&, int, ObjectStore::Transaction*)+0x3ff) [0x559fe461439f]
> > Jun 18 13:52:23 island sh[7068]:  4:
> > (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x478)
> [0x559fe4615828]
> > Jun 18 13:52:23 island sh[7068]:  5:
> > (boost::statechart::simple_state > PG::RecoveryState::RecoveryMachine, boost::mpl::list > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> > mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> >
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> > const&, void const*)+0x176) [0x559fe4645b86]
> > Jun 18 13:52:23 island sh[7068]:  6:
> > (boost::statechart::state_machine > PG::RecoveryState::Initial, std::allocator,
> >
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_b

[ceph-users] FAILED assert(i.first <= i.last)

2017-06-18 Thread Peter Rosell
765af245185)
Jun 18 13:52:23 island sh[7068]:  1: (ceph::__ceph_assert_fail(char const*,
char const*, int, char const*)+0x80) [0x559fe4c14360]
Jun 18 13:52:23 island sh[7068]:  2:
(pg_interval_t::check_new_interval(int, int, std::vector > const&, std::vector >
const&, int, int, std::vector > const&,
std::vector > const&, unsigned int, unsigned int,
std::shared_ptr, std::shared_ptr, pg_t,
IsPGRecoverablePredicate*, std::map, std::allocator > >*, std::ostream*)+0x72c) [0x559fe47f723c]
Jun 18 13:52:23 island sh[7068]:  3:
(PG::start_peering_interval(std::shared_ptr, std::vector > const&, int, std::vector >
const&, int, ObjectStore::Transaction*)+0x3ff) [0x559fe461439f]
Jun 18 13:52:23 island sh[7068]:  4:
(PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x478) [0x559fe4615828]
Jun 18 13:52:23 island sh[7068]:  5:
(boost::statechart::simple_state,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x176) [0x559fe4645b86]
Jun 18 13:52:23 island sh[7068]:  6:
(boost::statechart::state_machine,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x69) [0x559fe4626d49]
Jun 18 13:52:23 island sh[7068]:  7:
(PG::handle_advance_map(std::shared_ptr,
std::shared_ptr, std::vector >&,
int, std::vector >&, int, PG::RecoveryCtx*)+0x49e)
[0x559fe45fa5ae]
Jun 18 13:52:23 island sh[7068]:  8: (OSD::advance_pg(unsigned int, PG*,
ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set,
std::less >,
std::allocator > >*)+0x2f2) [0x559fe452c042]
Jun 18 13:52:23 island sh[7068]:  9:
(OSD::process_peering_events(std::__cxx11::list >
const&, ThreadPool::TPHandle&)+0x214) [0x559fe4546d34]
Jun 18 13:52:23 island sh[7068]:  10:
(ThreadPool::BatchWorkQueue::_void_process(void*,
ThreadPool::TPHandle&)+0x25) [0x559fe458f8e5]
Jun 18 13:52:23 island sh[7068]:  11:
(ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x559fe4c06531]
Jun 18 13:52:23 island sh[7068]:  12:
(ThreadPool::WorkThread::entry()+0x10) [0x559fe4c07630]
Jun 18 13:52:23 island sh[7068]:  13: (()+0x76fa) [0x7f4fe256b6fa]
Jun 18 13:52:23 island sh[7068]:  14: (clone()+0x6d) [0x7f4fe05e3b5d]
Jun 18 13:52:23 island sh[7068]:  NOTE: a copy of the executable, or
`objdump -rdS ` is needed to interpret this.
Jun 18 13:52:23 island sh[7068]: --- begin dump of recent events ---
Jun 18 13:52:23 island sh[7068]:  -2051> 2017-06-18 13:50:36.086036
7f4fe36bb8c0  5 asok(0x559fef2d6000) register_command perfcounters_dump
hook 0x559fef216030


/Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-11 Thread Peter Maloney
On 06/08/17 21:37, Sage Weil wrote:
> Questions:
>
>  - Does anybody on the list use a non-default cluster name?
>  - If so, do you have a reason not to switch back to 'ceph'?
>
> Thanks!
> sage
Will it still be possible for clients to use multiple clusters?

Also how does this affect rbd mirroring?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG that should not be on undersized+degraded on multi datacenter Ceph cluster

2017-06-07 Thread Peter Maloney
On 06/06/17 19:23, Alejandro Comisario wrote:
> Hi all, i have a multi datacenter 6 nodes (6 osd) ceph jewel cluster.
> There are 3 pools in the cluster, all three with size 3 and min_size 2.
>
> Today, i shut down all three nodes (controlled and in order) on
> datacenter "CPD2" just to validate that everything keeps working on
> "CPD1", whitch did (including rebalance of the infromation).
>
> After everything was off on CPD2, the "osd tree" looks like this,
> whitch seems ok.
>
> root@oskceph01:~# ceph osd tree
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 30.0 root default
> -8 15.0 datacenter CPD1
> -2  5.0 host oskceph01
>  0  5.0 osd.0   up  1.0  1.0
> -6  5.0 host oskceph05
>  4  5.0 osd.4   up  1.0  1.0
> -4  5.0 host oskceph03
>  2  5.0 osd.2   up  1.0  1.0
> -9 15.0 datacenter CPD2
> -3  5.0 host oskceph02
>  1  5.0 osd.1 down0  1.0
> -5  5.0 host oskceph04
>  3  5.0 osd.3 down0  1.0
> -7  5.0 host oskceph06
>  5  5.0 osd.5 down0  1.0
>
> ...
>
> root@oskceph01:~# ceph pg dump | egrep degrad
> dumped all in format plain
> 8.1b3 178 0 178 0 0 1786814 3078 3078 active+undersized+degraded
> 2017-06-06 13:11:46.130567 2361'250952 2361:248472 [0,2] 0 [0,2] 0
> 1889'249956 2017-06-06 04:11:52.736214 1889'242115 2017-06-03
> 19:07:06.615674
>
> For some extrange reason, i see that the acting set is [0,2] i dont
> see osd.4 on the acting set, and honestly, i dont know why.
>
> ...
I'm assuming you have failure domain as host, not datacenter? (otherwise
you'd never get 0,2 ... and size 3 could never work either)

So then it looks like a problem I had and solved this week... I had 60
osds with 19 down to be replaced, and one pg out of 1152 wouldn't peer.
Randomly I realized what was wrong... there's a "tunable
choose_total_tries" you can increase so the pgs that tried to find an
osd that many times and failed will try more:

> ceph osd getcrushmap -o crushmap
> crushtool -d crushmap -o crushmap.txt
> vim crushmap.txt
> here you change tunable choose_total_tries higher... default is
> 50. 100 worked for me the first time, and then later I changed it
> again to 200.
> crushtool -c crushmap.txt -o crushmap.new
> ceph osd setcrushmap -i crushmap.new

if anything goes wrong with the new crushmap, you can always set the old
again:
> ceph osd setcrushmap -i crushmap

Then you have to wait some time, maybe 30s before you have pgs peering.

Now if only there was a log or warning seen in ceph -s that said the
tries was exceeded, then this solution would be more obvious (and we
would know whether it applies to you)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-02 Thread Peter Maloney
On 06/02/17 12:25, koukou73gr wrote:
> On 2017-06-02 13:01, Peter Maloney wrote:
>>> Is it easy for you to reproduce it? I had the same problem, and the same
>>> solution. But it isn't easy to reproduce... Jason Dillaman asked me for
>>> a gcore dump of a hung process but I wasn't able to get one. Can you do
>>> that, and when you reply, CC  Jason Dillaman  ?
>> I mean a hung qemu process on the vm host (the one that uses librbd).
>> And I guess that should be TO rather than CC.
>>
> Peter,
>
> Can it be that my situation is different?
>
> In my case the guest/qemu process it self does not hang. The guest root
> filesystem resides in an rbd image w/o exclusive-lock enabled (the
> pre-existing kind I described).
Of course it could be different, but it seems the same so far... same
solution, and same warnings in the guest, just it takes some time before
the guest totally hangs.

Sometimes the OS seems ok but has those warnings...
then worse is you can see the disk looks busy in iostat like 100% but
has low activity like 1 w/s...
and worst is that you can't even get anything to run or any screen
output or keyboard input at all, and kill on the qemu process won't even
work at that point, except with -9.

And sometimes you can get the exact same symptoms with a curable
problem... like if you stop too many osds and min_size is not reached
for just one pg that the image uses, then it looks like it works, until
it hits that bad pg, then the above symptoms happen. And then most of
the time the VM recovers when the osds are up again, but sometimes not.
But since you mentioned exclusive lock, I still think it seems the same
or highly related.

>
> The problem surfaced when additional storage was attached to the guest,
> through a new rbd image created with exclusive-lock as it is the default
> on Jewel.
>
> Problem being when parted/fdisk is run on that device, they hang as
> reported. On the other hand,
>
> dd if=/dev/sdb of=/tmp/lala count=512
>
> has no problem completing, While the reverse,
>
> dd if=/tmp/lala of=/dev/sdb count=512
>
> hangs indefinately. While in this state, I can still,ssh to the guest
> and work as long as I don't touch the new device. It appears that when a
> write to the device backed by the exclusive-lock featured image hangs, a
> read to it will hang as well.
>
> -K.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-02 Thread Peter Maloney
On 06/02/17 12:06, koukou73gr wrote:
> Thanks for the reply.
>
> Easy?
> Sure, it happens reliably every time I boot the guest with
> exclusive-lock on :)
If it's that easy, also try with only exclusive-lock, and not object-map
nor fast-diff. And also with one or the other of those.

>
> I'll need some walkthrough on the gcore part though!
gcore is pretty easy... just do like:
gcore -o "$outfile" "$pid"

And then to upload it to the devs in some *sorta private way
ceph-post-file -d "gcore dump of hung qemu process with
exclusive-lock" "$outfile"


* sorta private warning from ceph-post-file:
> WARNING:
>   Basic measures are taken to make posted data be visible only to
>   developers with access to ceph.com infrastructure. However, users
>   should think twice and/or take appropriate precautions before
>   posting potentially sensitive data (for example, logs or data
>   directories that contain Ceph secrets).



> -K.
>
>
> On 2017-06-02 12:59, Peter Maloney wrote:
>> On 06/01/17 17:12, koukou73gr wrote:
>>> Hello list,
>>>
>>> Today I had to create a new image for a VM. This was the first time,
>>> since our cluster was updated from Hammer to Jewel. So far I was just
>>> copying an existing golden image and resized it as appropriate. But this
>>> time I used rbd create.
>>>
>>> So I "rbd create"d a 2T image and attached it to an existing VM guest
>>> with librbd using:
>>> 
>>>   
>>>   
>>> 
>>>   
>>>   
>>>   
>>>   
>>> 
>>>
>>>
>>> Booted the guest and tried to partition it the new drive from inside the
>>> guest. That's it, parted (and anything else for that matter) that tried
>>> to access the new disk would freeze. After 2 minutes the kernel would
>>> start complaining:
>>>
>>> [  360.212391] INFO: task parted:1836 blocked for more than 120 seconds.
>>> [  360.216001]   Not tainted 4.4.0-78-generic #99-Ubuntu
>>> [  360.218663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>> Is it easy for you to reproduce it? I had the same problem, and the same
>> solution. But it isn't easy to reproduce... Jason Dillaman asked me for
>> a gcore dump of a hung process but I wasn't able to get one. Can you do
>> that, and when you reply, CC  Jason Dillaman  ?
>

-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD exclusive-lock and lqemu/librbd

2017-06-02 Thread Peter Maloney
On 06/02/17 11:59, Peter Maloney wrote:
> On 06/01/17 17:12, koukou73gr wrote:
>> Hello list,
>>
>> Today I had to create a new image for a VM. This was the first time,
>> since our cluster was updated from Hammer to Jewel. So far I was just
>> copying an existing golden image and resized it as appropriate. But this
>> time I used rbd create.
>>
>> So I "rbd create"d a 2T image and attached it to an existing VM guest
>> with librbd using:
>> 
>>   
>>   
>> 
>>   
>>   
>>   
>>   
>> 
>>
>>
>> Booted the guest and tried to partition it the new drive from inside the
>> guest. That's it, parted (and anything else for that matter) that tried
>> to access the new disk would freeze. After 2 minutes the kernel would
>> start complaining:
>>
>> [  360.212391] INFO: task parted:1836 blocked for more than 120 seconds.
>> [  360.216001]   Not tainted 4.4.0-78-generic #99-Ubuntu
>> [  360.218663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
> Is it easy for you to reproduce it? I had the same problem, and the same
> solution. But it isn't easy to reproduce... Jason Dillaman asked me for
> a gcore dump of a hung process but I wasn't able to get one. Can you do
> that, and when you reply, CC  Jason Dillaman  ?
I mean a hung qemu process on the vm host (the one that uses librbd).
And I guess that should be TO rather than CC.

>> After much headbanging, trial and error, I finaly thought of checking
>> the enabled rbd features of an existing image versus the new one.
>>
>> pre-existing: layering, stripping
>> new: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>>
>> Disabling exclusive-lock (and fast-diff and object-map before that)
>> would allow the new image to become usable in the guest at last.
>>
>> This is with:
>>
>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>> qemu-img version 2.6.0 (qemu-kvm-ev-2.6.0-28.el7_3.3.1), Copyright (c)
>> 2004-2008 Fabrice Bellard
>>
>> on a host running:
>> CentOS Linux release 7.3.1611 (Core)
>> Linux host-10-206-123-184.physics.auth.gr 3.10.0-327.36.2.el7.x86_64 #1
>> SMP Mon Oct 10 23:08:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>>
>> and a guest
>> DISTRIB_ID=Ubuntu
>> DISTRIB_RELEASE=16.04
>> DISTRIB_CODENAME=xenial
>> DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
>> Linux srv-10-206-123-87.physics.auth.gr 4.4.0-78-generic #99-Ubuntu SMP
>> Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>>
>> I vagually remember references of problems when exclusive-lock was
>> enabled on rbd images but trying Google didn't reveal much to me.
>>
>> So what is it with exclusive lock? Why does it fail like this? Could you
>> please point me to some documentation on this behaviour?
>>
>> Thanks for any feedback.
>>
>> -K.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   >