Re: [ceph-users] mirror OSD configuration

2018-02-27 Thread Eino Tuominen
> Is it possible to configure crush map such that it will tolerate "room" 
> failure? In my case, there is one
> network switch per room and one power supply per room, which makes a single 
> point of (room) failure.

Hi,

You cannot achieve real room redundancy with just two rooms. At minimum you'll 
need a third room (witness) from which you'll need independent network 
connections to the two server rooms. Otherwise it's impossible to have monitor 
quorum when one of the two rooms fails. And then you'd need to consider osd 
redundancy. You could do with replica size = 4, min_size = 2 (or any min_size = 
n, size = 2*n ), but that's not perfect as you lose exactly half of the 
replicas in case of a room failure. If you were able to use EC-pools you'd have 
more options with LRC coding 
(http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/). 

We run ceph in a 3 room configuration with 3 monitors, size=3, min_size=2. It 
works, but it's not without hassle either.

-- 
  Eino Tuominen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mirror OSD configuration

2018-02-27 Thread Zoran Bošnjak
This is my planned OSD configuration:

root
room1
OSD host1
OSD host2
room2
OSD host3
OSD host4

There are 6 OSDs per host.

Is it possible to configure crush map such that it will tolerate "room" 
failure? In my case, there is one network switch per room and one power supply 
per room, which makes a single point of (room) failure. This is what I would 
like to mitigate.

I could not find any crush rule that would make this configuration redundant 
and safe.

Namely, to tolerate a sudden room (switch, power) failure, there must be a rule 
to "ack" write only after BOTH rooms make the "ack". The problem is that this 
rule holds only until both rooms are up. As soon as one room goes down (with 
the rule like this) the cluster won't be able to write any more data since the 
"ack" is not allowed by the rule. It looks like impossible task with a fix 
crush map rule. The cluster would somehow need to switch rules to make this 
redundant. What am I missing?

In general: can ceph tolerate sudden loss of half of the OSDs?
If not, what is the best redundancy I could get out of my configuration?
Is there any workaround with some external tools maybe to detect such failure 
and reconfigure ceph automatically?

regards,
Zoran
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Developer Monthly - March 2018

2018-02-27 Thread Leonardo Vaz
Hey Cephers,

This is just a friendly reminder that the next Ceph Developer Monthly
meeting is coming up:

 http://wiki.ceph.com/Planning

If you have work that you're doing that it a feature work, significant
backports, or anything you would like to discuss with the core team,
please add it to the following page:

 http://wiki.ceph.com/CDM_07-MAR-2018

This edition happens on APAC friendly hours (21:00 EST) and we will
use the following Bluejeans URL for the video conference:

 https://bluejeans.com/9290089010/

If you have questions or comments, please let us know.

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy won't install luminous (but Jewel instead)

2018-02-27 Thread jorpilo
Try using:ceph-deploy --release luminous host1...
 Mensaje original De: Massimiliano Cuttini  
Fecha: 28/2/18  12:42 a. m.  (GMT+01:00) Para: ceph-users@lists.ceph.com 
Asunto: [ceph-users] ceph-deploy won't install luminous (but Jewel instead) 

This is the 5th time that I install and after purge the
  installation.

  Ceph Deploy is alway install JEWEL instead of Luminous.


No way even if I force the repo from default to luminous:
https://download.ceph.com/rpm-luminous/el7/noarch
It still install Jewel it's stuck.


I've already checked if I had installed yum-plugin-priorities,
  and I did it.

  Everything is exaclty as the documentation request.

  But still I get always Jewel and not Luminous.

  





  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous | PG split causing slow requests

2018-02-27 Thread David C
This is super helpful, thanks for sharing, David. I need to a bit more
reading into this.

On 26 Feb 2018 6:08 p.m., "David Turner"  wrote:

The slow requests are absolutely expected on filestore subfolder
splitting.  You can however stop an OSD, split it's subfolders, and start
it back up.  I perform this maintenance once/month.  I changed my settings
to [1]these, but I only suggest doing something this drastic if you're
committed to manually split your PGs regularly.  In my environment that
needs to be once/month.

Along with those settings, I use [2]this script to perform the subfolder
splitting. It will change your config file to [3]these settings, perform
the subfolder splitting, change them back to what you currently have, and
start your OSDs back up.  using a negative merge threshold prevents
subfolder merging which is useful for some environments.

The script automatically sets noout and unset it for you afterwards as well
it won't start unless the cluster is health_ok.  Feel free to use it as is
or pick from it what's useful for you.  I highly suggest that anyone
feeling the pains of subfolder splitting to do some sort of offline
splitting to get through it.  If you're using some sort of config
management like salt or puppet, be sure to disable it so that the config
won't be overwritten while the subfolders are being split.


[1] filestore_merge_threshold = -16
 filestore_split_multiple = 256

[2] https://gist.github.com/drakonstein/cb76c7696e65522ab0e699b7ea1ab1c4

[3] filestore_merge_threshold = -1
 filestore_split_multiple = 1
On Mon, Feb 26, 2018 at 12:18 PM David C  wrote:

> Thanks, David. I think I've probably used the wrong terminology here, I'm
> not splitting PGs to create more PGs. This is the PG folder splitting that
> happens automatically, I believe it's controlled by the
> "filestore_split_multiple" setting (which is 8 on my OSDs, I believe that's
> the Luminous default...). Increasing heartbeat grace would probably still
> be a good idea to prevent the flapping. I'm trying to understand if the
> slow requests is to be expected or if I need to tune something or look at
> hardware.
>
> On Mon, Feb 26, 2018 at 4:19 PM, David Turner 
> wrote:
>
>> Splitting PG's is one of the most intensive and disruptive things you
>> can, and should, do to a cluster.  Tweaking recovery sleep, max backfills,
>> and heartbeat grace should help with this.  Heartbeat grace can be set high
>> enough to mitigate the OSDs flapping which slows things down by peering and
>> additional recovery, while still being able to detect OSDs that might fail
>> and go down.  The recovery sleep and max backfills are the settings you
>> want to look at for mitigating slow requests.  I generally tweak those
>> while watching iostat of some OSDs and ceph -s to make sure I'm not giving
>> too  much priority to the recovery operations so that client IO can still
>> happen.
>>
>> On Mon, Feb 26, 2018 at 11:10 AM David C  wrote:
>>
>>> Hi All
>>>
>>> I have a 12.2.1 cluster, all filestore OSDs, OSDs are spinners, journals
>>> on NVME. Cluster primarily used for CephFS, ~20M objects.
>>>
>>> I'm seeing some OSDs getting marked down, it appears to be related to PG
>>> splitting, e.g:
>>>
>>> 2018-02-26 10:27:27.935489 7f140dbe2700  1 _created [C,D] has 5121
 objects, starting split.

>>>
>>> Followed by:
>>>
>>> 2018-02-26 10:27:58.242551 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : 9 slow requests, 5 included below; oldest blocked for > 30.308128
 secs
 2018-02-26 10:27:58.242563 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.151105 seconds old, received at 2018-02-26
 10:27:28.091312: osd_op(mds.0.5339:811969 3.5c
 3:3bb9d743:::200.0018c6c4:head [write 73416~5897 [fadvise_dontneed]] snapc
 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
 commit_sent
 2018-02-26 10:27:58.242569 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.133441 seconds old, received at 2018-02-26
 10:27:28.108976: osd_op(mds.0.5339:811970 3.5c
 3:3bb9d743:::200.0018c6c4:head [write 79313~4866 [fadvise_dontneed]] snapc
 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
 commit_sent
 2018-02-26 10:27:58.242574 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.083401 seconds old, received at 2018-02-26
 10:27:28.159016: osd_op(mds.9174516.0:444202 3.5c
 3:3bb9d743:::200.0018c6c4:head [stat] snapc 0=[]
 ondisk+read+rwordered+known_if_redirected+full_force e13994) currently
 waiting for rw locks
 2018-02-26 10:27:58.242579 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.072310 seconds old, received at 2018-02-26
 10:27:28.170107: osd_op(mds.0.5339:811971 3.5c
 3:3bb9d743:::200.0018c6c4:head [write 84179~1941 [fadvise_dontneed]] snapc
 0=[] 

Re: [ceph-users] OSD maintenance (ceph osd set noout)

2018-02-27 Thread John Spray
On Tue, Feb 27, 2018 at 6:37 PM, Andre Goree  wrote:
> Is it still considered best practice to set 'noout' for OSDs that will be
> going under maintenance, e.g., rebooting an OSD ndoe for a kernel update?
>
> I ask, because I've set this twice now during times which the OSDs would
> only momentarily be 'out', however each time I've done this, the OSDs have
> become unusable and I've had to rebuild them.

Can you be more specific about "unusable"?  Marking an OSD noout is of
course not meant to harm it!

John

> Also, when I _do not_ set 'noout', it would seem that once the node reboots
> the OSDs come back online without issue _and_ there is very _little_
> recovery i/o -- I'd expect to see lots of recovery i/o if a node goes down
> as the cluster tries to replace the PGs on other OSD nodes.  This further
> makes me believe that setting 'noout' is no longer necessary.
>
> I'm running version 12.2.2-12.2.4 (in the middle of upgrading).
>
> Thanks in advance.
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kubernetes support for rbd-nbd

2018-02-27 Thread Prashant Murthy
Hi all,

Heads up for any of you who are using Kubernetes or looking to use
Kubernetes to schedule Ceph daemons running within containers (for block
volumes). We have been working on adding support for rbd-nbd in the
Kubernetes provisioner (as context, the current Kubernetes provisioner
offers support only for the Ceph kernel client - krbd to provision Ceph
block volumes). The changes to support rbd-nbd were merged into the
upcoming 1.10 Kubernetes release.

Once this is released, Kubernetes will be able to use rbd-nbd (with Ceph's
user-space client librbd) to provision Ceph block volumes.

The PR that was merged is here:
https://github.com/kubernetes/kubernetes/pull/58916.
Kubernetes 1.10 release timelines are here:
https://github.com/kubernetes/sig-release/blob/master/releases/release-1.10/release-1.10.md
.

Thanks,
Prashant


-- 
Prashant Murthy
Sr Director, Software Engineering | Salesforce
Mobile: 919-961-3041

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD maintenance (ceph osd set noout)

2018-02-27 Thread Andre Goree
Is it still considered best practice to set 'noout' for OSDs that will 
be going under maintenance, e.g., rebooting an OSD ndoe for a kernel 
update?


I ask, because I've set this twice now during times which the OSDs would 
only momentarily be 'out', however each time I've done this, the OSDs 
have become unusable and I've had to rebuild them.


Also, when I _do not_ set 'noout', it would seem that once the node 
reboots the OSDs come back online without issue _and_ there is very 
_little_ recovery i/o -- I'd expect to see lots of recovery i/o if a 
node goes down as the cluster tries to replace the PGs on other OSD 
nodes.  This further makes me believe that setting 'noout' is no longer 
necessary.


I'm running version 12.2.2-12.2.4 (in the middle of upgrading).

Thanks in advance.

--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs fsal + nfs-ganesha + el7/centos7

2018-02-27 Thread Oliver Freyermuth
Am 19.02.2018 um 17:22 schrieb Daniel Gryniewicz:
> To my knowledge, no one has done any work on ganesha + ceph and selinux.  
> Fedora (and RHEL) includes config in it's selinux package for ganesha + 
> gluster, but I'm sure there's missing bits for ceph.

Thanks!
I was asking here since from the latest talks on Ceph, I would expect 
nfs-ganesha to become a major "supported feature" potentially starting even 
from mimic. 

For anybody who is following / curious, I had to extend my manual SELinux 
module to fix kerberos ticket cache issues. 

I'm now using the following successfully: 

module nfs_ganesha-fix-perms 1.0;

require {
type proc_net_t;
type cyphesis_port_t;
type krb5_host_rcache_t;
type ganesha_t;
class capability setuid;
class capability setgid;
class capability dac_override;
class tcp_socket name_connect;
class file { getattr open read write };
}

#= ganesha_t ==
allow ganesha_t cyphesis_port_t:tcp_socket name_connect;
allow ganesha_t proc_net_t:file { getattr open read };
allow ganesha_t self:capability dac_override;
allow ganesha_t self:capability setuid;
allow ganesha_t self:capability setgid;
allow ganesha_t krb5_host_rcache_t:file write;

Cheers,
Oliver

> 
> Daniel
> 
> On 02/17/2018 03:15 PM, Oliver Freyermuth wrote:
>> Hi together,
>>
>> many thanks for the RPMs provided at:
>>    http://download.ceph.com/nfs-ganesha/
>> They are very much appreciated!
>>
>>
>> Since the statement was that they will also be maintained in the future, and 
>> NFS Ganesha seems an important project for the future of Ceph,
>> let me do the first "packaging" bug report.
>>
>> It seems that the current packages do not play so well with SELinux. I'm 
>> currently using an SELinux module with the following allows, found by
>> iterative use of audit2allow (full ".te" module added at the end of the 
>> mail):
>>
>> allow ganesha_t cyphesis_port_t:tcp_socket name_connect;
>> allow ganesha_t proc_net_t:file { getattr open read };
>> allow ganesha_t self:capability dac_override;
>> allow ganesha_t self:capability setuid;
>> allow ganesha_t self:capability setgid;
>>
>> "cyphesis_port_t" is probably needed since its range (tcp: 6767, 6769, 
>> 6780-6799) overlaps with the default ports
>> recommended for use by OSDs and nfs-ganesha uses libcephfs to talk to them, 
>> the other caps appear to be needed by nfs-ganesha itself.
>>
>> With these in place, it seems my setup is working well. Without the "setgid" 
>> cap, for example, nfs-ganesha just segfaults after the permission denied 
>> failure.
>> Of course, it would be best if they were installed by the package 
>> (potentially, more restrictive allows are possible with some care).
>>
>>
>> Please include me in replies, I am not subscribed to the list.
>>
>> Cheers and all the best,
>> Oliver
>>
>> 
>>
>> module nfs_ganesha-fix-perms 1.0;
>>
>> require {
>>  type proc_net_t;
>>  type cyphesis_port_t;
>>  type ganesha_t;
>>  class capability setuid;
>>  class capability setgid;
>>  class capability dac_override;
>>  class tcp_socket name_connect;
>>  class file { getattr open read };
>> }
>>
>> #= ganesha_t ==
>> allow ganesha_t cyphesis_port_t:tcp_socket name_connect;
>> allow ganesha_t proc_net_t:file { getattr open read };
>> allow ganesha_t self:capability dac_override;
>> allow ganesha_t self:capability setuid;
>> allow ganesha_t self:capability setgid;
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Dietmar Rieder
Thanks for making this clear.

Dietmar

On 02/27/2018 05:29 PM, Alfredo Deza wrote:
> On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder
>  wrote:
>> ... however, it would be nice if ceph-volume would also create the
>> partitions for the WAL and/or DB if needed. Is there a special reason,
>> why this is not implemented?
> 
> Yes, the reason is that this was one of the most painful points in
> ceph-disk (code and maintenance-wise): to be in the business of
> understanding partitions, sizes, requirements, and devices
> is non-trivial.
> 
> One of the reasons ceph-disk did this was because it required quite a
> hefty amount of "special sauce" on partitions so that these would be
> discovered later by mechanisms that included udev.
> 
> If an admin wanted more flexibility, we decided that it had to be up
> to configuration management system (or whatever deployment mechanism)
> to do so. For users that want a simplistic approach (in the case of
> bluestore)
> we have a 1:1 mapping for device->logical volume->OSD
> 
> On the ceph-volume side as well, implementing partitions meant to also
> have a similar support for logical volumes, which have lots of
> variations that can be supported and we were not willing to attempt to
> support them all.
> 
> Even a small subset would inevitably bring up the question of "why is
> setup X not supported by ceph-volume if setup Y is?"
> 
> Configuration management systems are better suited for handling these
> situations, and we would prefer to offload that responsibility to
> those systems.
> 
>>
>> Dietmar
>>
>>
>> On 02/27/2018 04:25 PM, David Turner wrote:
>>> Gotcha.  As a side note, that setting is only used by ceph-disk as
>>> ceph-volume does not create partitions for the WAL or DB.  You need to
>>> create those partitions manually if using anything other than a whole
>>> block device when creating OSDs with ceph-volume.
>>>
>>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit >> > wrote:
>>>
>>> David,
>>>
>>> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>>> just to inform other people that Ceph's default of 1GB is pretty low.
>>> Now that i read my own sentence it indeed looks as if i was using
>>> 1GB partitions, sorry for the confusion.
>>>
>>> Caspar
>>>
>>> 2018-02-27 14:11 GMT+01:00 David Turner >> >:
>>>
>>> If you're only using a 1GB DB partition, there is a very real
>>> possibility it's already 100% full. The safe estimate for DB
>>> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>>> for most use cases (except loads and loads of small files).
>>> There are a few threads that mention how to check how much of
>>> your DB partition is in use. Once it's full, it spills over to
>>> the HDD.
>>>
>>>
>>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>>> > wrote:
>>>
>>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
>>> >:
>>>
>>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>>> >
>>> wrote:
>>>
>>> 2018-02-24 7:10 GMT+01:00 David Turner
>>> >:
>>>
>>> Caspar, it looks like your idea should work.
>>> Worst case scenario seems like the osd wouldn't
>>> start, you'd put the old SSD back in and go back
>>> to the idea to weight them to 0, backfilling,
>>> then recreate the osds. Definitely with a try in
>>> my opinion, and I'd love to hear your experience
>>> after.
>>>
>>>
>>> Hi David,
>>>
>>> First of all, thank you for ALL your answers on this
>>> ML, you're really putting a lot of effort into
>>> answering many questions asked here and very often
>>> they contain invaluable information.
>>>
>>>
>>> To follow up on this post i went out and built a
>>> very small (proxmox) cluster (3 OSD's per host) to
>>> test my suggestion of cloning the DB/WAL SDD. And it
>>> worked!
>>> Note: this was on Luminous v12.2.2 (all bluestore,
>>> ceph-disk based OSD's)
>>>
>>> Here's what i did on 1 node:
>>>
>>> 1) ceph osd set noout
>>> 2) systemctl stop osd.0; systemctl stop
>>> osd.1; systemctl stop osd.2
>>> 3) ddrescue -f 

Re: [ceph-users] Ceph luminous - Erasure code and iSCSI gateway

2018-02-27 Thread Steven Vacaroaia
Thanks - that worked


[root@osd01 ~]# rbd --image image_ec1 -p rbd info
rbd image 'image_ec1':
size 51200 MB in 12800 objects
order 22 (4096 kB objects)
data_pool: ec_k4_m2
block_name_prefix: rbd_data.1.fe0f643c9869
format: 2
features: layering, data-pool
flags:
create_timestamp: Tue Feb 27 09:49:35 2018

[root@osd01 ~]# rbd feature enable image_ec1 exclusive-lock

[root@osd01 ~]# rbd --image image_ec1 -p rbd info
rbd image 'image_ec1':
size 51200 MB in 12800 objects
order 22 (4096 kB objects)
data_pool: ec_k4_m2
block_name_prefix: rbd_data.1.fe0f643c9869
format: 2
features: layering, exclusive-lock, data-pool
flags:
create_timestamp: Tue Feb 27 09:49:35 2018

[root@osd01 ~]# gwcli
/disks> create pool=rbd image=image_ec1 size=120G
ok
/disks> ls
o- disks
..
[320G, Disks: 3]
  o- rbd.image_ec1
..
[image_ec1 (120G)]
  o- rbd.vmware02

[vmware02 (100G)]
  o- rbd.vmwware01
..
[vmwware01 (100G)]


On 27 February 2018 at 11:17, Jason Dillaman  wrote:

> Do your pre-created images have the exclusive-lock feature enabled?
> That is required to utilize them for iSCSI.
>
> On Tue, Feb 27, 2018 at 11:09 AM, Steven Vacaroaia 
> wrote:
> > Hi Jason,
> >
> > Thanks for your prompt response
> >
> > I have not been able to find a way to add an existing image ... it looks
> > like I can just create new ones
> >
> >
> > Ill appreciate if you could provide details please
> >
> > For example how would I add the preexisting image named image_ec1 ?
> >
> >  rados -p rbd ls | grep rbd_id
> > rbd_id.image01
> > rbd_id.image_ec1
> > rbd_id.vmware02
> > rbd_id.vmwware01
> >
> > [root@osd01 ~]# gwcli
> > /disks> ls
> > o- disks
> > 
> ..
> > [200G, Disks: 2]
> >   o- rbd.vmware02
> > 
> 
> > [vmware02 (100G)]
> >   o- rbd.vmwware01
> > 
> ..
> > [vmwware01 (100G)]
> >
> > /disks> create pool=rbd image=image_ec1 size=120G
> > Failed : disk create/update failed on osd01. LUN allocation failure
> > /disks> exit
> >
> >
> > (LUN.allocate) rbd 'image_ec1' is not compatible with LIO
> > Only image features
> > RBD_FEATURE_LAYERING,RBD_FEATURE_EXCLUSIVE_LOCK,RBD_
> FEATURE_OBJECT_MAP,RBD_FEATURE_FAST_DIFF,RBD_FEATURE_DEEP_FLATTEN
> > are supported
> > 2018-02-27 11:06:23,424ERROR [rbd-target-api:731:_disk()] - LUN alloc
> > problem - (LUN.allocate) rbd 'image_ec1' is not compatible with LIO
> >
> >
> >
> > On 27 February 2018 at 10:52, Jason Dillaman 
> wrote:
> >>
> >> Your image does not live in the EC pool -- instead, only the data
> >> portion lives within the EC pool. Therefore, you would need to specify
> >> the replicated pool where the image lives when attaching it as a
> >> backing store for iSCSI (i.e. pre-create it via the rbd CLI):
> >>
> >> # gwcli
> >> /iscsi-target...sx01-657d71e0> cd /disks
> >> /disks> create pool=rbd image=image_ec1 size=XYZ
> >>
> >>
> >> On Tue, Feb 27, 2018 at 10:42 AM, Steven Vacaroaia 
> >> wrote:
> >> > Hi,
> >> >
> >> > I noticed it is possible to use erasure code pool for RBD and CEPHFS
> >> >
> >> > https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/
> >> >
> >> > This got me thinking that I can deploy iSCSI luns on EC pools
> >> > However it appears it is not working
> >> >
> >> > Anyone able to do that or have I misunderstood ?
> >> >
> >> > Thanks
> >> > Steven
> >> >
> >> > Here is the pool
> >> >
> >> > ceph osd pool get ec_k4_m2 all
> >> > size: 6
> >> > min_size: 5
> >> > crash_replay_interval: 0
> >> > pg_num: 128
> >> > pgp_num: 128
> >> > crush_rule: ec_k4_m2
> >> > hashpspool: true
> >> > nodelete: false
> >> > nopgchange: false
> >> > nosizechange: false
> >> > write_fadvise_dontneed: false
> >> > noscrub: false
> >> > nodeep-scrub: false
> >> > use_gmt_hitset: 1
> >> > auid: 0
> >> > erasure_code_profile: EC_OSD
> >> > fast_read: 0
> >> >
> >> >
> >> > here is how I created an image just to make sure RBD is supported
> >> > rbd create rbd/image_ec1 --size 51200 --data-pool ec_k4_m2
> >> > --image-feature
> >> > layering
> >> >
> >> > here is what gwcli complains about
> >> > gwcli
> >> > /iscsi-target...sx01-657d71e0> cd /disks
> >> > /disks> create pool=ec_k4_m2 image=testec size=120G
> >> > 

Re: [ceph-users] fast_read in EC pools

2018-02-27 Thread Oliver Freyermuth
Dear Caspar,

many thanks for the link! 

Now I'm pondering - the problem is that we'd certainly want to keep m=2, but 
reducing k=4 to k=3 already means significant reduction of storage. 
We'll elaborate on the probability of new OSD hosts being added in the near 
future and consider this before deciding on the final configuration. 

Many thanks again, this surely helps a lot!
Oliver

Am 27.02.2018 um 14:45 schrieb Caspar Smit:
> Oliver,
> 
> Here's the commit info:
> 
> https://github.com/ceph/ceph/commit/48e40fcde7b19bab98821ab8d604eab920591284
> 
> Caspar
> 
> 2018-02-27 14:28 GMT+01:00 Oliver Freyermuth  >:
> 
> Am 27.02.2018 um 14:16 schrieb Caspar Smit:
> > Oliver,
> >
> > Be aware that for k=4,m=2 the min_size will be 5 (k+1), so after a node 
> failure the min_size is already reached.
> > Any OSD failure beyond the node failure will probably result in some 
> PG's to be become incomplete (I/O Freeze) until the incomplete PG's data is 
> recovered to another OSD in that node.
> >
> > So please reconsider your statement "one host + x safety" as the x 
> safety (with I/O freeze) is probably not what you want.
> >
> > Forcing to run with min_size=4 could also be dangerous for other 
> reasons. (there's a reason why min_size = k+1)
> 
> Thanks for pointing this out!
> Yes, indeed, in case we need to take down a host for a longer period (we 
> would hope this never has to happen for > 24 hours... but you never know),
> and in case disks start to fail, we would indeed have to degrade to 
> min_size=4 to keep running.
> 
> What exactly are the implications?
> It should still be possible to ensure the data is not corrupt (with the 
> checksums), and recovery to k+1 copies should start automatically once a disk 
> fails -
> so what's the actual implication?
> Of course pg repair can not work in that case (if a PG for which the 
> additional disk failed is corrupted),
> but in general, when there's the need to reinstall a host, we'd try to 
> bring it back with OSD data intact -
> which should then allow to postpone the repair until that point.
> 
> Is there a danger I miss in my reasoning?
> 
> Cheers and many thanks!
>         Oliver
> 
> >
> > Caspar
> >
> > 2018-02-27 0:17 GMT+01:00 Oliver Freyermuth 
>  
>  >>:
> >
> >     Am 27.02.2018 um 00:10 schrieb Gregory Farnum:
> >     > On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth 
>  
> > 
>  
>   >     >
> >     >
> >     >     >     Does this match expectations?
> >     >     >
> >     >     >
> >     >     > Can you get the output of eg "ceph pg 2.7cd query"? Want to 
> make sure the backfilling versus acting sets and things are correct.
> >     >
> >     >     You'll find attached:
> >     >     query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs 
> are up and everything is healthy.
> >     >     query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs 
> 164-195 (one host) are down and out.
> >     >
> >     >
> >     > Yep, that's what we want to see. So when everything's well, we 
> have OSDs 91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1, 5, 
> 6, 4.
> >     >
> >     > When marking out a host, we have OSDs 91, 63, 33, 163, 123, 
> UNMAPPED. That corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.
> >     >
> >     > So what's happened is that with the new map, when choosing the 
> home for shard 4, we selected host 4 instead of host 6 (which is gone). And 
> now shard 5 can't map properly. But of course we still have shard 5 available 
> on host 4, so host 4 is going to end up properly owning shard 4, but also 
> just carrying that shard 5 around as a remapped location.
> >     >
> >     > So this is as we expect. Whew.
> >     > -Greg
> >
> >     Understood. Thanks for explaining step by step :-).
> >     It's of course a bit weird that this happens, since in the end, 
> this really means data is moved (or rather, a shard is recreated) and taking 
> up space without increasing redundancy
> >     (well, it might, if it lands on a different OSD than shard 5, but 
> that's not really ensured). I'm unsure if this can be solved "better" in any 
> way.
> >
> >     Anyways, it seems this would be another reason why running with 
> k+m=number of hosts should not be a general 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Alfredo Deza
On Tue, Feb 27, 2018 at 11:13 AM, Dietmar Rieder
 wrote:
> ... however, it would be nice if ceph-volume would also create the
> partitions for the WAL and/or DB if needed. Is there a special reason,
> why this is not implemented?

Yes, the reason is that this was one of the most painful points in
ceph-disk (code and maintenance-wise): to be in the business of
understanding partitions, sizes, requirements, and devices
is non-trivial.

One of the reasons ceph-disk did this was because it required quite a
hefty amount of "special sauce" on partitions so that these would be
discovered later by mechanisms that included udev.

If an admin wanted more flexibility, we decided that it had to be up
to configuration management system (or whatever deployment mechanism)
to do so. For users that want a simplistic approach (in the case of
bluestore)
we have a 1:1 mapping for device->logical volume->OSD

On the ceph-volume side as well, implementing partitions meant to also
have a similar support for logical volumes, which have lots of
variations that can be supported and we were not willing to attempt to
support them all.

Even a small subset would inevitably bring up the question of "why is
setup X not supported by ceph-volume if setup Y is?"

Configuration management systems are better suited for handling these
situations, and we would prefer to offload that responsibility to
those systems.

>
> Dietmar
>
>
> On 02/27/2018 04:25 PM, David Turner wrote:
>> Gotcha.  As a side note, that setting is only used by ceph-disk as
>> ceph-volume does not create partitions for the WAL or DB.  You need to
>> create those partitions manually if using anything other than a whole
>> block device when creating OSDs with ceph-volume.
>>
>> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit > > wrote:
>>
>> David,
>>
>> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>> just to inform other people that Ceph's default of 1GB is pretty low.
>> Now that i read my own sentence it indeed looks as if i was using
>> 1GB partitions, sorry for the confusion.
>>
>> Caspar
>>
>> 2018-02-27 14:11 GMT+01:00 David Turner > >:
>>
>> If you're only using a 1GB DB partition, there is a very real
>> possibility it's already 100% full. The safe estimate for DB
>> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>> for most use cases (except loads and loads of small files).
>> There are a few threads that mention how to check how much of
>> your DB partition is in use. Once it's full, it spills over to
>> the HDD.
>>
>>
>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>> > wrote:
>>
>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
>> >:
>>
>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>> >
>> wrote:
>>
>> 2018-02-24 7:10 GMT+01:00 David Turner
>> >:
>>
>> Caspar, it looks like your idea should work.
>> Worst case scenario seems like the osd wouldn't
>> start, you'd put the old SSD back in and go back
>> to the idea to weight them to 0, backfilling,
>> then recreate the osds. Definitely with a try in
>> my opinion, and I'd love to hear your experience
>> after.
>>
>>
>> Hi David,
>>
>> First of all, thank you for ALL your answers on this
>> ML, you're really putting a lot of effort into
>> answering many questions asked here and very often
>> they contain invaluable information.
>>
>>
>> To follow up on this post i went out and built a
>> very small (proxmox) cluster (3 OSD's per host) to
>> test my suggestion of cloning the DB/WAL SDD. And it
>> worked!
>> Note: this was on Luminous v12.2.2 (all bluestore,
>> ceph-disk based OSD's)
>>
>> Here's what i did on 1 node:
>>
>> 1) ceph osd set noout
>> 2) systemctl stop osd.0; systemctl stop
>> osd.1; systemctl stop osd.2
>> 3) ddrescue -f -n -vv  
>> /root/clone-db.log
>> 4) removed the old SSD physically from the node
>> 5) checked with "ceph -s" and already saw HEALTH_OK
>> and 

Re: [ceph-users] Ceph luminous - Erasure code and iSCSI gateway

2018-02-27 Thread Jason Dillaman
Do your pre-created images have the exclusive-lock feature enabled?
That is required to utilize them for iSCSI.

On Tue, Feb 27, 2018 at 11:09 AM, Steven Vacaroaia  wrote:
> Hi Jason,
>
> Thanks for your prompt response
>
> I have not been able to find a way to add an existing image ... it looks
> like I can just create new ones
>
>
> Ill appreciate if you could provide details please
>
> For example how would I add the preexisting image named image_ec1 ?
>
>  rados -p rbd ls | grep rbd_id
> rbd_id.image01
> rbd_id.image_ec1
> rbd_id.vmware02
> rbd_id.vmwware01
>
> [root@osd01 ~]# gwcli
> /disks> ls
> o- disks
> ..
> [200G, Disks: 2]
>   o- rbd.vmware02
> 
> [vmware02 (100G)]
>   o- rbd.vmwware01
> ..
> [vmwware01 (100G)]
>
> /disks> create pool=rbd image=image_ec1 size=120G
> Failed : disk create/update failed on osd01. LUN allocation failure
> /disks> exit
>
>
> (LUN.allocate) rbd 'image_ec1' is not compatible with LIO
> Only image features
> RBD_FEATURE_LAYERING,RBD_FEATURE_EXCLUSIVE_LOCK,RBD_FEATURE_OBJECT_MAP,RBD_FEATURE_FAST_DIFF,RBD_FEATURE_DEEP_FLATTEN
> are supported
> 2018-02-27 11:06:23,424ERROR [rbd-target-api:731:_disk()] - LUN alloc
> problem - (LUN.allocate) rbd 'image_ec1' is not compatible with LIO
>
>
>
> On 27 February 2018 at 10:52, Jason Dillaman  wrote:
>>
>> Your image does not live in the EC pool -- instead, only the data
>> portion lives within the EC pool. Therefore, you would need to specify
>> the replicated pool where the image lives when attaching it as a
>> backing store for iSCSI (i.e. pre-create it via the rbd CLI):
>>
>> # gwcli
>> /iscsi-target...sx01-657d71e0> cd /disks
>> /disks> create pool=rbd image=image_ec1 size=XYZ
>>
>>
>> On Tue, Feb 27, 2018 at 10:42 AM, Steven Vacaroaia 
>> wrote:
>> > Hi,
>> >
>> > I noticed it is possible to use erasure code pool for RBD and CEPHFS
>> >
>> > https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/
>> >
>> > This got me thinking that I can deploy iSCSI luns on EC pools
>> > However it appears it is not working
>> >
>> > Anyone able to do that or have I misunderstood ?
>> >
>> > Thanks
>> > Steven
>> >
>> > Here is the pool
>> >
>> > ceph osd pool get ec_k4_m2 all
>> > size: 6
>> > min_size: 5
>> > crash_replay_interval: 0
>> > pg_num: 128
>> > pgp_num: 128
>> > crush_rule: ec_k4_m2
>> > hashpspool: true
>> > nodelete: false
>> > nopgchange: false
>> > nosizechange: false
>> > write_fadvise_dontneed: false
>> > noscrub: false
>> > nodeep-scrub: false
>> > use_gmt_hitset: 1
>> > auid: 0
>> > erasure_code_profile: EC_OSD
>> > fast_read: 0
>> >
>> >
>> > here is how I created an image just to make sure RBD is supported
>> > rbd create rbd/image_ec1 --size 51200 --data-pool ec_k4_m2
>> > --image-feature
>> > layering
>> >
>> > here is what gwcli complains about
>> > gwcli
>> > /iscsi-target...sx01-657d71e0> cd /disks
>> > /disks> create pool=ec_k4_m2 image=testec size=120G
>> > Invalid pool (ec_k4_m2). Must already exist and be replicated
>> >
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Jason
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Dietmar Rieder
... however, it would be nice if ceph-volume would also create the
partitions for the WAL and/or DB if needed. Is there a special reason,
why this is not implemented?

Dietmar


On 02/27/2018 04:25 PM, David Turner wrote:
> Gotcha.  As a side note, that setting is only used by ceph-disk as
> ceph-volume does not create partitions for the WAL or DB.  You need to
> create those partitions manually if using anything other than a whole
> block device when creating OSDs with ceph-volume.
> 
> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit  > wrote:
> 
> David,
> 
> Yes i know, i use 20GB partitions for 2TB disks as journal. It was
> just to inform other people that Ceph's default of 1GB is pretty low.
> Now that i read my own sentence it indeed looks as if i was using
> 1GB partitions, sorry for the confusion.
> 
> Caspar
> 
> 2018-02-27 14:11 GMT+01:00 David Turner  >:
> 
> If you're only using a 1GB DB partition, there is a very real
> possibility it's already 100% full. The safe estimate for DB
> size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
> for most use cases (except loads and loads of small files).
> There are a few threads that mention how to check how much of
> your DB partition is in use. Once it's full, it spills over to
> the HDD.
> 
> 
> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
> > wrote:
> 
> 2018-02-26 23:01 GMT+01:00 Gregory Farnum
> >:
> 
> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
> >
> wrote:
> 
> 2018-02-24 7:10 GMT+01:00 David Turner
> >:
> 
> Caspar, it looks like your idea should work.
> Worst case scenario seems like the osd wouldn't
> start, you'd put the old SSD back in and go back
> to the idea to weight them to 0, backfilling,
> then recreate the osds. Definitely with a try in
> my opinion, and I'd love to hear your experience
> after.
> 
> 
> Hi David,
> 
> First of all, thank you for ALL your answers on this
> ML, you're really putting a lot of effort into
> answering many questions asked here and very often
> they contain invaluable information.
> 
> 
> To follow up on this post i went out and built a
> very small (proxmox) cluster (3 OSD's per host) to
> test my suggestion of cloning the DB/WAL SDD. And it
> worked!
> Note: this was on Luminous v12.2.2 (all bluestore,
> ceph-disk based OSD's)
> 
> Here's what i did on 1 node:
> 
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop
> osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv  
> /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK
> and all OSD's up/in
> 6) ceph osd unset noout
> 
> I assume that once the ddrescue step is finished a
> 'partprobe' or something similar is triggered and
> udev finds the DB partitions on the new SSD and
> starts the OSD's again (kind of what happens during
> hotplug)
> So it is probably better to clone the SSD in another
> (non-ceph) system to not trigger any udev events.
> 
> I also tested a reboot after this and everything
> still worked.
> 
> 
> The old SSD was 120GB and the new is 256GB (cloning
> took around 4 minutes)
> Delta of data was very low because it was a test
> cluster.
> 
> All in all the OSD's in question were 'down' for
> only 5 minutes (so i stayed within the
> ceph_osd_down_out interval of the default 10 minutes
> and didn't actually need to set noout :)
> 
> 
> I kicked off a brief discussion about this with some of
> the BlueStore guys and they're aware of the problem with
> migrating 

Re: [ceph-users] Ceph luminous - Erasure code and iSCSI gateway

2018-02-27 Thread Steven Vacaroaia
Hi Jason,

Thanks for your prompt response

I have not been able to find a way to add an existing image ... it looks
like I can just create new ones


Ill appreciate if you could provide details please

For example how would I add the preexisting image named image_ec1 ?

 rados -p rbd ls | grep rbd_id
rbd_id.image01
rbd_id.image_ec1
rbd_id.vmware02
rbd_id.vmwware01

[root@osd01 ~]# gwcli
/disks> ls
o- disks
..
[200G, Disks: 2]
  o- rbd.vmware02

[vmware02 (100G)]
  o- rbd.vmwware01
..
[vmwware01 (100G)]

/disks> create pool=rbd image=image_ec1 size=120G
Failed : disk create/update failed on osd01. LUN allocation failure
/disks> exit


(LUN.allocate) rbd 'image_ec1' is not compatible with LIO
Only image features
RBD_FEATURE_LAYERING,RBD_FEATURE_EXCLUSIVE_LOCK,RBD_FEATURE_OBJECT_MAP,RBD_FEATURE_FAST_DIFF,RBD_FEATURE_DEEP_FLATTEN
are supported
2018-02-27 11:06:23,424ERROR [rbd-target-api:731:_disk()] - LUN alloc
problem - (LUN.allocate) rbd 'image_ec1' is not compatible with LIO



On 27 February 2018 at 10:52, Jason Dillaman  wrote:

> Your image does not live in the EC pool -- instead, only the data
> portion lives within the EC pool. Therefore, you would need to specify
> the replicated pool where the image lives when attaching it as a
> backing store for iSCSI (i.e. pre-create it via the rbd CLI):
>
> # gwcli
> /iscsi-target...sx01-657d71e0> cd /disks
> /disks> create pool=rbd image=image_ec1 size=XYZ
>
>
> On Tue, Feb 27, 2018 at 10:42 AM, Steven Vacaroaia 
> wrote:
> > Hi,
> >
> > I noticed it is possible to use erasure code pool for RBD and CEPHFS
> >
> > https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/
> >
> > This got me thinking that I can deploy iSCSI luns on EC pools
> > However it appears it is not working
> >
> > Anyone able to do that or have I misunderstood ?
> >
> > Thanks
> > Steven
> >
> > Here is the pool
> >
> > ceph osd pool get ec_k4_m2 all
> > size: 6
> > min_size: 5
> > crash_replay_interval: 0
> > pg_num: 128
> > pgp_num: 128
> > crush_rule: ec_k4_m2
> > hashpspool: true
> > nodelete: false
> > nopgchange: false
> > nosizechange: false
> > write_fadvise_dontneed: false
> > noscrub: false
> > nodeep-scrub: false
> > use_gmt_hitset: 1
> > auid: 0
> > erasure_code_profile: EC_OSD
> > fast_read: 0
> >
> >
> > here is how I created an image just to make sure RBD is supported
> > rbd create rbd/image_ec1 --size 51200 --data-pool ec_k4_m2
> --image-feature
> > layering
> >
> > here is what gwcli complains about
> > gwcli
> > /iscsi-target...sx01-657d71e0> cd /disks
> > /disks> create pool=ec_k4_m2 image=testec size=120G
> > Invalid pool (ec_k4_m2). Must already exist and be replicated
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph luminous - Erasure code and iSCSI gateway

2018-02-27 Thread Jason Dillaman
Your image does not live in the EC pool -- instead, only the data
portion lives within the EC pool. Therefore, you would need to specify
the replicated pool where the image lives when attaching it as a
backing store for iSCSI (i.e. pre-create it via the rbd CLI):

# gwcli
/iscsi-target...sx01-657d71e0> cd /disks
/disks> create pool=rbd image=image_ec1 size=XYZ


On Tue, Feb 27, 2018 at 10:42 AM, Steven Vacaroaia  wrote:
> Hi,
>
> I noticed it is possible to use erasure code pool for RBD and CEPHFS
>
> https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/
>
> This got me thinking that I can deploy iSCSI luns on EC pools
> However it appears it is not working
>
> Anyone able to do that or have I misunderstood ?
>
> Thanks
> Steven
>
> Here is the pool
>
> ceph osd pool get ec_k4_m2 all
> size: 6
> min_size: 5
> crash_replay_interval: 0
> pg_num: 128
> pgp_num: 128
> crush_rule: ec_k4_m2
> hashpspool: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> use_gmt_hitset: 1
> auid: 0
> erasure_code_profile: EC_OSD
> fast_read: 0
>
>
> here is how I created an image just to make sure RBD is supported
> rbd create rbd/image_ec1 --size 51200 --data-pool ec_k4_m2 --image-feature
> layering
>
> here is what gwcli complains about
> gwcli
> /iscsi-target...sx01-657d71e0> cd /disks
> /disks> create pool=ec_k4_m2 image=testec size=120G
> Invalid pool (ec_k4_m2). Must already exist and be replicated
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-27 Thread Casey Bodley
s3cmd does have special handling for 'US' and 'us-east-1' that skips the 
LocationConstraint on bucket creation:


https://github.com/s3tools/s3cmd/blob/master/S3/S3.py#L380


On 02/26/2018 05:16 PM, David Turner wrote:
I just realized the difference between the internal realm, local 
realm, and local-atl realm.  local-atl is a Luminous cluster while the 
other 2 are Jewel.  It looks like that option was completely ignored 
in Jewel and now Luminous is taking it into account (which is better 
imo).  I think you're right that 'us' is probably some sort of default 
in s3cmd that doesn't actually send the variable to the gateway.


Unfortunately we only allow https for rgw in the environments I have 
set up, but I think we found the cause of the initial randomness of 
things.  Thanks Yehuda.


On Mon, Feb 26, 2018 at 4:26 PM Yehuda Sadeh-Weinraub 
> wrote:


I don't know why 'us' works for you, but it could be that s3cmd is
just not sending any location constraint when 'us' is set. You can try
looking at the capture for this. You can try using wireshark for the
capture (assuming http endpoint and not https).

Yehuda

On Mon, Feb 26, 2018 at 1:21 PM, David Turner
> wrote:
> I set it to that for randomness.  I don't have a zonegroup named
'us'
> either, but that works fine.  I don't see why 'cn' should be any
different.
> The bucket_location that triggered me noticing this was 'gd1'. 
I don't know
> where that one came from, but I don't see why we should force
people setting
> it to 'us' when that has nothing to do with the realm. If it
needed to be
> set to 'local-atl' that would make sense, but 'us' works just
fine.  Perhaps
> 'us' working is what shouldn't work as opposed to allowing
whatever else to
> be able to work.
>
> I tested setting bucket_location to 'local-atl' and it did
successfully
> create the bucket.  So the question becomes, why do my other
realms not care
> what that value is set to and why does this realm allow 'us' to
be used when
> it isn't correct?
>
> On Mon, Feb 26, 2018 at 4:12 PM Yehuda Sadeh-Weinraub
>
> wrote:
>>
>> If that's what you set in the config file, I assume that's what
passed
>> in. Why did you set that in your config file? You don't have a
>> zonegroup named 'cn', right?
>>
>> On Mon, Feb 26, 2018 at 1:10 PM, David Turner
>
>> wrote:
>> > I'm also not certain how to do the tcpdump for this.  Do you
have any
>> > pointers to how to capture that for you?
>> >
>> > On Mon, Feb 26, 2018 at 4:09 PM David Turner
>
>> > wrote:
>> >>
>> >> That's what I set it to in the config file. I probably
should have
>> >> mentioned that.
>> >>
>> >> On Mon, Feb 26, 2018 at 4:07 PM Yehuda Sadeh-Weinraub
>> >> >
>> >> wrote:
>> >>>
>> >>> According to the log here, it says that the location
constraint it got
>> >>> is "cn", can you take a look at a tcpdump, see if that's
actually
>> >>> what's passed in?
>> >>>
>> >>> On Mon, Feb 26, 2018 at 12:02 PM, David Turner
>
>> >>> wrote:
>> >>> > I run with `debug rgw = 10` and was able to find these
lines at the
>> >>> > end
>> >>> > of a
>> >>> > request to create the bucket.
>> >>> >
>> >>> > Successfully creating a bucket with `bucket_location =
US` looks
>> >>> > like
>> >>> > [1]this.  Failing to create a bucket has "ERROR: S3
error: 400
>> >>> > (InvalidLocationConstraint): The specified
location-constraint is
>> >>> > not
>> >>> > valid"
>> >>> > on the CLI and [2]this (excerpt from the end of the
request) in the
>> >>> > rgw
>> >>> > log
>> >>> > (debug level 10).  "create bucket location constraint"
was not found
>> >>> > in
>> >>> > the
>> >>> > log for successfully creating the bucket.
>> >>> >
>> >>> >
>> >>> > [1]
>> >>> > 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
>> >>> >
>> >>> >
>> >>> >

name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>> >>> > info.flags=0x17
>> >>> > 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
>> >>> >
>> >>> >
>> >>> >

local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>> >>> > to cache LRU end
>> >>> > 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
>> >>> > name=user.rgw.acl
>> >>> > bl.length()=141
>> >>> > 

[ceph-users] Ceph luminous - Erasure code and iSCSI gateway

2018-02-27 Thread Steven Vacaroaia
Hi,

I noticed it is possible to use erasure code pool for RBD and CEPHFS

https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/

This got me thinking that I can deploy iSCSI luns on EC pools
However it appears it is not working

Anyone able to do that or have I misunderstood ?

Thanks
Steven

Here is the pool

ceph osd pool get ec_k4_m2 all
size: 6
min_size: 5
crash_replay_interval: 0
pg_num: 128
pgp_num: 128
crush_rule: ec_k4_m2
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
auid: 0
erasure_code_profile: EC_OSD
fast_read: 0


here is how I created an image just to make sure RBD is supported
rbd create rbd/image_ec1 --size 51200 --data-pool ec_k4_m2 --image-feature
layering

here is what gwcli complains about
gwcli
/iscsi-target...sx01-657d71e0> cd /disks
/disks> create pool=ec_k4_m2 image=testec size=120G
Invalid pool (ec_k4_m2). Must already exist and be replicated
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread David Turner
Gotcha.  As a side note, that setting is only used by ceph-disk as
ceph-volume does not create partitions for the WAL or DB.  You need to
create those partitions manually if using anything other than a whole block
device when creating OSDs with ceph-volume.

On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit  wrote:

> David,
>
> Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to
> inform other people that Ceph's default of 1GB is pretty low.
> Now that i read my own sentence it indeed looks as if i was using 1GB
> partitions, sorry for the confusion.
>
> Caspar
>
> 2018-02-27 14:11 GMT+01:00 David Turner :
>
>> If you're only using a 1GB DB partition, there is a very real possibility
>> it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
>> so for a 4TB osd a 40GB DB should work for most use cases (except loads and
>> loads of small files). There are a few threads that mention how to check
>> how much of your DB partition is in use. Once it's full, it spills over to
>> the HDD.
>>
>>
>> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit  wrote:
>>
>>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum :
>>>
 On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
 wrote:

> 2018-02-24 7:10 GMT+01:00 David Turner :
>
>> Caspar, it looks like your idea should work. Worst case scenario
>> seems like the osd wouldn't start, you'd put the old SSD back in and go
>> back to the idea to weight them to 0, backfilling, then recreate the 
>> osds.
>> Definitely with a try in my opinion, and I'd love to hear your experience
>> after.
>>
>>
> Hi David,
>
> First of all, thank you for ALL your answers on this ML, you're really
> putting a lot of effort into answering many questions asked here and very
> often they contain invaluable information.
>
>
> To follow up on this post i went out and built a very small (proxmox)
> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL 
> SDD.
> And it worked!
> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based
> OSD's)
>
> Here's what i did on 1 node:
>
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv   /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
> 6) ceph osd unset noout
>
> I assume that once the ddrescue step is finished a 'partprobe' or
> something similar is triggered and udev finds the DB partitions on the new
> SSD and starts the OSD's again (kind of what happens during hotplug)
> So it is probably better to clone the SSD in another (non-ceph) system
> to not trigger any udev events.
>
> I also tested a reboot after this and everything still worked.
>
>
> The old SSD was 120GB and the new is 256GB (cloning took around 4
> minutes)
> Delta of data was very low because it was a test cluster.
>
> All in all the OSD's in question were 'down' for only 5 minutes (so i
> stayed within the ceph_osd_down_out interval of the default 10 minutes and
> didn't actually need to set noout :)
>

 I kicked off a brief discussion about this with some of the BlueStore
 guys and they're aware of the problem with migrating across SSDs, but so
 far it's just a Trello card:
 https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
 They do confirm you should be okay with dd'ing things across, assuming
 symlinks get set up correctly as David noted.


>>> Great that it is on the radar to address. This method feels hacky.
>>>
>>>
 I've got some other bad news, though: BlueStore has internal metadata
 about the size of the block device it's using, so if you copy it onto a
 larger block device, it will not actually make use of the additional space.
 :(
 -Greg

>>>
>>> Yes, i was well aware of that, no problem. The reason was the smaller
>>> SSD sizes are simply not being made anymore or discontinued by the
>>> manufacturer.
>>> Would be nice though if the DB size could be resized in the future, the
>>> default 1GB DB size seems very small to me.
>>>
>>> Caspar
>>>
>>>


>
> Kind regards,
> Caspar
>
>
>
>> Nico, it is not possible to change the WAL or DB size, location, etc
>> after osd creation. If you want to change the configuration of the osd
>> after creation, you have to remove it from the cluster and recreate it.
>> There is no similar functionality to how you could move, recreate, etc
>> filesystem osd journals. I think this might be on the radar as a feature,
>> but I don't know for certain. I definitely consider 

Re: [ceph-users] cannot reboot one of 3 nodes without locking a cluster OSDs stay in...

2018-02-27 Thread David Turner
Can you start giving specifics?  Ceph version, were the disks created with
ceph-disk or ceph-volume, filestore/bluestore, upgraded from another
version, has anything changed recently (an upgrade, migrating some OSDs
from filestore to bluestore), etc, etc.

Sometimes I've found a node just fails to mark its OSDs down for no
apparent reason.  Perhaps a race condition where the networking stopped
before the OSD code got to the part where it wanted to tell the MONs it was
going down.  If you manually run `ceph osd down #` it'll mark it as down
and not interfere with cluster communication.  This has happened
sporadically and not in any way reproducible for me on occasion.  Have you
tested rebooting this server again to see if it continues to happen?  You
might be able to find some information towards the end of the OSD log about
it when it comes back up.  It would be easier to look through that log if
you disable the OSD from automatically starting with the server so you're
only looking at the end of the log.

On Tue, Feb 27, 2018 at 9:05 AM Philip Schroth 
wrote:

> They are stopped gracefully. i did a reboot 2 days ago. but now it doesnt
> work.
>
>
>
> 2018-02-27 14:24 GMT+01:00 David Turner :
>
>> `systemctl list-dependencies ceph.target`
>>
>> I'm guessing that you might need to enable your osds to be managed by
>> systemctl so that they can be stopped when the server goes down.
>>
>> `systemctl enable ceph-osd@{osd number}`
>>
>> On Tue, Feb 27, 2018, 4:13 AM Philip Schroth 
>> wrote:
>>
>>> I have a 3 node production cluster. All works fine. but i have one
>>> failing node. i replaced one disk on sunday. everyting went fine. last
>>> night there was another disk broken. Ceph nicely maks it as down. but when
>>> i wanted to reboot this node now. all remaining osd's are being kept in and
>>> not marked as down. and the whole cluster locks during the reboot of this
>>> node. once i reboot one of the other two nodes when the first failing node
>>> is back it works like charm. only this node i cannot reboot anymore
>>> without locking which i could still on sunday...
>>>
>>> --
>>> Met vriendelijke groet / With kind regards
>>>
>>> Philip Schroth
>>> E: i...@schroth.nl
>>> T: +31630973268 <+31%206%2030973268>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
>
> --
> Met vriendelijke groet / With kind regards
>
> Philip Schroth
> E: i...@schroth.nl
> T: +31630973268 <+31%206%2030973268>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-27 Thread David Turner
I have 2 different configurations that are incorrectly showing rotational
for the OSDs.  The [1]first is a server with disks behind controllers and
an NVME riser card.  It has 2 different OSD types, one with the block on an
HDD and WAL on the NVME as well as a pure NVME OSD.  The Hybrid OSD seems
to be showing the correct configuration, but the pure NVME OSD is
incorrectly showing up with a rotational journal.

The [2]second configuration I have is with a new server configuration
without a controller and new NVME disks in 2.5" form factor.  It is also
showing a rotational journal. What I find most interesting between all of
these is that it doesn't appear that journal_rotational is being used by
the hybrid OSDs, while it's gumming up the works for the pure flash OSDs.
This seems to match what the others in this thread have seen.


[1] HDD + NVME WAL
"bluefs_db_rotational": "1",
"bluefs_wal_rotational": "0",
"bluestore_bdev_rotational": "1",
"journal_rotational": "0",
"rotational": "1"
Pure NVME
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"

[2]No controller NVME
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"

On Mon, Feb 26, 2018 at 5:54 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 23:29 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 2:23 PM Reed Dier  > wrote:
> >
> > Quick turn around,
> >
> > Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s
> on bluestore opened the floodgates.
> >
> >
> > Oh right, the OSD does not (think it can) have anything it can really do
> if you've got a rotational journal and an SSD main device, and since
> BlueStore was misreporting itself as having a rotational journal the OSD
> falls back to the hard drive settings. Sorry I didn't work through that
> ahead of time; glad this works around it for you!
> > -Greg
>
> To chime in, this also helps for me! Replication is much faster now.
> It's a bit strange though that for my metadata-OSDs I see the following
> with iostat now:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb   0,00 0,00 1333,00  301,40 143391,20 42861,60
>  227,9221,05   13,788,88   35,44   0,59  96,64
> sda   0,00 0,00 1283,40  258,20 139004,80   876,00
>  181,47 7,184,665,112,40   0,54  83,32
> (MDS should be doing nothing on it)
> while on the OSDs to which things are backfilled I see:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0,00 0,00   47,20  458,20   367,20  1628,00
>  7,90 0,921,826,951,29   1,18  59,86
> sdb   0,00 0,00   48,20  589,00   375,20  1892,00
>  7,12 0,400,630,780,62   0,59  37,32
>
> So it seems the "sending" OSDs are now finally taken to their limit (they
> read and write a lot), but the receiving side is rather bored.
> Maybe this strange effect (many writes when actually reading stuff for
> backfilling) is normal for metadata => RocksDB?
>
> In any case, glad this "rotational" issue is int he queue to be fixed in a
> future release ;-).
>
> Cheers,
> Oliver
>
> >
> >
> >
> >> pool objects-ssd id 20
> >>   recovery io 1512 MB/s, 21547 objects/s
> >>
> >> pool fs-metadata-ssd id 16
> >>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
> >>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
> >
> > Graph of performance jump. Extremely marked.
> > https://imgur.com/a/LZR9R
> >
> > So at least we now have the gun to go with the smoke.
> >
> > Thanks for the help and appreciate you pointing me in some
> directions that I was able to use to figure out the issue.
> >
> > Adding to ceph.conf for future OSD conversions.
> >
> > Thanks,
> >
> > Reed
> >
> >
> >> On Feb 26, 2018, at 4:12 PM, Reed Dier  > wrote:
> >>
> >> For the record, I am not seeing a demonstrative fix by injecting
> the value of 0 into the OSDs running.
> >>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may
> require restart)
> >>
> >> If it does indeed need to be restarted, I will need to wait for the
> current backfills to finish their process as restarting an OSD would bring
> me under min_size.
> >>
> >> However, doing config show on the osd daemon appears to have taken
> the value of 0.
> >>
> >>> ceph daemon osd.24 config show | grep recovery_sleep
> >>> "osd_recovery_sleep": "0.00",
> >>> "osd_recovery_sleep_hdd": "0.10",
> >>> 

Re: [ceph-users] cannot reboot one of 3 nodes without locking a cluster OSDs stay in...

2018-02-27 Thread Philip Schroth
They are stopped gracefully. i did a reboot 2 days ago. but now it doesnt
work.



2018-02-27 14:24 GMT+01:00 David Turner :

> `systemctl list-dependencies ceph.target`
>
> I'm guessing that you might need to enable your osds to be managed by
> systemctl so that they can be stopped when the server goes down.
>
> `systemctl enable ceph-osd@{osd number}`
>
> On Tue, Feb 27, 2018, 4:13 AM Philip Schroth 
> wrote:
>
>> I have a 3 node production cluster. All works fine. but i have one
>> failing node. i replaced one disk on sunday. everyting went fine. last
>> night there was another disk broken. Ceph nicely maks it as down. but when
>> i wanted to reboot this node now. all remaining osd's are being kept in and
>> not marked as down. and the whole cluster locks during the reboot of this
>> node. once i reboot one of the other two nodes when the first failing node
>> is back it works like charm. only this node i cannot reboot anymore
>> without locking which i could still on sunday...
>>
>> --
>> Met vriendelijke groet / With kind regards
>>
>> Philip Schroth
>> E: i...@schroth.nl
>> T: +31630973268 <+31%206%2030973268>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 
Met vriendelijke groet / With kind regards

Philip Schroth
E: i...@schroth.nl
T: +31630973268
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fast_read in EC pools

2018-02-27 Thread Caspar Smit
Oliver,

Here's the commit info:

https://github.com/ceph/ceph/commit/48e40fcde7b19bab98821ab8d604eab920591284

Caspar

2018-02-27 14:28 GMT+01:00 Oliver Freyermuth 
:

> Am 27.02.2018 um 14:16 schrieb Caspar Smit:
> > Oliver,
> >
> > Be aware that for k=4,m=2 the min_size will be 5 (k+1), so after a node
> failure the min_size is already reached.
> > Any OSD failure beyond the node failure will probably result in some
> PG's to be become incomplete (I/O Freeze) until the incomplete PG's data is
> recovered to another OSD in that node.
> >
> > So please reconsider your statement "one host + x safety" as the x
> safety (with I/O freeze) is probably not what you want.
> >
> > Forcing to run with min_size=4 could also be dangerous for other
> reasons. (there's a reason why min_size = k+1)
>
> Thanks for pointing this out!
> Yes, indeed, in case we need to take down a host for a longer period (we
> would hope this never has to happen for > 24 hours... but you never know),
> and in case disks start to fail, we would indeed have to degrade to
> min_size=4 to keep running.
>
> What exactly are the implications?
> It should still be possible to ensure the data is not corrupt (with the
> checksums), and recovery to k+1 copies should start automatically once a
> disk fails -
> so what's the actual implication?
> Of course pg repair can not work in that case (if a PG for which the
> additional disk failed is corrupted),
> but in general, when there's the need to reinstall a host, we'd try to
> bring it back with OSD data intact -
> which should then allow to postpone the repair until that point.
>
> Is there a danger I miss in my reasoning?
>
> Cheers and many thanks!
> Oliver
>
> >
> > Caspar
> >
> > 2018-02-27 0:17 GMT+01:00 Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >:
> >
> > Am 27.02.2018 um 00:10 schrieb Gregory Farnum:
> > > On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de 
>  bonn.de>>> wrote:
> > >
> > >
> > > > Does this match expectations?
> > > >
> > > >
> > > > Can you get the output of eg "ceph pg 2.7cd query"? Want to
> make sure the backfilling versus acting sets and things are correct.
> > >
> > > You'll find attached:
> > > query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs
> are up and everything is healthy.
> > > query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs
> 164-195 (one host) are down and out.
> > >
> > >
> > > Yep, that's what we want to see. So when everything's well, we
> have OSDs 91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1,
> 5, 6, 4.
> > >
> > > When marking out a host, we have OSDs 91, 63, 33, 163, 123,
> UNMAPPED. That corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.
> > >
> > > So what's happened is that with the new map, when choosing the
> home for shard 4, we selected host 4 instead of host 6 (which is gone). And
> now shard 5 can't map properly. But of course we still have shard 5
> available on host 4, so host 4 is going to end up properly owning shard 4,
> but also just carrying that shard 5 around as a remapped location.
> > >
> > > So this is as we expect. Whew.
> > > -Greg
> >
> > Understood. Thanks for explaining step by step :-).
> > It's of course a bit weird that this happens, since in the end, this
> really means data is moved (or rather, a shard is recreated) and taking up
> space without increasing redundancy
> > (well, it might, if it lands on a different OSD than shard 5, but
> that's not really ensured). I'm unsure if this can be solved "better" in
> any way.
> >
> > Anyways, it seems this would be another reason why running with
> k+m=number of hosts should not be a general recommendation. For us, it's
> fine for now,
> > especially since we want to keep the cluster open for later
> extension with more OSDs, and we do now know the gotchas - and I don't see
> a better EC configuration at the moment
> > which would accomodate our wishes (one host + x safety, don't reduce
> space too much).
> >
> > So thanks again!
> >
> > Cheers,
> > Oliver
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume activation

2018-02-27 Thread Dan van der Ster
Hi Oliver,

No ticket yet... we were distracted.

I have the same observations as what you show below...

-- dan



On Tue, Feb 27, 2018 at 2:33 PM, Oliver Freyermuth
 wrote:
> Am 22.02.2018 um 09:44 schrieb Dan van der Ster:
>> On Wed, Feb 21, 2018 at 11:56 PM, Oliver Freyermuth
>>  wrote:
>>> Am 21.02.2018 um 15:58 schrieb Alfredo Deza:
 On Wed, Feb 21, 2018 at 9:40 AM, Dan van der Ster  
 wrote:
> On Wed, Feb 21, 2018 at 2:24 PM, Alfredo Deza  wrote:
>> On Tue, Feb 20, 2018 at 9:05 PM, Oliver Freyermuth
>>  wrote:
>>> Many thanks for your replies!
>>>
>>> Are there plans to have something like
>>> "ceph-volume discover-and-activate"
>>> which would effectively do something like:
>>> ceph-volume list and activate all OSDs which are re-discovered from LVM 
>>> metadata?
>>
>> This is a good idea, I think ceph-disk had an 'activate all', and it
>> would make it easier for the situation you explain with ceph-volume
>>
>> I've created http://tracker.ceph.com/issues/23067 to follow up on this
>> an implement it.
>
> +1 thanks a lot for this thread and clear answers!
> We were literally stuck today not knowing how to restart a ceph-volume
> lvm created OSD.
>
> (It seems that once you systemctl stop ceph-osd@* on a machine, the
> only way to get them back is ceph-volume lvm activate ... )
>
> BTW, ceph-osd.target now has less obvious functionality. For example,
> this works:
>
>   systemctl restart ceph-osd.target
>
> But if you stop ceph-osd.target, then you can no longer start 
> ceph-osd.target.
>
> Is this a regression or something we'll have to live with?

 This sounds surprising. Stopping a ceph-osd target should not do
 anything with the devices. All that 'activate' does when called in
 ceph-volume is to ensure that
 the devices are available and mounted in the right places so that the
 OSD can start.

 If you are experiencing a problem stopping an OSD that can't be
 started again, then something is going on. I would urge you to create
 a ticket with as many details as you can
 at http://tracker.ceph.com/projects/ceph-volume/issues/new
>>>
>>> I also see this - but it's not really that "the osd can not be started 
>>> again".
>>> The problem is that once the osd is stopped, e.g. via
>>> systemctl stop ceph.target
>>> doing a
>>> systemctl start ceph.target
>>> will not bring it up again.
>>>
>>> Doing a manual
>>> systemctl start ceph-osd@36.service
>>> will work, though.
>>
>> In our case even that does not work reliably.
>> We're gathering info to create a tracker ticket.
>>
>> Cheers, Dan
>
> Dear Dan,
>
> did you manage to come around to report a ticket? If so, could you share the 
> ticket number?
> Then I'd happily subscribe to it (with the flood of tickets it's hard to 
> find...).
>
> On related news, I observe this:
>   # systemctl list-dependencies ceph-osd.target
>   ceph-osd.target
>   ● ├─ceph-osd@2.service
>   ● └─ceph-osd@3.service
> on a host installed with ceph-deploy < 2.0 (i.e. using ceph-disk),
> while I observe this:
>   # systemctl list-dependencies ceph-osd.target
>   ceph-osd.target
> for a host installed with ceph-deploy 2.0, i.e. using ceph-volume.
>
> I think this is caused by the ceph-volume systemd-files triggering the 
> ceph-osd services,
> so they are not enabled at all, and hence not considered as dependencies of 
> the target.
>
> Unsure how to solve this cleanly without refactoring the concept, but again, 
> I'm no systemd expert ;-).
>
> Cheers,
> Oliver
>
>>
>>>
>>> The ceph-osd@36.service, in fact, is not enabled on my machine,
>>> which is likely why ceph.target will not cause it to come up.
>>>
>>> I am not a systemd expert, but I think the issue is that the 
>>> ceph-volume@-services which
>>> (I think) take care to activate the OSD services are not re-triggered.
>>>
>>> Cheers,
>>> Oliver
>>>

>
> Cheers, Dan
>>>
>>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume activation

2018-02-27 Thread Oliver Freyermuth
Am 22.02.2018 um 09:44 schrieb Dan van der Ster:
> On Wed, Feb 21, 2018 at 11:56 PM, Oliver Freyermuth
>  wrote:
>> Am 21.02.2018 um 15:58 schrieb Alfredo Deza:
>>> On Wed, Feb 21, 2018 at 9:40 AM, Dan van der Ster  
>>> wrote:
 On Wed, Feb 21, 2018 at 2:24 PM, Alfredo Deza  wrote:
> On Tue, Feb 20, 2018 at 9:05 PM, Oliver Freyermuth
>  wrote:
>> Many thanks for your replies!
>>
>> Are there plans to have something like
>> "ceph-volume discover-and-activate"
>> which would effectively do something like:
>> ceph-volume list and activate all OSDs which are re-discovered from LVM 
>> metadata?
>
> This is a good idea, I think ceph-disk had an 'activate all', and it
> would make it easier for the situation you explain with ceph-volume
>
> I've created http://tracker.ceph.com/issues/23067 to follow up on this
> an implement it.

 +1 thanks a lot for this thread and clear answers!
 We were literally stuck today not knowing how to restart a ceph-volume
 lvm created OSD.

 (It seems that once you systemctl stop ceph-osd@* on a machine, the
 only way to get them back is ceph-volume lvm activate ... )

 BTW, ceph-osd.target now has less obvious functionality. For example,
 this works:

   systemctl restart ceph-osd.target

 But if you stop ceph-osd.target, then you can no longer start 
 ceph-osd.target.

 Is this a regression or something we'll have to live with?
>>>
>>> This sounds surprising. Stopping a ceph-osd target should not do
>>> anything with the devices. All that 'activate' does when called in
>>> ceph-volume is to ensure that
>>> the devices are available and mounted in the right places so that the
>>> OSD can start.
>>>
>>> If you are experiencing a problem stopping an OSD that can't be
>>> started again, then something is going on. I would urge you to create
>>> a ticket with as many details as you can
>>> at http://tracker.ceph.com/projects/ceph-volume/issues/new
>>
>> I also see this - but it's not really that "the osd can not be started 
>> again".
>> The problem is that once the osd is stopped, e.g. via
>> systemctl stop ceph.target
>> doing a
>> systemctl start ceph.target
>> will not bring it up again.
>>
>> Doing a manual
>> systemctl start ceph-osd@36.service
>> will work, though.
> 
> In our case even that does not work reliably.
> We're gathering info to create a tracker ticket.
> 
> Cheers, Dan

Dear Dan,

did you manage to come around to report a ticket? If so, could you share the 
ticket number?
Then I'd happily subscribe to it (with the flood of tickets it's hard to 
find...). 

On related news, I observe this:
  # systemctl list-dependencies ceph-osd.target
  ceph-osd.target   

  ● ├─ceph-osd@2.service

  ● └─ceph-osd@3.service
on a host installed with ceph-deploy < 2.0 (i.e. using ceph-disk),
while I observe this:
  # systemctl list-dependencies ceph-osd.target
  ceph-osd.target
for a host installed with ceph-deploy 2.0, i.e. using ceph-volume. 

I think this is caused by the ceph-volume systemd-files triggering the ceph-osd 
services,
so they are not enabled at all, and hence not considered as dependencies of the 
target. 

Unsure how to solve this cleanly without refactoring the concept, but again, 
I'm no systemd expert ;-). 

Cheers,
Oliver

> 
>>
>> The ceph-osd@36.service, in fact, is not enabled on my machine,
>> which is likely why ceph.target will not cause it to come up.
>>
>> I am not a systemd expert, but I think the issue is that the 
>> ceph-volume@-services which
>> (I think) take care to activate the OSD services are not re-triggered.
>>
>> Cheers,
>> Oliver
>>
>>>

 Cheers, Dan
>>
>>




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fast_read in EC pools

2018-02-27 Thread Oliver Freyermuth
Am 27.02.2018 um 14:16 schrieb Caspar Smit:
> Oliver,
> 
> Be aware that for k=4,m=2 the min_size will be 5 (k+1), so after a node 
> failure the min_size is already reached.
> Any OSD failure beyond the node failure will probably result in some PG's to 
> be become incomplete (I/O Freeze) until the incomplete PG's data is recovered 
> to another OSD in that node.
> 
> So please reconsider your statement "one host + x safety" as the x safety 
> (with I/O freeze) is probably not what you want.
> 
> Forcing to run with min_size=4 could also be dangerous for other reasons. 
> (there's a reason why min_size = k+1)

Thanks for pointing this out! 
Yes, indeed, in case we need to take down a host for a longer period (we would 
hope this never has to happen for > 24 hours... but you never know),
and in case disks start to fail, we would indeed have to degrade to min_size=4 
to keep running. 

What exactly are the implications? 
It should still be possible to ensure the data is not corrupt (with the 
checksums), and recovery to k+1 copies should start automatically once a disk 
fails - 
so what's the actual implication? 
Of course pg repair can not work in that case (if a PG for which the additional 
disk failed is corrupted), 
but in general, when there's the need to reinstall a host, we'd try to bring it 
back with OSD data intact - 
which should then allow to postpone the repair until that point. 

Is there a danger I miss in my reasoning? 

Cheers and many thanks! 
Oliver

> 
> Caspar
> 
> 2018-02-27 0:17 GMT+01:00 Oliver Freyermuth  >:
> 
> Am 27.02.2018 um 00:10 schrieb Gregory Farnum:
> > On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth 
>  
>  >> wrote:
> >
> >
> >     >     Does this match expectations?
> >     >
> >     >
> >     > Can you get the output of eg "ceph pg 2.7cd query"? Want to make 
> sure the backfilling versus acting sets and things are correct.
> >
> >     You'll find attached:
> >     query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs are 
> up and everything is healthy.
> >     query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs 
> 164-195 (one host) are down and out.
> >
> >
> > Yep, that's what we want to see. So when everything's well, we have 
> OSDs 91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1, 5, 6, 4.
> >
> > When marking out a host, we have OSDs 91, 63, 33, 163, 123, UNMAPPED. 
> That corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.
> >
> > So what's happened is that with the new map, when choosing the home for 
> shard 4, we selected host 4 instead of host 6 (which is gone). And now shard 
> 5 can't map properly. But of course we still have shard 5 available on host 
> 4, so host 4 is going to end up properly owning shard 4, but also just 
> carrying that shard 5 around as a remapped location.
> >
> > So this is as we expect. Whew.
> > -Greg
> 
> Understood. Thanks for explaining step by step :-).
> It's of course a bit weird that this happens, since in the end, this 
> really means data is moved (or rather, a shard is recreated) and taking up 
> space without increasing redundancy
> (well, it might, if it lands on a different OSD than shard 5, but that's 
> not really ensured). I'm unsure if this can be solved "better" in any way.
> 
> Anyways, it seems this would be another reason why running with 
> k+m=number of hosts should not be a general recommendation. For us, it's fine 
> for now,
> especially since we want to keep the cluster open for later extension 
> with more OSDs, and we do now know the gotchas - and I don't see a better EC 
> configuration at the moment
> which would accomodate our wishes (one host + x safety, don't reduce 
> space too much).
> 
> So thanks again!
> 
> Cheers,
>         Oliver
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot reboot one of 3 nodes without locking a cluster OSDs stay in...

2018-02-27 Thread David Turner
`systemctl list-dependencies ceph.target`

I'm guessing that you might need to enable your osds to be managed by
systemctl so that they can be stopped when the server goes down.

`systemctl enable ceph-osd@{osd number}`

On Tue, Feb 27, 2018, 4:13 AM Philip Schroth 
wrote:

> I have a 3 node production cluster. All works fine. but i have one failing
> node. i replaced one disk on sunday. everyting went fine. last night there
> was another disk broken. Ceph nicely maks it as down. but when i wanted to
> reboot this node now. all remaining osd's are being kept in and not marked
> as down. and the whole cluster locks during the reboot of this node. once i
> reboot one of the other two nodes when the first failing node is back it
> works like charm. only this node i cannot reboot anymore without locking
> which i could still on sunday...
>
> --
> Met vriendelijke groet / With kind regards
>
> Philip Schroth
> E: i...@schroth.nl
> T: +31630973268
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Caspar Smit
David,

Yes i know, i use 20GB partitions for 2TB disks as journal. It was just to
inform other people that Ceph's default of 1GB is pretty low.
Now that i read my own sentence it indeed looks as if i was using 1GB
partitions, sorry for the confusion.

Caspar

2018-02-27 14:11 GMT+01:00 David Turner :

> If you're only using a 1GB DB partition, there is a very real possibility
> it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
> so for a 4TB osd a 40GB DB should work for most use cases (except loads and
> loads of small files). There are a few threads that mention how to check
> how much of your DB partition is in use. Once it's full, it spills over to
> the HDD.
>
>
> On Tue, Feb 27, 2018, 6:19 AM Caspar Smit  wrote:
>
>> 2018-02-26 23:01 GMT+01:00 Gregory Farnum :
>>
>>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
>>> wrote:
>>>
 2018-02-24 7:10 GMT+01:00 David Turner :

> Caspar, it looks like your idea should work. Worst case scenario seems
> like the osd wouldn't start, you'd put the old SSD back in and go back to
> the idea to weight them to 0, backfilling, then recreate the osds.
> Definitely with a try in my opinion, and I'd love to hear your experience
> after.
>
>
 Hi David,

 First of all, thank you for ALL your answers on this ML, you're really
 putting a lot of effort into answering many questions asked here and very
 often they contain invaluable information.


 To follow up on this post i went out and built a very small (proxmox)
 cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
 And it worked!
 Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based
 OSD's)

 Here's what i did on 1 node:

 1) ceph osd set noout
 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
 3) ddrescue -f -n -vv   /root/clone-db.log
 4) removed the old SSD physically from the node
 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
 6) ceph osd unset noout

 I assume that once the ddrescue step is finished a 'partprobe' or
 something similar is triggered and udev finds the DB partitions on the new
 SSD and starts the OSD's again (kind of what happens during hotplug)
 So it is probably better to clone the SSD in another (non-ceph) system
 to not trigger any udev events.

 I also tested a reboot after this and everything still worked.


 The old SSD was 120GB and the new is 256GB (cloning took around 4
 minutes)
 Delta of data was very low because it was a test cluster.

 All in all the OSD's in question were 'down' for only 5 minutes (so i
 stayed within the ceph_osd_down_out interval of the default 10 minutes and
 didn't actually need to set noout :)

>>>
>>> I kicked off a brief discussion about this with some of the BlueStore
>>> guys and they're aware of the problem with migrating across SSDs, but so
>>> far it's just a Trello card: https://trello.com/c/
>>> 9cxTgG50/324-bluestore-add-remove-resize-wal-db
>>> They do confirm you should be okay with dd'ing things across, assuming
>>> symlinks get set up correctly as David noted.
>>>
>>>
>> Great that it is on the radar to address. This method feels hacky.
>>
>>
>>> I've got some other bad news, though: BlueStore has internal metadata
>>> about the size of the block device it's using, so if you copy it onto a
>>> larger block device, it will not actually make use of the additional space.
>>> :(
>>> -Greg
>>>
>>
>> Yes, i was well aware of that, no problem. The reason was the smaller SSD
>> sizes are simply not being made anymore or discontinued by the manufacturer.
>> Would be nice though if the DB size could be resized in the future, the
>> default 1GB DB size seems very small to me.
>>
>> Caspar
>>
>>
>>>
>>>

 Kind regards,
 Caspar



> Nico, it is not possible to change the WAL or DB size, location, etc
> after osd creation. If you want to change the configuration of the osd
> after creation, you have to remove it from the cluster and recreate it.
> There is no similar functionality to how you could move, recreate, etc
> filesystem osd journals. I think this might be on the radar as a feature,
> but I don't know for certain. I definitely consider it to be a regression
> of bluestore.
>
>
>
>
> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> A very interesting question and I would add the follow up question:
>>
>> Is there an easy way to add an external DB/WAL devices to an existing
>> OSD?
>>
>> I suspect that it might be something on the lines of:
>>
>> - stop osd
>> - create a link in 

Re: [ceph-users] fast_read in EC pools

2018-02-27 Thread Caspar Smit
Oliver,

Be aware that for k=4,m=2 the min_size will be 5 (k+1), so after a node
failure the min_size is already reached.
Any OSD failure beyond the node failure will probably result in some PG's
to be become incomplete (I/O Freeze) until the incomplete PG's data is
recovered to another OSD in that node.

So please reconsider your statement "one host + x safety" as the x safety
(with I/O freeze) is probably not what you want.

Forcing to run with min_size=4 could also be dangerous for other reasons.
(there's a reason why min_size = k+1)

Caspar

2018-02-27 0:17 GMT+01:00 Oliver Freyermuth :

> Am 27.02.2018 um 00:10 schrieb Gregory Farnum:
> > On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> >
> > > Does this match expectations?
> > >
> > >
> > > Can you get the output of eg "ceph pg 2.7cd query"? Want to make
> sure the backfilling versus acting sets and things are correct.
> >
> > You'll find attached:
> > query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs are up
> and everything is healthy.
> > query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs
> 164-195 (one host) are down and out.
> >
> >
> > Yep, that's what we want to see. So when everything's well, we have OSDs
> 91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1, 5, 6, 4.
> >
> > When marking out a host, we have OSDs 91, 63, 33, 163, 123, UNMAPPED.
> That corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.
> >
> > So what's happened is that with the new map, when choosing the home for
> shard 4, we selected host 4 instead of host 6 (which is gone). And now
> shard 5 can't map properly. But of course we still have shard 5 available
> on host 4, so host 4 is going to end up properly owning shard 4, but also
> just carrying that shard 5 around as a remapped location.
> >
> > So this is as we expect. Whew.
> > -Greg
>
> Understood. Thanks for explaining step by step :-).
> It's of course a bit weird that this happens, since in the end, this
> really means data is moved (or rather, a shard is recreated) and taking up
> space without increasing redundancy
> (well, it might, if it lands on a different OSD than shard 5, but that's
> not really ensured). I'm unsure if this can be solved "better" in any way.
>
> Anyways, it seems this would be another reason why running with k+m=number
> of hosts should not be a general recommendation. For us, it's fine for now,
> especially since we want to keep the cluster open for later extension with
> more OSDs, and we do now know the gotchas - and I don't see a better EC
> configuration at the moment
> which would accomodate our wishes (one host + x safety, don't reduce space
> too much).
>
> So thanks again!
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread David Turner
If you're only using a 1GB DB partition, there is a very real possibility
it's already 100% full. The safe estimate for DB size seams to be 10GB/1TB
so for a 4TB osd a 40GB DB should work for most use cases (except loads and
loads of small files). There are a few threads that mention how to check
how much of your DB partition is in use. Once it's full, it spills over to
the HDD.

On Tue, Feb 27, 2018, 6:19 AM Caspar Smit  wrote:

> 2018-02-26 23:01 GMT+01:00 Gregory Farnum :
>
>> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
>> wrote:
>>
>>> 2018-02-24 7:10 GMT+01:00 David Turner :
>>>
 Caspar, it looks like your idea should work. Worst case scenario seems
 like the osd wouldn't start, you'd put the old SSD back in and go back to
 the idea to weight them to 0, backfilling, then recreate the osds.
 Definitely with a try in my opinion, and I'd love to hear your experience
 after.


>>> Hi David,
>>>
>>> First of all, thank you for ALL your answers on this ML, you're really
>>> putting a lot of effort into answering many questions asked here and very
>>> often they contain invaluable information.
>>>
>>>
>>> To follow up on this post i went out and built a very small (proxmox)
>>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
>>> And it worked!
>>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>>>
>>> Here's what i did on 1 node:
>>>
>>> 1) ceph osd set noout
>>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>>> 3) ddrescue -f -n -vv   /root/clone-db.log
>>> 4) removed the old SSD physically from the node
>>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>>> 6) ceph osd unset noout
>>>
>>> I assume that once the ddrescue step is finished a 'partprobe' or
>>> something similar is triggered and udev finds the DB partitions on the new
>>> SSD and starts the OSD's again (kind of what happens during hotplug)
>>> So it is probably better to clone the SSD in another (non-ceph) system
>>> to not trigger any udev events.
>>>
>>> I also tested a reboot after this and everything still worked.
>>>
>>>
>>> The old SSD was 120GB and the new is 256GB (cloning took around 4
>>> minutes)
>>> Delta of data was very low because it was a test cluster.
>>>
>>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>>> didn't actually need to set noout :)
>>>
>>
>> I kicked off a brief discussion about this with some of the BlueStore
>> guys and they're aware of the problem with migrating across SSDs, but so
>> far it's just a Trello card:
>> https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
>> They do confirm you should be okay with dd'ing things across, assuming
>> symlinks get set up correctly as David noted.
>>
>>
> Great that it is on the radar to address. This method feels hacky.
>
>
>> I've got some other bad news, though: BlueStore has internal metadata
>> about the size of the block device it's using, so if you copy it onto a
>> larger block device, it will not actually make use of the additional space.
>> :(
>> -Greg
>>
>
> Yes, i was well aware of that, no problem. The reason was the smaller SSD
> sizes are simply not being made anymore or discontinued by the manufacturer.
> Would be nice though if the DB size could be resized in the future, the
> default 1GB DB size seems very small to me.
>
> Caspar
>
>
>>
>>
>>>
>>> Kind regards,
>>> Caspar
>>>
>>>
>>>
 Nico, it is not possible to change the WAL or DB size, location, etc
 after osd creation. If you want to change the configuration of the osd
 after creation, you have to remove it from the cluster and recreate it.
 There is no similar functionality to how you could move, recreate, etc
 filesystem osd journals. I think this might be on the radar as a feature,
 but I don't know for certain. I definitely consider it to be a regression
 of bluestore.




 On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
 nico.schottel...@ungleich.ch> wrote:

>
> A very interesting question and I would add the follow up question:
>
> Is there an easy way to add an external DB/WAL devices to an existing
> OSD?
>
> I suspect that it might be something on the lines of:
>
> - stop osd
> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
> - (maybe run some kind of osd mkfs ?)
> - start osd
>
> Has anyone done this so far or recommendations on how to do it?
>
> Which also makes me wonder: what is actually the format of WAL and
> BlockDB in bluestore? Is there any documentation available about it?
>
> Best,
>
> Nico
>
>
> Caspar Smit  writes:
>
> > Hi All,
> >

Re: [ceph-users] Small modifications to Bluestore migration documentation

2018-02-27 Thread Alfredo Deza
On Tue, Feb 27, 2018 at 6:55 AM, Alexander Kushnirenko
 wrote:
> Hello,
>
> Luminous 12.2.2
>
> There were several discussions on this list concerning Bluestore migration,
> as official documentation does not work quite well yet. In particular this
> one
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024190.html
>
> Is it possible to modify official documentation
> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/
>
> Item 8
> ceph osd destroy $ID --yes-i-really-mean-it
> ADD COMMAND
> ceph osd purge $ID  --yes-i-really-mean-it

Why do you think it would be necessary to purge? Keeping the ID (which
destroy does) is what we want here.

>
> Item 9
> REPLACE (There is actually a typo - lvm is missing)
> ceph-volume create --bluestore --data $DEVICE --osd-id $ID
> WITH
> ceph-volume lvm create --bluestore --data $DEVICE
>
> ceph-volume will automatically pick previous osd-id

The missing lvm is a great catch, thanks for pointing that out. I have
created http://tracker.ceph.com/issues/23148 to follow up on this.

There is no longer a need to avoid using --osd-id as that bug has been
fixed in 12.2.3 and is available right now.
>
> PLUS a note to ignore errors ( _read_bdev_label unable to decode label at
> offset 102) https://tracker.ceph.com/issues/22285

This has also been fixed and released.

>
> Alexander.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Small modifications to Bluestore migration documentation

2018-02-27 Thread Alexander Kushnirenko
Hello,

Luminous 12.2.2

There were several discussions on this list concerning Bluestore migration,
as official documentation does not work quite well yet. In particular this
one
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024190.html

Is it possible to modify official documentation
http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/

Item 8
ceph osd destroy $ID --yes-i-really-mean-it
ADD COMMAND
ceph osd purge $ID  --yes-i-really-mean-it

Item 9
REPLACE (There is actually a typo - lvm is missing)
ceph-volume create --bluestore --data $DEVICE --osd-id $ID
WITH
ceph-volume lvm create --bluestore --data $DEVICE

ceph-volume will automatically pick previous osd-id

PLUS a note to ignore errors ( _read_bdev_label unable to decode label at
offset 102) https://tracker.ceph.com/issues/22285

Alexander.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Caspar Smit
2018-02-26 23:01 GMT+01:00 Gregory Farnum :

> On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit 
> wrote:
>
>> 2018-02-24 7:10 GMT+01:00 David Turner :
>>
>>> Caspar, it looks like your idea should work. Worst case scenario seems
>>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>>> the idea to weight them to 0, backfilling, then recreate the osds.
>>> Definitely with a try in my opinion, and I'd love to hear your experience
>>> after.
>>>
>>>
>> Hi David,
>>
>> First of all, thank you for ALL your answers on this ML, you're really
>> putting a lot of effort into answering many questions asked here and very
>> often they contain invaluable information.
>>
>>
>> To follow up on this post i went out and built a very small (proxmox)
>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
>> And it worked!
>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>>
>> Here's what i did on 1 node:
>>
>> 1) ceph osd set noout
>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>> 3) ddrescue -f -n -vv   /root/clone-db.log
>> 4) removed the old SSD physically from the node
>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>> 6) ceph osd unset noout
>>
>> I assume that once the ddrescue step is finished a 'partprobe' or
>> something similar is triggered and udev finds the DB partitions on the new
>> SSD and starts the OSD's again (kind of what happens during hotplug)
>> So it is probably better to clone the SSD in another (non-ceph) system to
>> not trigger any udev events.
>>
>> I also tested a reboot after this and everything still worked.
>>
>>
>> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
>> Delta of data was very low because it was a test cluster.
>>
>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>> didn't actually need to set noout :)
>>
>
> I kicked off a brief discussion about this with some of the BlueStore guys
> and they're aware of the problem with migrating across SSDs, but so far
> it's just a Trello card: https://trello.com/c/9cxTgG50/324-bluestore-add-
> remove-resize-wal-db
> They do confirm you should be okay with dd'ing things across, assuming
> symlinks get set up correctly as David noted.
>
>
Great that it is on the radar to address. This method feels hacky.


> I've got some other bad news, though: BlueStore has internal metadata
> about the size of the block device it's using, so if you copy it onto a
> larger block device, it will not actually make use of the additional space.
> :(
> -Greg
>

Yes, i was well aware of that, no problem. The reason was the smaller SSD
sizes are simply not being made anymore or discontinued by the manufacturer.
Would be nice though if the DB size could be resized in the future, the
default 1GB DB size seems very small to me.

Caspar


>
>
>>
>> Kind regards,
>> Caspar
>>
>>
>>
>>> Nico, it is not possible to change the WAL or DB size, location, etc
>>> after osd creation. If you want to change the configuration of the osd
>>> after creation, you have to remove it from the cluster and recreate it.
>>> There is no similar functionality to how you could move, recreate, etc
>>> filesystem osd journals. I think this might be on the radar as a feature,
>>> but I don't know for certain. I definitely consider it to be a regression
>>> of bluestore.
>>>
>>>
>>>
>>>
>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>>> nico.schottel...@ungleich.ch> wrote:
>>>

 A very interesting question and I would add the follow up question:

 Is there an easy way to add an external DB/WAL devices to an existing
 OSD?

 I suspect that it might be something on the lines of:

 - stop osd
 - create a link in ...ceph/osd/ceph-XX/block.db to the target device
 - (maybe run some kind of osd mkfs ?)
 - start osd

 Has anyone done this so far or recommendations on how to do it?

 Which also makes me wonder: what is actually the format of WAL and
 BlockDB in bluestore? Is there any documentation available about it?

 Best,

 Nico


 Caspar Smit  writes:

 > Hi All,
 >
 > What would be the proper way to preventively replace a DB/WAL SSD
 (when it
 > is nearing it's DWPD/TBW limit and not failed yet).
 >
 > It hosts DB partitions for 5 OSD's
 >
 > Maybe something like:
 >
 > 1) ceph osd reweight 0 the 5 OSD's
 > 2) let backfilling complete
 > 3) destroy/remove the 5 OSD's
 > 4) replace SSD
 > 5) create 5 new OSD's with seperate DB partition on new SSD
 >
 > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
 so i
 > thought maybe the following would work:
 >
 > 

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-27 Thread Caspar Smit
2018-02-26 18:02 GMT+01:00 David Turner :

> I'm glad that I was able to help out.  I wanted to point out that the
> reason those steps worked for you as quickly as they did is likely that you
> configured your blocks.db to use the /dev/disk/by-partuuid/{guid} instead
> of /dev/sdx#.  Had you configured your osds with /dev/sdx#, then you would
> have needed to either modify them to point to the partuuid path or changed
> them to the new devices name (which is a bad name as it will likely change
> on reboot).  Changing your path for blocks.db is as simple as `ln -sf
> /var/lib/ceph/osd/ceph-#/blocks.db /dev/disk/by-partuuid/{uuid}` and then
> restarting the osd to make sure that it can read from the new symlink
> location.
>
>
Yes, i (proxmox) used  /dev/disk/by-partuuid/{guid} style links.


> I'm curious about your OSDs starting automatically after doing those steps
> as well.  I would guess you deployed them with ceph-disk instead of
> ceph-volume, is that right?  ceph-volume no longer uses udev rules and
> shouldn't have picked up these changes here.
>
>
Yes, ceph-disk based so udev kicked in on the partprobe.

Caspar


> On Mon, Feb 26, 2018 at 6:23 AM Caspar Smit 
> wrote:
>
>> 2018-02-24 7:10 GMT+01:00 David Turner :
>>
>>> Caspar, it looks like your idea should work. Worst case scenario seems
>>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>>> the idea to weight them to 0, backfilling, then recreate the osds.
>>> Definitely with a try in my opinion, and I'd love to hear your experience
>>> after.
>>>
>>>
>> Hi David,
>>
>> First of all, thank you for ALL your answers on this ML, you're really
>> putting a lot of effort into answering many questions asked here and very
>> often they contain invaluable information.
>>
>>
>> To follow up on this post i went out and built a very small (proxmox)
>> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
>> And it worked!
>> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>>
>> Here's what i did on 1 node:
>>
>> 1) ceph osd set noout
>> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
>> 3) ddrescue -f -n -vv   /root/clone-db.log
>> 4) removed the old SSD physically from the node
>> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
>> 6) ceph osd unset noout
>>
>> I assume that once the ddrescue step is finished a 'partprobe' or
>> something similar is triggered and udev finds the DB partitions on the new
>> SSD and starts the OSD's again (kind of what happens during hotplug)
>> So it is probably better to clone the SSD in another (non-ceph) system to
>> not trigger any udev events.
>>
>> I also tested a reboot after this and everything still worked.
>>
>>
>> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
>> Delta of data was very low because it was a test cluster.
>>
>> All in all the OSD's in question were 'down' for only 5 minutes (so i
>> stayed within the ceph_osd_down_out interval of the default 10 minutes and
>> didn't actually need to set noout :)
>>
>> Kind regards,
>> Caspar
>>
>>
>>
>>> Nico, it is not possible to change the WAL or DB size, location, etc
>>> after osd creation. If you want to change the configuration of the osd
>>> after creation, you have to remove it from the cluster and recreate it.
>>> There is no similar functionality to how you could move, recreate, etc
>>> filesystem osd journals. I think this might be on the radar as a feature,
>>> but I don't know for certain. I definitely consider it to be a regression
>>> of bluestore.
>>>
>>>
>>>
>>>
>>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>>> nico.schottel...@ungleich.ch> wrote:
>>>

 A very interesting question and I would add the follow up question:

 Is there an easy way to add an external DB/WAL devices to an existing
 OSD?

 I suspect that it might be something on the lines of:

 - stop osd
 - create a link in ...ceph/osd/ceph-XX/block.db to the target device
 - (maybe run some kind of osd mkfs ?)
 - start osd

 Has anyone done this so far or recommendations on how to do it?

 Which also makes me wonder: what is actually the format of WAL and
 BlockDB in bluestore? Is there any documentation available about it?

 Best,

 Nico


 Caspar Smit  writes:

 > Hi All,
 >
 > What would be the proper way to preventively replace a DB/WAL SSD
 (when it
 > is nearing it's DWPD/TBW limit and not failed yet).
 >
 > It hosts DB partitions for 5 OSD's
 >
 > Maybe something like:
 >
 > 1) ceph osd reweight 0 the 5 OSD's
 > 2) let backfilling complete
 > 3) destroy/remove the 5 OSD's
 > 4) replace SSD
 > 5) create 5 new OSD's with seperate DB partition on new SSD
 >
 > When these 5 

[ceph-users] cannot reboot one of 3 nodes without locking a cluster OSDs stay in...

2018-02-27 Thread Philip Schroth
I have a 3 node production cluster. All works fine. but i have one failing
node. i replaced one disk on sunday. everyting went fine. last night there
was another disk broken. Ceph nicely maks it as down. but when i wanted to
reboot this node now. all remaining osd's are being kept in and not marked
as down. and the whole cluster locks during the reboot of this node. once i
reboot one of the other two nodes when the first failing node is back it
works like charm. only this node i cannot reboot anymore without locking
which i could still on sunday...

-- 
Met vriendelijke groet / With kind regards

Philip Schroth
E: i...@schroth.nl
T: +31630973268
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] (no subject)

2018-02-27 Thread Philip Schroth
I have a 3 node production cluster. All works fine. but i have one failing
node. i replaced one disk on sunday. everyting went fine. last night there
was another disk broken. Ceph nicely maks it as down. but when i wanted to
reboot this node now. all remaining osd's are being kept in and not marked
as down. and the whole cluster locks during the reboot of this node. once i
reboot one of the other two nodes when the first failing node is back it
works like charm. only this node i cannot reboot anymore without locking
which i could still on sunday...


-- 
Met vriendelijke groet / With kind regards

Philip Schroth
E: i...@schroth.nl
T: +31630973268
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com