[ceph-users] Too few PGs per OSD (autoscaler)

2019-08-01 Thread Jan Kasprzak
Helo, Ceph users,

TL;DR: PG autoscaler should not cause the "too few PGs per OSD" warning

Detailed:
Some time ago, I upgraded the HW in my virtualization+Ceph cluster,
replacing 30+ old servers with <10 modern servers. I immediately got
"Too much PGs per OSD" warning, so I had to add more OSDs, even though
I did not need the space at that time. So I eagerly waited for the PG
autoscaling feature in Nautilus.

Yesterday I upgraded to Nautilus and enabled the autoscaler on my RBD pool.
Firstly I got the "objects per pg (XX) is more than XX times cluster average"
warning for several hours, which has been replaced with
"too few PGs per OSD" later on.

I have to set the minimum number of PGs per pool, but anyway, I think
autoscaler should not be too aggresive, and should not reduce the number
of PGs below the PGs per OSD limit.

(that said, the ability to reduce the number of PGs in a pool in Nautilus
works well for me, thanks for it!)

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-16 Thread Jan Kasprzak
Konstantin Shalygin wrote:
: >how do you deal with the "clock skew detected" HEALTH_WARN message?
: >
: >I think the internal RTC in most x86 servers does have 1 second resolution
: >only, but Ceph skew limit is much smaller than that. So every time I reboot
: >one of my mons (for kernel upgrade or something), I have to wait for several
: >minutes for the system clock to synchronize over NTP, even though ntpd
: >has been running before reboot and was started during the system boot again.
: 
: Definitely you should use chrony with iburst.

OK, many responses (thanks for them!) suggest chrony, so I tried it:
With all three mons running chrony and being in sync with my NTP server
with offsets under 0.0001 second, I rebooted one of the mons:

There still was the HEALTH_WARN clock_skew message as soon as
the rebooted mon starts responding to ping. The cluster returns to
HEALTH_OK about 95 seconds later.

According to "ntpdate -q my.ntp.server", the initial offset
after reboot is about 0.6 s (which is the reason of HEALTH_WARN, I think),
but it gets under 0.0001 s in about 25 seconds. The remaining ~50 seconds
of HEALTH_WARN is inside Ceph, with mons being already synchronized.

So the result is that chrony indeed synchronizes faster,
but nevertheless I still have about 95 seconds of HEALTH_WARN "clock skew
detected".

I guess now the workaround now is to ignore the warning, and wait
for two minutes before rebooting another mon.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Huge rebalance after rebooting OSD host (Mimic)

2019-05-15 Thread Jan Kasprzak
Hello, Ceph users,

I wanted to install the recent kernel update on my OSD hosts
with CentOS 7, Ceph 13.2.5 Mimic. So I set a noout flag and ran
"yum -y update" on the first OSD host. This host has 8 bluestore OSDs
with data on HDDs and database on LVs of two SSDs (each SSD has 4 LVs
for OSD metadata).

Everything went OK, so I rebooted this host. After the OSD host
went back online, the cluster went from HEALTH_WARN (noout flag set)
to HEALTH_ERR, and started to rebalance itself, with reportedly almost 60 %
objects misplaced, and some of them degraded. And, of course backfill_toofull:

  cluster:
health: HEALTH_ERR
2300616/3975384 objects misplaced (57.872%)
Degraded data redundancy: 74263/3975384 objects degraded (1.868%), 
146 pgs degraded, 122 pgs undersized
Degraded data redundancy (low space): 44 pgs backfill_toofull
 
  services:
mon: 3 daemons, quorum stratus1,stratus2,stratus3
mgr: stratus3(active), standbys: stratus1, stratus2
osd: 44 osds: 44 up, 44 in; 2022 remapped pgs
rgw: 1 daemon active
 
  data:
pools:   9 pools, 3360 pgs
objects: 1.33 M objects, 5.0 TiB
usage:   25 TiB used, 465 TiB / 490 TiB avail
pgs: 74263/3975384 objects degraded (1.868%)
 2300616/3975384 objects misplaced (57.872%)
 1739 active+remapped+backfill_wait
 1329 active+clean
 102  active+recovery_wait+remapped
 76   active+undersized+degraded+remapped+backfill_wait
 31   active+remapped+backfill_wait+backfill_toofull
 30   active+recovery_wait+undersized+degraded+remapped
 21   active+recovery_wait+degraded+remapped
 8
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
 6active+recovery_wait+degraded
 4active+remapped+backfill_toofull
 3active+recovery_wait+undersized+degraded
 3active+remapped+backfilling
 2active+recovery_wait
 2active+recovering+undersized
 1active+clean+remapped
 1active+undersized+degraded+remapped+backfill_toofull
 1active+undersized+degraded+remapped+backfilling
 1active+recovering+undersized+remapped
 
  io:
client:   681 B/s rd, 1013 KiB/s wr, 0 op/s rd, 32 op/s wr
recovery: 142 MiB/s, 93 objects/s
 
(note that I cleaned the noout flag afterwards). What is wrong with it?
Why did the cluster decided to rebalance itself?

I am keeping the rest of the OSD hosts unrebooted for now.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Jan Kasprzak
Hello, Ceph users,

how do you deal with the "clock skew detected" HEALTH_WARN message?

I think the internal RTC in most x86 servers does have 1 second resolution
only, but Ceph skew limit is much smaller than that. So every time I reboot
one of my mons (for kernel upgrade or something), I have to wait for several
minutes for the system clock to synchronize over NTP, even though ntpd
has been running before reboot and was started during the system boot again.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw object size limit?

2019-05-10 Thread Jan Kasprzak
Hello,

thanks for your help.

Casey Bodley wrote:
: It looks like the default.rgw.buckets.non-ec pool is missing, which
: is where we track in-progress multipart uploads. So I'm guessing
: that your perl client is not doing a multipart upload, where s3cmd
: does by default.
: 
: I'd recommend debugging this by trying to create the pool manually -
: the only requirement for this pool is that it not be erasure coded.
: See the docs for your ceph release for more information:
: 
: http://docs.ceph.com/docs/luminous/rados/operations/pools/#create-a-pool
: 
: http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/

I use Mimic, FWIW. I created the pool in question manually:

# ceph osd pool create default.rgw.buckets.non-ec 32
pool 'default.rgw.buckets.non-ec' created
# 

and it finished without any error. Now I can do multipart uploads
using s3cmd.

What could be the problem? Maybe radosgw cephx user does not have
sufficient rights to create a pool? ceph auth ls shows the following
keys:

client.bootstrap-rgw
key: ...
caps: [mgr] allow r
caps: [mon] allow profile bootstrap-rgw
client.rgw.myrgwhost
key: ...
caps: [mon] allow rw
caps: [osd] allow rwx

Is this correct?

Thank you very much!

-Yenya


-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw object size limit?

2019-05-10 Thread Jan Kasprzak
Hello Casey (and the ceph-users list),

I am returning to my older problem to which you replied:

Casey Bodley wrote:
: There is a rgw_max_put_size which defaults to 5G, which limits the
: size of a single PUT request. But in that case, the http response
: would be 400 EntityTooLarge. For multipart uploads, there's also a
: rgw_multipart_part_upload_limit that defaults to 1 parts, which
: would cause a 416 InvalidRange error. By default though, s3cmd does
: multipart uploads with 15MB parts, so your 11G object should only
: have ~750 parts.
: 
: Are you able to upload smaller objects successfully? These
: InvalidRange errors can also result from failures to create any
: rados pools that didn't exist already. If that's what you're
: hitting, you'd get the same InvalidRange errors for smaller object
: uploads, and you'd also see messages like this in your radosgw log:
: 
: > rgw_init_ioctx ERROR: librados::Rados::pool_create returned (34)
: Numerical result out of range (this can be due to a pool or
: placement group misconfiguration, e.g. pg_num < pgp_num or
: mon_max_pg_per_osd exceeded)

You are right. Now how do I know which pool it is and what is the
reason?

Anyway, If I try to upload a CentOS 7 ISO image using
Perl module Net::Amazon::S3, it works. I do something like this there:

my $bucket = $s3->add_bucket({
bucket => 'testbucket',
acl_short => 'private',
});
$bucket->add_key_filename("testdir/$dst", $file, {
content_type => 'application/octet-stream'
}) or die $s3->err . ': ' . $s3->errstr;

and I see the following in /var/log/ceph/ceph-client.rgwlog:

2019-05-10 15:55:28.394 7f4b859b8700  1 civetweb: 0x558108506000: 127.0.0.1 - - 
[10/May/2019:15:53:50 +0200] "PUT 
/testbucket/testdir/CentOS-7-x86_64-Everything-1810.iso HTTP/1.1" 200 234 - 
libwww-perl/6.38

I can see the uploaded object using "s3cmd ls", and I can download it back
using "s3cmd get", with matching sha1sum. When I do the same using
"s3cmd put" instead of Perl module, I indeed get the pool create failure:

2019-05-10 15:53:14.914 7f4b859b8700  1 == starting new request 
req=0x7f4b859af850 =
2019-05-10 15:53:15.492 7f4b859b8700  0 rgw_init_ioctx ERROR: 
librados::Rados::pool_create returned (34) Numerical result out of range (this 
can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num 
or mon_max_pg_per_osd exceeded)
2019-05-10 15:53:15.492 7f4b859b8700  1 == req done req=0x7f4b859af850 op 
status=-34 http_status=416 ==
2019-05-10 15:53:15.492 7f4b859b8700  1 civetweb: 0x558108506000: 127.0.0.1 - - 
[10/May/2019:15:53:14 +0200] "POST /testbucket/testdir/c7.iso?uploads HTTP/1.0" 
416 469 - -

So maybe the Perl module is configured differently? But which pool or
other parameter is the problem? I have the following pools:

# ceph osd pool ls
one
.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
default.rgw.buckets.index
default.rgw.buckets.data

(the "one" pool is unrelated to RadosGW, it contains OpenNebula RBD images).

Thanks,

-Yenya

: On 3/7/19 12:21 PM, Jan Kasprzak wrote:
: > Hello, Ceph users,
: >
: >does radosgw have an upper limit of object size? I tried to upload
: >a 11GB file using s3cmd, but it failed with InvalidRange error:
: >
: >$ s3cmd put --verbose 
centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso s3://mybucket/
: >INFO: No cache file found, creating it.
: >INFO: Compiling list of local files...
: >INFO: Running stat() and reading/calculating MD5 values on 1 files, this may 
take some time...
: >INFO: Summary: 1 local files to upload
: >WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner username not known. 
Storing UID=108 instead.
: >WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner groupname not known. 
Storing GID=108 instead.
: >ERROR: S3 error: 416 (InvalidRange)
: >
: >$ ls -lh centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso
: >-rw-r--r--. 1 108 108 11G Nov 26 15:28 
centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso
: >
: >Thanks for any hint how to increase the limit.
: >
: >-Yenya
: >
: ___
: ceph-users mailing list
: ceph-users@lists.ceph.com
: http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw object size limit?

2019-03-07 Thread Jan Kasprzak
Hello, Ceph users,

does radosgw have an upper limit of object size? I tried to upload
a 11GB file using s3cmd, but it failed with InvalidRange error:

$ s3cmd put --verbose centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso 
s3://mybucket/
INFO: No cache file found, creating it.
INFO: Compiling list of local files...
INFO: Running stat() and reading/calculating MD5 values on 1 files, this may 
take some time...
INFO: Summary: 1 local files to upload
WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner username not known. Storing 
UID=108 instead.
WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner groupname not known. 
Storing GID=108 instead.
ERROR: S3 error: 416 (InvalidRange)

$ ls -lh centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso
-rw-r--r--. 1 108 108 11G Nov 26 15:28 
centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso

Thanks for any hint how to increase the limit.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image format v1 EOL ...

2019-02-20 Thread Jan Kasprzak
Hello,

Jason Dillaman wrote:
: For the future Ceph Octopus release, I would like to remove all
: remaining support for RBD image format v1 images baring any
: substantial pushback.
: 
: The image format for new images has been defaulted to the v2 image
: format since Infernalis, the v1 format was officially deprecated in
: Jewel, and creation of new v1 images was prohibited starting with
: Mimic.
: 
: The forthcoming Nautilus release will add a new image migration
: feature to help provide a low-impact conversion path forward for any
: legacy images in a cluster. The ability to migrate existing images off
: the v1 image format was the last known pain point that was highlighted
: the previous time I suggested removing support.
: 
: Please let me know if anyone has any major objections or concerns.

If I read the parallel thread about pool migration in ceph-users@
correctly, the ability to migrate to v2 would still require to stop the client
before the "rbd migration prepare" can be executed.

On my OpenNebula/Ceph cluster, I still have bigger tens of images
in v1 format, so it would induce a moderate pain to figure out which VMs
are using them, how availability-critical they are, and finally to migrate
the images.

But whatever, I guess I can cope with it :-)

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore increased disk usage

2019-02-18 Thread Jan Kasprzak
Jakub Jaszewski wrote:
: Hi Yenya,
: 
: I guess Ceph adds the size of all  your data.db devices to the cluster
: total used space.

Jakub,

thanks for the hint. The disk usage increase almost corresponds
to that - I have added about 7.5 TB of data.db devices with the last
batch of OSDs.

Sincerely,

-Yenya

: pt., 8 lut 2019, 10:11: Jan Kasprzak  napisał(a):
: 
: > Hello, ceph users,
: >
: > I moved my cluster to bluestore (Ceph Mimic), and now I see the increased
: > disk usage. From ceph -s:
: >
: > pools:   8 pools, 3328 pgs
: > objects: 1.23 M objects, 4.6 TiB
: > usage:   23 TiB used, 444 TiB / 467 TiB avail
: >
: > I use 3-way replication of my data, so I would expect the disk usage
: > to be around 14 TiB. Which was true when I used filestore-based Luminous
: > OSDs
: > before. Why the disk usage now is 23 TiB?
: >
: > If I remember it correctly (a big if!), the disk usage was about the same
: > when I originally moved the data to empty bluestore OSDs by changing the
: > crush rule, but went up after I have added more bluestore OSDs and the
: > cluster
: > rebalanced itself.
: >
: > Could it be some miscalculation of free space in bluestore? Also, could it
: > be
: > related to the HEALTH_ERR backfill_toofull problem discused here in the
: > other
: > thread?
: >
: > Thanks,
: >
: > -Yenya
: >
: > --
: > | Jan "Yenya" Kasprzak 
: > |
: > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
: > |
: >  This is the world we live in: the way to deal with computers is to google
: >  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
: > ___
: > ceph-users mailing list
: > ceph-users@lists.ceph.com
: > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
: >

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore increased disk usage

2019-02-08 Thread Jan Kasprzak
Hello, ceph users,

I moved my cluster to bluestore (Ceph Mimic), and now I see the increased
disk usage. From ceph -s:

pools:   8 pools, 3328 pgs
objects: 1.23 M objects, 4.6 TiB
usage:   23 TiB used, 444 TiB / 467 TiB avail

I use 3-way replication of my data, so I would expect the disk usage
to be around 14 TiB. Which was true when I used filestore-based Luminous OSDs
before. Why the disk usage now is 23 TiB?

If I remember it correctly (a big if!), the disk usage was about the same
when I originally moved the data to empty bluestore OSDs by changing the
crush rule, but went up after I have added more bluestore OSDs and the cluster
rebalanced itself.

Could it be some miscalculation of free space in bluestore? Also, could it be
related to the HEALTH_ERR backfill_toofull problem discused here in the other
thread?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Downsizing a cephfs pool

2019-02-08 Thread Jan Kasprzak
Hello,

Brian Topping wrote:
: Hi all, I created a problem when moving data to Ceph and I would be grateful 
for some guidance before I do something dumb.
[...]
: Do I need to create new pools and copy again using cpio? Is there a better 
way?

I think I will be facing the same problem soon (moving my cluster
from ~64 1-2TB OSDs to about 16 12TB OSDs). Maybe this is the way to go:

https://ceph.com/geen-categorie/ceph-pool-migration/

(I did not tested that, though).

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pgs inactive after setting a new crush rule (Re: backfill_toofull after adding new OSDs)

2019-01-31 Thread Jan Kasprzak
Jan Kasprzak wrote:
:   OKay, now I changed the crush rule also on a pool with
: the real data, and it seems all the client i/o on that pool has stopped.
: The recovery continues, but things like qemu I/O, "rbd ls", and so on
: are just stuck doing nothing.
: 
:   Can I unstuck it somehow (faster than waiting for all the recovery
: to finish)? Thanks.

I was able to briefly reduce the "1721 pgs inactive" number
by restarting the some of the original filestore OSDs, but after some time
the number increased back to 1721. Then the data recovery finished,
and 1721 PGs remained inactive (and, of course this pool I/O was stuck,
both qemu and "rbd ls").

So I have returned the original crush rule, the data started
to migrate back to the original OSDs, and the client I/O got unstuck
(even though the data relocation is still in progress).

Where can be the problem? It might be that I am hitting the limit
of number of PGs per OSD or something? I had 60 OSDs before, and want
to move it all to 20 new OSDs instead. The pool in question has 2048 PGs.

Thanks,

-Yenya
: 
: # ceph -s
:   cluster:
: id: ... my-uuid ...
: health: HEALTH_ERR
: 3308311/3803892 objects misplaced (86.972%)
: Reduced data availability: 1721 pgs inactive
: Degraded data redundancy: 85361/3803892 objects degraded 
(2.244%), 1
: 39 pgs degraded, 139 pgs undersized
: Degraded data redundancy (low space): 25 pgs backfill_toofull
: 
:   services:
: mon: 3 daemons, quorum mon1,mon2,mon3
: mgr: mon2(active), standbys: mon1, mon3
: osd: 80 osds: 80 up, 80 in; 1868 remapped pgs
: rgw: 1 daemon active
: 
:   data:
: pools:   13 pools, 5056 pgs
: objects: 1.27 M objects, 4.8 TiB
: usage:   15 TiB used, 208 TiB / 224 TiB avail
: pgs: 34.039% pgs not active
:  85361/3803892 objects degraded (2.244%)
:  3308311/3803892 objects misplaced (86.972%)
:  3188 active+clean
:  1582 activating+remapped
:  139  activating+undersized+degraded+remapped
:  93   active+remapped+backfill_wait
:  29   active+remapped+backfilling
:  25   active+remapped+backfill_wait+backfill_toofull
: 
:   io:
: recovery: 174 MiB/s, 43 objects/s
: 
: 
: -Yenya
: 
: 
: Jan Kasprzak wrote:
: : : - Original Message -
: : : From: "Caspar Smit" 
: : : To: "Jan Kasprzak" 
: : : Cc: "ceph-users" 
: : : Sent: Thursday, 31 January, 2019 15:43:07
: : : Subject: Re: [ceph-users] backfill_toofull after adding new OSDs
: : : 
: : : Hi Jan, 
: : : 
: : : You might be hitting the same issue as Wido here: 
: : : 
: : : [ https://www.spinics.net/lists/ceph-users/msg50603.html | 
https://www.spinics.net/lists/ceph-users/msg50603.html ] 
: : : 
: : : Kind regards, 
: : : Caspar 
: : : 
: : : Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ 
mailto:k...@fi.muni.cz | k...@fi.muni.cz ] >: 
: : : 
: : : 
: : : Hello, ceph users, 
: : : 
: : : I see the following HEALTH_ERR during cluster rebalance: 
: : : 
: : : Degraded data redundancy (low space): 8 pgs backfill_toofull 
: : : 
: : : Detailed description: 
: : : I have upgraded my cluster to mimic and added 16 new bluestore OSDs 
: : : on 4 hosts. The hosts are in a separate region in my crush map, and crush 
: : : rules prevented data to be moved on the new OSDs. Now I want to move 
: : : all data to the new OSDs (and possibly decomission the old filestore 
OSDs). 
: : : I have created the following rule: 
: : : 
: : : # ceph osd crush rule create-replicated on-newhosts newhostsroot host 
: : : 
: : : after this, I am slowly moving the pools one-by-one to this new rule: 
: : : 
: : : # ceph osd pool set test-hdd-pool crush_rule on-newhosts 
: : : 
: : : When I do this, I get the above error. This is misleading, because 
: : : ceph osd df does not suggest the OSDs are getting full (the most full 
: : : OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR 
: : : disappears. Why am I getting this error? 
: : : 
: : : # ceph -s 
: : : cluster: 
: : : id: ...my UUID... 
: : : health: HEALTH_ERR 
: : : 1271/3803223 objects misplaced (0.033%) 
: : : Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs 
degraded, 67 pgs undersized 
: : : Degraded data redundancy (low space): 8 pgs backfill_toofull 
: : : 
: : : services: 
: : : mon: 3 daemons, quorum mon1,mon2,mon3 
: : : mgr: mon2(active), standbys: mon1, mon3 
: : : osd: 80 osds: 80 up, 80 in; 90 remapped pgs 
: : : rgw: 1 daemon active 
: : : 
: : : data: 
: : : pools: 13 pools, 5056 pgs 
: : : objects: 1.27 M objects, 4.8 TiB 
: : : usage: 15 TiB used, 208 TiB / 224 TiB avail 
: : : pgs: 40124/3803223 objects degraded (1.055%) 
: : : 1271/3803223 objects misplaced (0.033%) 
: : : 4963 active+clean 
: : : 41 active+recovery_wait+undersized+degraded+remapped 
: : : 21 

Re: [ceph-users] backfill_toofull after adding new OSDs

2019-01-31 Thread Jan Kasprzak
OKay, now I changed the crush rule also on a pool with
the real data, and it seems all the client i/o on that pool has stopped.
The recovery continues, but things like qemu I/O, "rbd ls", and so on
are just stuck doing nothing.

Can I unstuck it somehow (faster than waiting for all the recovery
to finish)? Thanks.

# ceph -s
  cluster:
id: ... my-uuid ...
health: HEALTH_ERR
3308311/3803892 objects misplaced (86.972%)
Reduced data availability: 1721 pgs inactive
Degraded data redundancy: 85361/3803892 objects degraded (2.244%), 1
39 pgs degraded, 139 pgs undersized
Degraded data redundancy (low space): 25 pgs backfill_toofull

  services:
mon: 3 daemons, quorum mon1,mon2,mon3
mgr: mon2(active), standbys: mon1, mon3
osd: 80 osds: 80 up, 80 in; 1868 remapped pgs
rgw: 1 daemon active

  data:
pools:   13 pools, 5056 pgs
objects: 1.27 M objects, 4.8 TiB
usage:   15 TiB used, 208 TiB / 224 TiB avail
pgs: 34.039% pgs not active
 85361/3803892 objects degraded (2.244%)
 3308311/3803892 objects misplaced (86.972%)
 3188 active+clean
 1582 activating+remapped
 139  activating+undersized+degraded+remapped
 93   active+remapped+backfill_wait
 29   active+remapped+backfilling
 25   active+remapped+backfill_wait+backfill_toofull

  io:
recovery: 174 MiB/s, 43 objects/s


-Yenya


Jan Kasprzak wrote:
: : - Original Message -
: : From: "Caspar Smit" 
: : To: "Jan Kasprzak" 
: : Cc: "ceph-users" 
: : Sent: Thursday, 31 January, 2019 15:43:07
: : Subject: Re: [ceph-users] backfill_toofull after adding new OSDs
: : 
: : Hi Jan, 
: : 
: : You might be hitting the same issue as Wido here: 
: : 
: : [ https://www.spinics.net/lists/ceph-users/msg50603.html | 
https://www.spinics.net/lists/ceph-users/msg50603.html ] 
: : 
: : Kind regards, 
: : Caspar 
: : 
: : Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:k...@fi.muni.cz 
| k...@fi.muni.cz ] >: 
: : 
: : 
: : Hello, ceph users, 
: : 
: : I see the following HEALTH_ERR during cluster rebalance: 
: : 
: : Degraded data redundancy (low space): 8 pgs backfill_toofull 
: : 
: : Detailed description: 
: : I have upgraded my cluster to mimic and added 16 new bluestore OSDs 
: : on 4 hosts. The hosts are in a separate region in my crush map, and crush 
: : rules prevented data to be moved on the new OSDs. Now I want to move 
: : all data to the new OSDs (and possibly decomission the old filestore OSDs). 
: : I have created the following rule: 
: : 
: : # ceph osd crush rule create-replicated on-newhosts newhostsroot host 
: : 
: : after this, I am slowly moving the pools one-by-one to this new rule: 
: : 
: : # ceph osd pool set test-hdd-pool crush_rule on-newhosts 
: : 
: : When I do this, I get the above error. This is misleading, because 
: : ceph osd df does not suggest the OSDs are getting full (the most full 
: : OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR 
: : disappears. Why am I getting this error? 
: : 
: : # ceph -s 
: : cluster: 
: : id: ...my UUID... 
: : health: HEALTH_ERR 
: : 1271/3803223 objects misplaced (0.033%) 
: : Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs 
degraded, 67 pgs undersized 
: : Degraded data redundancy (low space): 8 pgs backfill_toofull 
: : 
: : services: 
: : mon: 3 daemons, quorum mon1,mon2,mon3 
: : mgr: mon2(active), standbys: mon1, mon3 
: : osd: 80 osds: 80 up, 80 in; 90 remapped pgs 
: : rgw: 1 daemon active 
: : 
: : data: 
: : pools: 13 pools, 5056 pgs 
: : objects: 1.27 M objects, 4.8 TiB 
: : usage: 15 TiB used, 208 TiB / 224 TiB avail 
: : pgs: 40124/3803223 objects degraded (1.055%) 
: : 1271/3803223 objects misplaced (0.033%) 
: : 4963 active+clean 
: : 41 active+recovery_wait+undersized+degraded+remapped 
: : 21 active+recovery_wait+undersized+degraded 
: : 17 active+remapped+backfill_wait 
: : 5 active+remapped+backfill_wait+backfill_toofull 
: : 3 active+remapped+backfill_toofull 
: : 2 active+recovering+undersized+remapped 
: : 2 active+recovering+undersized+degraded+remapped 
: : 1 active+clean+remapped 
: : 1 active+recovering+undersized+degraded 
: : 
: : io: 
: : client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr 
: : recovery: 2.0 MiB/s, 92 objects/s 
: : 
: : Thanks for any hint, 
: : 
: : -Yenya 
: : 
: : -- 
: : | Jan "Yenya" Kasprzak http://fi.muni.cz/ | fi.muni.cz ] - work 
| [ http://yenya.net/ | yenya.net ] - private}> | 
: : | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 
4096R/A45477D5 | 
: : This is the world we live in: the way to deal with computers is to google 
: : the symptoms, and hope that you don't have to watch a video. --P. Zaitcev 
: : ___ 
: : ceph-users mailing list 
: : [ mailto:ceph-use

Re: [ceph-users] backfill_toofull after adding new OSDs

2019-01-31 Thread Jan Kasprzak
Fyodor Ustinov wrote:
: Hi!
: 
: I saw the same several times when I added a new osd to the cluster. One-two 
pg in "backfill_toofull" state.
: 
: In all versions of mimic.

Yep. In my case it is not (only) after adding the new OSDs.
An hour or so ago my cluster reached the HEALTH_OK state, so I moved
another pool to the new hosts with "crush_rule on-newhosts". The result
was immediate backfill_toofull on two PGs for about five minutes,
and then it reached the HEALTH_OK again.

So the PGs are not stuck in that state forever, they are there
only during the data reshuffle.

13.2.4 on CentOS 7.

-Yenya

: 
: - Original Message -
: From: "Caspar Smit" 
: To: "Jan Kasprzak" 
: Cc: "ceph-users" 
: Sent: Thursday, 31 January, 2019 15:43:07
: Subject: Re: [ceph-users] backfill_toofull after adding new OSDs
: 
: Hi Jan, 
: 
: You might be hitting the same issue as Wido here: 
: 
: [ https://www.spinics.net/lists/ceph-users/msg50603.html | 
https://www.spinics.net/lists/ceph-users/msg50603.html ] 
: 
: Kind regards, 
: Caspar 
: 
: Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:k...@fi.muni.cz | 
k...@fi.muni.cz ] >: 
: 
: 
: Hello, ceph users, 
: 
: I see the following HEALTH_ERR during cluster rebalance: 
: 
: Degraded data redundancy (low space): 8 pgs backfill_toofull 
: 
: Detailed description: 
: I have upgraded my cluster to mimic and added 16 new bluestore OSDs 
: on 4 hosts. The hosts are in a separate region in my crush map, and crush 
: rules prevented data to be moved on the new OSDs. Now I want to move 
: all data to the new OSDs (and possibly decomission the old filestore OSDs). 
: I have created the following rule: 
: 
: # ceph osd crush rule create-replicated on-newhosts newhostsroot host 
: 
: after this, I am slowly moving the pools one-by-one to this new rule: 
: 
: # ceph osd pool set test-hdd-pool crush_rule on-newhosts 
: 
: When I do this, I get the above error. This is misleading, because 
: ceph osd df does not suggest the OSDs are getting full (the most full 
: OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR 
: disappears. Why am I getting this error? 
: 
: # ceph -s 
: cluster: 
: id: ...my UUID... 
: health: HEALTH_ERR 
: 1271/3803223 objects misplaced (0.033%) 
: Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs 
degraded, 67 pgs undersized 
: Degraded data redundancy (low space): 8 pgs backfill_toofull 
: 
: services: 
: mon: 3 daemons, quorum mon1,mon2,mon3 
: mgr: mon2(active), standbys: mon1, mon3 
: osd: 80 osds: 80 up, 80 in; 90 remapped pgs 
: rgw: 1 daemon active 
: 
: data: 
: pools: 13 pools, 5056 pgs 
: objects: 1.27 M objects, 4.8 TiB 
: usage: 15 TiB used, 208 TiB / 224 TiB avail 
: pgs: 40124/3803223 objects degraded (1.055%) 
: 1271/3803223 objects misplaced (0.033%) 
: 4963 active+clean 
: 41 active+recovery_wait+undersized+degraded+remapped 
: 21 active+recovery_wait+undersized+degraded 
: 17 active+remapped+backfill_wait 
: 5 active+remapped+backfill_wait+backfill_toofull 
: 3 active+remapped+backfill_toofull 
: 2 active+recovering+undersized+remapped 
: 2 active+recovering+undersized+degraded+remapped 
: 1 active+clean+remapped 
: 1 active+recovering+undersized+degraded 
: 
: io: 
: client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr 
: recovery: 2.0 MiB/s, 92 objects/s 
: 
: Thanks for any hint, 
: 
: -Yenya 
: 
: -- 
: | Jan "Yenya" Kasprzak http://fi.muni.cz/ | fi.muni.cz ] - work | 
[ http://yenya.net/ | yenya.net ] - private}> | 
: | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 
4096R/A45477D5 | 
: This is the world we live in: the way to deal with computers is to google 
: the symptoms, and hope that you don't have to watch a video. --P. Zaitcev 
: ___ 
: ceph-users mailing list 
: [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
: [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 
: 
: ___
: ceph-users mailing list
: ceph-users@lists.ceph.com
: http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] backfill_toofull after adding new OSDs

2019-01-31 Thread Jan Kasprzak
Hello, ceph users,

I see the following HEALTH_ERR during cluster rebalance:

Degraded data redundancy (low space): 8 pgs backfill_toofull

Detailed description:
I have upgraded my cluster to mimic and added 16 new bluestore OSDs
on 4 hosts. The hosts are in a separate region in my crush map, and crush
rules prevented data to be moved on the new OSDs. Now I want to move
all data to the new OSDs (and possibly decomission the old filestore OSDs).
I have created the following rule:

# ceph osd crush rule create-replicated on-newhosts newhostsroot host

after this, I am slowly moving the pools one-by-one to this new rule:

# ceph osd pool set test-hdd-pool crush_rule on-newhosts

When I do this, I get the above error. This is misleading, because
ceph osd df does not suggest the OSDs are getting full (the most full
OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR
disappears. Why am I getting this error?

# ceph -s
  cluster:
id: ...my UUID...
health: HEALTH_ERR
1271/3803223 objects misplaced (0.033%)
Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 
65 pgs degraded, 67 pgs undersized
Degraded data redundancy (low space): 8 pgs backfill_toofull
 
  services:
mon: 3 daemons, quorum mon1,mon2,mon3
mgr: mon2(active), standbys: mon1, mon3
osd: 80 osds: 80 up, 80 in; 90 remapped pgs
rgw: 1 daemon active
 
  data:
pools:   13 pools, 5056 pgs
objects: 1.27 M objects, 4.8 TiB
usage:   15 TiB used, 208 TiB / 224 TiB avail
pgs: 40124/3803223 objects degraded (1.055%)
 1271/3803223 objects misplaced (0.033%)
 4963 active+clean
 41   active+recovery_wait+undersized+degraded+remapped
 21   active+recovery_wait+undersized+degraded
 17   active+remapped+backfill_wait
 5active+remapped+backfill_wait+backfill_toofull
 3active+remapped+backfill_toofull
 2active+recovering+undersized+remapped
 2active+recovering+undersized+degraded+remapped
 1active+clean+remapped
 1active+recovering+undersized+degraded
 
  io:
client:   6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr
recovery: 2.0 MiB/s, 92 objects/s
 
Thanks for any hint,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Spec for Ceph Mon+Mgr?

2019-01-23 Thread Jan Kasprzak
jes...@krogh.cc wrote:
: Hi.
: 
: We're currently co-locating our mons with the head node of our Hadoop
: installation. That may be giving us some problems, we dont know yet, but
: thus I'm speculation about moving them to dedicated hardware.
: 
: It is hard to get specifications "small" engough .. the specs for the
: mon is where we usually virtualize our way out of if .. which seems very
: wrong here.
: 
: Are other people just co-locating it with something random or what are
: others typically using in a small ceph cluster (< 100 OSDs .. 7 OSD hosts)

Jesper,

FWIW, we colocate our mons/mgrs with OpenNebula master node and minor
OpenNebula host nodes. As an example, one of them is AMD Opteron 6134
(8 cores, 2.3 GHz), 16 GB RAM, 1 Gbit ethernet. We have three mons.

I want to keep this setup also in the future, but I may move the
OpenNebula virtualization out of mon hosts - not because the hosts are
overloaded, but they are getting too old/slow/small for the VMs themselves :-).

We have three mons with a similar configuration.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to a dedicated cluster network

2019-01-23 Thread Jan Kasprzak
Jakub Jaszewski wrote:
: Hi Yenya,
: 
: Can I ask how your cluster looks and  why you want to do the network
: splitting?

Jakub,

we have deployed the Ceph cluster originally as a proof of concept for
a private cloud. We run OpenNebula and Ceph on about 30 old servers
with old HDDs (2 OSDs per host), all connected via 1 Gbit ethernet
with 10Gbit backbone. Since then our private cloud got pretty popular
among our users, so we are planning to upgrade it to a smaller amount
of modern servers. The new servers have two 10GbE interfaces, so the primary
reasoning behind it is "why not use them both when we already have them".
Of course, interface teaming/bonding is another option.

Currently I see the network being saturated only when doing a live
migration of a VM between the physical hosts, and then during a Ceph
cluster rebalance.

So, I don't think moving to a dedicated cluster network is a necessity for us.

Anyway, does anybody use the cluster network with larger MTU (jumbo frames)?

: We used to set up 9-12 OSD nodes (12-16 HDDs each) clusters using 2x10Gb
: for access and 2x10Gb for cluster network, however, I don't see the reasons
: to not use just one network for next cluster setup.


-Yenya

: śr., 23 sty 2019, 10:40: Jan Kasprzak  napisał(a):
: 
: > Hello, Ceph users,
: >
: > is it possible to migrate already deployed Ceph cluster, which uses
: > public network only, to a split public/dedicated networks? If so,
: > can this be done without service disruption? I have now got a new
: > hardware which makes this possible, but I am not sure how to do it.
: >
: > Another question is whether the cluster network can be done
: > solely on top of IPv6 link-local addresses without any public address
: > prefix.
: >
: > When deploying this cluster (Ceph Firefly, IIRC), I had problems
: > with mixed IPv4/IPv6 addressing, and ended up with ms_bind_ipv6 = false
: > in my Ceph conf.
: >
: > Thanks,
: >
: > -Yenya
: >
: > --
: > | Jan "Yenya" Kasprzak 
: > |
: > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
: > |
: >  This is the world we live in: the way to deal with computers is to google
: >  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
: > ___
: > ceph-users mailing list
: > ceph-users@lists.ceph.com
: > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
: >

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating to a dedicated cluster network

2019-01-23 Thread Jan Kasprzak
Hello, Ceph users,

is it possible to migrate already deployed Ceph cluster, which uses
public network only, to a split public/dedicated networks? If so,
can this be done without service disruption? I have now got a new
hardware which makes this possible, but I am not sure how to do it.

Another question is whether the cluster network can be done
solely on top of IPv6 link-local addresses without any public address prefix.

When deploying this cluster (Ceph Firefly, IIRC), I had problems
with mixed IPv4/IPv6 addressing, and ended up with ms_bind_ipv6 = false
in my Ceph conf.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Jan Kasprzak
Alfredo,

Alfredo Deza wrote:
: On Fri, Jan 18, 2019 at 7:21 AM Jan Kasprzak  wrote:
: > Eugen Block wrote:
: > :
: > : I think you're running into an issue reported a couple of times.
: > : For the use of LVM you have to specify the name of the Volume Group
: > : and the respective Logical Volume instead of the path, e.g.
: > :
: > : ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data 
/dev/sda
: > thanks, I will try it. In the meantime, I have discovered another way
: > how to get around it: convert my SSDs from MBR to GPT partition table,
: > and then create 15 additional GPT partitions for the respective block.dbs
: > instead of 2x15 LVs.
: 
: This is because ceph-volume can accept both LVs or GPT partitions for block.db
: 
: Another way around this, that doesn't require you to create the LVs is
: to use the `batch` sub-command, that will automatically
: detect your HDD and put data on it, and detect the SSD and create the
: block.db LVs. The command could look something like:
: 
: 
: ceph-volume lvm batch --bluestore /dev/sda /dev/sdb /dev/sdc /dev/sdd
: /dev/nvme0n1
: 
: Would create 4 OSDs, place data on: sda, sdb, sdc, and sdd. And create
: 4 block.db LVs on nvme0n1

Interesting. Thanks!

Can the batch command accept also partitions instead of a whole
device for block.db? I already have two partitions on my SSDs for
root and swap.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Jan Kasprzak
Eugen Block wrote:
: Hi Jan,
: 
: I think you're running into an issue reported a couple of times.
: For the use of LVM you have to specify the name of the Volume Group
: and the respective Logical Volume instead of the path, e.g.
: 
: ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda

Eugen,

thanks, I will try it. In the meantime, I have discovered another way
how to get around it: convert my SSDs from MBR to GPT partition table,
and then create 15 additional GPT partitions for the respective block.dbs
instead of 2x15 LVs.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)

2019-01-18 Thread Jan Kasprzak
Hello, Ceph users,

replying to my own post from several weeks ago:

Jan Kasprzak wrote:
: [...] I plan to add new OSD hosts,
: and I am looking for setup recommendations.
: 
: Intended usage:
: 
: - small-ish pool (tens of TB) for RBD volumes used by QEMU
: - large pool for object-based cold (or not-so-hot :-) data,
:   write-once read-many access pattern, average object size
:   10s or 100s of MBs, probably custom programmed on top of
:   libradosstriper.
: 
: Hardware:
: 
: The new OSD hosts have ~30 HDDs 12 TB each, and two 960 GB SSDs.
: There is a small RAID-1 root and RAID-1 swap volume spanning both SSDs,
: leaving about 900 GB free on each SSD.
: The OSD hosts have two CPU sockets (32 cores including SMT), 128 GB RAM.
: 
: My questions:
[...]
: - block.db on SSDs? The docs recommend about 4 % of the data size
:   for block.db, but my SSDs are only 0.6 % of total storage size.
: 
: - or would it be better to leave SSD caching on the OS and use LVMcache
:   or something?
: 
: - LVM or simple volumes?

I have problem setting this up with ceph-volume: I want to have an OSD
on each HDD, with block.db on the SSD. In order to set this up,
I have created a VG on the two SSDs, created 30 LVs on top of it for block.db,
and wanted to create an OSD using the following:

# ceph-volume lvm prepare --bluestore \
--block.db /dev/ssd_vg/ssd00 \
--data /dev/sda
[...]
--> blkid could not detect a PARTUUID for device: /dev/cbia_ssd_vg/ssd00
--> Was unable to complete a new OSD, will rollback changes
[...]

Then it failed, because deploying a volume used client.bootstrap-osd user,
but trying to roll the changes back required the client.admin user,
which does not have a keyring on the OSD host. Never mind.

The problem is with determining the PARTUUID of the SSD LV for block.db.
How can I deploy an OSD which is on top of bare HDD, but which also
has a block.db on an existing LV?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw cannot create pool

2019-01-17 Thread Jan Kasprzak
Hello, Ceph users,

TL;DR: radosgw fails on me with the following message:

2019-01-17 09:34:45.247721 7f52722b3dc0  0 rgw_init_ioctx ERROR: 
librados::Rados::pool_create returned (34) Numerical result out of range (this 
can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num 
or mon_max_pg_per_osd exceeded)

Detailed description:

I have a Ceph cluster installed long time ago as firefly on CentOS 7,
and now running luminous. So far I have used it for RBD pools, but now
I want to try using radosgw as well.

I tried to deploy radosgw using

# ceph-deploy rgw create myhost

Which went well until it tried to start it up:

[myhost][INFO  ] Running command: service ceph-radosgw start
[myhost][WARNIN] Redirecting to /bin/systemctl start ceph-radosgw.service
[myhost][WARNIN] Failed to start ceph-radosgw.service: Unit not found.
[myhost][ERROR ] RuntimeError: command returned non-zero exit status: 5
[ceph_deploy.rgw][ERROR ] Failed to execute command: service ceph-radosgw start
[ceph_deploy][ERROR ] GenericError: Failed to create 1 RGWs

Comparing it to my testing deployment of mimic, where radosgw works,
the problem was with the unit name, the correct way to start it up
apparently was

# systemctl start ceph-radosgw@rgw.myhost.service

Now it is apparently running:

/usr/bin/radosgw -f --cluster ceph --name client.rgw.myhost --setuser ceph 
--setgroup ceph

However, when I want to add the first user, radosgw-admin fails and
radosgw itself exits with the similar message:

# radosgw-admin user create --uid=kas --display-name="Jan Kasprzak"
2019-01-17 09:52:29.805828 7fea6cfd2dc0  0 rgw_init_ioctx ERROR: 
librados::Rados::pool_create returned (34) Numerical result out of range (this 
can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num 
or mon_max_pg_per_osd exceeded)
2019-01-17 09:52:29.805957 7fea6cfd2dc0 -1 ERROR: failed to initialize watch: 
(34) Numerical result out of range
couldn't init storage provider

So I guess it is trying to create a pool for data, but it fails somehow.
Can I determine which pool it is and what parameters it tries to use?

I have looked at my testing mimic cluster, and radosgw there created the
following pools:

.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
default.rgw.buckets.index
default.rgw.buckets.data

So I created these pools manually on my luminous cluster as well:

# ceph osd pool create .rgw.root 128
(repeat for all the above pool names)

Which helped, and I am able to create the user with radosgw-admin.
Now where should I look for the exact parameters radosgw is trying
to use when creating its pools?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Get packages - incorrect link

2019-01-10 Thread Jan Kasprzak
Hello, Ceph users,

I am not sure where to report the issue with the ceph.com website,
so I am posting to this list:

The https://ceph.com/use/ page has an incorrect link for getting
the packages:

"For packages, see http://ceph.com/docs/master/install/get-packages;

- the URL should be http://docs.ceph.com/docs/master/install/get-packages/
instead (docs.ceph.com instead of ceph.com).

Thanks in advance for fixing this.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph blog RSS/Atom URL?

2019-01-04 Thread Jan Kasprzak
Gregory Farnum wrote:
: It looks like ceph.com/feed is the RSS url?

Close enough, thanks.

Comparing the above with the blog itself, there are some
posts in (apparently) Chinese in /feed, which are not present in
/community/blog. The first one being

https://ceph.com/planet/vdbench%e6%b5%8b%e8%af%95%e5%ae%9e%e6%97%b6%e5%8f%af%e8%a7%86%e5%8c%96%e6%98%be%e7%a4%ba/

-Yenya

: On Fri, Jan 4, 2019 at 5:52 AM Jan Kasprzak  wrote:
: > is there any RSS or Atom source for Ceph blog? I have looked inside
: > the https://ceph.com/community/blog/ HTML source, but there is no
: >  or anything mentioning RSS or Atom.

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed

2019-01-04 Thread Jan Kasprzak
Gregory Farnum wrote:
: On Wed, Jan 2, 2019 at 5:12 AM Jan Kasprzak  wrote:
: 
: > Thomas Byrne - UKRI STFC wrote:
: > : I recently spent some time looking at this, I believe the 'summary' and
: > : 'overall_status' sections are now deprecated. The 'status' and 'checks'
: > : fields are the ones to use now.
: >
: > OK, thanks.
: >
: > : The 'status' field gives you the OK/WARN/ERR, but returning the most
: > : severe error condition from the 'checks' section is less trivial. AFAIK
: > : all health_warn states are treated as equally severe, and same for
: > : health_err. We ended up formatting our single line human readable output
: > : as something like:
: > :
: > : "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN:
: > 20 large omap objects"
: >
: > Speaking of scrub errors:
: >
: > In previous versions of Ceph, I was able to determine which PGs had
: > scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
: > provided that they were not already being scrubbed. In Luminous, the bad PG
: > is not visible in "ceph --status" anywhere. Should I use something like
: > "ceph health detail -f json-pretty" instead?
: >
: > Also, is it possible to configure Ceph to attempt repairing
: > the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on
: > top
: > of a bunch of old spinning disks, and a scrub error almost always means
: > that there is a bad sector somewhere, which can easily be fixed by
: > rewriting the lost data using "ceph pg repair".
: >
: 
: It is possible. It's a lot safer than it used to be, but is still NOT
: RECOMMENDED for replicated pools.
: 
: But if you are very sure, you can use the options osd_scrub_auto_repair
: (default: false) and osd_scrub_auto_repair_num_errors (default:5, which
: will not auto-repair if scrub detects more errors than that value) to
: configure it.

OK, thanks. I just want to say that I am NOT very sure,
but this is about the only way I am aware of, when I want to
handle the scrub error. I have mail notification set up in smartd.conf,
and so far the scrub errors seem to correlate with new reallocated
or pending sectors.

What are the drawbacks of running "ceph pg repair" as soon
asi the cluster enters the HEALTH_ERR state with scrub error?

Thanks for explanation,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed

2019-01-02 Thread Jan Kasprzak
Thomas Byrne - UKRI STFC wrote:
: I recently spent some time looking at this, I believe the 'summary' and
: 'overall_status' sections are now deprecated. The 'status' and 'checks'
: fields are the ones to use now.

OK, thanks.

: The 'status' field gives you the OK/WARN/ERR, but returning the most
: severe error condition from the 'checks' section is less trivial. AFAIK
: all health_warn states are treated as equally severe, and same for
: health_err. We ended up formatting our single line human readable output
: as something like:
: 
: "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: 20 
large omap objects"

Speaking of scrub errors:

In previous versions of Ceph, I was able to determine which PGs had
scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
provided that they were not already being scrubbed. In Luminous, the bad PG
is not visible in "ceph --status" anywhere. Should I use something like
"ceph health detail -f json-pretty" instead?

Also, is it possible to configure Ceph to attempt repairing
the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on top
of a bunch of old spinning disks, and a scrub error almost always means
that there is a bad sector somewhere, which can easily be fixed by
rewriting the lost data using "ceph pg repair".

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph health JSON format has changed sync?

2019-01-02 Thread Jan Kasprzak
Hello, Ceph users,

I am afraid the following question is a FAQ, but I still was not able
to find the answer:

I use ceph --status --format=json-pretty as a source of CEPH status
for my Nagios monitoring. After upgrading to Luminous, I see the following
in the JSON output when the cluster is not healthy:

"summary": [
{
"severity": "HEALTH_WARN",
"summary": "'ceph health' JSON format has changed in luminous. 
If you see this your monitoring system is scraping the wrong fields. Disable 
this with 'mon health preluminous compat warning = false'"
}
],

Apart from that, the JSON data seems reasonable. My question is which part
of JSON structure are the "wrong fields" I have to avoid. Is it just the
"summary" section, or some other parts as well? Or should I avoid
the whole ceph --status and use something different instead?

What I want is a single machine-readable value with OK/WARNING/ERROR meaning,
and a single human-readable text line, describing the most severe
error condition which is currently present. What is the preferred way to
get this data in Luminous?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mixed SSD+HDD OSD setup recommendation

2018-12-05 Thread Jan Kasprzak
Hello, CEPH users,

having upgraded my CEPH cluster to Luminous, I plan to add new OSD hosts,
and I am looking for setup recommendations.

Intended usage:

- small-ish pool (tens of TB) for RBD volumes used by QEMU
- large pool for object-based cold (or not-so-hot :-) data,
write-once read-many access pattern, average object size
10s or 100s of MBs, probably custom programmed on top of
libradosstriper.

Hardware:

The new OSD hosts have ~30 HDDs 12 TB each, and two 960 GB SSDs.
There is a small RAID-1 root and RAID-1 swap volume spanning both SSDs,
leaving about 900 GB free on each SSD.
The OSD hosts have two CPU sockets (32 cores including SMT), 128 GB RAM.

My questions:

- Filestore or Bluestore? -> probably the later, but I am also considering
using the OSD hosts for QEMU-based VMs which are not performance
critical, and then having the kernel balance the memory usage
between ceph-osd and qemu processes (using Filestore) would
probably be better? Am I right?

- block.db on SSDs? The docs recommend about 4 % of the data size
for block.db, but my SSDs are only 0.6 % of total storage size.

- or would it be better to leave SSD caching on the OS and use LVMcache
or something?

- LVM or simple volumes? I find it a bit strange and bloated to create
32 VGs, each VG for a single HDD or SSD, and have 30 VGs with only
one LV. Could I use /dev/disk/by-id/wwn-0x5000 symlinks to have
stable device names instead, and have only two VGs for two SSDs?

Thanks for any recommendations.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade to Luminous (mon+osd)

2018-12-03 Thread Jan Kasprzak
Dan van der Ster wrote:
: It's not that simple see http://tracker.ceph.com/issues/21672
: 
: For the 12.2.8 to 12.2.10 upgrade it seems the selinux module was
: updated -- so the rpms restart the ceph.target.
: What's worse is that this seems to happen before all the new updated
: files are in place.
: 
: Our 12.2.8 to 12.2.10 upgrade procedure is:
: 
: systemctl stop ceph.target
: yum update
: systemctl start ceph.target

Yes, this looks reasonable. Except that when upgrading
from Jewel, even after the restart the OSDs do not work until
_all_ mons are upgraded. So effectively if a PG happens to be placed
on the mon hosts only, there will be service outage during upgrade
from Jewel.

So I guess the upgrade procedure described here:

http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken

is misleading - the mons and osds get restarted anyway by the package
upgrade itself. The user should be warned that for this reason the package
upgrades should be run sequentially. And that the upgrade is not possible
without service outage, when there are OSDs on the mon hosts and when
the cluster is running under SELinux.

Also, there is another important thing omitted by the above upgrade
procedure: After "ceph osd require-osd-release luminous"
I have got HEALTH_WARN saying "application not enabled on X pool(s)".
I have fixed this by running the following scriptlet:

ceph osd pool ls | while read pool; do ceph osd pool application enable $pool 
rbd; done

(yes, all of my pools are used for rbd for now). Maybe this should be fixed
in the release notes as well. Thanks,

-Yenya

: On Mon, Dec 3, 2018 at 12:42 PM Paul Emmerich  wrote:
: >
: > Upgrading Ceph packages does not restart the services -- exactly for
: > this reason.
: >
: > This means there's something broken with your yum setup if the
: > services are restarted when only installing the new version.
: >
: >
: > Paul
: >
: > --
: > Paul Emmerich
: >
: > Looking for help with your Ceph cluster? Contact us at https://croit.io
: >
: > croit GmbH
: > Freseniusstr. 31h
: > 81247 München
: > www.croit.io
: > Tel: +49 89 1896585 90
: >
: > Am Mo., 3. Dez. 2018 um 11:56 Uhr schrieb Jan Kasprzak :
: > >
: > > Hello, ceph users,
: > >
: > > I have a small(-ish) Ceph cluster, where there are osds on each host,
: > > and in addition to that, there are mons on the first three hosts.
: > > Is it possible to upgrade the cluster to Luminous without service
: > > interruption?
: > >
: > > I have tested that when I run "yum --enablerepo Ceph update" on a
: > > mon host, the osds on that host remain down until all three mons
: > > are upgraded to Luminous. Is it possible to upgrade ceph-mon only,
: > > and keep ceph-osd running the old version (Jewel in my case) as long
: > > as possible? It seems RPM dependencies forbid this, but with --nodeps
: > > it could be done.
: > >
: > > Is there a supported way how to upgrade host running both mon and osd
: > > to Luminous?
: > >
: > > Thanks,
: > >
: > > -Yenya
: > >
: > > --
: > > | Jan "Yenya" Kasprzak  
|
: > > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 
|
: > >  This is the world we live in: the way to deal with computers is to google
: > >  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
: > > ___
: > > ceph-users mailing list
: > > ceph-users@lists.ceph.com
: > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
: > ___
: > ceph-users mailing list
: > ceph-users@lists.ceph.com
: > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade to Luminous (mon+osd)

2018-12-03 Thread Jan Kasprzak
Paul Emmerich wrote:
: Upgrading Ceph packages does not restart the services -- exactly for
: this reason.
: 
: This means there's something broken with your yum setup if the
: services are restarted when only installing the new version.

Interesting. I have verified that I have

CEPH_AUTO_RESTART_ON_UPGRADE=no

in my /etc/sysconfig/ceph, yet my ceph-osd daemons get restarted on upgrade.
I have watched "ps ax|grep ceph-osd" output during
"yum --enablerepo Ceph update", and it seems the OSDs got restarted
near the time ceph-selinux got upgraded:

  Updating   : 2:ceph-base-12.2.10-0.el7.x86_64  74/248 
  Updating   : 2:ceph-selinux-12.2.10-0.el7.x86_64   75/248
  Updating   : 2:ceph-mon-12.2.10-0.el7.x86_64   76/248 

And indeed, rpm -q --scripts ceph-selinux shows that this package restarts
the whole ceph.target when the labels got changed:

[...]
# Check whether the daemons are running
/usr/bin/systemctl status ceph.target > /dev/null 2>&1
STATUS=$?

# Stop the daemons if they were running
if test $STATUS -eq 0; then
/usr/bin/systemctl stop ceph.target > /dev/null 2>&1
fi
[...]

So maybe ceph-selinux should also honor CEPH_AUTO_RESTART_ON_UPGRADE=no
in /etc/sysconfig/ceph ? But I am not sure whether it is possible at all,
when the labels got changed.

-Yenya

: Am Mo., 3. Dez. 2018 um 11:56 Uhr schrieb Jan Kasprzak :
: >
: > I have a small(-ish) Ceph cluster, where there are osds on each host,
: > and in addition to that, there are mons on the first three hosts.
: > Is it possible to upgrade the cluster to Luminous without service
: > interruption?
: >
: > I have tested that when I run "yum --enablerepo Ceph update" on a
: > mon host, the osds on that host remain down until all three mons
: > are upgraded to Luminous. Is it possible to upgrade ceph-mon only,
: > and keep ceph-osd running the old version (Jewel in my case) as long
: > as possible? It seems RPM dependencies forbid this, but with --nodeps
: > it could be done.
: >
: > Is there a supported way how to upgrade host running both mon and osd
: > to Luminous?

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade to Luminous (mon+osd)

2018-12-03 Thread Jan Kasprzak
Hello, ceph users,

I have a small(-ish) Ceph cluster, where there are osds on each host,
and in addition to that, there are mons on the first three hosts.
Is it possible to upgrade the cluster to Luminous without service
interruption?

I have tested that when I run "yum --enablerepo Ceph update" on a
mon host, the osds on that host remain down until all three mons
are upgraded to Luminous. Is it possible to upgrade ceph-mon only,
and keep ceph-osd running the old version (Jewel in my case) as long
as possible? It seems RPM dependencies forbid this, but with --nodeps
it could be done.

Is there a supported way how to upgrade host running both mon and osd
to Luminous?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Atomic object replacement with libradosstriper

2017-08-15 Thread Jan Kasprzak
Hello, Ceph users,

I would like to use RADOS as an object storage (I have written about it
to this list a while ago), and I would like to use libradosstriper with C,
as has been suggested to me here.

My question is - when writing an object, is it possible to
do it so that either the old version as a whole or a new version
as a whole is visible by readers at all times? Also, when creating a new
object, only the fully written new object should be visible.

Is it possible to do this with libradosstriper?
With POSIX filesystem, one would do write(tmpfile)+fsync()+rename()
to achieve similar results.

Thanks!

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck unclean after removing OSDs

2017-06-28 Thread Jan Kasprzak
David Turner wrote:
: A couple things.  You didn't `ceph osd crush remove osd.21` after doing the
: other bits.  Also you will want to remove the bucket (re: host) from the
: crush map as it will now be empty.  Right now you have a host in the crush
: map with a weight, but no osds to put that data on.  It has a weight
: because of the 2 OSDs that are still in it that were removed from the
: cluster but not from the crush map.  It's confusing to your cluster.

OK, this helped. I have removed osd.20 and osd.21 from the crush
map, as well as the bucket for the faulty host. PGs got unstuck, and after
some time, my system now reports HEALTH_OK.

Thanks for the hint!

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pgs stuck unclean after removing OSDs

2017-06-28 Thread Jan Kasprzak
Hello,

TL;DR: what to do when my cluster reports stuck unclean pgs?

Detailed description:

One of the nodes in my cluster died. CEPH correctly rebalanced itself,
and reached the HEALTH_OK state. I have looked at the failed server,
and decided to take it out of the cluster permanently, because the hardware
is indeed faulty. It used to host two OSDs, which were marked down and out
in "ceph osd dump".

So from the HEALTH_OK I ran the following commands:

# ceph auth del osd.20
# ceph auth del osd.21
# ceph osd rm osd.20
# ceph osd rm osd.21

After that, CEPH started to rebalance itself, but now it reports some PGs
as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":

# ceph -s
cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
 health HEALTH_WARN
350 pgs stuck unclean
recovery 26/1596390 objects degraded (0.002%)
recovery 58772/1596390 objects misplaced (3.682%)
 monmap e16: 3 mons at {...}
election epoch 584, quorum 0,1,2 ...
 osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
flags require_jewel_osds
  pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
6244 GB used, 40569 GB / 46814 GB avail
26/1596390 objects degraded (0.002%)
58772/1596390 objects misplaced (3.682%)
3426 active+clean
 349 active+remapped
   1 active
  client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr

# ceph health detail
HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded 
(0.002%); recovery 58772/1596390 objects misplaced (3.682%)
pg 28.fa is stuck unclean for 14408925.966824, current state active+remapped, 
last acting [38,52,4]
pg 28.e7 is stuck unclean for 14408925.966886, current state active+remapped, 
last acting [29,42,22]
pg 23.dc is stuck unclean for 61698.641750, current state active+remapped, last 
acting [50,33,23]
pg 23.d9 is stuck unclean for 61223.093284, current state active+remapped, last 
acting [54,31,23]
pg 28.df is stuck unclean for 14408925.967120, current state active+remapped, 
last acting [33,7,15]
pg 34.38 is stuck unclean for 60904.322881, current state active+remapped, last 
acting [18,41,9]
pg 34.fe is stuck unclean for 60904.241762, current state active+remapped, last 
acting [58,1,44]
[...]
pg 28.8f is stuck unclean for 66102.059671, current state active, last acting 
[8,40,5]
[...]
recovery 26/1596390 objects degraded (0.002%)
recovery 58772/1596390 objects misplaced (3.682%)

Apart from that, the data stored in CEPH pools seems to be reachable
and usable as before.

The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH repository).

What other debugging info should I provide, or what to do in order
to unstuck the stuck pgs? Thanks!

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados rm: device or resource busy

2017-06-09 Thread Jan Kasprzak
Hello,

Brad Hubbard wrote:
: I can reproduce this.
[...] 
: That's here where you will notice it is returning EBUSY which is error
: code 16, "Device or resource busy".
: 
: 
https://github.com/badone/ceph/blob/wip-ceph_test_admin_socket_output/src/cls/lock/cls_lock.cc#L189
: 
: In order to remove the existing parts of the file you should be able
: to just run "rados --pool testpool ls" and remove the listed objects
: belonging to "testfile".
: 
: Example:
: rados --pool testpool ls
: testfile.0004
: testfile.0001
: testfile.
: testfile.0003
: testfile.0005
: testfile.0002
: 
: rados --pool testpool rm testfile.
: rados --pool testpool rm testfile.0001
: ...

This works for me, thanks!

: Please open a tracker for this so it can be investigated further.

Done: http://tracker.ceph.com/issues/20233

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados rm: device or resource busy

2017-06-08 Thread Jan Kasprzak
Hello,

David Turner wrote:
: How long have you waited?

About a day.

: I don't do much with rados objects directly.  I usually use RBDs and
: cephfs.  If you just need to clean things up, you can delete the pool and
: recreate it since it looks like it's testing.  However this is probably a
: prime time to figure out how to get past this in case it happens in the
: future in production.

Yes. This is why I am asking now.

-Yenya

: On Thu, Jun 8, 2017 at 11:04 AM Jan Kasprzak <k...@fi.muni.cz> wrote:
: > I have created a RADOS striped object using
: >
: > $ dd someargs | rados --pool testpool --striper put testfile -
: >
: > and interrupted it in the middle of writing. Now I cannot remove this
: > object:
: >
: > $ rados --pool testpool --striper rm testfile
: > error removing testpool>testfile: (16) Device or resource busy
: >
: > How can I tell CEPH that the writer is no longer around and does not come
: > back,
: > so that I can remove the object "testfile"?

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados rm: device or resource busy

2017-06-08 Thread Jan Kasprzak
Hello,

I have created a RADOS striped object using

$ dd someargs | rados --pool testpool --striper put testfile -

and interrupted it in the middle of writing. Now I cannot remove this object:

$ rados --pool testpool --striper rm testfile
error removing testpool>testfile: (16) Device or resource busy

How can I tell CEPH that the writer is no longer around and does not come back,
so that I can remove the object "testfile"?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS as a simple object storage

2017-03-01 Thread Jan Kasprzak
Wido den Hollander wrote:
: 
: > Op 27 februari 2017 om 15:59 schreef Jan Kasprzak <k...@fi.muni.cz>:
: > : > : > Here is some statistics from our biggest instance of the object 
storage:
: > : > : >
: > : > : > objects stored: 100_000_000
: > : > : > < 1024 bytes:10_000_000
: > : > : > 1k-64k bytes:80_000_000
: > : > : > 64k-4M bytes:10_000_000
: > : > : > 4M-256M bytes:1_000_000
: > : > : >> 256M bytes:10_000
: > : > : > biggest object:   15 GBytes
: > : > : >
: > : > : > Would it be feasible to put 100M to 1G objects as a native RADOS 
objects
: > : > : > into a single pool?
[...]
: > 
https://github.com/ceph/ceph/blob/master/src/libradosstriper/RadosStriperImpl.cc#L33
: > 
: > If I understand it correctly, it looks like libradosstriper only splits
: > large stored objects into smaller pieces (RADOS objects), but does not
: > consolidate more small stored objects into larger RADOS objects.
: 
: Why would you want to do that? Yes, very small objects can be a problem if 
you have millions of them since it takes a bit more to replicate them and 
recover them.

Yes. This is what I was afraid of. The immutability of my objects
would allow to consolidate smaller objects into larger bundles, but
if you say is not necessary for the problem of my size, I'll store them into
individual RADOS objects.
: 
: But overall I wouldn't bother about it too much.

OK, thanks!

: > So do you think I am ok with >10M tiny objects (smaller than 1KB)
: > and ~100,000,000 to 1,000,000,000 total objects, provided that I split
: > huge objects using libradosstriper?

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
Assuming that OpenSSL is written as carefully as Wietse's own code,
every 1000 lines introduce one additional bug into Postfix."   --TLS_README
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS as a simple object storage

2017-02-27 Thread Jan Kasprzak
Hello,

Gregory Farnum wrote:
: On Mon, Feb 20, 2017 at 11:57 AM, Jan Kasprzak <k...@fi.muni.cz> wrote:
: > Gregory Farnum wrote:
: > : On Mon, Feb 20, 2017 at 6:46 AM, Jan Kasprzak <k...@fi.muni.cz> wrote:
: > : >
: > : > I have been using CEPH RBD for a year or so as a virtual machine storage
: > : > backend, and I am thinking about moving our another subsystem to CEPH:
[...]
: > : > Here is some statistics from our biggest instance of the object storage:
: > : >
: > : > objects stored: 100_000_000
: > : > < 1024 bytes:10_000_000
: > : > 1k-64k bytes:80_000_000
: > : > 64k-4M bytes:10_000_000
: > : > 4M-256M bytes:1_000_000
: > : >> 256M bytes:10_000
: > : > biggest object:   15 GBytes
: > : >
: > : > Would it be feasible to put 100M to 1G objects as a native RADOS objects
: > : > into a single pool?
: > :
: > : This is well outside the object size RADOS is targeted or tested with;
: > : I'd expect issues. You might want to look at libradosstriper from the
: > : requirements you've mentioned.
: >
: > OK, thanks! Is there any documentation for libradosstriper?
: > I am looking for something similar to librados documentation:
: > http://docs.ceph.com/docs/master/rados/api/librados/
: 
: Not that I see, and I haven't used it myself, but the header file (see
: ceph/src/libradosstriper) seems to have reasonable function docs. It's
: a fairly thin wrapper around librados AFAIK.

OK, I have read the docs in the header file and the comment
near the top of RadosStriperImpl.cc:

https://github.com/ceph/ceph/blob/master/src/libradosstriper/RadosStriperImpl.cc#L33

If I understand it correctly, it looks like libradosstriper only splits
large stored objects into smaller pieces (RADOS objects), but does not
consolidate more small stored objects into larger RADOS objects.

So do you think I am ok with >10M tiny objects (smaller than 1KB)
and ~100,000,000 to 1,000,000,000 total objects, provided that I split
huge objects using libradosstriper?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
Assuming that OpenSSL is written as carefully as Wietse's own code,
every 1000 lines introduce one additional bug into Postfix."   --TLS_README
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS as a simple object storage

2017-02-20 Thread Jan Kasprzak
Gregory Farnum wrote:
: On Mon, Feb 20, 2017 at 6:46 AM, Jan Kasprzak <k...@fi.muni.cz> wrote:
: > Hello, world!\n
: >
: > I have been using CEPH RBD for a year or so as a virtual machine storage
: > backend, and I am thinking about moving our another subsystem to CEPH:
: >
: > The subsystem in question is a simple replicated object storage,
: > currently implemented by a custom C code by yours truly. My question
: > is whether implementing such a thing on top of a CEPH RADOS pool and 
librados
: > is feasible, and what layout and optimizations would you suggest.
: >
: > Our object storage indexes object with a numeric ID. The access methods
: > involve creating, reading and deleting objects. Objects are never modified
: > in place, they are instead deleted and an object with a new ID is created.
: > We also keep a hash of an object contents and use it to prevent bit rot
: > - the objects are scrubbed periodically, and if a checksum mismatch is
: > discovered, the object is restored from another replica.
: >
: > Here is some statistics from our biggest instance of the object storage:
: >
: > objects stored: 100_000_000
: > < 1024 bytes:10_000_000
: > 1k-64k bytes:80_000_000
: > 64k-4M bytes:10_000_000
: > 4M-256M bytes:1_000_000
: >> 256M bytes:10_000
: > biggest object:   15 GBytes
: >
: > Would it be feasible to put 100M to 1G objects as a native RADOS objects
: > into a single pool?
: 
: This is well outside the object size RADOS is targeted or tested with;
: I'd expect issues. You might want to look at libradosstriper from the
: requirements you've mentioned.

OK, thanks! Is there any documentation for libradosstriper?
I am looking for something similar to librados documentation:
http://docs.ceph.com/docs/master/rados/api/librados/

Thanks!

-Yenya


-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
Assuming that OpenSSL is written as carefully as Wietse's own code,
every 1000 lines introduce one additional bug into Postfix."   --TLS_README
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RADOS as a simple object storage

2017-02-20 Thread Jan Kasprzak
Hello, world!\n

I have been using CEPH RBD for a year or so as a virtual machine storage
backend, and I am thinking about moving our another subsystem to CEPH:

The subsystem in question is a simple replicated object storage,
currently implemented by a custom C code by yours truly. My question
is whether implementing such a thing on top of a CEPH RADOS pool and librados
is feasible, and what layout and optimizations would you suggest.

Our object storage indexes object with a numeric ID. The access methods
involve creating, reading and deleting objects. Objects are never modified
in place, they are instead deleted and an object with a new ID is created.
We also keep a hash of an object contents and use it to prevent bit rot
- the objects are scrubbed periodically, and if a checksum mismatch is
discovered, the object is restored from another replica.

Here is some statistics from our biggest instance of the object storage:

objects stored: 100_000_000
< 1024 bytes:10_000_000
1k-64k bytes:80_000_000
64k-4M bytes:10_000_000
4M-256M bytes:1_000_000
> 256M bytes:10_000
biggest object:   15 GBytes

Would it be feasible to put 100M to 1G objects as a native RADOS objects
into a single pool? Or should I consider their read-only nature and pack them
to bigger object/pack with metadata stored in a tmap object, and repack
those packed objects periodically as older object get deleted?

I have also considered rados-gw, but it looks like a too big hammer
for my nail :-)

Thanks for your suggestions,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
Assuming that OpenSSL is written as carefully as Wietse's own code,
every 1000 lines introduce one additional bug into Postfix."   --TLS_README
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com