Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Maciej Puzio
I still don't understand why I get any clean PGs in the erasure-coded
pool, when with two OSDs down there is no more redundancy, and
therefore all PGs should be undersized (or so I think).
I repeated the experiment by bringing two remaining OSDs online, and
then killing them, and got results similar to the previous test. But
this time I observed the process more closely.

Example showing state changes for one of the PGs [OSD assignment in brackets]:
When all OSDs were online: [3,2,4,0,1], state: active+clean
Initially after OSDs 3 and 4 were killed: [x,2,x,0,1], state:
active+undersized+degraded
After some time (OSD 3 and 4 still offline): [0,2,0,0,1], state: active+clean
('x' means that some large number was listed; I assume this meant
original OSD unavailable)

Another PG:
When all OSDs were online: [0,3,2,1,4], state: active+clean
Initially after OSDs 3 and 4 were killed: [0,x,2,1,x], state:
active+undersized+degraded
After some time (OSD 3 and 4 still offline): [0,1,2,1,1], state:
active+clean+remapped
Note: This PG became remapped, the previous one did not.

Does this mean that these PGs now have 5 chunks, of which 3 are stored
on one OSD?
Perhaps I am missing something, but could this arrangement be
redundant? And how can a non-redundant state be considered clean?
By the way, I am using crush-failure-domain=host, and I have one OSD per host.

On the good side, I have no complaints about how the replicated
metadata pool operates. Unfortunately, I will not be able to replicate
data in my future production cluster.
One more thing, I figured out that "degraded" means "undersized and
contains data".

Thanks

Maciej Puzio


On Wed, May 9, 2018 at 7:07 PM, Gregory Farnum  wrote:
> On Wed, May 9, 2018 at 4:37 PM, Maciej Puzio  wrote:
>> My setup consists of two pools on 5 OSDs, and is intended for cephfs:
>> 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally
>> 4), number of PGs=128
>> 2. replicated metadata pool: size=3, min_size=2, number of PGs=100
>>
>> When all OSDs were online, all PGs from both pools has status
>> active+clean. After killing two of five OSDs (and changing min_size to
>> 3), all metadata pool PGs remained active+clean, and of 128 data pool
>> PGs, 3 remained active+clean, 11 became active+clean+remapped, and the
>> rest became active+undersized, active+undersized+remapped,
>> active+undersized+degraded or active+undersized+degraded+remapped,
>> seemingly at random.
>>
>> After some time one of remaining three OSD nodes lost network
>> connectivity (due to ceph-unrelated bug in virtio_net; this toy setup
>> sure is becoming a bug motherlode!). The node was rebooted, ceph
>> cluster become accessible again (with 3 out of 5 OSDs online, as
>> before), and three active+clean data pool PGs now became
>> active+clean+remapped, while the rest of PGs seems to have kept their
>> previous status.
>
> That collection makes sense if you have a replicated pool as well as
> an EC one, then. They represent different states for the PG; see
> http://docs.ceph.com/docs/jewel/rados/operations/pg-states/ and
> they're not random but the collection of which PG is in which set of
> states is determined by how CRUSH placement and the failures interact,
> and CRUSH is a pseudo-random algorithm, so... ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Public network faster than cluster network

2018-05-09 Thread Christian Balzer
On Wed, 09 May 2018 20:46:19 + Gandalf Corvotempesta wrote:

> As subject, what would happen ?
>

This cosmic imbalance would clearly lead to the end of the universe. 

Seriously, think it through, what do you _think_ will happen?

Depending on the actual networks in question, the configuration and
capabilities of your servers/OSDs and your clients the answer for a
typical use case may be "nothing much".

Though a slower cluster network usually implies higher latencies and
_that_ part would hurt you more for smallish IOPS than any bandwidth
differences.

Again, you need to be more specific to get useful answers here.
For example a cluster with nodes that have 8 HDDs, a 10Gb/s cluster and a
25Gb/s public network. A single node can't even saturate the cluster
network so the "nothing much" answer applies. 

If you change the speeds to 1Gb/s and 10Gb/s respectively you're now
limited to the 100MB/s speed of the cluster network during large writes
and recovery/rebalancing traffic. 
But with a typical VMs on RBD use case you're unlikely to run into that
limitation and will be more hobbled by the significantly larger latencies
introduced by the 1Gb/s links. 

Lastly, more often than not segregated networks are not needed, add
unnecessary complexity and the resources spent on them would be better
used to have just one fast and redundant network instead. 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Gregory Farnum
On Wed, May 9, 2018 at 4:37 PM, Maciej Puzio  wrote:
> My setup consists of two pools on 5 OSDs, and is intended for cephfs:
> 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally
> 4), number of PGs=128
> 2. replicated metadata pool: size=3, min_size=2, number of PGs=100
>
> When all OSDs were online, all PGs from both pools has status
> active+clean. After killing two of five OSDs (and changing min_size to
> 3), all metadata pool PGs remained active+clean, and of 128 data pool
> PGs, 3 remained active+clean, 11 became active+clean+remapped, and the
> rest became active+undersized, active+undersized+remapped,
> active+undersized+degraded or active+undersized+degraded+remapped,
> seemingly at random.
>
> After some time one of remaining three OSD nodes lost network
> connectivity (due to ceph-unrelated bug in virtio_net; this toy setup
> sure is becoming a bug motherlode!). The node was rebooted, ceph
> cluster become accessible again (with 3 out of 5 OSDs online, as
> before), and three active+clean data pool PGs now became
> active+clean+remapped, while the rest of PGs seems to have kept their
> previous status.

That collection makes sense if you have a replicated pool as well as
an EC one, then. They represent different states for the PG; see
http://docs.ceph.com/docs/jewel/rados/operations/pg-states/ and
they're not random but the collection of which PG is in which set of
states is determined by how CRUSH placement and the failures interact,
and CRUSH is a pseudo-random algorithm, so... ;)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Maciej Puzio
My setup consists of two pools on 5 OSDs, and is intended for cephfs:
1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally
4), number of PGs=128
2. replicated metadata pool: size=3, min_size=2, number of PGs=100

When all OSDs were online, all PGs from both pools has status
active+clean. After killing two of five OSDs (and changing min_size to
3), all metadata pool PGs remained active+clean, and of 128 data pool
PGs, 3 remained active+clean, 11 became active+clean+remapped, and the
rest became active+undersized, active+undersized+remapped,
active+undersized+degraded or active+undersized+degraded+remapped,
seemingly at random.

After some time one of remaining three OSD nodes lost network
connectivity (due to ceph-unrelated bug in virtio_net; this toy setup
sure is becoming a bug motherlode!). The node was rebooted, ceph
cluster become accessible again (with 3 out of 5 OSDs online, as
before), and three active+clean data pool PGs now became
active+clean+remapped, while the rest of PGs seems to have kept their
previous status.

Thanks

Maciej Puzio


On Wed, May 9, 2018 at 4:49 PM, Gregory Farnum  wrote:
> active+clean does not make a lot of sense if every PG really was 3+2. But
> perhaps you had a 3x replicated pool or something hanging out as well from
> your deployment tool?
> The active+clean+remapped means that a PG was somehow lucky enough to have
> an existing "stray" copy on one of the OSDs that it has decided to use to
> bring it back up to the right number of copies, even though they certainly
> won't match the proper failure domains.
> The min_size in relation to the k+m values won't have any direct impact
> here, although they might indirectly affect it by changing how quickly stray
> PGs get deleted.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-09 Thread Gregory Farnum
On Tue, May 8, 2018 at 2:16 PM Maciej Puzio  wrote:

> Thank you everyone for your replies. However, I feel that at least
> part of the discussion deviated from the topic of my original post. As
> I wrote before, I am dealing with a toy cluster, whose purpose is not
> to provide a resilient storage, but to evaluate ceph and its behavior
> in the event of a failure, with particular attention paid to
> worst-case scenarios. This cluster is purposely minimal, and is built
> on VMs running on my workstation, all OSDs storing data on a single
> SSD. That's definitely not a production system.
>
> I am not asking for advice on how to build resilient clusters, not at
> this point. I asked some questions about specific things that I
> noticed during my tests, and that I was not able to find explained in
> ceph documentation. Dan van der Ster wrote:
> > See https://github.com/ceph/ceph/pull/8008 for the reason why min_size
> defaults to k+1 on ec pools.
> That's a good point, but I am wondering why are reads also blocked
> when number of OSDs falls down to k? What if total number of OSDs in a
> pool (n) is larger than k+m, should the min_size then be k(+1) or
> n-m(+1)?
> In any case, since min_size can be easily changed, then I guess this
> is not an implementation issue, but rather a documentation issue.
>
> Which leaves these my questions still unanswered:
> After killing m OSDs and setting min_size=k most of PGs were now
> active+undersized, often with ...+degraded and/or remapped, but a few
> were active+clean or active+clean+remapped. Why? I would expect all
> PGs to be in the same state (perhaps active+undersized+degraded?).
> Is this mishmash of PG states normal? If not, would I have avoided it
> if I created the pool with min_size=k=3 from the start? In other
> words, does min_size influence the assignment of PGs to OSDs? Or is it
> only used to force I/O shutdown in the event of OSDs failures?
>

active+clean does not make a lot of sense if every PG really was 3+2. But
perhaps you had a 3x replicated pool or something hanging out as well from
your deployment tool?
The active+clean+remapped means that a PG was somehow lucky enough to have
an existing "stray" copy on one of the OSDs that it has decided to use to
bring it back up to the right number of copies, even though they certainly
won't match the proper failure domains.
The min_size in relation to the k+m values won't have any direct impact
here, although they might indirectly affect it by changing how quickly
stray PGs get deleted.
-Greg


>
> Thank you very much
>
> Maciej Puzio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG automatically got "repaired" automatically?

2018-05-09 Thread Gregory Farnum
On Wed, May 9, 2018 at 8:21 AM Nikos Kormpakis  wrote:

> Hello,
>
> we operate a Ceph cluster running v12.2.4, on top of Debian Stretch,
> deployed
> with ceph-volume lvm with a default crushmap and a quite vanilla
> ceph.conf. OSDs
> live on single disks in JBOD mode, with a separate block.db LV on a
> shared SSD.
> We have a single pool (min_size=2, size=3) on this cluster used for RBD
> volumes.
>
> During deep-scrubbing, we ran across an inconsistent PG. Specifically:
>
> ```
> root@rd2-0427:~# rados list-inconsistent-obj 10.193c -f json-pretty
> {
>  "epoch": 10218,
>  "inconsistents": [
>  {
>  "object": {
>  "name": "rbd_data.a0a7c12ae8944a.8a89",
>  "nspace": "",
>  "locator": "",
>  "snap": "head",
>  "version": 8068
>  },
>  "errors": [],
>  "union_shard_errors": [
>  "read_error"
>  ],
>  "selected_object_info":
> "10:3c9c88a6:::rbd_data.a0a7c12ae8944a.8a89:head(11523'8068
> client.10490700.0:168994 dirty|omap_digest s 4194304 uv 8068 od 
> alloc_hint [4194304 4194304 0])",
>  "shards": [
>  {
>  "osd": 64,
>  "primary": false,
>  "errors": [],
>  "size": 4194304,
>  "omap_digest": "0x",
>  "data_digest": "0xfd46c017"
>  },
>  {
>  "osd": 84,
>  "primary": true,
>  "errors": [
>  "read_error"
>  ],
>  "size": 4194304
>  },
>  {
>  "osd": 122,
>  "primary": false,
>  "errors": [],
>  "size": 4194304,
>  "omap_digest": "0x",
>  "data_digest": "0xfd46c017"
>  }
>  ]
>  }
>  ]
> }
> ```
>
> This behavior is quite common. I/O errors happen often due to faulty
> disks,
> which we can confirm in this case:
>
> ```
> Apr 25 10:45:56 rd2-0530 kernel: [486531.949072] sd 1:0:6:0: [sdm] tag#1
> FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Apr 25 10:45:56 rd2-0530 kernel: [486531.964262] sd 1:0:6:0: [sdm] tag#1
> Sense Key : Medium Error [current]
> Apr 25 10:45:56 rd2-0530 kernel: [486531.977582] sd 1:0:6:0: [sdm] tag#1
> Add. Sense: Read retries exhausted
> Apr 25 10:45:56 rd2-0530 kernel: [486531.990798] sd 1:0:6:0: [sdm] tag#1
> CDB: Read(16) 88 00 00 00 00 00 0c b1 be 80 00 00 02 00 00 00
> Apr 25 10:45:56 rd2-0530 kernel: [486532.006759] blk_update_request:
> critical medium error, dev sdm, sector 212975501
> Apr 25 10:45:56 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:56.857351
> 7f396a798700 -1 bdev(0x561b7fd14d80 /var/lib/ceph/osd/ceph-84/block)
> _aio_thread got (5) Input/output error
> Apr 25 10:45:58 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:58.730731
> 7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c shard 84: soid
> 10:3c9c88a6:::rbd_data.a0a7c12ae8944a.8a89:head candidate
> had a read error
> Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709004
> 7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 0
> missing, 1 inconsistent objects
> Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709013
> 7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 1
> errors
> ```
>
> Also, osd.84, the one with the faulty shard, appears to be the
> up_primary+acting_primary osd of this pg (10.193c).
>
> Before issuing `ceph pg repair` or setting the problematic osd out
> (osd.84),
> we wanted to observe how Ceph behaves when trying to read an object with
> a
> problematic primary shard. So, we issued from a cluster node:
>
> ```
> # rados -p rbd_vima get rbd_data.a0a7c12ae8944a.8a89
> foobartest
> ```
>
> The command completed successfully after 30-35 seconds and wrote the
> object
> correctly to the destination file. During these 30 seconds we observed
> the
> following:
>
> ```
> 2018-05-02 16:31:37.934812 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247500 : cluster [WRN] Health check
> update: 188 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-05-02 16:31:39.772436 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247501 : cluster [WRN] Health check
> failed: Degraded data redundancy: 1/3227247 objects degraded (0.000%), 1
> pg degraded (PG_DEGRADED)
> 2018-05-02 16:31:42.934839 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247504 : cluster [WRN] Health check
> update: 198 slow requests are blocked > 32 sec (REQUEST_SLOW)
> 2018-05-02 16:31:42.934893 mon.rd2-0427 mon.0
> [2001:648:2ffa:198::68]:6789/0 247505 : cluster [INF] Health check
> cleared: 

[ceph-users] Public network faster than cluster network

2018-05-09 Thread Gandalf Corvotempesta
As subject, what would happen ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD trim performance

2018-05-09 Thread Jason Dillaman
It's not a zero-cost operation since it involves librbd issuing 1 to N
delete or truncate ops to the OSDs. The OSD will treat those ops just
like any other read or write op (i.e. they don't take lower priority).
Additionally, any discard smaller than the stripe period of the RBD
image will potentially result in "write-zero" operations if the extent
happens to be in the middle of a backing object.

On Wed, May 9, 2018 at 10:25 AM, Andre Goree  wrote:
> Is there anything I should be concerned with performance-wise when having my
> libvirt-based guest systems run trim on their attached RBD devices?  I'd
> imagine it all happens in the background (on the Ceph cluster) with minimal
> performance hit, given the fact that normal removal/deletion of files
> usually consists in an asynchronous operation, but I thought I'd ask to be
> sure.
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread Webert de Souza Lima
Hey Jon!

On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
sure what I should be looking at).
My current SSD disks have 2 partitions.
 - One is used for cephfs cache tier pool,
 - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
(this is an additional cephfs data pool with only ssds with file layout for
a specific direcotory to use it)

Because of this, iostat shows me peaks of 12k IOPS in the metadata
partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

I have yet to measure it the right way but I'd assume my metadata fits in
RAM (a few 100s of MB only).

This is an email hosting cluster with dozens of thousands of users so there
are a lot of random reads and writes, but not too many small files.
Email messages are concatenated together in files up to 4MB in size (when a
rotation happens).
Most user operations are dovecot's INDEX operations and I will keep index
directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

Saturarion will only happen in peak workloads, not often. By heavy write I
mean there are much more writes than reads, yes.
So I think I can start sharing the OSDs, if I think this is impacting
performance I can just change the ruleset and move metadata to a SSD-only
pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall number of
> metadata ops is so low that the device doesn't matter.



Thank you very much John!
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fstrim issue in VM for cloned rbd image with fast-diff feature

2018-05-09 Thread Jason Dillaman
On Wed, May 9, 2018 at 9:13 AM, Youzhong Yang  wrote:
> Thanks Jason.
>
> Yes, my concern is that fstrim increases clones' disk usage. The VM didn't
> use any additional space but fstrim caused its disk usage (in ceph) to go up
> significantly. Imagine when there are hundreds of VMs, it would soon cause
> space issue.
>
> If this is expected behavior, does it mean it's better to disable fast-diff
> from rbd image? I am fine with that.

Since fast-diff can only provide an estimate, that is the current
expected behavior for cases like this. It might be nice to add an
optional to "rbd disk-usage" to disable fast-diff computation and
perform the might painful per-object space calculation if desired.

> There is an ugly part of this discard/fstrim feature. At one time while I
> was doing dd + rm file + fstrim repeatedly, it rendered my VM root file
> system corrupted. Sadly I couldn't reproduce it again.
>
> Thanks.
>
> On Wed, May 9, 2018 at 11:52 AM, Jason Dillaman  wrote:
>>
>> On Wed, May 9, 2018 at 11:39 AM, Youzhong Yang  wrote:
>> > This is what I did:
>> >
>> > # rbd import /var/tmp/debian93-raw.img images/debian93
>> > # rbd info images/debian93
>> > rbd image 'debian93':
>> >  size 81920 MB in 20480 objects
>> >  order 22 (4096 kB objects)
>> >  block_name_prefix: rbd_data.384b74b0dc51
>> >  format: 2
>> >  features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>> >  flags:
>> >  create_timestamp: Wed May  9 09:31:24 2018
>> > # rbd snap create images/debian93@snap
>> > # rbd snap protect images/debian93@snap
>> > # rbd clone images/debian93@snap vms/debian93.dsk
>> > # rbd du vms/debian93.dsk
>> > NAME PROVISIONED USED
>> > debian93.dsk  81920M 336M
>> >
>> > --- Inside the VM ---
>> > # df -h /
>> > Filesystem  Size  Used Avail Use% Mounted on
>> > /dev/sda179G   10G   66G  14% /
>> > # fstrim -v /
>> > /: 36.6 GiB (39311650816 bytes) trimmed
>> >
>> > --- then rbd du reports ---
>> > # rbd du vms/debian93.dsk
>> > NAME PROVISIONED   USED
>> > debian93.dsk  81920M 76028M
>> >
>> > === If I disable fast-diff feature from images/debian93: ===
>> > # fstrim -v /
>> > /: 41 GiB (44059172864 bytes) trimmed
>> >
>> > # rbd du vms/debian93.dsk
>> > warning: fast-diff map is not enabled for debian93.dsk. operation may be
>> > slow.
>> > NAME PROVISIONED  USED
>> > debian93.dsk  81920M 8612M
>> >
>> > === or just flatten vms/debian93.dsk without disabling fast-diff ===
>> > # rbd du vms/debian93.dsk
>> > NAME PROVISIONED   USED
>> > debian93.dsk  81920M 11992M
>> >
>> > # fstrim -v /
>> > /: 68.7 GiB (73710755840 bytes) trimmed
>> >
>> > # rbd du vms/debian93.dsk
>> > NAME PROVISIONED   USED
>> > debian93.dsk  81920M 12000M
>> >
>> > Testing environment:
>> > Ceph: v12.2.5
>> > OS: Ubuntu 18.04
>> > QEMU: 2.11
>> > libvirt: 4.0.0
>> >
>> > Is this a known issue? or is the above behavior expected?
>>
>> What's your concern? I didn't see you state any potential problem. Are
>> you just concerned that "fstrim" appears to increase your clone's disk
>> usage? If that's the case, it's expected since "fast-diff" only tracks
>> the existence of objects (not the per-object usage) and since it's a
>> cloned image, a discard op results in the creation of a zero-byte
>> object to "hide" the associated extent within the parent image.
>>
>> > Thanks,
>> >
>> > --Youzhong
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Jason
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fstrim issue in VM for cloned rbd image with fast-diff feature

2018-05-09 Thread Youzhong Yang
Thanks Jason.

Yes, my concern is that fstrim increases clones' disk usage. The VM didn't
use any additional space but fstrim caused its disk usage (in ceph) to go
up significantly. Imagine when there are hundreds of VMs, it would soon
cause space issue.

If this is expected behavior, does it mean it's better to disable fast-diff
from rbd image? I am fine with that.

There is an ugly part of this discard/fstrim feature. At one time while I
was doing dd + rm file + fstrim repeatedly, it rendered my VM root file
system corrupted. Sadly I couldn't reproduce it again.

Thanks.

On Wed, May 9, 2018 at 11:52 AM, Jason Dillaman  wrote:

> On Wed, May 9, 2018 at 11:39 AM, Youzhong Yang  wrote:
> > This is what I did:
> >
> > # rbd import /var/tmp/debian93-raw.img images/debian93
> > # rbd info images/debian93
> > rbd image 'debian93':
> >  size 81920 MB in 20480 objects
> >  order 22 (4096 kB objects)
> >  block_name_prefix: rbd_data.384b74b0dc51
> >  format: 2
> >  features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> >  flags:
> >  create_timestamp: Wed May  9 09:31:24 2018
> > # rbd snap create images/debian93@snap
> > # rbd snap protect images/debian93@snap
> > # rbd clone images/debian93@snap vms/debian93.dsk
> > # rbd du vms/debian93.dsk
> > NAME PROVISIONED USED
> > debian93.dsk  81920M 336M
> >
> > --- Inside the VM ---
> > # df -h /
> > Filesystem  Size  Used Avail Use% Mounted on
> > /dev/sda179G   10G   66G  14% /
> > # fstrim -v /
> > /: 36.6 GiB (39311650816 bytes) trimmed
> >
> > --- then rbd du reports ---
> > # rbd du vms/debian93.dsk
> > NAME PROVISIONED   USED
> > debian93.dsk  81920M 76028M
> >
> > === If I disable fast-diff feature from images/debian93: ===
> > # fstrim -v /
> > /: 41 GiB (44059172864 bytes) trimmed
> >
> > # rbd du vms/debian93.dsk
> > warning: fast-diff map is not enabled for debian93.dsk. operation may be
> > slow.
> > NAME PROVISIONED  USED
> > debian93.dsk  81920M 8612M
> >
> > === or just flatten vms/debian93.dsk without disabling fast-diff ===
> > # rbd du vms/debian93.dsk
> > NAME PROVISIONED   USED
> > debian93.dsk  81920M 11992M
> >
> > # fstrim -v /
> > /: 68.7 GiB (73710755840 bytes) trimmed
> >
> > # rbd du vms/debian93.dsk
> > NAME PROVISIONED   USED
> > debian93.dsk  81920M 12000M
> >
> > Testing environment:
> > Ceph: v12.2.5
> > OS: Ubuntu 18.04
> > QEMU: 2.11
> > libvirt: 4.0.0
> >
> > Is this a known issue? or is the above behavior expected?
>
> What's your concern? I didn't see you state any potential problem. Are
> you just concerned that "fstrim" appears to increase your clone's disk
> usage? If that's the case, it's expected since "fast-diff" only tracks
> the existence of objects (not the per-object usage) and since it's a
> cloned image, a discard op results in the creation of a zero-byte
> object to "hide" the associated extent within the parent image.
>
> > Thanks,
> >
> > --Youzhong
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent PG automatically got "repaired" automatically?

2018-05-09 Thread Nikos Kormpakis

Hello,

we operate a Ceph cluster running v12.2.4, on top of Debian Stretch, 
deployed
with ceph-volume lvm with a default crushmap and a quite vanilla 
ceph.conf. OSDs
live on single disks in JBOD mode, with a separate block.db LV on a 
shared SSD.
We have a single pool (min_size=2, size=3) on this cluster used for RBD 
volumes.


During deep-scrubbing, we ran across an inconsistent PG. Specifically:

```
root@rd2-0427:~# rados list-inconsistent-obj 10.193c -f json-pretty
{
"epoch": 10218,
"inconsistents": [
{
"object": {
"name": "rbd_data.a0a7c12ae8944a.8a89",
"nspace": "",
"locator": "",
"snap": "head",
"version": 8068
},
"errors": [],
"union_shard_errors": [
"read_error"
],
"selected_object_info": 
"10:3c9c88a6:::rbd_data.a0a7c12ae8944a.8a89:head(11523'8068 
client.10490700.0:168994 dirty|omap_digest s 4194304 uv 8068 od  
alloc_hint [4194304 4194304 0])",

"shards": [
{
"osd": 64,
"primary": false,
"errors": [],
"size": 4194304,
"omap_digest": "0x",
"data_digest": "0xfd46c017"
},
{
"osd": 84,
"primary": true,
"errors": [
"read_error"
],
"size": 4194304
},
{
"osd": 122,
"primary": false,
"errors": [],
"size": 4194304,
"omap_digest": "0x",
"data_digest": "0xfd46c017"
}
]
}
]
}
```

This behavior is quite common. I/O errors happen often due to faulty 
disks,

which we can confirm in this case:

```
Apr 25 10:45:56 rd2-0530 kernel: [486531.949072] sd 1:0:6:0: [sdm] tag#1 
FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 25 10:45:56 rd2-0530 kernel: [486531.964262] sd 1:0:6:0: [sdm] tag#1 
Sense Key : Medium Error [current]
Apr 25 10:45:56 rd2-0530 kernel: [486531.977582] sd 1:0:6:0: [sdm] tag#1 
Add. Sense: Read retries exhausted
Apr 25 10:45:56 rd2-0530 kernel: [486531.990798] sd 1:0:6:0: [sdm] tag#1 
CDB: Read(16) 88 00 00 00 00 00 0c b1 be 80 00 00 02 00 00 00
Apr 25 10:45:56 rd2-0530 kernel: [486532.006759] blk_update_request: 
critical medium error, dev sdm, sector 212975501
Apr 25 10:45:56 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:56.857351 
7f396a798700 -1 bdev(0x561b7fd14d80 /var/lib/ceph/osd/ceph-84/block) 
_aio_thread got (5) Input/output error
Apr 25 10:45:58 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:58.730731 
7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c shard 84: soid 
10:3c9c88a6:::rbd_data.a0a7c12ae8944a.8a89:head candidate 
had a read error
Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709004 
7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 0 
missing, 1 inconsistent objects
Apr 25 10:45:59 rd2-0530 ceph-osd[6537]: 2018-04-25 10:45:59.709013 
7f395af79700 -1 log_channel(cluster) log [ERR] : 10.193c deep-scrub 1 
errors

```

Also, osd.84, the one with the faulty shard, appears to be the
up_primary+acting_primary osd of this pg (10.193c).

Before issuing `ceph pg repair` or setting the problematic osd out 
(osd.84),
we wanted to observe how Ceph behaves when trying to read an object with 
a

problematic primary shard. So, we issued from a cluster node:

```
# rados -p rbd_vima get rbd_data.a0a7c12ae8944a.8a89 
foobartest

```

The command completed successfully after 30-35 seconds and wrote the 
object
correctly to the destination file. During these 30 seconds we observed 
the

following:

```
2018-05-02 16:31:37.934812 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247500 : cluster [WRN] Health check 
update: 188 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-02 16:31:39.772436 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247501 : cluster [WRN] Health check 
failed: Degraded data redundancy: 1/3227247 objects degraded (0.000%), 1 
pg degraded (PG_DEGRADED)
2018-05-02 16:31:42.934839 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247504 : cluster [WRN] Health check 
update: 198 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-05-02 16:31:42.934893 mon.rd2-0427 mon.0 
[2001:648:2ffa:198::68]:6789/0 247505 : cluster [INF] Health check 
cleared: PG_DEGRADED (was: Degraded data redundancy: 1/3227247 objects 
degraded (0.000%), 1 pg degraded)
2018-05-02 16:31:37.384611 osd.84 osd.84 
[2001:648:2ffa:198::77]:6834/1031919 462 : cluster [ERR] 10.193c missing 
primary copy of 
10:3c9c88a6:::rbd_data.a0a7c12ae8944a.8a89:head, will try 
copies on 64,122
2018-05-02 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread John Spray
On Wed, May 9, 2018 at 3:32 PM, Webert de Souza Lima
 wrote:
> Hello,
>
> Currently, I run Jewel + Filestore for cephfs, with SSD-only pools used for
> cephfs-metadata, and HDD-only pools for cephfs-data. The current
> metadata/data ratio is something like 0,25% (50GB metadata for 20TB data).
>
> Regarding bluestore architecture, assuming I have:
>
>  - SSDs for WAL+DB
>  - Spinning Disks for bluestore data.
>
> would you recommend still store metadata in SSD-Only OSD nodes?

It depends on the metadata intensity of your workload.  It might be
quite interesting to gather some drive stats on how many IOPS are
currently hitting your metadata pool over a week of normal activity.

The primary reason for using SSDs for metadata is the cost-per-IOP.
SSDs are generally cheaper per operation than HDDs, so if you've got
enough IOPS to occupy an SSD then it's a no-brainer cost saving to use
SSDs (performance benefits are just a bonus).

If you are doing large file workloads, and the metadata mostly fits in
RAM, then the number of IOPS from the MDS can be very, very low.  On
the other hand, if you're doing random metadata reads from a small
file workload where the metadata does not fit in RAM, almost every
client read could generate a read operation, and each MDS could easily
generate thousands of ops per second.

> If not, is it recommended to dedicate some OSDs (Spindle+SSD for WAL/DB) for
> cephfs-metadata?

Isolating metadata OSDs is useful if the data OSDs are going to be
completely saturated: metadata performance will be protected even if
clients are hitting the data OSDs hard.

However, if your OSDs outnumber clients such that the clients couldn't
possibly saturate the OSDs, then you don't have this issue.

> If I just have 2 pools (metadata and data) all sharing the same OSDs in the
> cluster, would it be enough for heavy-write cases?

If "heavy write" means completely saturating the cluster, then sharing
the OSDs is risky.  If "heavy write" just means that there are more
writes than reads, then it may be fine if the metadata workload is not
heavy enough to make good use of SSDs.

The way I'd summarise this is: in the general case, dedicated SSDs are
the safe way to go -- they're intrinsically better suited to metadata.
However, in some quite common special cases, the overall number of
metadata ops is so low that the device doesn't matter.

John

> Assuming min_size=2, size=3.
>
> Thanks for your thoughts.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to configure s3 bucket acl so that one user's bucket is visible to another.

2018-05-09 Thread Sean Purdy
The other way to do it is with policies.

e.g. a bucket owned by user1, but read access granted to user2:

{ 
  "Version":"2012-10-17",
  "Statement":[
{
  "Sid":"user2 policy",
  "Effect":"Allow",
  "Principal": {"AWS": ["arn:aws:iam:::user/user2"]},
  "Action":["s3:GetObject","s3:ListBucket"],
  "Resource":[
"arn:aws:s3:::example1/*",
"arn:aws:s3:::example1"
  ]
}
  ]
}

And set the policy with:
$ s3cmd setpolicy policy.json s3://example1/
or similar.

user2 won't see the bucket in their list of buckets, but will be able to read 
and list the bucket in this case.

More at 
https://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html


Sean


On Tue,  8 May 2018, David Turner said:
> Sorry I've been on vacation, but I'm back now.  The command I use to create
> subusers for a rgw user is...
> 
> radosgw-admin user create --gen-access-key --gen-secret --uid=user_a
> --display_name="User A"
> radosgw-admin subuser create --gen-access-key --gen-secret
> --access={read,write,readwrite,full} --key-type=s3 --uid=user_a
> --subuser=subuser_1
> 
> Now all buckets created by user_a (or a subuser with --access=full) can now
> be accessed by user_a and all user_a:subusers.  What you missed was
> changing the default subuser type from swift to s3.  --access=full is
> needed for any user needed to be able to create and delete buckets, the
> others are fairly self explanatory for what they can do inside of existing
> buckets.
> 
> There are 2 approaches to use with subusers depending on your use case.
> The first use case is what I use for buckets.  We create 1 user per bucket
> and create subusers when necessary.  Most of our buckets are used by a
> single service and that's all the service uses... so they get the keys for
> their bucket and that's it.  Subusers are create just for the single bucket
> that the original user is in charge of.
> 
> The second use case is where you want a lot of buckets accessed by a single
> set of keys, but you want multiple people to all be able to access the
> buckets.  In this case I would create a single user and use that user to
> create all of the buckets and then create the subusers for everyone to be
> able to access the various buckets.  Note that with this method you get no
> more granularity to settings other than subuser_2 only has read access to
> every bucket.  You can't pick and choose which buckets a subuser has write
> access to, it's all or none.  That's why I use the first approach and call
> it "juggling" keys because if someone wants access to multiple buckets,
> they have keys for each individual bucket as a subuser.
> 
> On Sat, May 5, 2018 at 6:28 AM Marc Roos  wrote:
> 
> >
> > This 'juggle keys' is a bit cryptic to me. If I create a subuser it
> > becomes a swift user not? So how can that have access to the s3 or be
> > used in a s3 client. I have to put in the client the access and secret
> > key, in the subuser I only have a secret key.
> >
> > Is this multi tentant basically only limiting this buckets namespace to
> > the tenants users and nothing else?
> >
> >
> >
> >
> >
> > -Original Message-
> > From: David Turner [mailto:drakonst...@gmail.com]
> > Sent: zondag 29 april 2018 14:52
> > To: Yehuda Sadeh-Weinraub
> > Cc: ceph-users@lists.ceph.com; Безруков Илья Алексеевич
> > Subject: Re: [ceph-users] How to configure s3 bucket acl so that one
> > user's bucket is visible to another.
> >
> > You can create subuser keys to allow other users to have access to a
> > bucket. You have to juggle keys, but it works pretty well.
> >
> >
> > On Sun, Apr 29, 2018, 4:00 AM Yehuda Sadeh-Weinraub 
> > wrote:
> >
> >
> > You can't. A user can only list the buckets that it owns, it cannot
> > list other users' buckets.
> >
> > Yehuda
> >
> > On Sat, Apr 28, 2018 at 11:10 AM, Безруков Илья Алексеевич
> >  wrote:
> > > Hello,
> > >
> > > How to configure s3 bucket acl so that one user's bucket is
> > visible to
> > > another.
> > >
> > >
> > > I can create a bucket, objects in it and give another user
> > access
> > to it.
> > > But another user does not see this bucket in the list of
> > available buckets.
> > >
> > >
> > > ## User1
> > >
> > > ```
> > > s3cmd -c s3cfg_user1 ls s3://
> > >
> > > 2018-04-28 07:50  s3://example1
> > >
> > > #set ACL
> > > s3cmd -c s3cfg_user1 setacl --acl-grant=all:user2 s3://example1
> > > s3://example1/: ACL updated
> > >
> > > # Check
> > > s3cmd -c s3cfg_user1 info s3://example1
> > > s3://example1/ (bucket):
> > >Location:  us-east-1
> > >Payer: BucketOwner
> > >Expiration Rule: none
> > >Policy:none
> > >

Re: [ceph-users] Deleting an rbd image hangs

2018-05-09 Thread David Turner
Yeah, I was about to suggest looking up all currently existing rbd IDs and
snapshot IDs, compare to rados ls and remove the objects that exist for
rbds and snapshots not reported by the cluster.

On Wed, May 9, 2018, 8:38 AM Jason Dillaman  wrote:

> On Tue, May 8, 2018 at 2:31 PM,   wrote:
> > Hello Jason,
> >
> >
> > Am 8. Mai 2018 15:30:34 MESZ schrieb Jason Dillaman  >:
> >>Perhaps the image had associated snapshots? Deleting the object
> >>doesn't delete the associated snapshots so those objects will remain
> >>until the snapshot is removed. However, if you have removed the RBD
> >>header, the snapshot id is now gone.
> >>
> >
> > Hmm... that makes me curious...
> >
> > So when i have a vm-image (rbd) on ceph and am doing  One or more
> Snapshots from this Image i *must have* to delete the snapshot(s) at
> First completely before i delete the origin Image?
>
> Yup, the rbd CLI (and related librbd API helpers for removing images)
> will not let you delete an image that has snapshots for this very
> reason.
>
> > How can we then get rid of this orphaned objects when we accidentaly
> have deleted the origin Image First?
>
> Unfortunately, there isn't any available CLI tooling to let if you
> delete a "self-managed" snapshot even if you could determine the
> correct snapshot ids that are no longer in-use. If you are comfortable
> building a custom C/C++ librados application to clean up your pool,
> you could first generate a list of all known in-use snapshot IDs by
> collecting all the "snapshot_XYZ" keys on all "rbd_header.ABC" objects
> in the pool (where XYZ is the snapshot id in hex and ABC is each
> image's unique id) and cross-referencing them w/ the pool's
> "removed_snaps" output from "ceph osd pool ls detail".
>
> > Thanks if you have a Bit of Time to clarify me/us :)
> >
> > - Mehmet
> >
> >>On Tue, May 8, 2018 at 12:29 AM, Eugen Block  wrote:
> >>> Hi,
> >>>
> >>> I have a similar issue and would also need some advice how to get rid
> >>of the
> >>> already deleted files.
> >>>
> >>> Ceph is our OpenStack backend and there was a nova clone without
> >>parent
> >>> information. Apparently, the base image had been deleted without a
> >>warning
> >>> or anything although there were existing clones.
> >>> Anyway, I tried to delete the respective rbd_data and _header files
> >>as
> >>> described in [1]. There were about 700 objects to be deleted, but 255
> >>> objects remained according to the 'rados -p pool ls' command. The
> >>attempt to
> >>> delete the rest (again) resulted (and still results) in "No such file
> >>or
> >>> directory". After about half an hour later one more object vanished
> >>> (rbd_header file), there are now still 254 objects left in the pool.
> >>First I
> >>> thought maybe Ceph will cleanup itself, it just takes some time, but
> >>this
> >>> was weeks ago and the number of objects has not changed since then.
> >>>
> >>> I would really appreciate any help.
> >>>
> >>> Regards,
> >>> Eugen
> >>>
> >>>
> >>> Zitat von Jan Marquardt :
> >>>
> >>>
>  Am 30.04.18 um 09:26 schrieb Jan Marquardt:
> >
> > Am 27.04.18 um 20:48 schrieb David Turner:
> >>
> >> This old [1] blog post about removing super large RBDs is not
> >>relevant
> >> if you're using object map on the RBDs, however it's method to
> >>manually
> >> delete an RBD is still valid.  You can see if this works for you
> >>to
> >> manually remove the problem RBD you're having.
> >
> >
> > I followed the instructions, but it seems that 'rados -p rbd ls |
> >>grep
> > '^rbd_data.221bf2eb141f2.' | xargs -n 200  rados -p rbd rm' gets
> >>stuck,
> > too. It's running since Friday and still not finished. The rbd
> >>image
> > is/was about 1 TB large.
> >
> > Until now the only output was:
> > error removing rbd>rbd_data.221bf2eb141f2.51d2: (2) No
> >>such
> > file or directory
> > error removing rbd>rbd_data.221bf2eb141f2.e3f2: (2) No
> >>such
> > file or directory
> 
> 
>  I am still trying to get rid of this. 'rados -p rbd ls' still shows
> >>a
>  lot of objects beginning with rbd_data.221bf2eb141f2, but if I try
> >>to
>  delete them with 'rados -p rbd rm ' it says 'No such file or
>  directory'. This is not the behaviour I'd expect. Any ideas?
> 
>  Besides this rbd_data.221bf2eb141f2.00016379 is still
> >>causing
>  the OSDs crashing, which leaves the cluster unusable for us at the
>  moment. Even if it's just a proof of concept, I'd like to get this
> >>fixed
>  without destroying the whole cluster.
> 
> >>
> >> [1]
> >>http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image
> >>
> >> On Thu, Apr 26, 2018 at 9:25 AM Jan Marquardt  >> > wrote:
> >>
> >> Hi,
> >>
> >> I am 

Re: [ceph-users] Deleting an rbd image hangs

2018-05-09 Thread Jason Dillaman
On Tue, May 8, 2018 at 2:31 PM,   wrote:
> Hello Jason,
>
>
> Am 8. Mai 2018 15:30:34 MESZ schrieb Jason Dillaman :
>>Perhaps the image had associated snapshots? Deleting the object
>>doesn't delete the associated snapshots so those objects will remain
>>until the snapshot is removed. However, if you have removed the RBD
>>header, the snapshot id is now gone.
>>
>
> Hmm... that makes me curious...
>
> So when i have a vm-image (rbd) on ceph and am doing  One or more Snapshots 
> from this Image i *must have* to delete the snapshot(s) at First 
> completely before i delete the origin Image?

Yup, the rbd CLI (and related librbd API helpers for removing images)
will not let you delete an image that has snapshots for this very
reason.

> How can we then get rid of this orphaned objects when we accidentaly have 
> deleted the origin Image First?

Unfortunately, there isn't any available CLI tooling to let if you
delete a "self-managed" snapshot even if you could determine the
correct snapshot ids that are no longer in-use. If you are comfortable
building a custom C/C++ librados application to clean up your pool,
you could first generate a list of all known in-use snapshot IDs by
collecting all the "snapshot_XYZ" keys on all "rbd_header.ABC" objects
in the pool (where XYZ is the snapshot id in hex and ABC is each
image's unique id) and cross-referencing them w/ the pool's
"removed_snaps" output from "ceph osd pool ls detail".

> Thanks if you have a Bit of Time to clarify me/us :)
>
> - Mehmet
>
>>On Tue, May 8, 2018 at 12:29 AM, Eugen Block  wrote:
>>> Hi,
>>>
>>> I have a similar issue and would also need some advice how to get rid
>>of the
>>> already deleted files.
>>>
>>> Ceph is our OpenStack backend and there was a nova clone without
>>parent
>>> information. Apparently, the base image had been deleted without a
>>warning
>>> or anything although there were existing clones.
>>> Anyway, I tried to delete the respective rbd_data and _header files
>>as
>>> described in [1]. There were about 700 objects to be deleted, but 255
>>> objects remained according to the 'rados -p pool ls' command. The
>>attempt to
>>> delete the rest (again) resulted (and still results) in "No such file
>>or
>>> directory". After about half an hour later one more object vanished
>>> (rbd_header file), there are now still 254 objects left in the pool.
>>First I
>>> thought maybe Ceph will cleanup itself, it just takes some time, but
>>this
>>> was weeks ago and the number of objects has not changed since then.
>>>
>>> I would really appreciate any help.
>>>
>>> Regards,
>>> Eugen
>>>
>>>
>>> Zitat von Jan Marquardt :
>>>
>>>
 Am 30.04.18 um 09:26 schrieb Jan Marquardt:
>
> Am 27.04.18 um 20:48 schrieb David Turner:
>>
>> This old [1] blog post about removing super large RBDs is not
>>relevant
>> if you're using object map on the RBDs, however it's method to
>>manually
>> delete an RBD is still valid.  You can see if this works for you
>>to
>> manually remove the problem RBD you're having.
>
>
> I followed the instructions, but it seems that 'rados -p rbd ls |
>>grep
> '^rbd_data.221bf2eb141f2.' | xargs -n 200  rados -p rbd rm' gets
>>stuck,
> too. It's running since Friday and still not finished. The rbd
>>image
> is/was about 1 TB large.
>
> Until now the only output was:
> error removing rbd>rbd_data.221bf2eb141f2.51d2: (2) No
>>such
> file or directory
> error removing rbd>rbd_data.221bf2eb141f2.e3f2: (2) No
>>such
> file or directory


 I am still trying to get rid of this. 'rados -p rbd ls' still shows
>>a
 lot of objects beginning with rbd_data.221bf2eb141f2, but if I try
>>to
 delete them with 'rados -p rbd rm ' it says 'No such file or
 directory'. This is not the behaviour I'd expect. Any ideas?

 Besides this rbd_data.221bf2eb141f2.00016379 is still
>>causing
 the OSDs crashing, which leaves the cluster unusable for us at the
 moment. Even if it's just a proof of concept, I'd like to get this
>>fixed
 without destroying the whole cluster.

>>
>> [1]
>>http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image
>>
>> On Thu, Apr 26, 2018 at 9:25 AM Jan Marquardt > > wrote:
>>
>> Hi,
>>
>> I am currently trying to delete an rbd image which is
>>seemingly
>> causing
>> our OSDs to crash, but it always gets stuck at 3%.
>>
>> root@ceph4:~# rbd rm noc_tobedeleted
>> Removing image: 3% complete...
>>
>> Is there any way to force the deletion? Any other advices?
>>
>> Best Regards
>>
>> Jan




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 

[ceph-users] Open-sourcing GRNET's Ceph-related tooling

2018-05-09 Thread Nikos Kormpakis

Hello,

I'm happy to announce that GRNET [1] is open-sourcing its Ceph-related
tooling on GitHub [2]. This repo includes multiple monitoring health
checks compatible with Luminous and tooling in order deploy quickly our
new Ceph clusters based on Luminous, ceph-volume lvm and BlueStore.

We hope that people may find these tools useful.

Best regards,
Nikos

[1] https://grnet.gr/en/
[2] https://github.com/grnet/cephtools

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph ObjectCacher FAILED assert (qemu/kvm)

2018-05-09 Thread Jason Dillaman
Any clue what Windows is doing issuing a discard against an extent
that has an in-flight read? If this is repeatable, can you add "debug
rbd = 20" and "debug objectcacher = 20" to your hypervisor's ceph.conf
and attach the Ceph log to a tracker ticket?

On Tue, May 8, 2018 at 8:09 PM, Richard Bade  wrote:
> Hi Everyone,
> We run some hosts with Proxmox 4.4 connected to our ceph cluster for
> RBD storage. Occasionally we get a vm suddenly stop with no real
> explanation. The last time this happened to one particular vm I turned
> on some qemu logging via Proxmox Monitor tab for the vm and got this
> dump this time when the vm stopped again:
>
> osdc/ObjectCacher.cc: In function 'void
> ObjectCacher::Object::discard(loff_t, loff_t)' thread 7f1c6ebfd700
> time 2018-05-08 07:00:47.816114
> osdc/ObjectCacher.cc: 533: FAILED assert(bh->waitfor_read.empty())
>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (()+0x2d0712) [0x7f1c8e093712]
>  2: (()+0x52c107) [0x7f1c8e2ef107]
>  3: (()+0x52c45f) [0x7f1c8e2ef45f]
>  4: (()+0x82107) [0x7f1c8de45107]
>  5: (()+0x83388) [0x7f1c8de46388]
>  6: (()+0x80e74) [0x7f1c8de43e74]
>  7: (()+0x86db0) [0x7f1c8de49db0]
>  8: (()+0x2c0ddf) [0x7f1c8e083ddf]
>  9: (()+0x2c1d00) [0x7f1c8e084d00]
>  10: (()+0x8064) [0x7f1c804e0064]
>  11: (clone()+0x6d) [0x7f1c8021562d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> We're using virtio-scsi for the disk with discard option and writeback
> cache enabled. The vm is Win2012r2.
>
> Has anyone seen this before? Is there a resolution?
> I couldn't find any mention of this while googling for various key
> words in the dump.
>
> Regards,
> Richard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan safety on active filesystem

2018-05-09 Thread John Spray
On Tue, May 8, 2018 at 8:49 PM, Ryan Leimenstoll
 wrote:
> Hi Gregg, John,
>
> Thanks for the warning. It was definitely conveyed that they are dangerous. I 
> thought the online part was implied to be a bad idea, but just wanted to 
> verify.
>
> John,
>
> We were mostly operating off of what the mds logs reported. After bringing 
> the mds back online and active, we mounted the volume using the kernel driver 
> to one host and started a recursive ls through the root of the filesystem to 
> see what was broken. There were seemingly two main paths of the tree that 
> were affected initially, both reporting errors like the following in the mds 
> log (I’ve swapped out the paths):
>
> Group 1:
> 2018-05-04 12:04:38.004029 7fc81f69a700 -1 log_channel(cluster) log [ERR] : 
> dir 0x10011125556 object missing on disk; some files may be lost 
> (/cephfs/redacted1/path/dir1)
> 2018-05-04 12:04:38.028861 7fc81f69a700 -1 log_channel(cluster) log [ERR] : 
> dir 0x1001112bf14 object missing on disk; some files may be lost 
> (/cephfs/redacted1/path/dir2)
> 2018-05-04 12:04:38.030504 7fc81f69a700 -1 log_channel(cluster) log [ERR] : 
> dir 0x10011131118 object missing on disk; some files may be lost 
> (/cephfs/redacted1/path/dir3)
>
> Group 2:
> 2021-05-04 13:24:29.495892 7fc81f69a700 -1 log_channel(cluster) log [ERR] : 
> dir 0x1001102c5f6 object missing on disk; some files may be lost 
> (/cephfs/redacted2/path/dir1)
>
> For some of the paths it complained about were empty via ls, although trying 
> to rm [-r] them via the mount failed with an error suggesting files still 
> exist in the directory. We removed the dir object in the metadata pool that 
> it was still warning about (rados -p metapool rm 10011125556., for 
> example). This cleaned up errors on this path. We then did the same for Group 
> 2.
>
> After this, we initiated a recursive scrub with the mds daemon on the root of 
> the filesystem to run over the weekend.
>
> In retrospect, we probably should have done the data scan steps mentioned in 
> the disaster recovery guide before bringing the system online. The cluster is 
> currently healthy (or, rather, reporting healthy) and has been for a while.
>
> My understanding here is that we would need something like the 
> cephfs-data-scan steps to recreate metadata or at least identify (for 
> cleanup) objects that may have been stranded in the data pool. Is there 
> anyway, likely with another tool, to do this for an active cluster? If not, 
> is this something that can be done with some amount of safety on an offline 
> system? (not sure how long it would take, data pool is ~100T large w/ 242 
> million objects, and downtime is a big pain point for our users with 
> deadlines).

When you do a forward scrub, there is an option to apply a "tag" (an
arbitrary string) to the data objects of files that are present in the
metadata tree (i.e. non-orphans).  There is then a "--filter-tag"
option to cephfs-data-scan, that enables skipping everything that's
tagged.  That enables you to then do a scan_extents,scan_inodes
process targeting only the orphans, which will recover them into a
lost+found directory.  If you ultimately just want to delete them, you
can then do that from lost+found once the filesystem is back online.
In that process, the forward scrub can happen while the cluster is
online, but the cephfs-data-scan bits happen while the MDS is *not*
running.

The alternative way to do it is to write yourself a script that
recursively lists the filesystem, keep a big index of which inodes
exist in the filesystem, and then scan through all the objects in the
data pool and remove anything with an inode prefix that isn't in the
set.  You could do that without stopping your MDS, although of course
you would need to either stop creating new files, or have your script
include any inode above a certain limit in your list of inodes that
exist.

In any case, I'd strongly recommend you create a separate filesystem
to play with first, to work out your procedure.  You can synthesise
the damage from a lost PG easily by random deleting some percentage of
the objects in your metadata pool.

John


> Thanks,
>
> Ryan
>
>> On May 8, 2018, at 5:05 AM, John Spray  wrote:
>>
>> On Mon, May 7, 2018 at 8:50 PM, Ryan Leimenstoll
>>  wrote:
>>> Hi All,
>>>
>>> We recently experienced a failure with our 12.2.4 cluster running a CephFS
>>> instance that resulted in some data loss due to a seemingly problematic OSD
>>> blocking IO on its PGs. We restarted the (single active) mds daemon during
>>> this, which caused damage due to the journal not having the chance to flush
>>> back. We reset the journal, session table, and fs to bring the filesystem
>>> online. We then removed some directories/inodes that were causing the
>>> cluster to report damaged metadata (and were otherwise visibly broken by
>>> navigating the filesystem).
>>
>> This may be over-optimistic of 

Re: [ceph-users] stale status from monitor?

2018-05-09 Thread John Spray
On Tue, May 8, 2018 at 9:50 PM, Bryan Henderson  wrote:
> My cluster got stuck somehow, and at one point in trying to recycle things to
> unstick it, I ended up shutting down everything, then bringing up just the
> monitors.  At that point, the cluster reported the status below.
>
> With nothing but the monitors running, I don't see how the status can say
> there are two OSDs and an MDS up and requests are blocked.  This was the
> status of the cluster when I previously shut down the monitors (which I
> probably shouldn't have done when there were still OSDs and MDSs up, but I
> did).

The mon learns about down OSDs from other OSD peers.  That means that
the last OSD or two to go down will have nobody to report their
absence, so they still appear up.  I thought we recently changed
something to add a special case for that, but perhaps not.

With the MDS, the monitor can't immediately distinguish between a
daemon that is really gone, vs. one that is slow responding.  If
another standby MDS is available, then the monitor will give up on the
laggy MDS eventually and have the standby take over.  However, if no
standbys are available, then there is nothing useful the monitor can
do about a down MDS, so it just takes the optimistic view that the MDS
is still running, and will report back eventually.

All that is the "how", not necessarily a claim that the resulting
health report is ideal!

John


> It stayed that way for about 20 minutes, and I finally brought up the OSDs and
> everything went back to normal.
>
> So my question is:  Is this normal and what has to happen for the status to be
> current?
>
> cluster 23352cdb-18fc-4efc-9d54-e72c000abfdb
>  health HEALTH_WARN
> 60 pgs peering
> 60 pgs stuck inactive
> 60 pgs stuck unclean
> 4 requests are blocked > 32 sec
> mds cluster is degraded
> mds a is laggy
>  monmap e3: 3 mons at 
> {a=192.168.1.16:6789/0,b=192.168.1.23:6789/0,c=192.168.1.20:6789/0}
> election epoch 202, quorum 0,1,2 a,c,b
>  mdsmap e315: 1/1/1 up {0=a=up:replay(laggy or crashed)}
>  osdmap e495: 2 osds: 2 up, 2 in
>  pgmap v33881: 160 pgs, 4 pools, 568 MB data, 14851 objects
>1430 MB used, 43704 MB / 45134 MB avail
> 100 active+clean
>  60 peering
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com