[ceph-users] Ceph ObjectCacher FAILED assert (qemu/kvm)

2018-05-08 Thread Richard Bade
Hi Everyone,
We run some hosts with Proxmox 4.4 connected to our ceph cluster for
RBD storage. Occasionally we get a vm suddenly stop with no real
explanation. The last time this happened to one particular vm I turned
on some qemu logging via Proxmox Monitor tab for the vm and got this
dump this time when the vm stopped again:

osdc/ObjectCacher.cc: In function 'void
ObjectCacher::Object::discard(loff_t, loff_t)' thread 7f1c6ebfd700
time 2018-05-08 07:00:47.816114
osdc/ObjectCacher.cc: 533: FAILED assert(bh->waitfor_read.empty())
 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x2d0712) [0x7f1c8e093712]
 2: (()+0x52c107) [0x7f1c8e2ef107]
 3: (()+0x52c45f) [0x7f1c8e2ef45f]
 4: (()+0x82107) [0x7f1c8de45107]
 5: (()+0x83388) [0x7f1c8de46388]
 6: (()+0x80e74) [0x7f1c8de43e74]
 7: (()+0x86db0) [0x7f1c8de49db0]
 8: (()+0x2c0ddf) [0x7f1c8e083ddf]
 9: (()+0x2c1d00) [0x7f1c8e084d00]
 10: (()+0x8064) [0x7f1c804e0064]
 11: (clone()+0x6d) [0x7f1c8021562d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

We're using virtio-scsi for the disk with discard option and writeback
cache enabled. The vm is Win2012r2.

Has anyone seen this before? Is there a resolution?
I couldn't find any mention of this while googling for various key
words in the dump.

Regards,
Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW (Swift) failures during upgrade from Jewel to Luminous

2018-05-08 Thread Bryan Stillwell
We recently began our upgrade testing for going from Jewel (10.2.10) to
Luminous (12.2.5) on our clusters.  The first part of the upgrade went
pretty smoothly (upgrading the mon nodes, adding the mgr nodes, upgrading
the OSD nodes), however, when we got to the RGWs we started seeing internal
server errors (500s) on the Jewel RGWs once the first RGW was upgraded to
Luminous.  Further testing found two different problems:

The first problem (internal server error) was seen when the container and
object were created by a Luminous RGW, but then a Jewel RGW attempted to
list the container.

The second problem (container appears to be empty) was seen when the
container was created by a Luminous RGW, an object was added using a Jewel
RGW, and then the container was listed by a Luminous RGW.

Here were all the tests I performed:

Test #1: Create container (Jewel),Add object (Jewel),List container 
(Jewel),Result: Success
Test #2: Create container (Jewel),Add object (Jewel),List container 
(Luminous), Result: Success
Test #3: Create container (Jewel),Add object (Luminous), List container 
(Jewel),Result: Success
Test #4: Create container (Jewel),Add object (Luminous), List container 
(Luminous), Result: Success
Test #5: Create container (Luminous), Add object (Jewel),List container 
(Jewel),Result: Success
Test #6: Create container (Luminous), Add object (Jewel),List container 
(Luminous), Result: Failure (Container appears empty)
Test #7: Create container (Luminous), Add object (Luminous), List container 
(Jewel),Result: Failure (Internal Server Error)
Test #8: Create container (Luminous), Add object (Luminous), List container 
(Luminous), Result: Success

It appears that we ran into these bugs because our load balancer was
alternating between the RGWs while they were running a mixture of the two
versions (like you would expect during an upgrade).

Has anyone run into this problem as well?  Is there a way to workaround it
besides disabling half the RGWs, upgrading that half, swinging all the
traffic to the upgraded RGWs, upgrading the other half, and then enabling
the second half?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deleting an rbd image hangs

2018-05-08 Thread ceph
Hello Jason,
  

Am 8. Mai 2018 15:30:34 MESZ schrieb Jason Dillaman :
>Perhaps the image had associated snapshots? Deleting the object
>doesn't delete the associated snapshots so those objects will remain
>until the snapshot is removed. However, if you have removed the RBD
>header, the snapshot id is now gone.
>

Hmm... that makes me curious...

So when i have a vm-image (rbd) on ceph and am doing  One or more Snapshots 
from this Image i *must have* to delete the snapshot(s) at First completely 
before i delete the origin Image?

How can we then get rid of this orphaned objects when we accidentaly have 
deleted the origin Image First?

Thanks if you have a Bit of Time to clarify me/us :)

- Mehmet

>On Tue, May 8, 2018 at 12:29 AM, Eugen Block  wrote:
>> Hi,
>>
>> I have a similar issue and would also need some advice how to get rid
>of the
>> already deleted files.
>>
>> Ceph is our OpenStack backend and there was a nova clone without
>parent
>> information. Apparently, the base image had been deleted without a
>warning
>> or anything although there were existing clones.
>> Anyway, I tried to delete the respective rbd_data and _header files
>as
>> described in [1]. There were about 700 objects to be deleted, but 255
>> objects remained according to the 'rados -p pool ls' command. The
>attempt to
>> delete the rest (again) resulted (and still results) in "No such file
>or
>> directory". After about half an hour later one more object vanished
>> (rbd_header file), there are now still 254 objects left in the pool.
>First I
>> thought maybe Ceph will cleanup itself, it just takes some time, but
>this
>> was weeks ago and the number of objects has not changed since then.
>>
>> I would really appreciate any help.
>>
>> Regards,
>> Eugen
>>
>>
>> Zitat von Jan Marquardt :
>>
>>
>>> Am 30.04.18 um 09:26 schrieb Jan Marquardt:

 Am 27.04.18 um 20:48 schrieb David Turner:
>
> This old [1] blog post about removing super large RBDs is not
>relevant
> if you're using object map on the RBDs, however it's method to
>manually
> delete an RBD is still valid.  You can see if this works for you
>to
> manually remove the problem RBD you're having.


 I followed the instructions, but it seems that 'rados -p rbd ls |
>grep
 '^rbd_data.221bf2eb141f2.' | xargs -n 200  rados -p rbd rm' gets
>stuck,
 too. It's running since Friday and still not finished. The rbd
>image
 is/was about 1 TB large.

 Until now the only output was:
 error removing rbd>rbd_data.221bf2eb141f2.51d2: (2) No
>such
 file or directory
 error removing rbd>rbd_data.221bf2eb141f2.e3f2: (2) No
>such
 file or directory
>>>
>>>
>>> I am still trying to get rid of this. 'rados -p rbd ls' still shows
>a
>>> lot of objects beginning with rbd_data.221bf2eb141f2, but if I try
>to
>>> delete them with 'rados -p rbd rm ' it says 'No such file or
>>> directory'. This is not the behaviour I'd expect. Any ideas?
>>>
>>> Besides this rbd_data.221bf2eb141f2.00016379 is still
>causing
>>> the OSDs crashing, which leaves the cluster unusable for us at the
>>> moment. Even if it's just a proof of concept, I'd like to get this
>fixed
>>> without destroying the whole cluster.
>>>
>
> [1]
>http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image
>
> On Thu, Apr 26, 2018 at 9:25 AM Jan Marquardt  > wrote:
>
> Hi,
>
> I am currently trying to delete an rbd image which is
>seemingly
> causing
> our OSDs to crash, but it always gets stuck at 3%.
>
> root@ceph4:~# rbd rm noc_tobedeleted
> Removing image: 3% complete...
>
> Is there any way to force the deletion? Any other advices?
>
> Best Regards
>
> Jan
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Maciej Puzio
Thank you everyone for your replies. However, I feel that at least
part of the discussion deviated from the topic of my original post. As
I wrote before, I am dealing with a toy cluster, whose purpose is not
to provide a resilient storage, but to evaluate ceph and its behavior
in the event of a failure, with particular attention paid to
worst-case scenarios. This cluster is purposely minimal, and is built
on VMs running on my workstation, all OSDs storing data on a single
SSD. That's definitely not a production system.

I am not asking for advice on how to build resilient clusters, not at
this point. I asked some questions about specific things that I
noticed during my tests, and that I was not able to find explained in
ceph documentation. Dan van der Ster wrote:
> See https://github.com/ceph/ceph/pull/8008 for the reason why min_size 
> defaults to k+1 on ec pools.
That's a good point, but I am wondering why are reads also blocked
when number of OSDs falls down to k? What if total number of OSDs in a
pool (n) is larger than k+m, should the min_size then be k(+1) or
n-m(+1)?
In any case, since min_size can be easily changed, then I guess this
is not an implementation issue, but rather a documentation issue.

Which leaves these my questions still unanswered:
After killing m OSDs and setting min_size=k most of PGs were now
active+undersized, often with ...+degraded and/or remapped, but a few
were active+clean or active+clean+remapped. Why? I would expect all
PGs to be in the same state (perhaps active+undersized+degraded?).
Is this mishmash of PG states normal? If not, would I have avoided it
if I created the pool with min_size=k=3 from the start? In other
words, does min_size influence the assignment of PGs to OSDs? Or is it
only used to force I/O shutdown in the event of OSDs failures?

Thank you very much

Maciej Puzio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] stale status from monitor?

2018-05-08 Thread Bryan Henderson
My cluster got stuck somehow, and at one point in trying to recycle things to
unstick it, I ended up shutting down everything, then bringing up just the
monitors.  At that point, the cluster reported the status below.

With nothing but the monitors running, I don't see how the status can say
there are two OSDs and an MDS up and requests are blocked.  This was the
status of the cluster when I previously shut down the monitors (which I
probably shouldn't have done when there were still OSDs and MDSs up, but I
did).

It stayed that way for about 20 minutes, and I finally brought up the OSDs and
everything went back to normal.

So my question is:  Is this normal and what has to happen for the status to be
current?

cluster 23352cdb-18fc-4efc-9d54-e72c000abfdb
 health HEALTH_WARN
60 pgs peering
60 pgs stuck inactive
60 pgs stuck unclean
4 requests are blocked > 32 sec
mds cluster is degraded
mds a is laggy
 monmap e3: 3 mons at 
{a=192.168.1.16:6789/0,b=192.168.1.23:6789/0,c=192.168.1.20:6789/0}
election epoch 202, quorum 0,1,2 a,c,b
 mdsmap e315: 1/1/1 up {0=a=up:replay(laggy or crashed)}
 osdmap e495: 2 osds: 2 up, 2 in
 pgmap v33881: 160 pgs, 4 pools, 568 MB data, 14851 objects
   1430 MB used, 43704 MB / 45134 MB avail
100 active+clean
 60 peering

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan safety on active filesystem

2018-05-08 Thread Ryan Leimenstoll
Hi Gregg, John, 

Thanks for the warning. It was definitely conveyed that they are dangerous. I 
thought the online part was implied to be a bad idea, but just wanted to verify.

John,

We were mostly operating off of what the mds logs reported. After bringing the 
mds back online and active, we mounted the volume using the kernel driver to 
one host and started a recursive ls through the root of the filesystem to see 
what was broken. There were seemingly two main paths of the tree that were 
affected initially, both reporting errors like the following in the mds log 
(I’ve swapped out the paths):

Group 1:
2018-05-04 12:04:38.004029 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x10011125556 object missing on disk; some files may be lost 
(/cephfs/redacted1/path/dir1) 
2018-05-04 12:04:38.028861 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x1001112bf14 object missing on disk; some files may be lost 
(/cephfs/redacted1/path/dir2)
2018-05-04 12:04:38.030504 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x10011131118 object missing on disk; some files may be lost 
(/cephfs/redacted1/path/dir3) 

Group 2:
2021-05-04 13:24:29.495892 7fc81f69a700 -1 log_channel(cluster) log [ERR] : dir 
0x1001102c5f6 object missing on disk; some files may be lost 
(/cephfs/redacted2/path/dir1) 

For some of the paths it complained about were empty via ls, although trying to 
rm [-r] them via the mount failed with an error suggesting files still exist in 
the directory. We removed the dir object in the metadata pool that it was still 
warning about (rados -p metapool rm 10011125556., for example). This 
cleaned up errors on this path. We then did the same for Group 2. 

After this, we initiated a recursive scrub with the mds daemon on the root of 
the filesystem to run over the weekend.

In retrospect, we probably should have done the data scan steps mentioned in 
the disaster recovery guide before bringing the system online. The cluster is 
currently healthy (or, rather, reporting healthy) and has been for a while.

My understanding here is that we would need something like the cephfs-data-scan 
steps to recreate metadata or at least identify (for cleanup) objects that may 
have been stranded in the data pool. Is there anyway, likely with another tool, 
to do this for an active cluster? If not, is this something that can be done 
with some amount of safety on an offline system? (not sure how long it would 
take, data pool is ~100T large w/ 242 million objects, and downtime is a big 
pain point for our users with deadlines).

Thanks,

Ryan

> On May 8, 2018, at 5:05 AM, John Spray  wrote:
> 
> On Mon, May 7, 2018 at 8:50 PM, Ryan Leimenstoll
>  wrote:
>> Hi All,
>> 
>> We recently experienced a failure with our 12.2.4 cluster running a CephFS
>> instance that resulted in some data loss due to a seemingly problematic OSD
>> blocking IO on its PGs. We restarted the (single active) mds daemon during
>> this, which caused damage due to the journal not having the chance to flush
>> back. We reset the journal, session table, and fs to bring the filesystem
>> online. We then removed some directories/inodes that were causing the
>> cluster to report damaged metadata (and were otherwise visibly broken by
>> navigating the filesystem).
> 
> This may be over-optimistic of me, but is there any chance you kept a
> detailed record of exactly what damage was reported, and what you did
> to the filesystem so far?  It's hard to give any intelligent advice on
> repairing it, when we don't know exactly what was broken, and a bunch
> of unknown repair-ish things have already manipulated the metadata
> behind the scenes.
> 
> John
> 
>> With that, there are now some paths that seem to have been orphaned (which
>> we expected). We did not run the ‘cephfs-data-scan’ tool [0] in the name of
>> getting the system back online ASAP. Now that the filesystem is otherwise
>> stable, can we initiate a scan_links operation with the mds active safely?
>> 
>> [0]
>> http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/#recovery-from-missing-metadata-objects
>> 
>> Thanks much,
>> Ryan Leimenstoll
>> rleim...@umiacs.umd.edu
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Vasu Kulkarni
On Tue, May 8, 2018 at 12:07 PM, Dan van der Ster  wrote:
> On Tue, May 8, 2018 at 7:35 PM, Vasu Kulkarni  wrote:
>> On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio  wrote:
>>> I am an admin in a research lab looking for a cluster storage
>>> solution, and a newbie to ceph. I have setup a mini toy cluster on
>>> some VMs, to familiarize myself with ceph and to test failure
>>> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
>>> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
>>> replicated pool for metadata, and CephFS on top of them, using default
>>> settings wherever possible. I mounted the filesystem on another
>>> machine and verified that it worked.
>>>
>>> I then killed two OSD VMs with an expectation that the data pool will
>>> still be available, even if in a degraded state, but I found that this
>>> was not the case, and that the pool became inaccessible for reading
>>> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
>>> in an incomplete state. I then found that the pool had size=5 and
>>> min_size=4. Where did the value 4 come from, I do not know.
>>>
>>> This is what I found in the ceph documentation in relation to min_size
>>> and resiliency of erasure-coded pools:
>>>
>>> 1. According to
>>> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
>>> size and min_size are for replicated pools only.
>>> 2. According to the same document, for erasure-coded pools the number
>>> of OSDs that are allowed to fail without losing data equals the number
>>> of coding chunks (m=2 in my case). Of course data loss is not the same
>>> thing as lack of access, but why these two things happen at different
>>> redundancy levels, by default?
>>> 3. The same document states that that no object in the data pool will
>>> receive I/O with fewer than min_size replicas. This refers to
>>> replicas, and taken together with #1, appear not to apply to
>>> erasured-coded pools. But in fact it does, and the default min_size !=
>>> k causes a surprising behavior.
>>> 4. According to
>>> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
>>> reducing min_size may allow recovery of an erasure-coded pool. This
>>> advice was deemed unhelpful and removed from documentation (commit
>>> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
>>> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
>>> one confused.
>>
>>
>> you bring up good inconsistency that needs to be addressed, afaik,only
>> m value is important
>> for ec pools, i am not sure if the *replicated* metadata pool is
>> somehow causing min_size
>> variance in your experiment to work. when we create replicated pool it
>> has option for min size
>> and for ec pool it is the m value.
>
> See https://github.com/ceph/ceph/pull/8008 for the reason why min_size
> defaults to k+1 on ec pools.

So this looks like its happening by default per ec pool, unless user
is changing the pool min_size.
 probably this should be left unchanged and we could document it? It
is bit confusing with
coding chunks.

>
> Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Dan van der Ster
On Tue, May 8, 2018 at 7:35 PM, Vasu Kulkarni  wrote:
> On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio  wrote:
>> I am an admin in a research lab looking for a cluster storage
>> solution, and a newbie to ceph. I have setup a mini toy cluster on
>> some VMs, to familiarize myself with ceph and to test failure
>> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
>> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
>> replicated pool for metadata, and CephFS on top of them, using default
>> settings wherever possible. I mounted the filesystem on another
>> machine and verified that it worked.
>>
>> I then killed two OSD VMs with an expectation that the data pool will
>> still be available, even if in a degraded state, but I found that this
>> was not the case, and that the pool became inaccessible for reading
>> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
>> in an incomplete state. I then found that the pool had size=5 and
>> min_size=4. Where did the value 4 come from, I do not know.
>>
>> This is what I found in the ceph documentation in relation to min_size
>> and resiliency of erasure-coded pools:
>>
>> 1. According to
>> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
>> size and min_size are for replicated pools only.
>> 2. According to the same document, for erasure-coded pools the number
>> of OSDs that are allowed to fail without losing data equals the number
>> of coding chunks (m=2 in my case). Of course data loss is not the same
>> thing as lack of access, but why these two things happen at different
>> redundancy levels, by default?
>> 3. The same document states that that no object in the data pool will
>> receive I/O with fewer than min_size replicas. This refers to
>> replicas, and taken together with #1, appear not to apply to
>> erasured-coded pools. But in fact it does, and the default min_size !=
>> k causes a surprising behavior.
>> 4. According to
>> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
>> reducing min_size may allow recovery of an erasure-coded pool. This
>> advice was deemed unhelpful and removed from documentation (commit
>> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
>> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
>> one confused.
>
>
> you bring up good inconsistency that needs to be addressed, afaik,only
> m value is important
> for ec pools, i am not sure if the *replicated* metadata pool is
> somehow causing min_size
> variance in your experiment to work. when we create replicated pool it
> has option for min size
> and for ec pool it is the m value.

See https://github.com/ceph/ceph/pull/8008 for the reason why min_size
defaults to k+1 on ec pools.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Vasu Kulkarni
On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio  wrote:
> I am an admin in a research lab looking for a cluster storage
> solution, and a newbie to ceph. I have setup a mini toy cluster on
> some VMs, to familiarize myself with ceph and to test failure
> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
> replicated pool for metadata, and CephFS on top of them, using default
> settings wherever possible. I mounted the filesystem on another
> machine and verified that it worked.
>
> I then killed two OSD VMs with an expectation that the data pool will
> still be available, even if in a degraded state, but I found that this
> was not the case, and that the pool became inaccessible for reading
> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
> in an incomplete state. I then found that the pool had size=5 and
> min_size=4. Where did the value 4 come from, I do not know.
>
> This is what I found in the ceph documentation in relation to min_size
> and resiliency of erasure-coded pools:
>
> 1. According to
> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
> size and min_size are for replicated pools only.
> 2. According to the same document, for erasure-coded pools the number
> of OSDs that are allowed to fail without losing data equals the number
> of coding chunks (m=2 in my case). Of course data loss is not the same
> thing as lack of access, but why these two things happen at different
> redundancy levels, by default?
> 3. The same document states that that no object in the data pool will
> receive I/O with fewer than min_size replicas. This refers to
> replicas, and taken together with #1, appear not to apply to
> erasured-coded pools. But in fact it does, and the default min_size !=
> k causes a surprising behavior.
> 4. According to
> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
> reducing min_size may allow recovery of an erasure-coded pool. This
> advice was deemed unhelpful and removed from documentation (commit
> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
> one confused.


you bring up good inconsistency that needs to be addressed, afaik,only
m value is important
for ec pools, i am not sure if the *replicated* metadata pool is
somehow causing min_size
variance in your experiment to work. when we create replicated pool it
has option for min size
and for ec pool it is the m value.

>
> I followed the advice #4 and reduced min_size to 3. Lo and behold, the
> pool became accessible, and I could read the data previously stored,
> and write new one. This appears to contradict #1, but at least it
> works. The look at ceph pg ls revealed another mystery, though. Most
> of PGs were now active+undersized, often with ...+degraded and/or
> remapped, but a few were active+clean or active+clean+remapped. Why? I
> would expect all PGs to be in the same state (perhaps
> active+undersized+degraded?)
>
> I apologize if this behavior turns out to be expected and
> straightforward to experienced ceph users, or if I missed some
> documentation that explains this clearly. My goal is to put about 500
> TB on ceph or another cluster storage system, and I find these issues
> confusing and worrisome. Helpful and competent replies will be much
> appreciated. Please note that my questions are about erasure-coded
> pools, and not about replicated pools.
>
> Thank you
>
> Maciej Puzio
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to configure s3 bucket acl so that one user's bucket is visible to another.

2018-05-08 Thread David Turner
Sorry I've been on vacation, but I'm back now.  The command I use to create
subusers for a rgw user is...

radosgw-admin user create --gen-access-key --gen-secret --uid=user_a
--display_name="User A"
radosgw-admin subuser create --gen-access-key --gen-secret
--access={read,write,readwrite,full} --key-type=s3 --uid=user_a
--subuser=subuser_1

Now all buckets created by user_a (or a subuser with --access=full) can now
be accessed by user_a and all user_a:subusers.  What you missed was
changing the default subuser type from swift to s3.  --access=full is
needed for any user needed to be able to create and delete buckets, the
others are fairly self explanatory for what they can do inside of existing
buckets.

There are 2 approaches to use with subusers depending on your use case.
The first use case is what I use for buckets.  We create 1 user per bucket
and create subusers when necessary.  Most of our buckets are used by a
single service and that's all the service uses... so they get the keys for
their bucket and that's it.  Subusers are create just for the single bucket
that the original user is in charge of.

The second use case is where you want a lot of buckets accessed by a single
set of keys, but you want multiple people to all be able to access the
buckets.  In this case I would create a single user and use that user to
create all of the buckets and then create the subusers for everyone to be
able to access the various buckets.  Note that with this method you get no
more granularity to settings other than subuser_2 only has read access to
every bucket.  You can't pick and choose which buckets a subuser has write
access to, it's all or none.  That's why I use the first approach and call
it "juggling" keys because if someone wants access to multiple buckets,
they have keys for each individual bucket as a subuser.

On Sat, May 5, 2018 at 6:28 AM Marc Roos  wrote:

>
> This 'juggle keys' is a bit cryptic to me. If I create a subuser it
> becomes a swift user not? So how can that have access to the s3 or be
> used in a s3 client. I have to put in the client the access and secret
> key, in the subuser I only have a secret key.
>
> Is this multi tentant basically only limiting this buckets namespace to
> the tenants users and nothing else?
>
>
>
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: zondag 29 april 2018 14:52
> To: Yehuda Sadeh-Weinraub
> Cc: ceph-users@lists.ceph.com; Безруков Илья Алексеевич
> Subject: Re: [ceph-users] How to configure s3 bucket acl so that one
> user's bucket is visible to another.
>
> You can create subuser keys to allow other users to have access to a
> bucket. You have to juggle keys, but it works pretty well.
>
>
> On Sun, Apr 29, 2018, 4:00 AM Yehuda Sadeh-Weinraub 
> wrote:
>
>
> You can't. A user can only list the buckets that it owns, it cannot
> list other users' buckets.
>
> Yehuda
>
> On Sat, Apr 28, 2018 at 11:10 AM, Безруков Илья Алексеевич
>  wrote:
> > Hello,
> >
> > How to configure s3 bucket acl so that one user's bucket is
> visible to
> > another.
> >
> >
> > I can create a bucket, objects in it and give another user
> access
> to it.
> > But another user does not see this bucket in the list of
> available buckets.
> >
> >
> > ## User1
> >
> > ```
> > s3cmd -c s3cfg_user1 ls s3://
> >
> > 2018-04-28 07:50  s3://example1
> >
> > #set ACL
> > s3cmd -c s3cfg_user1 setacl --acl-grant=all:user2 s3://example1
> > s3://example1/: ACL updated
> >
> > # Check
> > s3cmd -c s3cfg_user1 info s3://example1
> > s3://example1/ (bucket):
> >Location:  us-east-1
> >Payer: BucketOwner
> >Expiration Rule: none
> >Policy:none
> >CORS:  none
> >ACL:   User1: FULL_CONTROL
> >ACL:   User2: FULL_CONTROL
> >
> > # Put some data
> > s3cmd -c s3cfg_user1 put /tmp/dmesg s3://example1
> > upload: '/tmp/dmesg' -> 's3://example1/dmesg'  [1 of 1]
> >  5305 of 5305   100% in0s27.28 kB/s  done
> >
> > #set ACL
> > s3cmd -c s3cfg_user1 setacl --acl-grant=all:bondarenko
> s3://example1/dmesg
> > s3://example1/dmesg: ACL updated
> >
> > ```
> >
> > ## User2
> > ```
> > s3cmd -c ~/.s3cfg_user2 ls s3://
> > 2018-04-27 14:23  s3://only_itself_dir
> >
> > # Check info
> > s3cmd -c ~/.s3cfg_user2 info s3://example1
> > ERROR: Access to bucket 'example1' was denied
> > ERROR: S3 error: 403 (AccessDenied)
> >
> > # ls bucket
> > s3cmd -c ~/.s3cfg_user2 ls s3://example1
> > 2018-04-28 07:58  5305   s3://example1/dmesg
> 

Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread David Turner
You talked about "using default settings wherever possible"... Well, Ceph's
default settings everywhere they exist, is to not allow you to write while
you don't have at least 1 more copy that you can lose without data loss.
If your bosses require you to be able to lose 2 servers and still serve
customers, then tell them that Ceph requires you to have 3 parity copies of
the data.

Why do you want to change your one and only copy of the data while you
already have a degraded system?  And not just a degraded system, but a
system where 2/5's of your servers are down... That sounds awful, terrible,
and just plain bad.

To directly answer your question about min_size, min_size does not affect
where data is placed.  It only affects when a PG claims to not have enough
copies online to be able to receive read or write requests.

On Tue, May 8, 2018 at 7:47 AM Janne Johansson  wrote:

> 2018-05-08 1:46 GMT+02:00 Maciej Puzio :
>
>> Paul, many thanks for your reply.
>> Thinking about it, I can't decide if I'd prefer to operate the storage
>> server without redundancy, or have it automatically force a downtime,
>> subjecting me to a rage of my users and my boss.
>> But I think that the typical expectation is that system serves the
>> data while it is able to do so.
>
>
> If you want to prevent angry bosses, you would have made 10 OSD hosts
> or some other large number so that ceph cloud place PGs over more places
> so that 2 lost hosts would not impact so much, but also so it can recover
> into
> each PG into one of the 10 ( minus two broken minus the three that already
> hold data you want to spread out) other OSDs and get back into full service
> even with two lost hosts.
>
> It's fun to test assumptions and "how low can I go", but if you REALLY
> wanted
> a cluster with resilience to planned and unplanned maintenance,
> you would have redundancy, just like that Raid6 disk box would
> presumably have a fair amount of hot and perhaps cold spares nearby to kick
> in if lots of disks started go missing.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shutting down: why OSDs first?

2018-05-08 Thread David Turner
The mons work best when they know absolutely everything.  If they know that
osd.3 was down 40 seconds before osd.2 that means that if a write was
stilling happening while osd.2 was still up, the mons have record of it in
the maps and when osd.3 comes up, it can get what it needs from the other
osds.  Mons are the keepers of maps, epochs, and everything important to
know about a cluster.  If you're using encryption on your OSDs, the mons
keep track of the keys to decrypt the osds iirc.

Even if you aren't using encryption, the OSDs check with the Mons when they
first start to know what the most recent map is.  If they can't communicate
with a Mon, they will fail to start and die.  Last down/first up ensures
that the mons know everything and is the safest way to handle a cluster
shutdown.  Yes Ceph can usually handle full system power-offs with no
proper order or having too many of something shutdown while the rest of the
cluster is running, but most people try to avoid disaster scenarios if they
can help it.

On Mon, May 7, 2018 at 9:48 PM Bryan Henderson 
wrote:

> There is a lot of advice around on shutting down a Ceph cluster that says
> to shut down the OSDs before the monitors and bring up the monitors before
> the OSDs, but no one explains why.
>
> I would have thought it would be better to shut down the monitors first and
> bring them up last, so they don't have to witness all the interim states
> with
> OSDs down.  And it should make the noout, nodown, etc. settings
> unnecessary.
>
> So what am I missing?
>
> Also, how much difference does it really make?  Ceph is obviously designed
> to
> tolerate any sequence of failures and recoveries of nodes, so how much risk
> would I be taking if I just haphazardly killed everything instead of
> orchestrating a shutdown?
>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object storage share 'archive' bucket best practice

2018-05-08 Thread David Turner
Didn't mean to hit send on that quite yet, but that's the gist of
everything you need to do.  There is nothing special about this for RGW vs
AWS except that AWS can set this permission on a full bucket while in RGW
you need to do this on each object when you upload them.

On Tue, May 8, 2018 at 12:42 PM David Turner  wrote:

> Something simple like `s3cmd put file s3://bucket/file --acl-public`
>
> On Sat, May 5, 2018 at 6:36 AM Marc Roos  wrote:
>
>>
>>
>> What would be the best way to implement a situation where:
>>
>> I would like to archive some files in lets say an archive bucket and use
>> a read/write account for putting the files. Then give other users only
>> read access to this bucket so they can download something if necessary?
>>
>> All using some basic s3 client.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object storage share 'archive' bucket best practice

2018-05-08 Thread David Turner
Something simple like `s3cmd put file s3://bucket/file --acl-public`

On Sat, May 5, 2018 at 6:36 AM Marc Roos  wrote:

>
>
> What would be the best way to implement a situation where:
>
> I would like to archive some files in lets say an archive bucket and use
> a read/write account for putting the files. Then give other users only
> read access to this bucket so they can download something if necessary?
>
> All using some basic s3 client.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] network change

2018-05-08 Thread John Spray
On Tue, May 8, 2018 at 3:50 PM, James Mauro  wrote:
> (newbie warning - my first go-round with ceph, doing a lot of reading)
>
> I have a small Ceph cluster, four storage nodes total, three dedicated to
> data (OSD’s) and one for metadata. One client machine.
>
> I made a network change. When I installed and configured the cluster, it was
> done
> using the system’s 10Gb interface information. I now have everything on a
> 100Gb network (IB in Ethernet mode).
>
> My question is, what is the most expedient way for me to change the ceph
> config
> such that all nodes are using the 100Gb network? Can I shut down the
> cluster,
> edit one or more .conf files and restart, or do I need to re-configure from
> scratch?

The key part is changing the monitors addresses:
http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address

Once your mons are happy with their new addresses, you can just update
ceph.conf for the rest of the services (including OSDs).

By the way, I notice you're an 11.x version, which is EOL.  It would
be wise to update your cluster to the 12.x ("luminous") stable series
before doing the address updates; that way if you run into any issues
you'll be using a version that's better tested and more familiar to
everyone.

John

>
> Thanks
> Jim
>
>
> cepher@srv-01:~$ sudo ceph --version
> ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
> cepher@srv-01:~$ sudo ceph -s
> cluster f201e454-9c73-4b29-abe1-48dd609266a6
>  health HEALTH_OK
>  monmap e4: 3 mons at
> {dgx-srv-04=10.33.3.46:6789/0,dgx-srv-05=10.33.3.48:6789/0,dgx-srv-06=10.33.3.50:6789/0}
> election epoch 12, quorum 0,1,2 dgx-srv-04,dgx-srv-05,dgx-srv-06
>   fsmap e5: 1/1/1 up {0=dgx-srv-03=up:active}
> mgr active: dgx-srv-06 standbys: dgx-srv-04, dgx-srv-05
>  osdmap e114: 18 osds: 18 up, 18 in
> flags sortbitwise,require_jewel_osds,require_kraken_osds
>   pgmap v7946: 3072 pgs, 3 pools, 2148 bytes data, 20 objects
> 99684 MB used, 26717 GB / 26814 GB avail
> 3072 active+clean
> cepher@srv-01:~$ uname -a
> Linux srv-01 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018
> x86_64 x86_64 x86_64 GNU/Linux
> 
> This email message is for the sole use of the intended recipient(s) and may
> contain confidential information.  Any unauthorized review, use, disclosure
> or distribution is prohibited.  If you are not the intended recipient,
> please contact the sender by reply email and destroy all copies of the
> original message.
> 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] network change

2018-05-08 Thread James Mauro
(newbie warning - my first go-round with ceph, doing a lot of reading)

I have a small Ceph cluster, four storage nodes total, three dedicated to
data (OSD’s) and one for metadata. One client machine.

I made a network change. When I installed and configured the cluster, it was 
done
using the system’s 10Gb interface information. I now have everything on a
100Gb network (IB in Ethernet mode).

My question is, what is the most expedient way for me to change the ceph config
such that all nodes are using the 100Gb network? Can I shut down the cluster,
edit one or more .conf files and restart, or do I need to re-configure from 
scratch?

Thanks
Jim


cepher@srv-01:~$ sudo ceph --version
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)
cepher@srv-01:~$ sudo ceph -s
cluster f201e454-9c73-4b29-abe1-48dd609266a6
 health HEALTH_OK
 monmap e4: 3 mons at 
{dgx-srv-04=10.33.3.46:6789/0,dgx-srv-05=10.33.3.48:6789/0,dgx-srv-06=10.33.3.50:6789/0}
election epoch 12, quorum 0,1,2 dgx-srv-04,dgx-srv-05,dgx-srv-06
  fsmap e5: 1/1/1 up {0=dgx-srv-03=up:active}
mgr active: dgx-srv-06 standbys: dgx-srv-04, dgx-srv-05
 osdmap e114: 18 osds: 18 up, 18 in
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v7946: 3072 pgs, 3 pools, 2148 bytes data, 20 objects
99684 MB used, 26717 GB / 26814 GB avail
3072 active+clean
cepher@srv-01:~$ uname -a
Linux srv-01 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked

2018-05-08 Thread Jean-Charles Lopez
Hi Grigory,

are these lines the only lines in your log file for OSD 15?

Just for sanity, what are the log levels you have set, if any, in your config 
file away from the default? If you set all log levels to 0 like some people do 
you may want to simply go back to the default by commenting out the debug_ 
lines in your config file. If you want to see something more detailed you can 
indeed increase the log level to 5 or 10.

What you can also do is to use the admin socket on the machine to see what 
operations are actually blocked: ceph daemon osd.15 dump_ops_in_flight and ceph 
daemon osd.15 dump_historic_ops.

These two commands and their output will show you what exact operations are 
blocked and will also point you to the other OSDs this OSD is working with to 
serve the IO. May be the culprit is actually one of the OSDs handling the 
subops or it could be a network problem.

Regards
JC

> On May 8, 2018, at 03:11, Grigory Murashov  wrote:
> 
> Hello Jean-Charles!
> 
> I have finally catch the problem, It was at 13-02.
> 
> [cephuser@storage-ru1-osd3 ~]$ ceph health detail
> HEALTH_WARN 18 slow requests are blocked > 32 sec
> REQUEST_SLOW 18 slow requests are blocked > 32 sec
> 3 ops are blocked > 65.536 sec
> 15 ops are blocked > 32.768 sec
> osd.15 has blocked requests > 65.536 sec
> [cephuser@storage-ru1-osd3 ~]$
> 
> 
> But surprise - there is no information in ceph-osd.15.log that time
> 
> 
> 2018-05-08 12:54:26.105919 7f003f5f9700  4 rocksdb: (Original Log Time 
> 2018/05/08-12:54:26.105843) EVENT_LOG_v1 {"time_micros": 1525773266105834, 
> "job": 2793, "event": "trivial_move", "dest
> ination_level": 3, "files": 1, "total_files_size": 68316970}
> 2018-05-08 12:54:26.105926 7f003f5f9700  4 rocksdb: (Original Log Time 
> 2018/05/08-12:54:26.105854) 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1537]
>  [default] Moved #1 files to level-3 68316970 bytes OK
> : base level 1 max bytes base 268435456 files[0 4 45 403 722 0 0] max score 
> 0.98
> 
> 2018-05-08 13:07:29.711425 7f004f619700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:684] 
> reusing log 8051 from recycle list
> 
> 2018-05-08 13:07:29.711497 7f004f619700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:725] 
> [default] New memtable created with log file: #8089. Immutable memtables: 0.
> 
> 2018-05-08 13:07:29.726107 7f003fdfa700  4 rocksdb: (Original Log Time 
> 2018/05/08-13:07:29.711524) 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1158]
>  Calling FlushMemTableToOutputFile with column family
> [default], flush slots available 1, compaction slots allowed 1, compaction 
> slots scheduled 1
> 2018-05-08 13:07:29.726124 7f003fdfa700  4 rocksdb: 
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
> elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/flush_job.cc:264] 
> [default] [JOB 2794] Flushing memtable with next log file: 8089
> 
> Should I have some deeply logging?
> 
> 
> Grigory Murashov
> Voximplant
> 
> 07.05.2018 18:59, Jean-Charles Lopez пишет:
>> Hi,
>> 
>> ceph health detail
>> 
>> This will tell you which OSDs are experiencing the problem so you can then 
>> go and inspect the logs and use the admin socket to find out which requests 
>> are at the source.
>> 
>> Regards
>> JC
>> 
>>> On May 7, 2018, at 03:52, Grigory Murashov  wrote:
>>> 
>>> Hello!
>>> 
>>> I'm not much experiensed in ceph troubleshouting that why I ask for help.
>>> 
>>> I have multiple warnings coming from zabbix as a result of ceph -s
>>> 
>>> REQUEST_SLOW: HEALTH_WARN : 21 slow requests are blocked > 32 sec
>>> 
>>> I don't see any hardware problems that time.
>>> 
>>> I'm able to find the same strings in ceph.log and ceph-mon.log like
>>> 
>>> 2018-05-07 12:37:57.375546 7f3037dae700  0 log_channel(cluster) log [WRN] : 
>>> Health check failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
>>> 
>>> Now It's important to find out the root of the problem.
>>> 
>>> How to find out:
>>> 
>>> 1. which OSDs are affected
>>> 
>>> 2. which particular requests were slowed and blocked?
>>> 
>>> I assume I need more detailed logging - how to do that?
>>> 
>>> Appreciate your help.
>>> 
>>> -- 
>>> Grigory Murashov
>>> 
>>> 

Re: [ceph-users] Deleting an rbd image hangs

2018-05-08 Thread Jason Dillaman
Perhaps the image had associated snapshots? Deleting the object
doesn't delete the associated snapshots so those objects will remain
until the snapshot is removed. However, if you have removed the RBD
header, the snapshot id is now gone.

On Tue, May 8, 2018 at 12:29 AM, Eugen Block  wrote:
> Hi,
>
> I have a similar issue and would also need some advice how to get rid of the
> already deleted files.
>
> Ceph is our OpenStack backend and there was a nova clone without parent
> information. Apparently, the base image had been deleted without a warning
> or anything although there were existing clones.
> Anyway, I tried to delete the respective rbd_data and _header files as
> described in [1]. There were about 700 objects to be deleted, but 255
> objects remained according to the 'rados -p pool ls' command. The attempt to
> delete the rest (again) resulted (and still results) in "No such file or
> directory". After about half an hour later one more object vanished
> (rbd_header file), there are now still 254 objects left in the pool. First I
> thought maybe Ceph will cleanup itself, it just takes some time, but this
> was weeks ago and the number of objects has not changed since then.
>
> I would really appreciate any help.
>
> Regards,
> Eugen
>
>
> Zitat von Jan Marquardt :
>
>
>> Am 30.04.18 um 09:26 schrieb Jan Marquardt:
>>>
>>> Am 27.04.18 um 20:48 schrieb David Turner:

 This old [1] blog post about removing super large RBDs is not relevant
 if you're using object map on the RBDs, however it's method to manually
 delete an RBD is still valid.  You can see if this works for you to
 manually remove the problem RBD you're having.
>>>
>>>
>>> I followed the instructions, but it seems that 'rados -p rbd ls | grep
>>> '^rbd_data.221bf2eb141f2.' | xargs -n 200  rados -p rbd rm' gets stuck,
>>> too. It's running since Friday and still not finished. The rbd image
>>> is/was about 1 TB large.
>>>
>>> Until now the only output was:
>>> error removing rbd>rbd_data.221bf2eb141f2.51d2: (2) No such
>>> file or directory
>>> error removing rbd>rbd_data.221bf2eb141f2.e3f2: (2) No such
>>> file or directory
>>
>>
>> I am still trying to get rid of this. 'rados -p rbd ls' still shows a
>> lot of objects beginning with rbd_data.221bf2eb141f2, but if I try to
>> delete them with 'rados -p rbd rm ' it says 'No such file or
>> directory'. This is not the behaviour I'd expect. Any ideas?
>>
>> Besides this rbd_data.221bf2eb141f2.00016379 is still causing
>> the OSDs crashing, which leaves the cluster unusable for us at the
>> moment. Even if it's just a proof of concept, I'd like to get this fixed
>> without destroying the whole cluster.
>>

 [1] http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image

 On Thu, Apr 26, 2018 at 9:25 AM Jan Marquardt >>> > wrote:

 Hi,

 I am currently trying to delete an rbd image which is seemingly
 causing
 our OSDs to crash, but it always gets stuck at 3%.

 root@ceph4:~# rbd rm noc_tobedeleted
 Removing image: 3% complete...

 Is there any way to force the deletion? Any other advices?

 Best Regards

 Jan
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Janne Johansson
2018-05-08 1:46 GMT+02:00 Maciej Puzio :

> Paul, many thanks for your reply.
> Thinking about it, I can't decide if I'd prefer to operate the storage
> server without redundancy, or have it automatically force a downtime,
> subjecting me to a rage of my users and my boss.
> But I think that the typical expectation is that system serves the
> data while it is able to do so.


If you want to prevent angry bosses, you would have made 10 OSD hosts
or some other large number so that ceph cloud place PGs over more places
so that 2 lost hosts would not impact so much, but also so it can recover
into
each PG into one of the 10 ( minus two broken minus the three that already
hold data you want to spread out) other OSDs and get back into full service
even with two lost hosts.

It's fun to test assumptions and "how low can I go", but if you REALLY
wanted
a cluster with resilience to planned and unplanned maintenance,
you would have redundancy, just like that Raid6 disk box would
presumably have a fair amount of hot and perhaps cold spares nearby to kick
in if lots of disks started go missing.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked

2018-05-08 Thread Grigory Murashov

Hello Jean-Charles!

I have finally catch the problem, It was at 13-02.

[cephuser@storage-ru1-osd3 ~]$ ceph health detail
HEALTH_WARN 18 slow requests are blocked > 32 sec
REQUEST_SLOW 18 slow requests are blocked > 32 sec
    3 ops are blocked > 65.536 sec
    15 ops are blocked > 32.768 sec
    osd.15 has blocked requests > 65.536 sec
[cephuser@storage-ru1-osd3 ~]$


But surprise - there is no information in ceph-osd.15.log that time


2018-05-08 12:54:26.105919 7f003f5f9700  4 rocksdb: (Original Log Time 
2018/05/08-12:54:26.105843) EVENT_LOG_v1 {"time_micros": 
1525773266105834, "job": 2793, "event": "trivial_move", "dest

ination_level": 3, "files": 1, "total_files_size": 68316970}
2018-05-08 12:54:26.105926 7f003f5f9700  4 rocksdb: (Original Log Time 
2018/05/08-12:54:26.105854) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1537] 
[default] Moved #1 files to level-3 68316970 bytes OK
: base level 1 max bytes base 268435456 files[0 4 45 403 722 0 0] max 
score 0.98


2018-05-08 13:07:29.711425 7f004f619700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:684] 
reusing log 8051 from recycle list


2018-05-08 13:07:29.711497 7f004f619700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_write.cc:725] 
[default] New memtable created with log file: #8089. Immutable memtables: 0.


2018-05-08 13:07:29.726107 7f003fdfa700  4 rocksdb: (Original Log Time 
2018/05/08-13:07:29.711524) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1158] 
Calling FlushMemTableToOutputFile with column family
[default], flush slots available 1, compaction slots allowed 1, 
compaction slots scheduled 1
2018-05-08 13:07:29.726124 7f003fdfa700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/r
elease/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/db/flush_job.cc:264] 
[default] [JOB 2794] Flushing memtable with next log file: 8089


Should I have some deeply logging?


Grigory Murashov
Voximplant

07.05.2018 18:59, Jean-Charles Lopez пишет:

Hi,

ceph health detail

This will tell you which OSDs are experiencing the problem so you can then go 
and inspect the logs and use the admin socket to find out which requests are at 
the source.

Regards
JC


On May 7, 2018, at 03:52, Grigory Murashov  wrote:

Hello!

I'm not much experiensed in ceph troubleshouting that why I ask for help.

I have multiple warnings coming from zabbix as a result of ceph -s

REQUEST_SLOW: HEALTH_WARN : 21 slow requests are blocked > 32 sec

I don't see any hardware problems that time.

I'm able to find the same strings in ceph.log and ceph-mon.log like

2018-05-07 12:37:57.375546 7f3037dae700  0 log_channel(cluster) log [WRN] : Health 
check failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)

Now It's important to find out the root of the problem.

How to find out:

1. which OSDs are affected

2. which particular requests were slowed and blocked?

I assume I need more detailed logging - how to do that?

Appreciate your help.

--
Grigory Murashov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan safety on active filesystem

2018-05-08 Thread John Spray
On Mon, May 7, 2018 at 8:50 PM, Ryan Leimenstoll
 wrote:
> Hi All,
>
> We recently experienced a failure with our 12.2.4 cluster running a CephFS
> instance that resulted in some data loss due to a seemingly problematic OSD
> blocking IO on its PGs. We restarted the (single active) mds daemon during
> this, which caused damage due to the journal not having the chance to flush
> back. We reset the journal, session table, and fs to bring the filesystem
> online. We then removed some directories/inodes that were causing the
> cluster to report damaged metadata (and were otherwise visibly broken by
> navigating the filesystem).

This may be over-optimistic of me, but is there any chance you kept a
detailed record of exactly what damage was reported, and what you did
to the filesystem so far?  It's hard to give any intelligent advice on
repairing it, when we don't know exactly what was broken, and a bunch
of unknown repair-ish things have already manipulated the metadata
behind the scenes.

John

> With that, there are now some paths that seem to have been orphaned (which
> we expected). We did not run the ‘cephfs-data-scan’ tool [0] in the name of
> getting the system back online ASAP. Now that the filesystem is otherwise
> stable, can we initiate a scan_links operation with the mds active safely?
>
> [0]
> http://docs.ceph.com/docs/luminous/cephfs/disaster-recovery/#recovery-from-missing-metadata-objects
>
> Thanks much,
> Ryan Leimenstoll
> rleim...@umiacs.umd.edu
> University of Maryland Institute for Advanced Computer Studies
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous : mark_unfound_lost for EC pool

2018-05-08 Thread nokia ceph
Thank you , it works

On Tue, May 8, 2018 at 2:05 PM, Paul Emmerich 
wrote:

> EC pools only support deleting unfound objects as there aren't multiple
> copies around that could be reverted to.
>
> ceph pg  mark_unfound_lost delete
>
>
> Paul
>
> 2018-05-08 9:26 GMT+02:00 nokia ceph :
>
>> Hi Team,
>>
>> I was trying to forcefully lost  the unfound objects using the below
>> commands mentioned in the documentation , it is not working in the latest
>> release , any prerequisites required for EC pool.
>>
>> cn1.chn6m1c1ru1c1.cdn ~# *ceph pg 4.1206 mark_unfound_lost revert|delete*
>> -bash: delete: command not found
>> *Error EINVAL: mode must be 'delete' for ec pool*
>> cn1.chn6m1c1ru1c1.cdn ~# ceph pg 4.1206 mark_unfound_lost
>>
>> Invalid command: missing required parameter mulcmd(revert|delete)
>> *pg  mark_unfound_lost revert|delete *:  mark all unfound objects
>> in this pg as lost, either removing or reverting to a prior version if one
>> is available
>> Error EINVAL: invalid command
>> cn1.chn6m1c1ru1c1.cdn ~#
>> cn1.chn6m1c1ru1c1.cdn ~# ceph pg 4.1206 mark_unfound_lost revert|delete
>> -bash: delete: command not found
>> Error EINVAL: *mode must be 'delete' for ec pool*
>>
>> Thanks,
>> Muthu
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 
> 81247 München
> 
> www.croit.io
> Tel: +49 89 1896585 90
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is the meaning of size and min_size for erasure-coded pools?

2018-05-08 Thread Paul Emmerich
It's a very bad idea to accept data if you can't guarantee that it will be
stored in way that tolerates a disk outage
without data loss. Just don't.

Increase the number of coding chunks to 3 if you want to withstand two
simultaneous disk
failures without impacting availability.


Paul


2018-05-08 1:46 GMT+02:00 Maciej Puzio :

> Paul, many thanks for your reply.
> Thinking about it, I can't decide if I'd prefer to operate the storage
> server without redundancy, or have it automatically force a downtime,
> subjecting me to a rage of my users and my boss.
> But I think that the typical expectation is that system serves the
> data while it is able to do so. Since ceph by default does otherwise,
> may I suggest that this is explained in the docs? As things are now, I
> needed a trial-and-error approach to figure out why ceph was not
> working in a setup that I think was hardly exotic, and in fact
> resembled an ordinary RAID 6.
>
> Which leaves us with a mishmash of PG states. Is it normal? If not,
> would I have avoided it if I created the pool with min_size=k=3 from
> the start? In other words, does min_size influence the assignment of
> PGs to OSDs? Or is it only used to force I/O shutdown in the event of
> OSDs failures?
>
> Thank you very much
>
> Maciej Puzio
>
>
> On Mon, May 7, 2018 at 5:00 PM, Paul Emmerich 
> wrote:
> > The docs seem wrong here. min_size is available for erasure coded pools
> and
> > works like you'd expect it to work.
> > Still, it's not a good idea to reduce it to the number of data chunks.
> >
> >
> > Paul
> >
> > --
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous : mark_unfound_lost for EC pool

2018-05-08 Thread Paul Emmerich
EC pools only support deleting unfound objects as there aren't multiple
copies around that could be reverted to.

ceph pg  mark_unfound_lost delete


Paul

2018-05-08 9:26 GMT+02:00 nokia ceph :

> Hi Team,
>
> I was trying to forcefully lost  the unfound objects using the below
> commands mentioned in the documentation , it is not working in the latest
> release , any prerequisites required for EC pool.
>
> cn1.chn6m1c1ru1c1.cdn ~# *ceph pg 4.1206 mark_unfound_lost revert|delete*
> -bash: delete: command not found
> *Error EINVAL: mode must be 'delete' for ec pool*
> cn1.chn6m1c1ru1c1.cdn ~# ceph pg 4.1206 mark_unfound_lost
>
> Invalid command: missing required parameter mulcmd(revert|delete)
> *pg  mark_unfound_lost revert|delete *:  mark all unfound objects
> in this pg as lost, either removing or reverting to a prior version if one
> is available
> Error EINVAL: invalid command
> cn1.chn6m1c1ru1c1.cdn ~#
> cn1.chn6m1c1ru1c1.cdn ~# ceph pg 4.1206 mark_unfound_lost revert|delete
> -bash: delete: command not found
> Error EINVAL: *mode must be 'delete' for ec pool*
>
> Thanks,
> Muthu
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deleting an rbd image hangs

2018-05-08 Thread Eugen Block

Hi,

I have a similar issue and would also need some advice how to get rid  
of the already deleted files.


Ceph is our OpenStack backend and there was a nova clone without  
parent information. Apparently, the base image had been deleted  
without a warning or anything although there were existing clones.
Anyway, I tried to delete the respective rbd_data and _header files as  
described in [1]. There were about 700 objects to be deleted, but 255  
objects remained according to the 'rados -p pool ls' command. The  
attempt to delete the rest (again) resulted (and still results) in "No  
such file or directory". After about half an hour later one more  
object vanished (rbd_header file), there are now still 254 objects  
left in the pool. First I thought maybe Ceph will cleanup itself, it  
just takes some time, but this was weeks ago and the number of objects  
has not changed since then.


I would really appreciate any help.

Regards,
Eugen


Zitat von Jan Marquardt :


Am 30.04.18 um 09:26 schrieb Jan Marquardt:

Am 27.04.18 um 20:48 schrieb David Turner:

This old [1] blog post about removing super large RBDs is not relevant
if you're using object map on the RBDs, however it's method to manually
delete an RBD is still valid.  You can see if this works for you to
manually remove the problem RBD you're having.


I followed the instructions, but it seems that 'rados -p rbd ls | grep
'^rbd_data.221bf2eb141f2.' | xargs -n 200  rados -p rbd rm' gets stuck,
too. It's running since Friday and still not finished. The rbd image
is/was about 1 TB large.

Until now the only output was:
error removing rbd>rbd_data.221bf2eb141f2.51d2: (2) No such
file or directory
error removing rbd>rbd_data.221bf2eb141f2.e3f2: (2) No such
file or directory


I am still trying to get rid of this. 'rados -p rbd ls' still shows a
lot of objects beginning with rbd_data.221bf2eb141f2, but if I try to
delete them with 'rados -p rbd rm ' it says 'No such file or
directory'. This is not the behaviour I'd expect. Any ideas?

Besides this rbd_data.221bf2eb141f2.00016379 is still causing
the OSDs crashing, which leaves the cluster unusable for us at the
moment. Even if it's just a proof of concept, I'd like to get this fixed
without destroying the whole cluster.



[1] http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image

On Thu, Apr 26, 2018 at 9:25 AM Jan Marquardt mailto:j...@artfiles.de>> wrote:

Hi,

I am currently trying to delete an rbd image which is seemingly causing
our OSDs to crash, but it always gets stuck at 3%.

root@ceph4:~# rbd rm noc_tobedeleted
Removing image: 3% complete...

Is there any way to force the deletion? Any other advices?

Best Regards

Jan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous : mark_unfound_lost for EC pool

2018-05-08 Thread nokia ceph
Hi Team,

I was trying to forcefully lost  the unfound objects using the below
commands mentioned in the documentation , it is not working in the latest
release , any prerequisites required for EC pool.

cn1.chn6m1c1ru1c1.cdn ~# *ceph pg 4.1206 mark_unfound_lost revert|delete*
-bash: delete: command not found
*Error EINVAL: mode must be 'delete' for ec pool*
cn1.chn6m1c1ru1c1.cdn ~# ceph pg 4.1206 mark_unfound_lost

Invalid command: missing required parameter mulcmd(revert|delete)
*pg  mark_unfound_lost revert|delete *:  mark all unfound objects in
this pg as lost, either removing or reverting to a prior version if one is
available
Error EINVAL: invalid command
cn1.chn6m1c1ru1c1.cdn ~#
cn1.chn6m1c1ru1c1.cdn ~# ceph pg 4.1206 mark_unfound_lost revert|delete
-bash: delete: command not found
Error EINVAL: *mode must be 'delete' for ec pool*

Thanks,
Muthu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com