Re: [ceph-users] Ceph pg active+clean+inconsistent

2016-12-22 Thread Shinobu Kinjo
Would you be able to execute ``ceph pg ${PG ID} query`` against that
particular PG?

On Wed, Dec 21, 2016 at 11:44 PM, Andras Pataki
 wrote:
> Yes, size = 3, and I have checked that all three replicas are the same zero
> length object on the disk.  I think some metadata info is mismatching what
> the OSD log refers to as "object info size".  But I'm not sure what to do
> about it.  pg repair does not fix it.  In fact, the file this object
> corresponds to in CephFS is shorter so this chunk shouldn't even exist I
> think (details are in the original email).  Although I may be understanding
> the situation wrong ...
>
> Andras
>
>
> On 12/21/2016 07:17 AM, Mehmet wrote:
>
> Hi Andras,
>
> Iam not the experienced User but i guess you could have a look on this
> object on each related osd for the pg, compare them and delete the Different
> object. I assume you have size = 3.
>
> Then again pg repair.
>
> But be carefull iirc the replica will be recovered from the primary pg.
>
> Hth
>
> Am 20. Dezember 2016 22:39:44 MEZ, schrieb Andras Pataki
> :
>>
>> Hi cephers,
>>
>> Any ideas on how to proceed on the inconsistencies below?  At the moment
>> our ceph setup has 5 of these - in all cases it seems like some zero length
>> objects that match across the three replicas, but do not match the object
>> info size.  I tried running pg repair on one of them, but it didn't repair
>> the problem:
>>
>> 2016-12-20 16:24:40.870307 7f3e1a4b1700  0 log_channel(cluster) log [INF]
>> : 6.92c repair starts
>> 2016-12-20 16:27:06.183186 7f3e1a4b1700 -1 log_channel(cluster) log [ERR]
>> : repair 6.92c 6:34932257:::1000187bbb5.0009:head on disk size (0) does
>> not match object info size (3014656) adjusted for ondisk to (3014656)
>> 2016-12-20 16:27:35.885496 7f3e17cac700 -1 log_channel(cluster) log [ERR]
>> : 6.92c repair 1 errors, 0 fixed
>>
>>
>> Any help/hints would be appreciated.
>>
>> Thanks,
>>
>> Andras
>>
>>
>> On 12/15/2016 10:13 AM, Andras Pataki wrote:
>>
>> Hi everyone,
>>
>> Yesterday scrubbing turned up an inconsistency in one of our placement
>> groups.  We are running ceph 10.2.3, using CephFS and RBD for some VM
>> images.
>>
>> [root@hyperv017 ~]# ceph -s
>> cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>>  health HEALTH_ERR
>> 1 pgs inconsistent
>> 1 scrub errors
>> noout flag(s) set
>>  monmap e15: 3 mons at
>> {hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
>> election epoch 27192, quorum 0,1,2
>> hyperv029,hyperv030,hyperv031
>>   fsmap e17181: 1/1/1 up {0=hyperv029=up:active}, 2 up:standby
>>  osdmap e342930: 385 osds: 385 up, 385 in
>> flags noout
>>   pgmap v37580512: 34816 pgs, 5 pools, 673 TB data, 198 Mobjects
>> 1583 TB used, 840 TB / 2423 TB avail
>>34809 active+clean
>>4 active+clean+scrubbing+deep
>>2 active+clean+scrubbing
>>1 active+clean+inconsistent
>>   client io 87543 kB/s rd, 671 MB/s wr, 23 op/s rd, 2846 op/s wr
>>
>> # ceph pg dump | grep inconsistent
>> 6.13f1  46920   0   0   0 16057314767 30873087
>> active+clean+inconsistent 2016-12-14 16:49:48.391572  342929'41011
>> 342929:43966 [158,215,364]   158 [158,215,364]   158 342928'40540
>> 2016-12-14 16:49:48.391511  342928'405402016-12-14 16:49:48.391511
>>
>> I tried a couple of other deep scrubs on pg 6.13f1 but got repeated
>> errors.  In the OSD logs:
>>
>> 2016-12-14 16:48:07.733291 7f3b56e3a700 -1 log_channel(cluster) log [ERR]
>> : deep-scrub 6.13f1 6:8fc91b77:::1000187bb70.0009:head on disk size (0)
>> does not match object info size (1835008) adjusted for ondisk to (1835008)
>> I looked at the objects on the 3 OSD's on their respective hosts and they
>> are the same, zero length files:
>>
>> # cd ~ceph/osd/ceph-158/current/6.13f1_head
>> # find . -name *1000187bb70* -ls
>> 6697380 -rw-r--r--   1 ceph ceph0 Dec 13 17:00
>> ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6
>>
>> # cd ~ceph/osd/ceph-215/current/6.13f1_head
>> # find . -name *1000187bb70* -ls
>> 5398156470 -rw-r--r--   1 ceph ceph0 Dec 13 17:00
>> ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6
>>
>> # cd ~ceph/osd/ceph-364/current/6.13f1_head
>> # find . -name *1000187bb70* -ls
>> 18814322150 -rw-r--r--   1 ceph ceph0 Dec 13 17:00
>> ./DIR_1/DIR_F/DIR_3/DIR_9/DIR_8/1000187bb70.0009__head_EED893F1__6
>>
>> At the time of the write, there wasn't anything unusual going on as far as
>> I can tell (no hardware/network issues, all processes were up, etc).
>>
>> This pool is a CephFS data pool, and the corresponding file (inode hex
>> 1000187bb70, decimal 1099537300336) looks like this:
>>
>> # ls -li chr4.tags.tsv
>> 1099537300336 -rw-r--r-- 1 

Re: [ceph-users] Can't create bucket (ERROR: endpoints not configured for upstream zone)

2016-12-22 Thread Ben Hines
FWIW, this is still required with Jewel 10.2.5. It sounded like it was
finally fixed from the release notes, but i had the same issue. Fortunately
Micha's steps are easy and fix it right up.

In my case i didn't think i had any mixed RGWs - was planning to stop them
all first -  but i had forgotten about my monitoring system which runs
'radosgw-admin' -- that part upgraded first, before i'd stopped any of my
Infernalis RGW's.

-Ben

On Thu, Jul 28, 2016 at 7:50 AM, Arvydas Opulskis <
arvydas.opuls...@adform.com> wrote:

> Hi,
>
> We solved it by running Micha scripts, plus we needed to run period update
> and commit commands (for some reason we had to do it in separate commands):
>
> radosgw-admin period update
> radosgw-admin period commit
>
> Btw, we added endpoints to json file, but I am not sure these are needed.
>
> And I agree with Micha - this should be noticed in upgrade instructions on
> Ceph site. We run into this trap on our prod env (upgrading Infernalis ->
> Jewel). Maybe we should test it more next time..
>
> Br,
> Arvydas
>
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Micha Krause
> Sent: Wednesday, July 6, 2016 2:46 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Can't create bucket (ERROR: endpoints not
> configured for upstream zone)
>
> Hi,
>
> I think I found a Solution for my Problem, here are my findings:
>
>
> This Bug can be easily reproduced in a test environment:
>
> 1. Delete all rgw related pools.
> 2. Start infernalis radosgw to initialize them again.
> 3. Create user.
> 4. User creates bucket.
> 5. Upgrade radosgw to jewel
> 6. User creates bucket -> fail
>
> I found this scary script from Yehuda: https://raw.githubusercontent.
> com/yehudasa/ceph/wip-fix-default-zone/src/fix-zone
> which needs to be modified according to http://www.spinics.net/lists/c
> eph-users/msg27957.html.
>
> After the modification, a lot of the script becomes obsolete (in my
> opinion), and can be rewritten to this (less scary):
>
>
> #!/bin/sh
>
> set -x
>
> RADOSGW_ADMIN=radosgw-admin
>
> echo "Exercise initialization code"
> $RADOSGW_ADMIN user info --uid=foo # exercise init code (???)
>
> echo "Get default zonegroup"
> $RADOSGW_ADMIN zonegroup get --rgw-zonegroup=default | sed
> 's/"id":.*/"id": "default",/g' | sed 's/"master_zone.*/"master_zone":
> "default",/g' > default-zg.json
>
> echo "Get default zone"
> $RADOSGW_ADMIN zone get --zone-id=default > default-zone.json
>
> echo "Creating realm"
> $RADOSGW_ADMIN realm create --rgw-realm=myrealm
>
> echo "Creating default zonegroup"
> $RADOSGW_ADMIN zonegroup set --rgw-zonegroup=default < default-zg.json
>
> echo "Creating default zone"
> $RADOSGW_ADMIN zone set --rgw-zone=default < default-zone.json
>
> echo "Setting default zonegroup to 'default'"
> $RADOSGW_ADMIN zonegroup default --rgw-zonegroup=default
>
> echo "Setting default zone to 'default'"
> $RADOSGW_ADMIN zone default --rgw-zone=default
>
>
> My plan to do this in production is now:
>
> 1. Stop all rados-gateways
> 2. Upgrade rados-gateways to jewel
> 3. Run less scary script
> 4. Start rados-gateways
>
> This whole thing is a serious problem, there should at least be a clear
> notice in the Jewel release notes about this. I was lucky to catch this in
> my test-cluster, I'm sure a lot of people will run into this in production.
>
>
> Micha Krause
>
>
> Am 05.07.2016 um 09:30 schrieb Micha Krause:
> > *bump*
> >
> > Am 01.07.2016 um 13:00 schrieb Micha Krause:
> >> Hi,
> >>
> >>  > In Infernalis there was this command:
> >>>
> >>> radosgw-admin regions list
> >>>
> >>> But this is missing in Jewel.
> >>
> >> Ok, I just found out that this was renamed to zonegroup list:
> >>
> >> root@rgw01:~ # radosgw-admin --id radosgw.rgw zonegroup list
> >> read_default_id : -2 {
> >>  "default_info": "",
> >>  "zonegroups": [
> >>  "default"
> >>  ]
> >> }
> >>
> >> This looks to me like there is indeed only one zonegroup or region
> configured.
> >>
> >> Micha Krause
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw setup issue

2016-12-22 Thread Kamble, Nitin A
I am trying to setup radosgw on a ceph cluster, and I am seeing some issues 
where google is not helping. I hope some of the developers would be able to 
help here.


I tried to create radosgw as mentioned here [0] on a jewel cluster. And it 
gives the following error in log file after starting radosgw.

 
2016-12-22 17:36:46.755786 7f084beeb9c0  0 set uid:gid to 167:167 (ceph:ceph)
2016-12-22 17:36:46.755849 7f084beeb9c0  0 ceph version 10.2.2-118-g894a5f8 
(894a5f8d878d4b267f80b90a4bffce157f2b4ba7), process radosgw, pid 10092
2016-12-22 17:36:46.763821 7f084beeb9c0  1 -- :/0 messenger.start
2016-12-22 17:36:46.764731 7f084beeb9c0  1 -- :/1011033520 --> 39.0.16.7:6789/0 
-- auth(proto 0 40 bytes epoch 0) v1 -- ?+0 0x7f084c8e9f60 con 0x7f084c8e9480
2016-12-22 17:36:46.765055 7f084beda700  1 -- 39.0.16.9:0/1011033520 learned my 
addr 39.0.16.9:0/1011033520
2016-12-22 17:36:46.765492 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 1  mon_map magic: 0 v1  195+0+0 (146652916 0 0) 
0x7f0814000a60 con 0x7f084c8e9480
2016-12-22 17:36:46.765562 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  33+0+0 
(1206278719 0 0) 0x7f0814000ee0 con 0x7f084c8e9480
2016-12-22 17:36:46.765697 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7f08180013b0 con 
0x7f084c8e9480
2016-12-22 17:36:46.765968 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  222+0+0 
(4230455906 0 0) 0x7f0814000ee0 con 0x7f084c8e9480
2016-12-22 17:36:46.766053 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- auth(proto 2 181 bytes epoch 0) v1 -- ?+0 0x7f0818001830 
con 0x7f084c8e9480
2016-12-22 17:36:46.766315 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  425+0+0 
(3179848142 0 0) 0x7f0814001180 con 0x7f084c8e9480
2016-12-22 17:36:46.766383 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7f084c8ea440 con 
0x7f084c8e9480
2016-12-22 17:36:46.766452 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- mon_subscribe({osdmap=0}) v2 -- ?+0 0x7f084c8ea440 con 
0x7f084c8e9480
2016-12-22 17:36:46.766518 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 5  mon_map magic: 0 v1  195+0+0 (146652916 0 0) 
0x7f0814001110 con 0x7f084c8e9480
2016-12-22 17:36:46.766671 7f08227fc700  2 
RGWDataChangesLog::ChangesRenewThread: start
2016-12-22 17:36:46.766691 7f084beeb9c0 20 get_system_obj_state: 
rctx=0x7ffec2850d00 obj=.rgw.root:default.realm state=0x7f084c8efdf8 
s->prefetch_data=0
2016-12-22 17:36:46.766750 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 6  osd_map(9506..9506 src has 8863..9506) v3  
66915+0+0 (689048617 0 0) 0x7f0814011680 con 0x7f084c8e9480
2016-12-22 17:36:46.767029 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- mon_get_version(what=osdmap handle=1) v1 -- ?+0 
0x7f084c8f05f0 con 0x7f084c8e9480
2016-12-22 17:36:46.767163 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 7  mon_get_version_reply(handle=1 version=9506) v2  
24+0+0 (2817198406 0 0) 0x7f0814001110 con 0x7f084c8e9480
2016-12-22 17:36:46.767214 7f084beeb9c0 20 get_system_obj_state: 
rctx=0x7ffec2850210 obj=.rgw.root:default.realm state=0x7f084c8efdf8 
s->prefetch_data=0
2016-12-22 17:36:46.767231 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- mon_get_version(what=osdmap handle=2) v1 -- ?+0 
0x7f084c8f0ac0 con 0x7f084c8e9480
2016-12-22 17:36:46.767341 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 8  mon_get_version_reply(handle=2 version=9506) v2  
24+0+0 (1826043941 0 0) 0x7f0814001110 con 0x7f084c8e9480
2016-12-22 17:36:46.767367 7f084beeb9c0 10 could not read realm id: (2) No such 
file or directory
2016-12-22 17:36:46.767390 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- mon_get_version(what=osdmap handle=3) v1 -- ?+0 
0x7f084c8efe50 con 0x7f084c8e9480
2016-12-22 17:36:46.767496 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 9  mon_get_version_reply(handle=3 version=9506) v2  
24+0+0 (3600349867 0 0) 0x7f0814001110 con 0x7f084c8e9480
2016-12-22 17:36:46.767518 7f084beeb9c0 10 failed to list objects 
pool_iterate_begin() returned r=-2
2016-12-22 17:36:46.767542 7f084beeb9c0 20 get_system_obj_state: 
rctx=0x7ffec2850420 obj=.rgw.root:zone_names.default state=0x7f084c8f0f38 
s->prefetch_data=0
2016-12-22 17:36:46.767554 7f084beeb9c0  1 -- 39.0.16.9:0/1011033520 --> 
39.0.16.7:6789/0 -- mon_get_version(what=osdmap handle=4) v1 -- ?+0 
0x7f084c8f1630 con 0x7f084c8e9480
2016-12-22 17:36:46.767660 7f082a7fc700  1 -- 39.0.16.9:0/1011033520 <== mon.0 
39.0.16.7:6789/0 10  mon_get_version_reply(handle=4 

Re: [ceph-users] How exactly does rgw work?

2016-12-22 Thread Daniel Gryniewicz

Yes, this is common practice.

Daniel

On 12/22/2016 02:34 PM, Gerald Spencer wrote:

Wonderful, just as I expected. Do folks normally have several RGW
running on individual machines with a load balancer at larger scales?

On Wed, Dec 21, 2016 at 8:22 AM, LOPEZ Jean-Charles > wrote:

Hi Gerald,

for the s3 and swift case, the clients are not accessing the ceph
cluster. They are s3 and swift clients and only discuss with the RGW
over HTTP. The RGW is the ceph client that does all the interaction
with the ceph cluster.

Best
JC


On Dec 21, 2016, at 07:27, Gerald Spencer > wrote:

I was under the impression that when a client talks to the
cluster, it grabs the osd map and computes the crush algorithm to
determine where it stores the object. Does the rgw server do this
for clients? If I had 12 clients all talking through one gateway,
would that server have to pass all of the objects from the clients
to the cluster?


And 48 osd nodes, each with 12 x 6TB drives and a PCIe write
journal. That would be 576 osds in the cluster, with about 3.4PB
raw...


On Tue, Dec 20, 2016 at 1:12 AM Wido den Hollander > wrote:



> Op 20 december 2016 om 3:24 schreef Gerald Spencer
>:

>

>

> Hello all,

>

> We're currently waiting on a delivery of equipment for a
small 50TB proof

> of concept cluster, and I've been lurking/learning a ton
from you. Thanks

> for how active everyone is.

>

> Question(s):

> How does the raids gateway work exactly?



The RGW doesn't do any RAID. It chunks up larger objects into
smaller RADOS chunks. The first chunk is always 512k (IIRC)
and then it chunks up into 4MB RADOS objects.



> Does it introduce a single point of failure?



It does if you deploy only one RGW. Always deploy multiple
with loadbalancing in front.



> Does all of the traffic go through the host running the rgw
server?



Yes it does.



>

> I just don't fully understand that side of things. As for
architecture our

> poc will have:

> - 1 monitor

> - 4 OSDs with 12 x 6TB drives, 1 x 800 PCIe journal

>



Underscaled machines, go for less disks per machine but more
machines. More smaller machines works a lot better with Ceph
then a few big machines.



> I'd all goes as planned, this will scale up to:

> - 3 monitors



Always run with 3 MONs. Otherwise it is a serious SPOF.



> - 48 osds

>

> This should give us enough storage (~1.2PB) wth enough
throughput to handle

> the data requirements of our machines to saturate our 100Gb
link...

>



That won't happen with just 4 machines. Replica 3x taken into
account is well. You will need a lot more machines to get the
100Gb link fully utilized.



Wido



>

>

>

>

> Cheers,

> G

> ___

> ceph-users mailing list

> ceph-users@lists.ceph.com 

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How exactly does rgw work?

2016-12-22 Thread Gerald Spencer
Wonderful, just as I expected. Do folks normally have several RGW running
on individual machines with a load balancer at larger scales?

On Wed, Dec 21, 2016 at 8:22 AM, LOPEZ Jean-Charles 
wrote:

> Hi Gerald,
>
> for the s3 and swift case, the clients are not accessing the ceph cluster.
> They are s3 and swift clients and only discuss with the RGW over HTTP. The
> RGW is the ceph client that does all the interaction with the ceph cluster.
>
> Best
> JC
>
> On Dec 21, 2016, at 07:27, Gerald Spencer  wrote:
>
> I was under the impression that when a client talks to the cluster, it
> grabs the osd map and computes the crush algorithm to determine where it
> stores the object. Does the rgw server do this for clients? If I had 12
> clients all talking through one gateway, would that server have to pass all
> of the objects from the clients to the cluster?
>
>
> And 48 osd nodes, each with 12 x 6TB drives and a PCIe write journal. That
> would be 576 osds in the cluster, with about 3.4PB raw...
>
>
> On Tue, Dec 20, 2016 at 1:12 AM Wido den Hollander  wrote:
>
>>
>>
>> > Op 20 december 2016 om 3:24 schreef Gerald Spencer <
>> ger.spenc...@gmail.com>:
>>
>> >
>>
>> >
>>
>> > Hello all,
>>
>> >
>>
>> > We're currently waiting on a delivery of equipment for a small 50TB
>> proof
>>
>> > of concept cluster, and I've been lurking/learning a ton from you.
>> Thanks
>>
>> > for how active everyone is.
>>
>> >
>>
>> > Question(s):
>>
>> > How does the raids gateway work exactly?
>>
>>
>>
>> The RGW doesn't do any RAID. It chunks up larger objects into smaller
>> RADOS chunks. The first chunk is always 512k (IIRC) and then it chunks up
>> into 4MB RADOS objects.
>>
>>
>>
>> > Does it introduce a single point of failure?
>>
>>
>>
>> It does if you deploy only one RGW. Always deploy multiple with
>> loadbalancing in front.
>>
>>
>>
>> > Does all of the traffic go through the host running the rgw server?
>>
>>
>>
>> Yes it does.
>>
>>
>>
>> >
>>
>> > I just don't fully understand that side of things. As for architecture
>> our
>>
>> > poc will have:
>>
>> > - 1 monitor
>>
>> > - 4 OSDs with 12 x 6TB drives, 1 x 800 PCIe journal
>>
>> >
>>
>>
>>
>> Underscaled machines, go for less disks per machine but more machines.
>> More smaller machines works a lot better with Ceph then a few big machines.
>>
>>
>>
>> > I'd all goes as planned, this will scale up to:
>>
>> > - 3 monitors
>>
>>
>>
>> Always run with 3 MONs. Otherwise it is a serious SPOF.
>>
>>
>>
>> > - 48 osds
>>
>> >
>>
>> > This should give us enough storage (~1.2PB) wth enough throughput to
>> handle
>>
>> > the data requirements of our machines to saturate our 100Gb link...
>>
>> >
>>
>>
>>
>> That won't happen with just 4 machines. Replica 3x taken into account is
>> well. You will need a lot more machines to get the 100Gb link fully
>> utilized.
>>
>>
>>
>> Wido
>>
>>
>>
>> >
>>
>> >
>>
>> >
>>
>> >
>>
>> > Cheers,
>>
>> > G
>>
>> > ___
>>
>> > ceph-users mailing list
>>
>> > ceph-users@lists.ceph.com
>>
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephalocon Sponsorships Open

2016-12-22 Thread Wes Dillingham
I / my group / our organization would be interested in discussing our
deployment of Ceph and how we are using it, deploying it, future plans etc.
This sounds like an exciting event. We look forward to hearing more
details.

On Thu, Dec 22, 2016 at 1:44 PM, Patrick McGarry 
wrote:

> Hey cephers,
>
> Just letting you know that we're opening the flood gates for
> sponsorship opportunities at Cephalocon next year (23-25 Aug 2017,
> Boston, MA). If you would be interested in sponsoring/exhibiting at
> our inaugural Ceph conference, please drop me a line. Thanks!
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephalocon Sponsorships Open

2016-12-22 Thread Patrick McGarry
Hey cephers,

Just letting you know that we're opening the flood gates for
sponsorship opportunities at Cephalocon next year (23-25 Aug 2017,
Boston, MA). If you would be interested in sponsoring/exhibiting at
our inaugural Ceph conference, please drop me a line. Thanks!


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw leaking data, orphan search loop

2016-12-22 Thread Orit Wasserman
HI Maruis,

On Thu, Dec 22, 2016 at 12:00 PM, Marius Vaitiekunas
 wrote:
> On Thu, Dec 22, 2016 at 11:58 AM, Marius Vaitiekunas
>  wrote:
>>
>> Hi,
>>
>> 1) I've written before into mailing list, but one more time. We have big
>> issues recently with rgw on jewel. because of leaked data - the rate is
>> about 50GB/hour.
>>
>> We've hitted these bugs:
>> rgw: fix put_acls for objects starting and ending with underscore
>> (issue#17625, pr#11669, Orit Wasserman)
>>
>> Upgraded to jewel 10.2.5 - no luck.
>>
>> Also we've hitted this one:
>> rgw: RGW loses realm/period/zonegroup/zone data: period overwritten if
>> somewhere in the cluster is still running Hammer (issue#17371, pr#11519,
>> Orit Wasserman)
>>
>> Fixed zonemaps - also no luck.
>>
>> We do not use multisite - only default realm, zonegroup, zone.
>>
>> We have no more ideas, how these data leak could happen. gc is working -
>> we can see it in rgw logs.
>>
>> Maybe, someone could give any hint about this? Where should we look?
>>
>>
>> 2) Another story is about removing all the leaked/orphan objects.
>> radosgw-admin orphans find enters the loop state on stage when it starts
>> linking objects.
>>
>> We've tried to change the number of shards to 16, 64 (default), 512. At
>> the moment it's running with shards number 1.
>>
>> Again, any ideas how to make orphan search happen?
>>
>>
>> I could provide any logs, configs, etc. if someone is ready to help on
>> this case.
>>
>>

How many buckets do you have ? how many object in each?
Can you provide the output of rados ls -p .rgw.buckets ?

Orit

>
> Sorry. I forgot to mention, that we've registered two issues on tracker:
> http://tracker.ceph.com/issues/18331
> http://tracker.ceph.com/issues/18258
>
> --
> Marius Vaitiekūnas
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can I debug "rbd list" hang?

2016-12-22 Thread Nick Fisk


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Stéphane Klein
Sent: 22 December 2016 17:10
To: n...@fisk.me.uk
Cc: ceph-users 
Subject: Re: [ceph-users] How can I debug "rbd list" hang?

2016-12-22 18:07 GMT+01:00 Nick Fisk >:
I think you have probably just answered your previous question. I would guess 
pauserd and pausewr, pauses read and write IO, hence your command to list is 
being blocked on reads.


How can I fix that? Where is the documentation about this two flags status?

Try:
ceph osd unset pauserd
ceph osd unset pausewr



Nick Fisk
Technical Support Engineer

System Professional Ltd
The Clove Gallery, Maguire Street,
London SE1 2NQ

tel: 01825 83
fax: 01825 830001
mail: nick.f...@sys-pro.co.uk
web: www.sys-pro.co.uk

IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING

Registered Office:
Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU
Registered in England and Wales.
Company Number: 04754200


Confidentiality: This e-mail and its attachments are intended for the above 
named only and may be confidential. If they have come to you in error you must 
take no action based on them, nor must you copy or show them to anyone; please 
reply to this e-mail and highlight the error.

Security Warning: Please note that this e-mail has been created in the 
knowledge that Internet e-mail is not a 100% secure communications medium. We 
advise that you understand and observe this lack of security when e-mailing us.

Viruses: Although we have taken steps to ensure that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free. 
Any views expressed in this e-mail message are those of the individual and not 
necessarily those of the company or any of its subsidiaries.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can I debug "rbd list" hang?

2016-12-22 Thread Stéphane Klein
2016-12-22 18:07 GMT+01:00 Nick Fisk :

> I think you have probably just answered your previous question. I would
> guess pauserd and pausewr, pauses read and write IO, hence your command to
> list is being blocked on reads.
>
>
>

How can I fix that? Where is the documentation about this two flags status?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is pauserd and pausewr status?

2016-12-22 Thread Wido den Hollander

> Op 22 december 2016 om 17:55 schreef Stéphane Klein 
> :
> 
> 
> Hi,
> 
> I have this status:
> 
> bash-4.2# ceph status
> cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
>  health HEALTH_WARN
> pauserd,pausewr,sortbitwise,require_jewel_osds flag(s) set
>  monmap e1: 3 mons at {ceph-mon-1=
> 172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
> }
> election epoch 12, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
>  osdmap e49: 4 osds: 4 up, 4 in
> flags pauserd,pausewr,sortbitwise,require_jewel_osds
>   pgmap v263: 64 pgs, 1 pools, 77443 kB data, 35 objects
> 281 MB used, 1978 GB / 1979 GB avail
>   64 active+clean
> 
> where can I found document about:
> 
> * pauserd ?
> * pausewr ?
> 

pauserd: Pause reads
pauserw: Pause writes

When you set the 'pause' flag it sets both pauserd and pauserw.

When these flags are set all I/O (RD and/or RW) is blocked to clients.

Wido

> Nothing in documentation search engine.
> 
> Best regards,
> Stéphane
> -- 
> Stéphane Klein 
> blog: http://stephane-klein.info
> cv : http://cv.stephane-klein.info
> Twitter: http://twitter.com/klein_stephane
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How can I debug "rbd list" hang?

2016-12-22 Thread Nick Fisk
I think you have probably just answered your previous question. I would guess 
pauserd and pausewr, pauses read and write IO, hence your command to list is 
being blocked on reads.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Stéphane Klein
Sent: 22 December 2016 17:04
To: ceph-users 
Subject: [ceph-users] How can I debug "rbd list" hang?

 

Hi,

I have this status:

root@ceph-mon-1:/home/vagrant# ceph status
cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
 health HEALTH_WARN
pauserd,pausewr,sortbitwise,require_jewel_osds flag(s) set
 monmap e1: 3 mons at 
{ceph-mon-1=172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
 

 }
election epoch 12, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
 osdmap e49: 4 osds: 4 up, 4 in
flags pauserd,pausewr,sortbitwise,require_jewel_osds
  pgmap v266: 64 pgs, 1 pools, 77443 kB data, 35 objects
281 MB used, 1978 GB / 1979 GB avail
  64 active+clean

Why "rbd list" command hang?

How can I debug that?

 

Best regards,

Stéphane

-- 

Stéphane Klein  >
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How can I debug "rbd list" hang?

2016-12-22 Thread Stéphane Klein
Hi,

I have this status:

root@ceph-mon-1:/home/vagrant# ceph status
cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
 health HEALTH_WARN
pauserd,pausewr,sortbitwise,require_jewel_osds flag(s) set
 monmap e1: 3 mons at {ceph-mon-1=
172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
}
election epoch 12, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
 osdmap e49: 4 osds: 4 up, 4 in
flags pauserd,pausewr,sortbitwise,require_jewel_osds
  pgmap v266: 64 pgs, 1 pools, 77443 kB data, 35 objects
281 MB used, 1978 GB / 1979 GB avail
  64 active+clean

Why "rbd list" command hang?

How can I debug that?

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What is pauserd and pausewr status?

2016-12-22 Thread Stéphane Klein
Hi,

I have this status:

bash-4.2# ceph status
cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
 health HEALTH_WARN
pauserd,pausewr,sortbitwise,require_jewel_osds flag(s) set
 monmap e1: 3 mons at {ceph-mon-1=
172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
}
election epoch 12, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
 osdmap e49: 4 osds: 4 up, 4 in
flags pauserd,pausewr,sortbitwise,require_jewel_osds
  pgmap v263: 64 pgs, 1 pools, 77443 kB data, 35 objects
281 MB used, 1978 GB / 1979 GB avail
  64 active+clean

where can I found document about:

* pauserd ?
* pausewr ?

Nothing in documentation search engine.

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Orphaned objects after deleting rbd images

2016-12-22 Thread Ruben Kerkhof
On Wed, Dec 21, 2016 at 10:33 PM, Jason Dillaman  wrote:
> [moving to ceph-users ...]
>
> You should be able to use the rados CLI to list all the objects in
> your pool, excluding all objects associated with known, valid image
> ids:
>
> rados ls -p rbd | grep -vE "($(rados -p rbd ls | grep rbd_header |
> grep -o "\.[0-9a-f]*" | sed -e :a -e '$!N; s/\n/|/; ta' -e
> 's/\./\\./g'))" | grep -E '(rbd_data|journal|rbd_object_map)'
>
> Once you tweak / verify the list, you can pipe it to the rados rm command.

Perfect, thanks Jason!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] BlueStore with v11.1.0 Kraken

2016-12-22 Thread Eugen Leitl
Hi guys,

I'm building a first test cluster for homelab, and would like to start
using BlueStore since data loss is not critical. However, there are
obviously no official documentation for basic best usage online yet.

My original layout was using 2x single Xeon nodes with 24 GB RAM each
under Proxmox VE for the test application and two metadata servers, 
each as a VM guest. Each VM woud be about 8 GB, 16 GB max.

Ceph OSD etc. was 7x dual-core Opteron with 8 GB RAM each, and some
2x2 to 2x1 TB SATA drives. Current total is 24 TB SATA, 56 GB RAM.

Each node has 4x Gbit NIC, so I have two local storage networks each on a
dedicated unmanaged switch and two NICs serving the app data, on two
dedicated managed ones. I guess up to 0.6 GB/s worst case is more than
enough for dual core Opterons, especially with crappy (nVidia/Broadcom) NICs.

Question is, how is Bluestore changing the picture?

E.g. looking at 
http://www.slideshare.net/sageweil1/bluestore-a-new-faster-storage-backend-for-ceph-63311181
They say things like metadata is all in memory. 
So how much GB RAM for each TB disk then? I'm assuming 4 MB object size as 
default.

Slide 23 has four example cases. Assuming I have only two HDDs I 
guess my options are small partition for Linux boot/root, and the 
rest as raw partitions for rocksdb and object data. I could boot 
the nodes from an USB memory stick, of course. Would that work, 
or too much I/O still on the slow USB device?

Before I was limited due to 8 GB RAM to max 8 TB/node, 
so e.g. 2x 4 TB disks. Is this still the case for Bluestore?

Thanks!

Regards,
Eugen 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clone data inconsistency in hammer

2016-12-22 Thread Sage Weil
On Thu, 22 Dec 2016, Bartłomiej Święcki wrote:
> Hi,
> 
> I have problems runnign Kraken tools on Hammer/Jewel cluster (official 11.1.0 
> debs),
> it asserts:
> 
> /build/ceph-11.1.0/src/mon/MonMap.cc: In function 'void 
> MonMap::sanitize_mons(std::map&)' 
> thread 7fffd37fe700 time 2016-12-22 12:26:23.457058
> /build/ceph-11.1.0/src/mon/MonMap.cc: 70: FAILED 
> assert(mon_info.count(p.first))

See http://tracker.ceph.com/issues/18265 and 
https://github.com/ceph/ceph/pull/12611

sage


> I tried to debug it a bit and it looks like mon_info has temporary mon names:
> 
> (gdb) p mon_info
> $1 = std::map with 3 elements = {["noname-a"] = {name = "noname-a", .
> 
> while it checks for a correct one:
> 
> (gdb) p p
> $2 = {first = "mon-01-690d38c0-2567-447b-bdfb-0edd137183db", 
> 
> 
> Anyway, I was thinking about the missing image problem - maybe it would be 
> easier
> to recreate removed image? Would restoring rbd_header object be enough?
> 
> 
> P.S. Adding ceph-devel
> 
> On Thu, 22 Dec 2016 10:10:09 +0100
> Bartłomiej Święcki  wrote:
> 
> > Hi Jason,
> > 
> > I'll test kraken tools since it happened on production, everything works 
> > there
> > since the clone is flattened after being created and the production 
> > equivalent
> > of "test" user can access the image only after it has been flattened.
> > 
> > The issue happened when someone accidentally removed not-yet-flattened image
> > using the user with weaker permissions. Good to hear this has been spotted
> > already.
> > 
> > Thanks for help,
> > Bartek
> > 
> > 
> > 
> > On Wed, 21 Dec 2016 11:53:57 -0500
> > Jason Dillaman  wrote:
> > 
> > > You are unfortunately the second person today to hit an issue where
> > > "rbd remove" incorrectly proceeds when it hits a corner-case error.
> > > 
> > > First things first, when you configure your new user, you needed to
> > > give it "rx" permissions to the parent image's pool. If you attempted
> > > the clone operation using the "test" user, the clone would have
> > > immediately failed due to this issue.
> > > 
> > > Second, unless this is a test cluster where you can delete the
> > > "rbd_children" object in the "rbd" pool (i.e. you don't have any
> > > additional clones in the rbd pool) via the rados CLI, you will need to
> > > use the Kraken release candidate (or master branch) version of the
> > > rados CLI to manually manipulate the "rbd_children" object to remove
> > > the dangling reference to the deleted image.
> > > 
> > > On Wed, Dec 21, 2016 at 6:57 AM, Bartłomiej Święcki
> > >  wrote:
> > > > Hi,
> > > >
> > > > I'm currently investigating a case where Ceph cluster ended up with 
> > > > inconsistent clone information.
> > > >
> > > > Here's a what I did to quickly reproduce:
> > > > * Created new cluster (tested in hammer 0.94.6 and jewel 10.2.3)
> > > > * Created two pools: test and rbd
> > > > * Created base image in pool test, created snapshot, protected it and 
> > > > created clone of this snapshot in pool rbd:
> > > > # rbd -p test create --size 10 --image-format 2 base
> > > > # rbd -p test snap create base@base
> > > > # rbd -p test snap protect base@base
> > > > # rbd clone test/base@base rbd/destination
> > > > * Created new user called "test" with rwx permissions to rbd pool only:
> > > > caps: [mon] allow r
> > > > caps: [osd] allow class-read object_prefix rbd_children, allow 
> > > > rwx pool=rbd
> > > > * Using this newly creted user I removed the cloned image in rbd pool, 
> > > > had errors but finally removed the image:
> > > > # rbd --id test -p rbd rm destination
> > > > 2016-12-21 11:50:03.758221 7f32b7459700 -1 
> > > > librbd::image::OpenRequest: failed to retreive name: (1) Operation not 
> > > > permitted
> > > > 2016-12-21 11:50:03.758288 7f32b6c58700 -1 
> > > > librbd::image::RefreshParentRequest: failed to open parent image: (1) 
> > > > Operation not permitted
> > > > 2016-12-21 11:50:03.758312 7f32b6c58700 -1 
> > > > librbd::image::RefreshRequest: failed to refresh parent image: (1) 
> > > > Operation not permitted
> > > > 2016-12-21 11:50:03.758333 7f32b6c58700 -1 
> > > > librbd::image::OpenRequest: failed to refresh image: (1) Operation not 
> > > > permitted
> > > > 2016-12-21 11:50:03.759366 7f32b6c58700 -1 librbd::ImageState: 
> > > > failed to open image: (1) Operation not permitted
> > > > Removing image: 100% complete...done.
> > > >
> > > > At this point there's no cloned image but the original snapshot still 
> > > > has reference to it:
> > > >
> > > > # rbd -p test snap unprotect base@base
> > > > 2016-12-21 11:53:47.359060 7fee037fe700 -1 
> > > > librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 
> > > > child(ren) [29b0238e1f29] in pool 'rbd'
> > > > 2016-12-21 11:53:47.359678 7fee037fe700 -1 
> > 

Re: [ceph-users] Clone data inconsistency in hammer

2016-12-22 Thread Bartłomiej Święcki
Hi,

I have problems runnign Kraken tools on Hammer/Jewel cluster (official 11.1.0 
debs),
it asserts:

/build/ceph-11.1.0/src/mon/MonMap.cc: In function 'void 
MonMap::sanitize_mons(std::map&)' 
thread 7fffd37fe700 time 2016-12-22 12:26:23.457058
/build/ceph-11.1.0/src/mon/MonMap.cc: 70: FAILED assert(mon_info.count(p.first))

I tried to debug it a bit and it looks like mon_info has temporary mon names:

(gdb) p mon_info
$1 = std::map with 3 elements = {["noname-a"] = {name = "noname-a", .

while it checks for a correct one:

(gdb) p p
$2 = {first = "mon-01-690d38c0-2567-447b-bdfb-0edd137183db", 


Anyway, I was thinking about the missing image problem - maybe it would be 
easier
to recreate removed image? Would restoring rbd_header object be enough?


P.S. Adding ceph-devel

On Thu, 22 Dec 2016 10:10:09 +0100
Bartłomiej Święcki  wrote:

> Hi Jason,
> 
> I'll test kraken tools since it happened on production, everything works there
> since the clone is flattened after being created and the production equivalent
> of "test" user can access the image only after it has been flattened.
> 
> The issue happened when someone accidentally removed not-yet-flattened image
> using the user with weaker permissions. Good to hear this has been spotted
> already.
> 
> Thanks for help,
> Bartek
> 
> 
> 
> On Wed, 21 Dec 2016 11:53:57 -0500
> Jason Dillaman  wrote:
> 
> > You are unfortunately the second person today to hit an issue where
> > "rbd remove" incorrectly proceeds when it hits a corner-case error.
> > 
> > First things first, when you configure your new user, you needed to
> > give it "rx" permissions to the parent image's pool. If you attempted
> > the clone operation using the "test" user, the clone would have
> > immediately failed due to this issue.
> > 
> > Second, unless this is a test cluster where you can delete the
> > "rbd_children" object in the "rbd" pool (i.e. you don't have any
> > additional clones in the rbd pool) via the rados CLI, you will need to
> > use the Kraken release candidate (or master branch) version of the
> > rados CLI to manually manipulate the "rbd_children" object to remove
> > the dangling reference to the deleted image.
> > 
> > On Wed, Dec 21, 2016 at 6:57 AM, Bartłomiej Święcki
> >  wrote:
> > > Hi,
> > >
> > > I'm currently investigating a case where Ceph cluster ended up with 
> > > inconsistent clone information.
> > >
> > > Here's a what I did to quickly reproduce:
> > > * Created new cluster (tested in hammer 0.94.6 and jewel 10.2.3)
> > > * Created two pools: test and rbd
> > > * Created base image in pool test, created snapshot, protected it and 
> > > created clone of this snapshot in pool rbd:
> > > # rbd -p test create --size 10 --image-format 2 base
> > > # rbd -p test snap create base@base
> > > # rbd -p test snap protect base@base
> > > # rbd clone test/base@base rbd/destination
> > > * Created new user called "test" with rwx permissions to rbd pool only:
> > > caps: [mon] allow r
> > > caps: [osd] allow class-read object_prefix rbd_children, allow 
> > > rwx pool=rbd
> > > * Using this newly creted user I removed the cloned image in rbd pool, 
> > > had errors but finally removed the image:
> > > # rbd --id test -p rbd rm destination
> > > 2016-12-21 11:50:03.758221 7f32b7459700 -1 
> > > librbd::image::OpenRequest: failed to retreive name: (1) Operation not 
> > > permitted
> > > 2016-12-21 11:50:03.758288 7f32b6c58700 -1 
> > > librbd::image::RefreshParentRequest: failed to open parent image: (1) 
> > > Operation not permitted
> > > 2016-12-21 11:50:03.758312 7f32b6c58700 -1 
> > > librbd::image::RefreshRequest: failed to refresh parent image: (1) 
> > > Operation not permitted
> > > 2016-12-21 11:50:03.758333 7f32b6c58700 -1 
> > > librbd::image::OpenRequest: failed to refresh image: (1) Operation not 
> > > permitted
> > > 2016-12-21 11:50:03.759366 7f32b6c58700 -1 librbd::ImageState: 
> > > failed to open image: (1) Operation not permitted
> > > Removing image: 100% complete...done.
> > >
> > > At this point there's no cloned image but the original snapshot still has 
> > > reference to it:
> > >
> > > # rbd -p test snap unprotect base@base
> > > 2016-12-21 11:53:47.359060 7fee037fe700 -1 
> > > librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 child(ren) 
> > > [29b0238e1f29] in pool 'rbd'
> > > 2016-12-21 11:53:47.359678 7fee037fe700 -1 
> > > librbd::SnapshotUnprotectRequest: encountered error: (16) Device or 
> > > resource busy
> > > 2016-12-21 11:53:47.359691 7fee037fe700 -1 
> > > librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: 
> > > ret_val=-16
> > > 2016-12-21 11:53:47.360627 7fee037fe700 -1 
> > > librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: 
> > > 

Re: [ceph-users] cannot commit period: period does not have a master zone of a master zonegroup

2016-12-22 Thread Wido den Hollander

> Op 20 december 2016 om 18:06 schreef Orit Wasserman :
> 
> 
> On Tue, Dec 20, 2016 at 5:39 PM, Wido den Hollander  wrote:
> >
> >> Op 15 december 2016 om 17:10 schreef Orit Wasserman :
> >>
> >>
> >> Hi Wido,
> >>
> >> This looks like you are hitting http://tracker.ceph.com/issues/17364
> >> The fix is being backported to jewel: 
> >> https://github.com/ceph/ceph/pull/12315
> >>
> >> A workaround:
> >> save the realm, zonegroup and zones json file
> >> make a copy of .rgw.root (the pool contain the multisite config)
> >> remove .rgw.root
> >> stop the gateway
> >> radosgw-admin realm set < json
> >> radosgw-admin zonegroup set < json
> >> raodsgw-admin zone set < json
> >> radosgw-admin period update --commit
> >> start the gateway
> >>
> >> If the realm set will give you problems you can create a new realm
> >> and will need to update the realm id in the zonegroup and zones json
> >> files before using them
> >>
> >
> > I eventually ended up doing that indeed. Setting a realm from a backup 
> > doesn't work.
> >
> 
> I suspect that, can you open an tracker issue?
> 

Sure, done: http://tracker.ceph.com/issues/18333

In short, a new period doesn't seem to be created when there is none when 
setting the realm. Creating one afterwards doesn't seem to work either.

> > So my sequence in commands:
> >
> > NOTE THE UUID OF THE REALM AND CHANGE IN JSON FILES
> >
> > nano zm1.json
> > nano zonegroup.json
> >
> > radosgw-admin zonegroup set --rgw-zonegroup gn < zonegroup.json
> > radosgw-admin zone set --rgw-zonegroup gn --rgw-zone zm1 < zm1.json
> > radosgw-admin zonegroup default --rgw-zonegroup gn
> > radosgw-admin zone default --rgw-zone zm1
> > radosgw-admin period update
> > radosgw-admin period update --commit
> >
> > This eventually got things working again.
> >
> Good
> 
> > The only thing I keep seeing everywhere:
> >
> > root@alpha:~# radosgw-admin period update --commit
> > 2016-12-20 16:38:07.958860 7f9571697a00  0 error in read_id for id  : (2) 
> > No such file or directory
> > 2016-12-20 16:38:07.960035 7f9571697a00  0 error in read_id for id  : (2) 
> > No such file or directory
> >
> 
> I am guessing this is not an error just a message that should have
> higher log level,
> can you open an issue?

I updated issue #15776 since that still seems relevant.

See my update there: http://tracker.ceph.com/issues/15776

Thanks!

Wido

> 
> > Brought me to:
> >
> > - http://tracker.ceph.com/issues/15776
> > - https://github.com/ceph/ceph/pull/8994
> >
> > Doesn't seem to be backported to 10.2.5 however.
> >
> 
> Strange it should be part of jewel , I will look into it.
> 
> > Wido
> >
> >> Orit
> >>
> >>
> >> On Thu, Dec 15, 2016 at 4:47 PM, Wido den Hollander  wrote:
> >> > Hi,
> >> >
> >> > On a Ceph cluster running Jewel 10.2.5 I'm running into a problem.
> >> >
> >> > I want to change the amount of shards:
> >> >
> >> > # radosgw-admin zonegroup-map get > zonegroup.json
> >> > # nano zonegroup.json
> >> > # radosgw-admin zonegroup-map set --infile zonegroup.json
> >> > # radosgw-admin period update --commit
> >> >
> >> > Now, the error arrises:
> >> >
> >> > cannot commit period: period does not have a master zone of a master 
> >> > zonegroup
> >> > failed to commit period: (22) Invalid argument
> >> >
> >> > Looking at the output:
> >> >
> >> > # radosgw-admin period update
> >> >
> >> > {
> >> > ...
> >> > "master_zonegroup": "",
> >> > "master_zone": "",
> >> > ...
> >> > }
> >> >
> >> > # radosgw-admin zone list
> >> >
> >> > {
> >> > "default_info": "zm1",
> >> > "zones": [
> >> > "default",
> >> > "zm1"
> >> > ]
> >> > }
> >> >
> >> > To me it seems like there is something wrong with the period since there 
> >> > is no UUID present in master_zone/zonegroup.
> >> >
> >> > Any idea on how to fix this?
> >> >
> >> > Wido
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] If I shutdown 2 osd / 3, Ceph Cluster say 2 osd UP, why?

2016-12-22 Thread Stéphane Klein
2016-12-22 12:30 GMT+01:00 Henrik Korkuc :

> try waiting a little longer. Mon needs multiple down reports to take OSD
> down. And as your cluster is very small there is small amount (1 in this
> case) of OSDs to report that others are down.
>
>
Why this limitation? because my rbd mount on ceph-client-1 host is hang
since 10 minutes already :(
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] If I shutdown 2 osd / 3, Ceph Cluster say 2 osd UP, why?

2016-12-22 Thread Henrik Korkuc

On 16-12-22 13:26, Stéphane Klein wrote:

Hi,

I have:

* 3 mon
* 3 osd

When I shutdown one osd, I work great:

cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
 health HEALTH_WARN
43 pgs degraded
43 pgs stuck unclean
43 pgs undersized
recovery 24/70 objects degraded (34.286%)
too few PGs per OSD (28 < min 30)
1/3 in osds are down
 monmap e1: 3 mons at 
{ceph-mon-1=172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0 
}
election epoch 10, quorum 0,1,2 
ceph-mon-1,ceph-mon-2,ceph-mon-3

 osdmap e22: 3 osds: 2 up, 3 in; 43 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v169: 64 pgs, 1 pools, 77443 kB data, 35 objects
252 MB used, 1484 GB / 1484 GB avail
24/70 objects degraded (34.286%)
  43 active+undersized+degraded
  21 active+clean

But, when I shutdown 2 osd, Ceph Cluster don't see that second osd 
node is down :(


root@ceph-mon-1:/home/vagrant# ceph status
cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
 health HEALTH_WARN
clock skew detected on mon.ceph-mon-2
pauserd,pausewr,sortbitwise,require_jewel_osds flag(s) set
Monitor clock skew detected
 monmap e1: 3 mons at 
{ceph-mon-1=172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0 
}
election epoch 10, quorum 0,1,2 
ceph-mon-1,ceph-mon-2,ceph-mon-3

 osdmap e26: 3 osds: 2 up, 2 in
flags pauserd,pausewr,sortbitwise,require_jewel_osds
  pgmap v203: 64 pgs, 1 pools, 77443 kB data, 35 objects
219 MB used, 989 GB / 989 GB avail
  64 active+clean

2 osd up ! why ?

root@ceph-mon-1:/home/vagrant# ping ceph-osd-1 -c1
--- ceph-osd-1 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

root@ceph-mon-1:/home/vagrant# ping ceph-osd-2 -c1
--- ceph-osd-2 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

root@ceph-mon-1:/home/vagrant# ping ceph-osd-3 -c1
--- ceph-osd-3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.278/0.278/0.278/0.000 ms

My configuration:

ceph_conf_overrides:
   global:
  osd_pool_default_size: 2
  osd_pool_default_min_size: 1

Full Ansible configuration is here: 
https://github.com/harobed/poc-ceph-ansible/blob/master/vagrant-3mons-3osd/hosts/group_vars/all.yml#L11


What is my mistake? Is it Ceph bug?

try waiting a little longer. Mon needs multiple down reports to take OSD 
down. And as your cluster is very small there is small amount (1 in this 
case) of OSDs to report that others are down.



Best regards,
Stéphane
--
Stéphane Klein >

blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When I shutdown one osd node, where can I see the block movement?

2016-12-22 Thread Henrik Korkuc

On 16-12-22 13:20, Stéphane Klein wrote:



2016-12-22 12:18 GMT+01:00 Henrik Korkuc >:


On 16-12-22 13:12, Stéphane Klein wrote:

HEALTH_WARN 43 pgs degraded; 43 pgs stuck unclean; 43 pgs
undersized; recovery 24/70 objects degraded (34.286%); too few
PGs per OSD (28 < min 30); 1/3 in osds are down;


it says 1/3 OSDs are down. By default Ceph pools are setup with
size 3. If your setup is same it will not be able to restore to
normal status without size decrease or additional OSDs


I have this config:

ceph_conf_overrides:
   global:
  osd_pool_default_size: 2
  osd_pool_default_min_size: 1

see: 
https://github.com/harobed/poc-ceph-ansible/blob/master/vagrant-3mons-3osd/hosts/group_vars/all.yml#L11


Can you please provide outputs of "ceph -s" "ceph osd tree" and "ceph 
osd dump |grep size"?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] If I shutdown 2 osd / 3, Ceph Cluster say 2 osd UP, why?

2016-12-22 Thread Stéphane Klein
Hi,

I have:

* 3 mon
* 3 osd

When I shutdown one osd, I work great:

cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
 health HEALTH_WARN
43 pgs degraded
43 pgs stuck unclean
43 pgs undersized
recovery 24/70 objects degraded (34.286%)
too few PGs per OSD (28 < min 30)
1/3 in osds are down
 monmap e1: 3 mons at {ceph-mon-1=
172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
}
election epoch 10, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
 osdmap e22: 3 osds: 2 up, 3 in; 43 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v169: 64 pgs, 1 pools, 77443 kB data, 35 objects
252 MB used, 1484 GB / 1484 GB avail
24/70 objects degraded (34.286%)
  43 active+undersized+degraded
  21 active+clean

But, when I shutdown 2 osd, Ceph Cluster don't see that second osd node is
down :(

root@ceph-mon-1:/home/vagrant# ceph status
cluster 7ecb6ebd-2e7a-44c3-bf0d-ff8d193e03ac
 health HEALTH_WARN
clock skew detected on mon.ceph-mon-2
pauserd,pausewr,sortbitwise,require_jewel_osds flag(s) set
Monitor clock skew detected
 monmap e1: 3 mons at {ceph-mon-1=
172.28.128.2:6789/0,ceph-mon-2=172.28.128.3:6789/0,ceph-mon-3=172.28.128.4:6789/0
}
election epoch 10, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
 osdmap e26: 3 osds: 2 up, 2 in
flags pauserd,pausewr,sortbitwise,require_jewel_osds
  pgmap v203: 64 pgs, 1 pools, 77443 kB data, 35 objects
219 MB used, 989 GB / 989 GB avail
  64 active+clean

2 osd up ! why ?

root@ceph-mon-1:/home/vagrant# ping ceph-osd-1 -c1
--- ceph-osd-1 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

root@ceph-mon-1:/home/vagrant# ping ceph-osd-2 -c1
--- ceph-osd-2 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

root@ceph-mon-1:/home/vagrant# ping ceph-osd-3 -c1
--- ceph-osd-3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.278/0.278/0.278/0.000 ms

My configuration:

ceph_conf_overrides:
   global:
  osd_pool_default_size: 2
  osd_pool_default_min_size: 1

Full Ansible configuration is here:
https://github.com/harobed/poc-ceph-ansible/blob/master/vagrant-3mons-3osd/hosts/group_vars/all.yml#L11

What is my mistake? Is it Ceph bug?

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When I shutdown one osd node, where can I see the block movement?

2016-12-22 Thread Henrik Korkuc

On 16-12-22 13:12, Stéphane Klein wrote:
HEALTH_WARN 43 pgs degraded; 43 pgs stuck unclean; 43 pgs undersized; 
recovery 24/70 objects degraded (34.286%); too few PGs per OSD (28 < 
min 30); 1/3 in osds are down;


it says 1/3 OSDs are down. By default Ceph pools are setup with size 3. 
If your setup is same it will not be able to restore to normal status 
without size decrease or additional OSDs



Here Ceph say there are 24 objects to move?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When I shutdown one osd node, where can I see the block movement?

2016-12-22 Thread ceph
That's correct :)

On 22/12/2016 12:12, Stéphane Klein wrote:
> HEALTH_WARN 43 pgs degraded; 43 pgs stuck unclean; 43 pgs undersized;
> recovery 24/70 objects degraded (34.286%); too few PGs per OSD (28 < min
> 30); 1/3 in osds are down;
> 
> Here Ceph say there are 24 objects to move?
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When I shutdown one osd node, where can I see the block movement?

2016-12-22 Thread Stéphane Klein
HEALTH_WARN 43 pgs degraded; 43 pgs stuck unclean; 43 pgs undersized;
recovery 24/70 objects degraded (34.286%); too few PGs per OSD (28 < min
30); 1/3 in osds are down;

Here Ceph say there are 24 objects to move?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] When I shutdown one osd node, where can I see the block movement?

2016-12-22 Thread ceph
As always: ceph status

On 22/12/2016 11:53, Stéphane Klein wrote:
> Hi,
> 
> When I shutdown one osd node, where can I see the block movement?
> Where can I see percentage progression?
> 
> Best regards,
> Stéphane
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How can I ask to Ceph Cluster to move blocks now when osd is down?

2016-12-22 Thread Stéphane Klein
Hi,

How can I ask to Ceph Cluster to move blocks now when osd is down?

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] When I shutdown one osd node, where can I see the block movement?

2016-12-22 Thread Stéphane Klein
Hi,

When I shutdown one osd node, where can I see the block movement?
Where can I see percentage progression?

Best regards,
Stéphane
-- 
Stéphane Klein 
blog: http://stephane-klein.info
cv : http://cv.stephane-klein.info
Twitter: http://twitter.com/klein_stephane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount /dev/rbd0 /mnt/image2 + rm Python-2.7.13 -rf => freeze

2016-12-22 Thread Ilya Dryomov
On Thu, Dec 22, 2016 at 8:32 AM, Stéphane Klein
 wrote:
>
>
> 2016-12-21 23:39 GMT+01:00 Stéphane Klein :
>>
>>
>>
>> 2016-12-21 23:33 GMT+01:00 Ilya Dryomov :
>>>
>>> What if you boot ceph-client-3 with >512M memory, say 2G?
>>
>>
>> Success !
>
>
>
> It is possible to add a warning message in rbd to say if memory is too low?

It's not rbd per se.  Those kernels are probably just missing
a backport.  Kernel 3.13 wasn't a regular LTS, so even though ubuntu
team generally does a good job and applies a lot of patches, something
may have been missed.

Can you try upgrading the kernel package and booting with 512M, just to
confirm?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw leaking data, orphan search loop

2016-12-22 Thread Marius Vaitiekunas
On Thu, Dec 22, 2016 at 11:58 AM, Marius Vaitiekunas <
mariusvaitieku...@gmail.com> wrote:

> Hi,
>
> 1) I've written before into mailing list, but one more time. We have big
> issues recently with rgw on jewel. because of leaked data - the rate is
> about 50GB/hour.
>
> We've hitted these bugs:
> rgw: fix put_acls for objects starting and ending with underscore (
> issue#17625 , pr#11669
> , Orit Wasserman)
>
> Upgraded to jewel 10.2.5 - no luck.
>
> Also we've hitted this one:
> rgw: RGW loses realm/period/zonegroup/zone data: period overwritten if
> somewhere in the cluster is still running Hammer (issue#17371
> , pr#11519
> , Orit Wasserman)
>
> Fixed zonemaps - also no luck.
>
> We do not use multisite - only default realm, zonegroup, zone.
>
> We have no more ideas, how these data leak could happen. gc is working -
> we can see it in rgw logs.
>
> Maybe, someone could give any hint about this? Where should we look?
>
>
> 2) Another story is about removing all the leaked/orphan objects.
> radosgw-admin orphans find enters the loop state on stage when it starts
> linking objects.
>
> We've tried to change the number of shards to 16, 64 (default), 512. At
> the moment it's running with shards number 1.
>
> Again, any ideas how to make orphan search happen?
>
>
> I could provide any logs, configs, etc. if someone is ready to help on
> this case.
>
>
>
Sorry. I forgot to mention, that we've registered two issues on tracker:
http://tracker.ceph.com/issues/18331
http://tracker.ceph.com/issues/18258

-- 
Marius Vaitiekūnas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw leaking data, orphan search loop

2016-12-22 Thread Marius Vaitiekunas
Hi,

1) I've written before into mailing list, but one more time. We have big
issues recently with rgw on jewel. because of leaked data - the rate is
about 50GB/hour.

We've hitted these bugs:
rgw: fix put_acls for objects starting and ending with underscore (
issue#17625 , pr#11669
, Orit Wasserman)

Upgraded to jewel 10.2.5 - no luck.

Also we've hitted this one:
rgw: RGW loses realm/period/zonegroup/zone data: period overwritten if
somewhere in the cluster is still running Hammer (issue#17371
, pr#11519
, Orit Wasserman)

Fixed zonemaps - also no luck.

We do not use multisite - only default realm, zonegroup, zone.

We have no more ideas, how these data leak could happen. gc is working - we
can see it in rgw logs.

Maybe, someone could give any hint about this? Where should we look?


2) Another story is about removing all the leaked/orphan objects.
radosgw-admin orphans find enters the loop state on stage when it starts
linking objects.

We've tried to change the number of shards to 16, 64 (default), 512. At the
moment it's running with shards number 1.

Again, any ideas how to make orphan search happen?


I could provide any logs, configs, etc. if someone is ready to help on this
case.

-- 
Marius Vaitiekūnas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD will not start after heartbeatsuicide timeout, assert error from PGLog

2016-12-22 Thread Nick Fisk
Hi,

I hit this a few weeks ago, here is the related tracker. You might want to 
update it to reflect your case and upload logs.

http://tracker.ceph.com/issues/17916

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Trygve Vea
> Sent: 21 December 2016 20:18
> To: ceph-users 
> Subject: [ceph-users] OSD will not start after heartbeatsuicide timeout, 
> assert error from PGLog
> 
> Hi,
> 
> One of our OSDs have gone into a mode where it will throw an assert and die 
> shortly after it has been started.
> 
> The following assert is being thrown:
> https://github.com/ceph/ceph/blob/v10.2.5/src/osd/PGLog.cc#L1036-L1047
> 
> --- begin dump of recent events ---
>  0> 2016-12-21 17:05:57.975799 7f1d91d59800 -1 *** Caught signal 
> (Aborted) **  in thread 7f1d91d59800 thread_name:ceph-osd
> 
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (()+0x91875a) [0x7f1d9268975a]
>  2: (()+0xf100) [0x7f1d906ba100]
>  3: (gsignal()+0x37) [0x7f1d8ec7c5f7]
>  4: (abort()+0x148) [0x7f1d8ec7dce8]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x267) [0x7f1d927866c7]
>  6: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t 
> const&, std::map std::less, std::allocator > >&, PGLog::IndexedLog&, pg_missing_t&,
> std::basic_ostringstream >&, DoutPrefixProvider const*, std::set std::less, std::allocator >*)+0xdc7) 
> [0x7f1d92371ae7]
>  7: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x490) [0x7f1d921cf440]
>  8: (OSD::load_pgs()+0x9b6) [0x7f1d92105056]
>  9: (OSD::init()+0x2086) [0x7f1d92117846]
>  10: (main()+0x2c55) [0x7f1d9207b595]
>  11: (__libc_start_main()+0xf5) [0x7f1d8ec68b15]
>  12: (()+0x3549b9) [0x7f1d920c59b9]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> 
> It looks to me like prior to this, the osd died while hitting a suicide 
> timeout:
> 
> 7fafac213700 time 2016-12-21 16:50:13.038341
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
> 
>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x85) [0x7fb001b3c4e5]
>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, 
> long)+0x2e1) [0x7fb001a7bf21]
>  3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7fb001a7c77e]
>  4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7fb0014b289f]
>  5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7fb0014b3acb]
>  6: (DispatchQueue::entry()+0x78a) [0x7fb001bfe45a]
>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fb001b17cdd]
>  8: (()+0x7dc5) [0x7fafffa68dc5]
>  9: (clone()+0x6d) [0x7faffe0f3ced]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> 
> These timeouts started to occasionally occur after we upgraded to Jewel.  I 
> have saved a dump of the recent events prior to the
> suicide timeout here: 
> http://employee.tv.situla.bitbit.net/heartbeat_suicide.log
> 
> 
> If the Ceph-project is interested in doing forensics on this, I still have 
> the OSD available in its current state.
> 
> My hypothesis is that some kind of inconsistencies have occurred as a result 
> of the first assert error.
> 
> Is this a bug?
> 
> 
> Regards
> --
> Trygve Vea
> Redpill Linpro AS
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clone data inconsistency in hammer

2016-12-22 Thread Bartłomiej Święcki
Hi Jason,

I'll test kraken tools since it happened on production, everything works there
since the clone is flattened after being created and the production equivalent
of "test" user can access the image only after it has been flattened.

The issue happened when someone accidentally removed not-yet-flattened image
using the user with weaker permissions. Good to hear this has been spotted
already.

Thanks for help,
Bartek



On Wed, 21 Dec 2016 11:53:57 -0500
Jason Dillaman  wrote:

> You are unfortunately the second person today to hit an issue where
> "rbd remove" incorrectly proceeds when it hits a corner-case error.
> 
> First things first, when you configure your new user, you needed to
> give it "rx" permissions to the parent image's pool. If you attempted
> the clone operation using the "test" user, the clone would have
> immediately failed due to this issue.
> 
> Second, unless this is a test cluster where you can delete the
> "rbd_children" object in the "rbd" pool (i.e. you don't have any
> additional clones in the rbd pool) via the rados CLI, you will need to
> use the Kraken release candidate (or master branch) version of the
> rados CLI to manually manipulate the "rbd_children" object to remove
> the dangling reference to the deleted image.
> 
> On Wed, Dec 21, 2016 at 6:57 AM, Bartłomiej Święcki
>  wrote:
> > Hi,
> >
> > I'm currently investigating a case where Ceph cluster ended up with 
> > inconsistent clone information.
> >
> > Here's a what I did to quickly reproduce:
> > * Created new cluster (tested in hammer 0.94.6 and jewel 10.2.3)
> > * Created two pools: test and rbd
> > * Created base image in pool test, created snapshot, protected it and 
> > created clone of this snapshot in pool rbd:
> > # rbd -p test create --size 10 --image-format 2 base
> > # rbd -p test snap create base@base
> > # rbd -p test snap protect base@base
> > # rbd clone test/base@base rbd/destination
> > * Created new user called "test" with rwx permissions to rbd pool only:
> > caps: [mon] allow r
> > caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
> > pool=rbd
> > * Using this newly creted user I removed the cloned image in rbd pool, had 
> > errors but finally removed the image:
> > # rbd --id test -p rbd rm destination
> > 2016-12-21 11:50:03.758221 7f32b7459700 -1 
> > librbd::image::OpenRequest: failed to retreive name: (1) Operation not 
> > permitted
> > 2016-12-21 11:50:03.758288 7f32b6c58700 -1 
> > librbd::image::RefreshParentRequest: failed to open parent image: (1) 
> > Operation not permitted
> > 2016-12-21 11:50:03.758312 7f32b6c58700 -1 
> > librbd::image::RefreshRequest: failed to refresh parent image: (1) 
> > Operation not permitted
> > 2016-12-21 11:50:03.758333 7f32b6c58700 -1 
> > librbd::image::OpenRequest: failed to refresh image: (1) Operation not 
> > permitted
> > 2016-12-21 11:50:03.759366 7f32b6c58700 -1 librbd::ImageState: 
> > failed to open image: (1) Operation not permitted
> > Removing image: 100% complete...done.
> >
> > At this point there's no cloned image but the original snapshot still has 
> > reference to it:
> >
> > # rbd -p test snap unprotect base@base
> > 2016-12-21 11:53:47.359060 7fee037fe700 -1 
> > librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 child(ren) 
> > [29b0238e1f29] in pool 'rbd'
> > 2016-12-21 11:53:47.359678 7fee037fe700 -1 
> > librbd::SnapshotUnprotectRequest: encountered error: (16) Device or 
> > resource busy
> > 2016-12-21 11:53:47.359691 7fee037fe700 -1 
> > librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: 
> > ret_val=-16
> > 2016-12-21 11:53:47.360627 7fee037fe700 -1 
> > librbd::SnapshotUnprotectRequest: 0x7fee39ae9340 should_complete_error: 
> > ret_val=-16
> > rbd: unprotecting snap failed: (16) Device or resource busy
> >
> > # rbd -p test children base@base
> > rbd: listing children failed: (2) No such file or directory2016-12-21
> > 11:53:08.716987 7ff2b2eaad80 -1 librbd: Error looking up name for image
> > id 29b0238e1f29 in pool rbd
> >
> >
> > Any ideas on how this could be fixed?
> >
> >
> > Thanks,
> > Bartek
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount /dev/rbd0 /mnt/image2 + rm Python-2.7.13 -rf => freeze

2016-12-22 Thread Stéphane Klein
2016-12-21 23:33 GMT+01:00 Ilya Dryomov :

>
> What if you boot ceph-client-3 with >512M memory, say 2G?
>
>
With:

* 512 M memory => failed
* 1000 M memory => failed
* 1500 M memory => success
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com