Re: [ceph-users] Some questions concerning filestore --> bluestore migration

2018-10-04 Thread ceph
Hello

Am 4. Oktober 2018 02:38:35 MESZ schrieb solarflow99 :
>I use the same configuration you have, and I plan on using bluestore. 
>My
>SSDs are only 240GB and it worked with filestore all this time, I
>suspect
>bluestore should be fine too.
>
>
>On Wed, Oct 3, 2018 at 4:25 AM Massimo Sgaravatto <
>massimo.sgarava...@gmail.com> wrote:
>
>> Hi
>>
>> I have a ceph cluster, running luminous, composed of 5 OSD nodes,
>which is
>> using filestore.
>> Each OSD node has 2 E5-2620 v4 processors, 64 GB of RAM, 10x6TB SATA
>disk
>> + 2x200GB SSD disk (then I have 2 other disks in RAID for the OS), 10
>Gbps.
>> So each SSD disk is used for the journal for 5 OSDs. With this
>> configuration everything is running smoothly ...
>>
>>
>> We are now buying some new storage nodes, and I am trying to buy
>something
>> which is bluestore compliant. So the idea is to consider a
>configuration
>> something like:
>>
>> - 10 SATA disks (8TB / 10TB / 12TB each. TBD)
>> - 2 processor (~ 10 core each)
>> - 64 GB of RAM
>> - 2 SSD to be used for WAL+DB
>> - 10 Gbps
>>
>> For what concerns the size of the SSD disks I read in this mailing
>list
>> that it is suggested to have at least 10GB of SSD disk/10TB of SATA
>disk.
>>
>>
>> So, the questions:
>>
>> 1) Does this hardware configuration seem reasonable ?
>>
>> 2) Are there problems to live (forever, or until filestore
>deprecation)
>> with some OSDs using filestore (the old ones) and some OSDs using
>bluestore
>> (the old ones) ?
>>
>> 3) Would you suggest to update to bluestore also the old OSDs, even
>if the
>> available SSDs are too small (they don't satisfy the "10GB of SSD
>disk/10TB
>> of SATA disk" rule) ?

AFAIR should the db size 4% of the osd in question.

So 

For example, if the block size is 1TB, then block.db shouldn’t be less than 40GB

See: http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

Hth
- Mehmet 

>>
>> Thanks, Massimo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS performance.

2018-10-04 Thread Patrick Donnelly
On Thu, Oct 4, 2018 at 2:10 AM Ronny Aasen  wrote:
> in rbd there is a fancy striping solution, by using --stripe-unit and
> --stripe-count. This would get more spindles running ; perhaps consider
> using rbd instead of cephfs if it fits the workload.

CephFS also supports custom striping via layouts:
http://docs.ceph.com/docs/master/cephfs/file-layouts/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-10-04 Thread Stefan Kooman
Dear list,

Today we hit our first Ceph MDS issue. Out of the blue the active MDS
stopped working:

mon.mon1 [WRN] daemon mds.mds1 is not responding, replacing it as rank 0 with 
standby
daemon mds.mds2.

Logging of ceph-mds1:

2018-10-04 10:50:08.524745 7fdd516bf700 1 mds.mds1 asok_command: status 
(starting...)
2018-10-04 10:50:08.524782 7fdd516bf700 1 mds.mds1 asok_command: status 
(complete)

^^ one of our monitoring health checks performing a "ceph daemon mds.mds1 
version", business as usual.

2018-10-04 10:52:36.712525 7fdd51ec0700 1 heartbeat_map is_healthy 'MDSRank' 
had timed out after 15
2018-10-04 10:52:36.747577 7fdd4deb8700 1 heartbeat_map is_healthy 'MDSRank' 
had timed out after 15
2018-10-04 10:52:36.747584 7fdd4deb8700 1 mds.beacon.mds1 _send skipping 
beacon, heartbeat map not healthy

^^ the unresponsive mds1 consumes 100% CPU and keeps on logging above 
heatbeat_map messages.

In the meantime ceph-mds2 has transitioned from "standby-replay" to "active":

mon.mon1 [INF] daemon mds.mds2 is now active in filesystem BITED-153874-cephfs 
as rank 0

Logging:

replays, final replay as standby, reopen log

2018-10-04 10:52:53.268470 7fdb231d9700 1 mds.0.141 reconnect_done
2018-10-04 10:52:53.759844 7fdb231d9700 1 mds.mds2 Updating MDS map to version 
143 from mon.3
2018-10-04 10:52:53.759859 7fdb231d9700 1 mds.0.141 handle_mds_map i am now 
mds.0.141
2018-10-04 10:52:53.759862 7fdb231d9700 1 mds.0.141 handle_mds_map state change 
up:reconnect --> up:rejoin
2018-10-04 10:52:53.759868 7fdb231d9700 1 mds.0.141 rejoin_start
2018-10-04 10:52:53.759970 7fdb231d9700 1 mds.0.141 rejoin_joint_start
2018-10-04 10:52:53.760970 7fdb1d1cd700 0 mds.0.cache failed to open ino 
0x1cd95e9 err -5/0
2018-10-04 10:52:54.126658 7fdb1d1cd700 1 mds.0.141 rejoin_done
2018-10-04 10:52:54.770457 7fdb231d9700 1 mds.mds2 Updating MDS map to version 
144 from mon.3
2018-10-04 10:52:54.770484 7fdb231d9700 1 mds.0.141 handle_mds_map i am now 
mds.0.141
2018-10-04 10:52:54.770487 7fdb231d9700 1 mds.0.141 handle_mds_map state change 
up:rejoin --> up:clientreplay
2018-10-04 10:52:54.770494 7fdb231d9700 1 mds.0.141 recovery_done -- successful 
recovery!
2018-10-04 10:52:54.770617 7fdb231d9700 1 mds.0.141 clientreplay_start
2018-10-04 10:52:54.882995 7fdb1d1cd700 1 mds.0.141 clientreplay_done
2018-10-04 10:52:55.778598 7fdb231d9700 1 mds.mds2 Updating MDS map to version 
145 from mon.3
2018-10-04 10:52:55.778622 7fdb231d9700 1 mds.0.141 handle_mds_map i am now 
mds.0.141
2018-10-04 10:52:55.778628 7fdb231d9700 1 mds.0.141 handle_mds_map state change 
up:clientreplay --> up:active
2018-10-04 10:52:55.778638 7fdb231d9700 1 mds.0.141 active_start
2018-10-04 10:52:55.805206 7fdb231d9700 1 mds.0.141 cluster recovered.

And then it _also_ starts to log hearbeat_map messages (and consuming 100% CPU):

en dan deze meldingen die zichzelf blijven herhalen bij 100% cpu
2018-10-04 10:53:41.550793 7fdb241db700 1 heartbeat_map is_healthy 'MDSRank' 
had timed out after 15
2018-10-04 10:53:42.884018 7fdb201d3700 1 heartbeat_map is_healthy 'MDSRank' 
had timed out after 15
2018-10-04 10:53:42.884024 7fdb201d3700 1 mds.beacon.mds2 _send skipping 
beacon, heartbeat map not healthy

At that point in time there is one active MDS according to ceph, but in reality 
it's
not functioning correctly (not serving clients at least).

... we stopped both daemons. Restarted one ... recovery ...
works for half a minute ... then starts logging heartbeat_map messages.
Restart again ... works for a little while ... starts logging
heartbeat_map messages again. We restart the mds with debug_mds=20 
it keeps working fine. The other mds gets restarted and keeps on
working. We do a couple of failover tests  works flawlessly
(failover in < 1 second, clients reconnect instantly).  

A couple of hours later we hit the same issue. We restarted with
debug_mds=20 and debug_journaler=20 on the standby-replay node. Eight
hours later (an hour ago) we hit the same issue. We captured ~ 4.7 GB of
logging I skipped to the end of the log file just before the
"hearbeat_map" messages start:

2018-10-04 23:23:53.144644 7f415ebf4700 20 mds.0.locker  client.17079146 
pending pAsLsXsFscr allowed pAsLsXsFscr wanted pFscr
2018-10-04 23:23:53.144645 7f415ebf4700 10 mds.0.locker eval done
2018-10-04 23:23:55.088542 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 
5021
2018-10-04 23:23:59.088602 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 
5022
2018-10-04 23:24:03.088688 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 
5023
2018-10-04 23:24:07.088775 7f415bbee700 10 mds.beacon.mds2 _send up:active seq 
5024
2018-10-04 23:24:11.088867 7f415bbee700  1 heartbeat_map is_healthy 'MDSRank' 
had timed out after 15
2018-10-04 23:24:11.088871 7f415bbee700  1 mds.beacon.mds2 _send skipping 
beacon, heartbeat map not healthy

As far as I can see just normal behaviour.

The big question is: what is happening when the mds start logging the 
hearbeat_map messages?
Why does 

[ceph-users] Ceph version upgrade with Juju

2018-10-04 Thread Fabio Abreu
 Hi Cephers,

I have  a little  doubt about the migration of Jewel version in the MAAS /
JUJU implementation scenario .

Could someone has the same experience in production environment?

I am asking this because we mapping all challenges of this scenario.

Thanks and best Regards,
Fabio Abreu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph 13.2.2 on Ubuntu 18.04 arm64

2018-10-04 Thread Rob Raymakers
Hi,

I'm trying to get Ceph 13.2.2 running on Ubuntu 18.06 arm64, specifically a
Rock64 to build a mini cluster. But I can't figure out how to build Ceph
13.2.2 from github, as there is no Ceph 13.2.2 package available yet. I
tried to figure it out with the instructions on github, but no success.
Can anyone help me out or point me in the right direction? I got my setup
working on 6 VMs with Ubuntu 18.04 on amd64. I just need the arm64 packages.

Sorry for the noob question, never build a package before, you got to start
somewhere.

Thank,

Rob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with more chunks than servers

2018-10-04 Thread Paul Emmerich
Yes, you can use a crush rule with two steps:

take default
chooseleaf indep 5
emit
take default
chooseleaf indep 2
emit

You'll have to adjust it when adding a server, so it's not a great
solution. I'm not sure if there's a way to do it without hardcoding
the number of servers (I don't think there is).

Paul


Am Do., 4. Okt. 2018 um 20:28 Uhr schrieb Vladimir Brik
:
>
> Hello
>
> I have a 5-server cluster and I am wondering if it's possible to create
> pool that uses k=5 m=2 erasure code. In my experiments, I ended up with
> pools whose pgs are stuck in creating+incomplete state even when I
> created the erasure code profile with --crush-failure-domain=osd.
>
> Assuming that what I want to do is possible, will CRUSH distribute
> chunks evenly among servers, so that if I need to bring one server down
> (e.g. reboot), clients' ability to write or read any object would not be
> disrupted? (I guess something would need to ensure that no server holds
> more than two chunks of an object)
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure coding with more chunks than servers

2018-10-04 Thread Vladimir Brik
Hello

I have a 5-server cluster and I am wondering if it's possible to create
pool that uses k=5 m=2 erasure code. In my experiments, I ended up with
pools whose pgs are stuck in creating+incomplete state even when I
created the erasure code profile with --crush-failure-domain=osd.

Assuming that what I want to do is possible, will CRUSH distribute
chunks evenly among servers, so that if I need to bring one server down
(e.g. reboot), clients' ability to write or read any object would not be
disrupted? (I guess something would need to ensure that no server holds
more than two chunks of an object)

Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Resolving Large omap objects in RGW index pool

2018-10-04 Thread Chris Sarginson
Hi,

Thanks for the response - I am still unsure as to what will happen to the
"marker" reference in the bucket metadata, as this is the object that is
being detected as Large.  Will the bucket generate a new "marker" reference
in the bucket metadata?

I've been reading this page to try and get a better understanding of this
http://docs.ceph.com/docs/luminous/radosgw/layout/

However I'm no clearer on this (and what the "marker" is used for), or why
there are multiple separate "bucket_id" values (with different mtime
stamps) that all show as having the same number of shards.

If I were to remove the old bucket would I just be looking to execute

rados - p .rgw.buckets.index rm .dir.default.5689810.107

Is the differing marker/bucket_id in the other buckets that was found also
an indicator?  As I say, there's a good number of these, here's some
additional examples, though these aren't necessarily reporting as large
omap objects:

"BUCKET1", "default.281853840.479", "default.105206134.5",
"BUCKET2", "default.364663174.1", "default.349712129.3674",

Checking these other buckets, they are exhibiting the same sort of symptoms
as the first (multiple instances of radosgw-admin metadata get showing what
seem to be multiple resharding processes being run, with different mtimes
recorded).

Thanks
Chris

On Thu, 4 Oct 2018 at 16:21 Konstantin Shalygin  wrote:

> Hi,
>
> Ceph version: Luminous 12.2.7
>
> Following upgrading to Luminous from Jewel we have been stuck with a
> cluster in HEALTH_WARN state that is complaining about large omap objects.
> These all seem to be located in our .rgw.buckets.index pool.  We've
> disabled auto resharding on bucket indexes due to seeming looping issues
> after our upgrade.  We've reduced the number reported of reported large
> omap objects by initially increasing the following value:
>
> ~# ceph daemon mon.ceph-mon-1 config get
> osd_deep_scrub_large_omap_object_value_sum_threshold
> {
> "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648 
> <(214)%20748-3648>"
> }
>
> However we're still getting a warning about a single large OMAP object,
> however I don't believe this is related to an unsharded index - here's the
> log entry:
>
> 2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 :
> cluster [WRN] Large omap object found. Object:
> 15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
> (bytes): 4458647149 <(445)%20864-7149>
>
> The object in the logs is the "marker" object, rather than the bucket_id -
> I've put some details regarding the bucket here:
> https://pastebin.com/hW53kTxL
>
> The bucket limit check shows that the index is sharded, so I think this
> might be related to versioning, although I was unable to get confirmation
> that the bucket in question has versioning enabled through the aws
> cli(snipped debug output below)
>
> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
> headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
> 'x-amz-request-id': 'tx0020e3b15-005bb37c85-15870fe0-default',
> 'content-type': 'application/xml'}
> 2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
> body:
>  xmlns="http://s3.amazonaws.com/doc/2006-03-01/;>
>
> After dumping the contents of large omap object mentioned above into a file
> it does seem to be a simple listing of the bucket contents, potentially an
> old index:
>
> ~# wc -l omap_keys
> 17467251 omap_keys
>
> This is approximately 5 million below the currently reported number of
> objects in the bucket.
>
> When running the commands listed 
> here:http://tracker.ceph.com/issues/34307#note-1
>
> The problematic bucket is listed in the output (along with 72 other
> buckets):
> "CLIENTBUCKET", "default.294495648.690", "default.5689810.107"
>
> As this tests for bucket_id and marker fields not matching to print out the
> information, is the implication here that both of these should match in
> order to fully migrate to the new sharded index?
>
> I was able to do a "metadata get" using what appears to be the old index
> object ID, which seems to support this (there's a "new_bucket_instance_id"
> field, containing a newer "bucket_id" and reshard_status is 2, which seems
> to suggest it has completed).
>
> I am able to take the "new_bucket_instance_id" and get additional metadata
> about the bucket, each time I do this I get a slightly newer
> "new_bucket_instance_id", until it stops suggesting updated indexes.
>
> It's probably worth pointing out that when going through this process the
> final "bucket_id" doesn't match the one that I currently get when running
> 'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also
> suggests that no further resharding has been done as "reshard_status" = 0
> and "new_bucket_instance_id" is blank.  The output is available to view
> here:
> https://pastebin.com/g1TJfKLU
>
> It would be useful if anyone can offer some clarification on 

Re: [ceph-users] Unfound object on erasure when recovering

2018-10-04 Thread Jan Pekař - Imatic
I thought, that putting disk "in" solves the problem, but not. Problem is there, but cluster don't see object as unfound, so it reports IO 
error.


This is what I got when using rados get command -

error getting erasure_3_1/11eec49.: (5) Input/output error

So maybe problem appeared before trying to re-balance my cluster and was invisible to me. But it never happened before and scrub and 
depp-scrub is running regularly.


I don't know where to continue with debugging this problem.

JP

On 3.10.2018 08:47, Jan Pekař - Imatic wrote:

Hi all,

I'm playing with my testing cluster with ceph 12.2.8 installed.

It happened to me for the second time, that I have 1 unfound objects on erasure 
coded pool.

I have erasure with 3+1 configuration.

First time I was adding additional disk. During cluster rebalance I noticed one unfound object. I hoped, that it will be fixed after 
cluster rebalance, but it was not.


I coped with marking object as lost, because disk IO on that object stuck.

Yesterday I was trying to remove one disk so I marked it out.

After few hours I noticed also one object unfound. This is dump of pg 
list_missing.

There is strange pool number, snapid (I'm not using snapshots on that pool, it is just pool for cephfs data), also locations array looks 
strange.


I decided to put disk I wanted to remove back "in" and unfound object 
disappeared.

Can you give me additional informations to this problem? Should I debug it more?

Thank you

{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 0,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "11eec49.",
    "key": "",
    "snapid": -2,
    "hash": 586898362,
    "max": 0,
    "pool": 10,
    "namespace": ""
    },
    "need": "13528'6795",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "7(3)"
    ]
    }
    ],
    "more": false
}




--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hardware heterogeneous in same pool

2018-10-04 Thread Brett Chancellor
You could also set *osd_crush_initial_weight = 0 . *New OSDs will
automatically come up with a 0 weight and you won't have to race the clock.

-Brett

On Thu, Oct 4, 2018 at 3:50 AM Janne Johansson  wrote:

>
>
> Den tors 4 okt. 2018 kl 00:09 skrev Bruno Carvalho :
>
>> Hi Cephers, I would like to know how you are growing the cluster.
>> Using dissimilar hardware in the same pool or creating a pool for each
>> different hardware group.
>> What problem would I have many problems using different hardware (CPU,
>> memory, disk) in the same pool?
>
>
> I don't think CPU and RAM (and other hw related things like HBA controller
> card brand) matters
> a lot, more is always nicer, but as long as you don't add worse machines
> like Jonathan wrote you
> should not see any degradation.
>
> What you might want to look out for is if the new disks are very uneven
> compared to the old
> setup, so if you used to have servers with 10x2TB drives and suddenly add
> one with 2x10TB,
> things might become very unbalanced, since those differences will not be
> handled seamlessly
> by the crush map.
>
> Apart from that, the only issues for us is "add drives, quickly set crush
> reweight to 0.0 before
> all existing OSD hosts shoot massive amounts of I/O on them, then script a
> slower raise of
> crush weight upto what they should end up at", to lessen the impact for
> our 24/7 operations.
>
> If you have weekends where noone accesses the cluster or night-time low-IO
> usage patterns,
> just upping the weight at the right hour might suffice.
>
> Lastly, for ssd/nvme setups with good networking, this is almost moot,
> they converge so fast
> its almost unfair. A real joy working with expanding flash-only
> pools/clusters.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster broken and ODSs crash with failed assertion in PGLog::merge_log

2018-10-04 Thread Jonas Jelten
Hello!

Unfortunately, our single-node-"Cluster" with 11 ODSs is broken because some 
ODSs crash when they start peering.
I'm on Ubuntu 18.04 with Ceph Mimic (13.2.2).

The problem was induced by when RAM was filled up and ODS processes then 
crashed because of memory allocation failures.

No weird commands (e.g. force_create_pg) were used on this cluster and it was 
set up with 13.2.1 initially.
The affected pool seems to be a replicated pool with size=3 and min_size=2 
(which haven't been changed).

Crash log of osd.4 (only the crashed thread):

99424: -1577> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty
99425: -1576> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 
queue_want_up_thru want 1433 <= queued 1433, currently 1426
99427: -1574> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
3.8 to_process <> waiting <>
waiting_peering {}
99428: -1573> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 
3.8( v 866'122691 (569'119300,866'122691]
local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
1433/1433/1433)) features:
0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] 
acting 4,10),([1401,1403] acting
2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) queued
99430: -1571> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
3.8 to_process  waiting <>
waiting_peering {}
99433: -1568> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 
3.8( v 866'122691 (569'119300,866'122691]
local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
1433/1433/1433)) features:
0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] 
acting 4,10),([1401,1403] acting
2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) pg 
0x56013bc87400
99437: -1564> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 
pg[3.8( v 866'127774 (866'124700,866'127774]
local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
1433/1433/1433) [4,2] r=0 lpr=1433
pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] 
do_peering_event: epoch_sent: 1433 epoch_requested:
1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 
(569'119300,866'122691]
local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
1433/1433/1433)) features:
0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] 
acting 4,10),([1401,1403] acting
2,10),([1426,1428] acting 2,4)) +create_info
99440: -1561> 2018-10-04 13:40:11.024 7f3838417700  7 osd.4 pg_epoch: 1433 
pg[3.8( v 866'127774 (866'124700,866'127774]
local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
1433/1433/1433) [4,2] r=0 lpr=1433
pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] 
state: handle_pg_notify from osd.2
99444: -1557> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 
pg[3.8( v 866'127774 (866'124700,866'127774]
local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
1433/1433/1433) [4,2] r=0 lpr=1433
pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}]  got dup 
osd.2 info 3.8( v 866'122691
(569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 
1401/859 les/c/f 1402/860/0 1433/1433/1433),
identical to ours
99445: -1556> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty
99446: -1555> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 
queue_want_up_thru want 1433 <= queued 1433, currently 1426
99448: -1553> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
3.8 to_process <> waiting <>
waiting_peering {}
99450: -1551> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433) 
queued
99456: -1545> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
3.8 to_process  waiting <>
waiting_peering {}
99458: -1543> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 
OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433) 
pg 0x56013bc87400
99461: -1540> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 
pg[3.8( v 866'127774 (866'124700,866'127774]
local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 
1433/1433/1433) [4,2] r=0 lpr=1433
pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] 
do_peering_event: epoch_sent: 1433 epoch_requested:
1433 MLogRec from 2 +create_info
99465: -1536> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 
pg[3.8( v 866'127774 

Re: [ceph-users] RBD Mirror Question

2018-10-04 Thread Jason Dillaman
On Thu, Oct 4, 2018 at 11:15 AM Vikas Rana  wrote:
>
> Bummer.
>
> Our OSD is on 10G private network and MON is on 1G public network. I believe 
> this is reference architecture mentioned everywhere to separate MON and OSD.
>
> I believe the requirement for rbd-mirror for the secondary site MON to reach 
> the private OSD IPs on primary was never mentioned anywhere or may be i 
> missed it.
>
> Looks like if rbd-mirror needs to be used, we have to use 1 network for both 
> MON and the OSDs. No private addresseing  :-)
>
> Thanks a lot for your help. I won't have got this information without your 
> help.

Sorry about that, I thought it was documented but it turns out that
was only in the Red Hat block mirroring documentation. I've opened a
PR to explicitly state that the "rbd-mirror" daemon requires full
cluster connectivity [1].

> -Vikas
>
>
> On Thu, Oct 4, 2018 at 10:37 AM Jason Dillaman  wrote:
>>
>> On Thu, Oct 4, 2018 at 10:27 AM Vikas Rana  wrote:
>> >
>> > on Primary site, we have OSD's running on 192.168.4.x address.
>> >
>> > Similarly on Secondary site, we have OSD's running on 192.168.4.x address. 
>> > 192.168.3.x is the old MON network.on both site which was non route-able.
>> > So we renamed mon on primary site to 165.x.x and mon on secondary site to 
>> > 165.x.y. now primary and secondary can see each other.
>> >
>> >
>> > Do the OSD daemon from primary and secondary have to talk to each other? 
>> > we have same non routed networks for OSD.
>>
>> The secondary site needs to be able to communicate with all MON and
>> OSD daemons in the primary site.
>>
>> > Thanks,
>> > -Vikas
>> >
>> > On Thu, Oct 4, 2018 at 10:13 AM Jason Dillaman  wrote:
>> >>
>> >> On Thu, Oct 4, 2018 at 10:10 AM Vikas Rana  wrote:
>> >> >
>> >> > Thanks Jason for great suggestions.
>> >> >
>> >> > but somehow rbd mirror status not working from secondary to primary. 
>> >> > Here;s the status from both sides. cluster name is ceph on primary side 
>> >> > and cephdr on remote site. mirrordr is the user on DR side and 
>> >> > mirrorprod is on primary prod side.
>> >> >
>> >> > # rbd mirror pool info nfs
>> >> > Mode: image
>> >> > Peers:
>> >> >   UUID NAME   CLIENT
>> >> >   3ccd7a67-2343-44bf-960b-02d9b1258371 cephdr client.mirrordr.
>> >> >
>> >> > rbd --cluster cephdr mirror pool info nfs
>> >> > Mode: image
>> >> > Peers:
>> >> >   UUID NAME CLIENT
>> >> >   e6b9ba05-48de-462c-ad5f-0b51d0ee733f ceph client.mirrorprod
>> >> >
>> >> >
>> >> > From primary site, when i query the remote site, its looks good.
>> >> > # rbd --cluster cephdr --id mirrordr mirror pool status nfs
>> >> > health: OK
>> >> > images: 0 total
>> >> >
>> >> > but when i query from secondary site to primary side, I'm getting this 
>> >> > error
>> >> > # rbd  --cluster ceph --id mirrorprod mirror pool status nfs
>> >> > 2018-10-03 10:21:06.645903 7f27a44ed700  0 -- 165.x.x.202:0/1310074448 
>> >> > >> 192.168.3.21:6804/3835 pipe(0x55ed47daf480 sd=4 :0 s=1 pgs=0 cs=0 
>> >> > l=1 c=0x55ed47db0740).fault
>> >> >
>> >> >
>> >> > We were using 192.168.3.x for MON network before we renamed it to use 
>> >> > 165 address since its routeable. why its trying to connect to 192.x 
>> >> > address instead of 165.x.y address?
>> >>
>> >> Are your OSDs on that 192.168.3.x subnet? What daemons are running on
>> >> 192.168.3.21?
>> >>
>> >> > I could do ceph -s from both side and they can see each other. Only rbd 
>> >> > command is having issue.
>> >> >
>> >> > Thanks,
>> >> > -Vikas
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Oct 2, 2018 at 5:14 PM Jason Dillaman  
>> >> > wrote:
>> >> >>
>> >> >> On Tue, Oct 2, 2018 at 4:47 PM Vikas Rana  wrote:
>> >> >> >
>> >> >> > Hi,
>> >> >> >
>> >> >> > We have a CEPH 3 node cluster at primary site. We created a RBD 
>> >> >> > image and the image has about 100TB of data.
>> >> >> >
>> >> >> > Now we installed another 3 node cluster on secondary site. We want 
>> >> >> > to replicate the image at primary site to this new cluster on 
>> >> >> > secondary site.
>> >> >> >
>> >> >> > As per documentation, we enabled journaling on primary site. We 
>> >> >> > followed all the procedure and peering looks good but the image is 
>> >> >> > not copying.
>> >> >> > The status is always showing down.
>> >> >>
>> >> >> Do you have an "rbd-mirror" daemon running on the secondary site? Are
>> >> >> you running "rbd mirror pool status" against the primary site or the
>> >> >> secondary site? The mirroring status is only available on the sites
>> >> >> running "rbd-mirror" daemon (the "down" means that the cluster you are
>> >> >> connected to doesn't have the daemon running).
>> >> >>
>> >> >> > So my question is, is it possible to replicate a image which already 
>> >> >> > have some data before enabling journalling?
>> >> >>
>> >> >> Indeed -- it will perform a full image sync to the secondary site.
>> >> >>
>> >> >> > We are using the image mirroring instead of 

Re: [ceph-users] Mimic upgrade 13.2.1 > 13.2.2 monmap changed

2018-10-04 Thread Paul Emmerich
You can manually extract, edit, and inject the mon map to manually fix it.
In this case you probably need to:

1. check what exactly is going on, inspect the mon map of all mons
2. maybe the IP addresses changed or something? see if you can fix it
somehow without editing the monmap
3. adjust the mon map accordingly and inject it back into 2 mons

You can use these instructions to get/set/edit the mon map:
http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-mons/#removing-monitors-from-an-unhealthy-cluster

Paul
Am Do., 4. Okt. 2018 um 14:40 Uhr schrieb Nino Bosteels
:
>
> Hello list,
>
>
>
> I’m having a serious issue, since my ceph cluster has become unresponsive. I 
> was upgrading my cluster (3 servers, 3 monitors) from 13.2.1 to 13.2.2, which 
> shouldn’t be a problem.
>
>
>
> Though on reboot my first host reported:
>
>
>
> starting mon.ceph01 rank -1 at 192.168.200.197:6789/0 mon_data 
> /var/lib/ceph/mon/ceph-ceph01 fsid 27dd45f1-28b5-4ac6-81ab-c62bc581130c
>
> mon.cephxx@-1(probing) e5 preinit fsid 27dd45f1-28b5-4ac6-81ab-c62bc581130c
>
> mon.cephxx@-1(probing) e5 not in monmap and have been in a quorum before; 
> must have been removed
>
> -1 mon.cephxx@-1(probing) e5 commit suicide!
>
> -1 failed to initialize
>
>
>
> I thought, perhaps the monitor doesn’t want to accept the monmap of the other 
> 2, because of the version-difference. Sadly, I upgraded and rebooted the 
> second server.
>
>
>
> Since the cluster is unresponsive (because more than half of the monitors is 
> offline / out of quorum). The logs of my second host, it keeps spamming:
>
>
>
> 2018-10-04 14:39:06.802 7fed0058f700 -1 mon.ceph02@1(probing) e6 
> get_health_metrics reporting 14 slow ops, oldest is auth(proto 0 27 bytes 
> epoch 6)
>
>
>
> Any help VERY MUCH appreciated, this sucks.
>
>
>
> Thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Resolving Large omap objects in RGW index pool

2018-10-04 Thread Konstantin Shalygin

Hi,

Ceph version: Luminous 12.2.7

Following upgrading to Luminous from Jewel we have been stuck with a
cluster in HEALTH_WARN state that is complaining about large omap objects.
These all seem to be located in our .rgw.buckets.index pool.  We've
disabled auto resharding on bucket indexes due to seeming looping issues
after our upgrade.  We've reduced the number reported of reported large
omap objects by initially increasing the following value:

~# ceph daemon mon.ceph-mon-1 config get
osd_deep_scrub_large_omap_object_value_sum_threshold
{
 "osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648"
}

However we're still getting a warning about a single large OMAP object,
however I don't believe this is related to an unsharded index - here's the
log entry:

2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 :
cluster [WRN] Large omap object found. Object:
15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
(bytes): 4458647149

The object in the logs is the "marker" object, rather than the bucket_id -
I've put some details regarding the bucket here:

https://pastebin.com/hW53kTxL

The bucket limit check shows that the index is sharded, so I think this
might be related to versioning, although I was unable to get confirmation
that the bucket in question has versioning enabled through the aws
cli(snipped debug output below)

2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
'x-amz-request-id': 'tx0020e3b15-005bb37c85-15870fe0-default',
'content-type': 'application/xml'}
2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
body:
http://s3.amazonaws.com/doc/2006-03-01/;>

After dumping the contents of large omap object mentioned above into a file
it does seem to be a simple listing of the bucket contents, potentially an
old index:

~# wc -l omap_keys
17467251 omap_keys

This is approximately 5 million below the currently reported number of
objects in the bucket.

When running the commands listed here:
http://tracker.ceph.com/issues/34307#note-1

The problematic bucket is listed in the output (along with 72 other
buckets):
"CLIENTBUCKET", "default.294495648.690", "default.5689810.107"

As this tests for bucket_id and marker fields not matching to print out the
information, is the implication here that both of these should match in
order to fully migrate to the new sharded index?

I was able to do a "metadata get" using what appears to be the old index
object ID, which seems to support this (there's a "new_bucket_instance_id"
field, containing a newer "bucket_id" and reshard_status is 2, which seems
to suggest it has completed).

I am able to take the "new_bucket_instance_id" and get additional metadata
about the bucket, each time I do this I get a slightly newer
"new_bucket_instance_id", until it stops suggesting updated indexes.

It's probably worth pointing out that when going through this process the
final "bucket_id" doesn't match the one that I currently get when running
'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also
suggests that no further resharding has been done as "reshard_status" = 0
and "new_bucket_instance_id" is blank.  The output is available to view
here:

https://pastebin.com/g1TJfKLU

It would be useful if anyone can offer some clarification on how to proceed
from this situation, identifying and removing any old/stale indexes from
the index pool (if that is the case), as I've not been able to spot
anything in the archives.

If there's any further information that is needed for additional context
please let me know.



Usually, when you bucket is automatically resharded in some case old big 
index is not deleted - this is your large omap object.


This index is safe to delete. Also look at [1].


[1] https://tracker.ceph.com/issues/24457



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Mirror Question

2018-10-04 Thread Vikas Rana
Bummer.

Our OSD is on 10G private network and MON is on 1G public network. I
believe this is reference architecture mentioned everywhere to separate MON
and OSD.

I believe the requirement for rbd-mirror for the secondary site MON to
reach the private OSD IPs on primary was never mentioned anywhere or may be
i missed it.

Looks like if rbd-mirror needs to be used, we have to use 1 network for
both MON and the OSDs. No private addresseing  :-)

Thanks a lot for your help. I won't have got this information without your
help.

-Vikas


On Thu, Oct 4, 2018 at 10:37 AM Jason Dillaman  wrote:

> On Thu, Oct 4, 2018 at 10:27 AM Vikas Rana  wrote:
> >
> > on Primary site, we have OSD's running on 192.168.4.x address.
> >
> > Similarly on Secondary site, we have OSD's running on 192.168.4.x
> address. 192.168.3.x is the old MON network.on both site which was non
> route-able.
> > So we renamed mon on primary site to 165.x.x and mon on secondary site
> to 165.x.y. now primary and secondary can see each other.
> >
> >
> > Do the OSD daemon from primary and secondary have to talk to each other?
> we have same non routed networks for OSD.
>
> The secondary site needs to be able to communicate with all MON and
> OSD daemons in the primary site.
>
> > Thanks,
> > -Vikas
> >
> > On Thu, Oct 4, 2018 at 10:13 AM Jason Dillaman 
> wrote:
> >>
> >> On Thu, Oct 4, 2018 at 10:10 AM Vikas Rana 
> wrote:
> >> >
> >> > Thanks Jason for great suggestions.
> >> >
> >> > but somehow rbd mirror status not working from secondary to primary.
> Here;s the status from both sides. cluster name is ceph on primary side and
> cephdr on remote site. mirrordr is the user on DR side and mirrorprod is on
> primary prod side.
> >> >
> >> > # rbd mirror pool info nfs
> >> > Mode: image
> >> > Peers:
> >> >   UUID NAME   CLIENT
> >> >   3ccd7a67-2343-44bf-960b-02d9b1258371 cephdr client.mirrordr.
> >> >
> >> > rbd --cluster cephdr mirror pool info nfs
> >> > Mode: image
> >> > Peers:
> >> >   UUID NAME CLIENT
> >> >   e6b9ba05-48de-462c-ad5f-0b51d0ee733f ceph client.mirrorprod
> >> >
> >> >
> >> > From primary site, when i query the remote site, its looks good.
> >> > # rbd --cluster cephdr --id mirrordr mirror pool status nfs
> >> > health: OK
> >> > images: 0 total
> >> >
> >> > but when i query from secondary site to primary side, I'm getting
> this error
> >> > # rbd  --cluster ceph --id mirrorprod mirror pool status nfs
> >> > 2018-10-03 10:21:06.645903 7f27a44ed700  0 --
> 165.x.x.202:0/1310074448 >> 192.168.3.21:6804/3835 pipe(0x55ed47daf480
> sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x55ed47db0740).fault
> >> >
> >> >
> >> > We were using 192.168.3.x for MON network before we renamed it to use
> 165 address since its routeable. why its trying to connect to 192.x address
> instead of 165.x.y address?
> >>
> >> Are your OSDs on that 192.168.3.x subnet? What daemons are running on
> >> 192.168.3.21?
> >>
> >> > I could do ceph -s from both side and they can see each other. Only
> rbd command is having issue.
> >> >
> >> > Thanks,
> >> > -Vikas
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Oct 2, 2018 at 5:14 PM Jason Dillaman 
> wrote:
> >> >>
> >> >> On Tue, Oct 2, 2018 at 4:47 PM Vikas Rana 
> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > We have a CEPH 3 node cluster at primary site. We created a RBD
> image and the image has about 100TB of data.
> >> >> >
> >> >> > Now we installed another 3 node cluster on secondary site. We want
> to replicate the image at primary site to this new cluster on secondary
> site.
> >> >> >
> >> >> > As per documentation, we enabled journaling on primary site. We
> followed all the procedure and peering looks good but the image is not
> copying.
> >> >> > The status is always showing down.
> >> >>
> >> >> Do you have an "rbd-mirror" daemon running on the secondary site? Are
> >> >> you running "rbd mirror pool status" against the primary site or the
> >> >> secondary site? The mirroring status is only available on the sites
> >> >> running "rbd-mirror" daemon (the "down" means that the cluster you
> are
> >> >> connected to doesn't have the daemon running).
> >> >>
> >> >> > So my question is, is it possible to replicate a image which
> already have some data before enabling journalling?
> >> >>
> >> >> Indeed -- it will perform a full image sync to the secondary site.
> >> >>
> >> >> > We are using the image mirroring instead of pool mirroring. Do we
> need to create the RBD image on secondary site? As per documentation, its
> not required.
> >> >>
> >> >> The only difference between the two modes is whether or not you need
> >> >> to run "rbd mirror image enable" or not.
> >> >>Nit: same comment -- can we drop the ```max_data_area_mb``` parameter?
> >> >> > Is there any other option to copy the image to the remote site?
> >> >>
> >> >> No other procedure should be required.
> >> >>
> >> >> > Thanks,
> >> >> > -Vikas
> >> >> > 

Re: [ceph-users] RBD Mirror Question

2018-10-04 Thread Jason Dillaman
On Thu, Oct 4, 2018 at 10:27 AM Vikas Rana  wrote:
>
> on Primary site, we have OSD's running on 192.168.4.x address.
>
> Similarly on Secondary site, we have OSD's running on 192.168.4.x address. 
> 192.168.3.x is the old MON network.on both site which was non route-able.
> So we renamed mon on primary site to 165.x.x and mon on secondary site to 
> 165.x.y. now primary and secondary can see each other.
>
>
> Do the OSD daemon from primary and secondary have to talk to each other? we 
> have same non routed networks for OSD.

The secondary site needs to be able to communicate with all MON and
OSD daemons in the primary site.

> Thanks,
> -Vikas
>
> On Thu, Oct 4, 2018 at 10:13 AM Jason Dillaman  wrote:
>>
>> On Thu, Oct 4, 2018 at 10:10 AM Vikas Rana  wrote:
>> >
>> > Thanks Jason for great suggestions.
>> >
>> > but somehow rbd mirror status not working from secondary to primary. 
>> > Here;s the status from both sides. cluster name is ceph on primary side 
>> > and cephdr on remote site. mirrordr is the user on DR side and mirrorprod 
>> > is on primary prod side.
>> >
>> > # rbd mirror pool info nfs
>> > Mode: image
>> > Peers:
>> >   UUID NAME   CLIENT
>> >   3ccd7a67-2343-44bf-960b-02d9b1258371 cephdr client.mirrordr.
>> >
>> > rbd --cluster cephdr mirror pool info nfs
>> > Mode: image
>> > Peers:
>> >   UUID NAME CLIENT
>> >   e6b9ba05-48de-462c-ad5f-0b51d0ee733f ceph client.mirrorprod
>> >
>> >
>> > From primary site, when i query the remote site, its looks good.
>> > # rbd --cluster cephdr --id mirrordr mirror pool status nfs
>> > health: OK
>> > images: 0 total
>> >
>> > but when i query from secondary site to primary side, I'm getting this 
>> > error
>> > # rbd  --cluster ceph --id mirrorprod mirror pool status nfs
>> > 2018-10-03 10:21:06.645903 7f27a44ed700  0 -- 165.x.x.202:0/1310074448 >> 
>> > 192.168.3.21:6804/3835 pipe(0x55ed47daf480 sd=4 :0 s=1 pgs=0 cs=0 l=1 
>> > c=0x55ed47db0740).fault
>> >
>> >
>> > We were using 192.168.3.x for MON network before we renamed it to use 165 
>> > address since its routeable. why its trying to connect to 192.x address 
>> > instead of 165.x.y address?
>>
>> Are your OSDs on that 192.168.3.x subnet? What daemons are running on
>> 192.168.3.21?
>>
>> > I could do ceph -s from both side and they can see each other. Only rbd 
>> > command is having issue.
>> >
>> > Thanks,
>> > -Vikas
>> >
>> >
>> >
>> >
>> > On Tue, Oct 2, 2018 at 5:14 PM Jason Dillaman  wrote:
>> >>
>> >> On Tue, Oct 2, 2018 at 4:47 PM Vikas Rana  wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > We have a CEPH 3 node cluster at primary site. We created a RBD image 
>> >> > and the image has about 100TB of data.
>> >> >
>> >> > Now we installed another 3 node cluster on secondary site. We want to 
>> >> > replicate the image at primary site to this new cluster on secondary 
>> >> > site.
>> >> >
>> >> > As per documentation, we enabled journaling on primary site. We 
>> >> > followed all the procedure and peering looks good but the image is not 
>> >> > copying.
>> >> > The status is always showing down.
>> >>
>> >> Do you have an "rbd-mirror" daemon running on the secondary site? Are
>> >> you running "rbd mirror pool status" against the primary site or the
>> >> secondary site? The mirroring status is only available on the sites
>> >> running "rbd-mirror" daemon (the "down" means that the cluster you are
>> >> connected to doesn't have the daemon running).
>> >>
>> >> > So my question is, is it possible to replicate a image which already 
>> >> > have some data before enabling journalling?
>> >>
>> >> Indeed -- it will perform a full image sync to the secondary site.
>> >>
>> >> > We are using the image mirroring instead of pool mirroring. Do we need 
>> >> > to create the RBD image on secondary site? As per documentation, its 
>> >> > not required.
>> >>
>> >> The only difference between the two modes is whether or not you need
>> >> to run "rbd mirror image enable" or not.
>> >>Nit: same comment -- can we drop the ```max_data_area_mb``` parameter?
>> >> > Is there any other option to copy the image to the remote site?
>> >>
>> >> No other procedure should be required.
>> >>
>> >> > Thanks,
>> >> > -Vikas
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >>
>> >>
>> >> --
>> >> Jason
>>
>>
>>
>> --
>> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Mirror Question

2018-10-04 Thread Vikas Rana
on Primary site, we have OSD's running on 192.168.4.x address.

Similarly on Secondary site, we have OSD's running on 192.168.4.x address.
192.168.3.x is the old MON network.on both site which was non route-able.
So we renamed mon on primary site to 165.x.x and mon on secondary site to
165.x.y. now primary and secondary can see each other.


Do the OSD daemon from primary and secondary have to talk to each other? we
have same non routed networks for OSD.

Thanks,
-Vikas

On Thu, Oct 4, 2018 at 10:13 AM Jason Dillaman  wrote:

> On Thu, Oct 4, 2018 at 10:10 AM Vikas Rana  wrote:
> >
> > Thanks Jason for great suggestions.
> >
> > but somehow rbd mirror status not working from secondary to primary.
> Here;s the status from both sides. cluster name is ceph on primary side and
> cephdr on remote site. mirrordr is the user on DR side and mirrorprod is on
> primary prod side.
> >
> > # rbd mirror pool info nfs
> > Mode: image
> > Peers:
> >   UUID NAME   CLIENT
> >   3ccd7a67-2343-44bf-960b-02d9b1258371 cephdr client.mirrordr.
> >
> > rbd --cluster cephdr mirror pool info nfs
> > Mode: image
> > Peers:
> >   UUID NAME CLIENT
> >   e6b9ba05-48de-462c-ad5f-0b51d0ee733f ceph client.mirrorprod
> >
> >
> > From primary site, when i query the remote site, its looks good.
> > # rbd --cluster cephdr --id mirrordr mirror pool status nfs
> > health: OK
> > images: 0 total
> >
> > but when i query from secondary site to primary side, I'm getting this
> error
> > # rbd  --cluster ceph --id mirrorprod mirror pool status nfs
> > 2018-10-03 10:21:06.645903 7f27a44ed700  0 -- 165.x.x.202:0/1310074448
> >> 192.168.3.21:6804/3835 pipe(0x55ed47daf480 sd=4 :0 s=1 pgs=0 cs=0 l=1
> c=0x55ed47db0740).fault
> >
> >
> > We were using 192.168.3.x for MON network before we renamed it to use
> 165 address since its routeable. why its trying to connect to 192.x address
> instead of 165.x.y address?
>
> Are your OSDs on that 192.168.3.x subnet? What daemons are running on
> 192.168.3.21?
>
> > I could do ceph -s from both side and they can see each other. Only rbd
> command is having issue.
> >
> > Thanks,
> > -Vikas
> >
> >
> >
> >
> > On Tue, Oct 2, 2018 at 5:14 PM Jason Dillaman 
> wrote:
> >>
> >> On Tue, Oct 2, 2018 at 4:47 PM Vikas Rana  wrote:
> >> >
> >> > Hi,
> >> >
> >> > We have a CEPH 3 node cluster at primary site. We created a RBD image
> and the image has about 100TB of data.
> >> >
> >> > Now we installed another 3 node cluster on secondary site. We want to
> replicate the image at primary site to this new cluster on secondary site.
> >> >
> >> > As per documentation, we enabled journaling on primary site. We
> followed all the procedure and peering looks good but the image is not
> copying.
> >> > The status is always showing down.
> >>
> >> Do you have an "rbd-mirror" daemon running on the secondary site? Are
> >> you running "rbd mirror pool status" against the primary site or the
> >> secondary site? The mirroring status is only available on the sites
> >> running "rbd-mirror" daemon (the "down" means that the cluster you are
> >> connected to doesn't have the daemon running).
> >>
> >> > So my question is, is it possible to replicate a image which already
> have some data before enabling journalling?
> >>
> >> Indeed -- it will perform a full image sync to the secondary site.
> >>
> >> > We are using the image mirroring instead of pool mirroring. Do we
> need to create the RBD image on secondary site? As per documentation, its
> not required.
> >>
> >> The only difference between the two modes is whether or not you need
> >> to run "rbd mirror image enable" or not.
> >>
> >> > Is there any other option to copy the image to the remote site?
> >>
> >> No other procedure should be required.
> >>
> >> > Thanks,
> >> > -Vikas
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Resolving Large omap objects in RGW index pool

2018-10-04 Thread Chris Sarginson
Hi,

Ceph version: Luminous 12.2.7

Following upgrading to Luminous from Jewel we have been stuck with a
cluster in HEALTH_WARN state that is complaining about large omap objects.
These all seem to be located in our .rgw.buckets.index pool.  We've
disabled auto resharding on bucket indexes due to seeming looping issues
after our upgrade.  We've reduced the number reported of reported large
omap objects by initially increasing the following value:

~# ceph daemon mon.ceph-mon-1 config get
osd_deep_scrub_large_omap_object_value_sum_threshold
{
"osd_deep_scrub_large_omap_object_value_sum_threshold": "2147483648"
}

However we're still getting a warning about a single large OMAP object,
however I don't believe this is related to an unsharded index - here's the
log entry:

2018-10-01 13:46:24.427213 osd.477 osd.477 172.26.216.6:6804/2311858 8482 :
cluster [WRN] Large omap object found. Object:
15:333d5ad7:::.dir.default.5689810.107:head Key count: 17467251 Size
(bytes): 4458647149

The object in the logs is the "marker" object, rather than the bucket_id -
I've put some details regarding the bucket here:

https://pastebin.com/hW53kTxL

The bucket limit check shows that the index is sharded, so I think this
might be related to versioning, although I was unable to get confirmation
that the bucket in question has versioning enabled through the aws
cli(snipped debug output below)

2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
headers: {'date': 'Tue, 02 Oct 2018 14:11:17 GMT', 'content-length': '137',
'x-amz-request-id': 'tx0020e3b15-005bb37c85-15870fe0-default',
'content-type': 'application/xml'}
2018-10-02 15:11:17,530 - MainThread - botocore.parsers - DEBUG - Response
body:
http://s3.amazonaws.com/doc/2006-03-01/;>

After dumping the contents of large omap object mentioned above into a file
it does seem to be a simple listing of the bucket contents, potentially an
old index:

~# wc -l omap_keys
17467251 omap_keys

This is approximately 5 million below the currently reported number of
objects in the bucket.

When running the commands listed here:
http://tracker.ceph.com/issues/34307#note-1

The problematic bucket is listed in the output (along with 72 other
buckets):
"CLIENTBUCKET", "default.294495648.690", "default.5689810.107"

As this tests for bucket_id and marker fields not matching to print out the
information, is the implication here that both of these should match in
order to fully migrate to the new sharded index?

I was able to do a "metadata get" using what appears to be the old index
object ID, which seems to support this (there's a "new_bucket_instance_id"
field, containing a newer "bucket_id" and reshard_status is 2, which seems
to suggest it has completed).

I am able to take the "new_bucket_instance_id" and get additional metadata
about the bucket, each time I do this I get a slightly newer
"new_bucket_instance_id", until it stops suggesting updated indexes.

It's probably worth pointing out that when going through this process the
final "bucket_id" doesn't match the one that I currently get when running
'radosgw-admin bucket stats --bucket "CLIENTBUCKET"', even though it also
suggests that no further resharding has been done as "reshard_status" = 0
and "new_bucket_instance_id" is blank.  The output is available to view
here:

https://pastebin.com/g1TJfKLU

It would be useful if anyone can offer some clarification on how to proceed
from this situation, identifying and removing any old/stale indexes from
the index pool (if that is the case), as I've not been able to spot
anything in the archives.

If there's any further information that is needed for additional context
please let me know.

Thanks
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Mirror Question

2018-10-04 Thread Jason Dillaman
On Thu, Oct 4, 2018 at 10:10 AM Vikas Rana  wrote:
>
> Thanks Jason for great suggestions.
>
> but somehow rbd mirror status not working from secondary to primary. Here;s 
> the status from both sides. cluster name is ceph on primary side and cephdr 
> on remote site. mirrordr is the user on DR side and mirrorprod is on primary 
> prod side.
>
> # rbd mirror pool info nfs
> Mode: image
> Peers:
>   UUID NAME   CLIENT
>   3ccd7a67-2343-44bf-960b-02d9b1258371 cephdr client.mirrordr.
>
> rbd --cluster cephdr mirror pool info nfs
> Mode: image
> Peers:
>   UUID NAME CLIENT
>   e6b9ba05-48de-462c-ad5f-0b51d0ee733f ceph client.mirrorprod
>
>
> From primary site, when i query the remote site, its looks good.
> # rbd --cluster cephdr --id mirrordr mirror pool status nfs
> health: OK
> images: 0 total
>
> but when i query from secondary site to primary side, I'm getting this error
> # rbd  --cluster ceph --id mirrorprod mirror pool status nfs
> 2018-10-03 10:21:06.645903 7f27a44ed700  0 -- 165.x.x.202:0/1310074448 >> 
> 192.168.3.21:6804/3835 pipe(0x55ed47daf480 sd=4 :0 s=1 pgs=0 cs=0 l=1 
> c=0x55ed47db0740).fault
>
>
> We were using 192.168.3.x for MON network before we renamed it to use 165 
> address since its routeable. why its trying to connect to 192.x address 
> instead of 165.x.y address?

Are your OSDs on that 192.168.3.x subnet? What daemons are running on
192.168.3.21?

> I could do ceph -s from both side and they can see each other. Only rbd 
> command is having issue.
>
> Thanks,
> -Vikas
>
>
>
>
> On Tue, Oct 2, 2018 at 5:14 PM Jason Dillaman  wrote:
>>
>> On Tue, Oct 2, 2018 at 4:47 PM Vikas Rana  wrote:
>> >
>> > Hi,
>> >
>> > We have a CEPH 3 node cluster at primary site. We created a RBD image and 
>> > the image has about 100TB of data.
>> >
>> > Now we installed another 3 node cluster on secondary site. We want to 
>> > replicate the image at primary site to this new cluster on secondary site.
>> >
>> > As per documentation, we enabled journaling on primary site. We followed 
>> > all the procedure and peering looks good but the image is not copying.
>> > The status is always showing down.
>>
>> Do you have an "rbd-mirror" daemon running on the secondary site? Are
>> you running "rbd mirror pool status" against the primary site or the
>> secondary site? The mirroring status is only available on the sites
>> running "rbd-mirror" daemon (the "down" means that the cluster you are
>> connected to doesn't have the daemon running).
>>
>> > So my question is, is it possible to replicate a image which already have 
>> > some data before enabling journalling?
>>
>> Indeed -- it will perform a full image sync to the secondary site.
>>
>> > We are using the image mirroring instead of pool mirroring. Do we need to 
>> > create the RBD image on secondary site? As per documentation, its not 
>> > required.
>>
>> The only difference between the two modes is whether or not you need
>> to run "rbd mirror image enable" or not.
>>
>> > Is there any other option to copy the image to the remote site?
>>
>> No other procedure should be required.
>>
>> > Thanks,
>> > -Vikas
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Mirror Question

2018-10-04 Thread Vikas Rana
Thanks Jason for great suggestions.

but somehow rbd mirror status not working from secondary to primary. Here;s
the status from both sides. cluster name is ceph on primary side and cephdr
on remote site. mirrordr is the user on DR side and mirrorprod is on
primary prod side.

# rbd mirror pool info nfs
Mode: image
Peers:
  UUID NAME   CLIENT
  3ccd7a67-2343-44bf-960b-02d9b1258371 cephdr client.mirrordr.

rbd --cluster cephdr mirror pool info nfs
Mode: image
Peers:
  UUID NAME CLIENT
  e6b9ba05-48de-462c-ad5f-0b51d0ee733f ceph client.mirrorprod


>From primary site, when i query the remote site, its looks good.
# rbd --cluster cephdr --id mirrordr mirror pool status nfs
health: OK
images: 0 total

but when i query from secondary site to primary side, I'm getting this error
# rbd  --cluster ceph --id mirrorprod mirror pool status nfs
2018-10-03 10:21:06.645903 7f27a44ed700  0 -- 165.x.x.202:0/1310074448 >>
192.168.3.21:6804/3835 pipe(0x55ed47daf480 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x55ed47db0740).fault


We were using 192.168.3.x for MON network before we renamed it to use 165
address since its routeable. why its trying to connect to 192.x address
instead of 165.x.y address?

I could do ceph -s from both side and they can see each other. Only rbd
command is having issue.

Thanks,
-Vikas




On Tue, Oct 2, 2018 at 5:14 PM Jason Dillaman  wrote:

> On Tue, Oct 2, 2018 at 4:47 PM Vikas Rana  wrote:
> >
> > Hi,
> >
> > We have a CEPH 3 node cluster at primary site. We created a RBD image
> and the image has about 100TB of data.
> >
> > Now we installed another 3 node cluster on secondary site. We want to
> replicate the image at primary site to this new cluster on secondary site.
> >
> > As per documentation, we enabled journaling on primary site. We followed
> all the procedure and peering looks good but the image is not copying.
> > The status is always showing down.
>
> Do you have an "rbd-mirror" daemon running on the secondary site? Are
> you running "rbd mirror pool status" against the primary site or the
> secondary site? The mirroring status is only available on the sites
> running "rbd-mirror" daemon (the "down" means that the cluster you are
> connected to doesn't have the daemon running).
>
> > So my question is, is it possible to replicate a image which already
> have some data before enabling journalling?
>
> Indeed -- it will perform a full image sync to the secondary site.
>
> > We are using the image mirroring instead of pool mirroring. Do we need
> to create the RBD image on secondary site? As per documentation, its not
> required.
>
> The only difference between the two modes is whether or not you need
> to run "rbd mirror image enable" or not.
>
> > Is there any other option to copy the image to the remote site?
>
> No other procedure should be required.
>
> > Thanks,
> > -Vikas
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Mirror Question

2018-10-04 Thread ceph
Hello Vikas,

Could you please provide us which Commands you have uses to Setup rbd-mirror?

Would be Great if you could Provide a short howto :)

Thanks in advise
 - Mehmet 

Am 2. Oktober 2018 22:47:08 MESZ schrieb Vikas Rana :
>Hi,
>
>We have a CEPH 3 node cluster at primary site. We created a RBD image
>and
>the image has about 100TB of data.
>
>Now we installed another 3 node cluster on secondary site. We want to
>replicate the image at primary site to this new cluster on secondary
>site.
>
>As per documentation, we enabled journaling on primary site. We
>followed
>all the procedure and peering looks good but the image is not copying.
>The status is always showing down.
>
>
>So my question is, is it possible to replicate a image which already
>have
>some data before enabling journalling?
>
>We are using the image mirroring instead of pool mirroring. Do we need
>to
>create the RBD image on secondary site? As per documentation, its not
>required.
>
>Is there any other option to copy the image to the remote site?
>
>Thanks,
>-Vikas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-10-04 Thread Webert de Souza Lima
Hi, bring this up again to ask one more question:

what would be the best recommended locking strategy for dovecot against
cephfs?
this is a balanced setup using independent director instances but all
dovecot instances on each node share the same storage system (cephfs).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 5:15 PM Webert de Souza Lima 
wrote:

> Thanks Jack.
>
> That's good to know. It is definitely something to consider.
> In a distributed storage scenario we might build a dedicated pool for that
> and tune the pool as more capacity or performance is needed.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
>
> On Wed, May 16, 2018 at 4:45 PM Jack  wrote:
>
>> On 05/16/2018 09:35 PM, Webert de Souza Lima wrote:
>> > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
>> > backend.
>> > We'll have to do some some work on how to simulate user traffic, for
>> writes
>> > and readings. That seems troublesome.
>> I would appreciate seeing these results !
>>
>> > Thanks for the plugins recommendations. I'll take the change and ask you
>> > how is the SIS status? We have used it in the past and we've had some
>> > problems with it.
>>
>> I am using it since Dec 2016 with mdbox, with no issue at all (I am
>> currently using Dovecot 2.2.27-3 from Debian Stretch)
>> The only config I use is mail_attachment_dir, the rest lies as default
>> (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix,
>> ail_attachment_hash = %{sha1})
>> The backend storage is a local filesystem, and there is only one Dovecot
>> instance
>>
>> >
>> > Regards,
>> >
>> > Webert Lima
>> > DevOps Engineer at MAV Tecnologia
>> > *Belo Horizonte - Brasil*
>> > *IRC NICK - WebertRLZ*
>> >
>> >
>> > On Wed, May 16, 2018 at 4:19 PM Jack  wrote:
>> >
>> >> Hi,
>> >>
>> >> Many (most ?) filesystems does not store multiple files on the same
>> block
>> >>
>> >> Thus, with sdbox, every single mail (you know, that kind of mail with
>> 10
>> >> lines in it) will eat an inode, and a block (4k here)
>> >> mdbox is more compact on this way
>> >>
>> >> Another difference: sdbox removes the message, mdbox does not : a
>> single
>> >> metadata update is performed, which may be packed with others if many
>> >> files are deleted at once
>> >>
>> >> That said, I do not have experience with dovecot + cephfs, nor have
>> made
>> >> tests for sdbox vs mdbox
>> >>
>> >> However, and this is a bit out of topic, I recommend you look at the
>> >> following dovecot's features (if not already done), as they are awesome
>> >> and will help you a lot:
>> >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
>> >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
>> >> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
>> >>
>> >> Regards,
>> >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
>> >>> I'm sending this message to both dovecot and ceph-users ML so please
>> >> don't
>> >>> mind if something seems too obvious for you.
>> >>>
>> >>> Hi,
>> >>>
>> >>> I have a question for both dovecot and ceph lists and below I'll
>> explain
>> >>> what's going on.
>> >>>
>> >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
>> >> when
>> >>> using sdbox, a new file is stored for each email message.
>> >>> When using mdbox, multiple messages are appended to a single file
>> until
>> >> it
>> >>> reaches/passes the rotate limit.
>> >>>
>> >>> I would like to understand better how the mdbox format impacts on IO
>> >>> performance.
>> >>> I think it's generally expected that fewer larger file translate to
>> less
>> >> IO
>> >>> and more troughput when compared to more small files, but how does
>> >> dovecot
>> >>> handle that with mdbox?
>> >>> If dovecot does flush data to storage upon each and every new email is
>> >>> arrived and appended to the corresponding file, would that mean that
>> it
>> >>> generate the same ammount of IO as it would do with one file per
>> message?
>> >>> Also, if using mdbox many messages will be appended to a said file
>> >> before a
>> >>> new file is created. That should mean that a file descriptor is kept
>> open
>> >>> for sometime by dovecot process.
>> >>> Using cephfs as backend, how would this impact cluster performance
>> >>> regarding MDS caps and inodes cached when files from thousands of
>> users
>> >> are
>> >>> opened and appended all over?
>> >>>
>> >>> I would like to understand this better.
>> >>>
>> >>> Why?
>> >>> We are a small Business Email Hosting provider with bare metal, self
>> >> hosted
>> >>> systems, using dovecot for servicing mailboxes and cephfs for email
>> >> storage.
>> >>>
>> >>> We are currently working on dovecot and storage redesign to be in
>> >>> production ASAP. The main objective is to serve more users with better
>> >>> performance, high 

[ceph-users] Mimic upgrade 13.2.1 > 13.2.2 monmap changed

2018-10-04 Thread Nino Bosteels
Hello list,

I'm having a serious issue, since my ceph cluster has become unresponsive. I 
was upgrading my cluster (3 servers, 3 monitors) from 13.2.1 to 13.2.2, which 
shouldn't be a problem.

Though on reboot my first host reported:

starting mon.ceph01 rank -1 at 192.168.200.197:6789/0 mon_data 
/var/lib/ceph/mon/ceph-ceph01 fsid 27dd45f1-28b5-4ac6-81ab-c62bc581130c
mon.cephxx@-1(probing) e5 preinit fsid 27dd45f1-28b5-4ac6-81ab-c62bc581130c
mon.cephxx@-1(probing) e5 not in monmap and have been in a quorum before; must 
have been removed
-1 mon.cephxx@-1(probing) e5 commit suicide!
-1 failed to initialize

I thought, perhaps the monitor doesn't want to accept the monmap of the other 
2, because of the version-difference. Sadly, I upgraded and rebooted the second 
server.

Since the cluster is unresponsive (because more than half of the monitors is 
offline / out of quorum). The logs of my second host, it keeps spamming:

2018-10-04 14:39:06.802 7fed0058f700 -1 mon.ceph02@1(probing) e6 
get_health_metrics reporting 14 slow ops, oldest is auth(proto 0 27 bytes epoch 
6)

Any help VERY MUCH appreciated, this sucks.

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic 13.2.2 SCST or ceph-iscsi ?

2018-10-04 Thread Steven Vacaroaia
Hi,

Which implementation of iSCSI is recommended for Mimic 13.2.2 and why ?
Is multipathing supported by both in a VMWare environment ?
Anyone willing to share  performance details ?

Many thanks
Steven
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] deep scrub error caused by missing object

2018-10-04 Thread Roman Steinhart
Hi all,

since some weeks we have a small problem with one of the PG's on our ceph 
cluster.
Every time the pg 2.10d is deep scrubbing it fails because of this:
2018-08-06 19:36:28.080707 osd.14 osd.14 *.*.*.110:6809/3935 133 : cluster 
[ERR] 2.10d scrub stat mismatch, got 397/398 objects, 0/0 clones, 397/398 
dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
2609281919/2609293215 bytes, 0/0 hit_set_archive bytes.
2018-08-06 19:36:28.080905 osd.14 osd.14 *.*.*.110:6809/3935 134 : cluster 
[ERR] 2.10d scrub 1 errors
As far as I understand ceph is missing an object on that osd.14 which should be 
stored on this osd. A small ceph pg repair 2.10d fixes the problem but as soon 
as a deep scrubbing job for that pg is running again(manual or automatically) 
the problem is back again.
I tried to find out which object is missing, but a small search leads me to the 
result that there is no real way to find out which objects are stored in this 
PG or which object exactly is missing.
That's why I've gone for some "unconventional" methods.
I completely removed OSD.14 from the cluster. I waited until everything was 
balanced and then added the OSD again.
Unfortunately the problem is still there.

Some weeks later we've added a huge amount of OSD's to our cluster which had a 
big impact on the crush map.
Since then the PG 2.10d was running on two other OSD's -> [119,93] (We have a 
replica of 2)
Still the same error message, but another OSD:
2018-10-03 03:39:22.776521 7f12d9979700 -1 log_channel(cluster) log [ERR] : 
2.10d scrub stat mismatch, got 728/729 objects, 0/0 clones, 728/729 dirty, 0/0 
omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7281369687/7281381269 
bytes, 0/0 hit_set_archive bytes.

As a first step it would be enough for me to find out which the problematic 
object is. Then I am able to check if the object is critical, if any recovery 
is required or if I am able to just drop that object(That would be 90% of the 
case)
I hope anyone is able to help me to get rid of this.
It's not really a problem for us. Ceph runs despite this message without 
further problems.
It's just a bit annoying that every time the error occurs our monitoring 
triggers a big alarm because Ceph is in ERROR status. :)

Thanks in advance,
Roman

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bcache, dm-cache support

2018-10-04 Thread Maged Mokhtar


Hello all,

Do  bcache and dm-cache work well with Ceph ? Is one recommended on the 
other ? Are there any issues ?
There are a few posts in this list around them, but i could not 
determine if they are ready for mainstream use or not


Appreciate any clarifications.  /Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-04 Thread Goktug Yildirim
This is ceph-object-store tool logs for OSD.0.

https://paste.ubuntu.com/p/jNwf4DC46H/

There is something wrong. But we are not sure if we cant use the tool or there 
is something wrong with OSD.


> On 4 Oct 2018, at 06:17, Sage Weil  wrote:
> 
> On Thu, 4 Oct 2018, Goktug Yildirim wrote:
>> This is our cluster state right now. I can reach rbd list and thats good! 
>> Thanks a lot Sage!!!
>> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/
> 
> Progress!  Not out of the woods yet, though...
> 
>> As you can see we have 2 unfound pg since some of our OSDs can not start. 58 
>> OSD gives different errors.
>> How can I fix these OSD's? If I remember correctly it should not be so much 
>> trouble.
>> 
>> These are OSDs' failed logs.
>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> 
> These are both failing in rocksdb code, with something like
> 
> Can't access /032949.sst: NotFound:
> 
> Can you check whether that .sst file actually exists?  Might be a 
> weird path issue.
> 
>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> 
> These are failing in the rocksdb CheckConstency code.  Not sure what to 
> make of that.
> 
>> https://paste.ubuntu.com/p/SY3576dNbJ/
>> https://paste.ubuntu.com/p/smyT6Y976b/
> 
> These are failing in BlueStore code.  The ceph-blustore-tool fsck may help 
> here, can you give it a shot?
> 
> sage
> 
> 
>> 
>>> On 3 Oct 2018, at 21:37, Sage Weil  wrote:
>>> 
>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
 I'm so sorry about that I missed "out" parameter. My bad..
 This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
>>> 
>>> Excellent, thanks.  That looks like it confirms the problem is that teh 
>>> recovery tool didn't repopulate the creating pgs properly.
>>> 
>>> If you take that 30 byte file I sent earlier (as hex) and update the 
>>> osdmap epoch to the latest on the mon, confirm it decodes and dumps 
>>> properly, and then inject it on the 3 mons, that should get you past this 
>>> hump (and hopefully back up!).
>>> 
>>> sage
>>> 
>>> 
 
 Sage Weil  şunları yazdı (3 Eki 2018 21:13):
 
> I bet the kvstore output it in a hexdump format?  There is another option 
> to get the raw data iirc
> 
> 
> 
>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
>>  wrote:
>> I changed the file name to make it clear.
>> When I use your command with "+decode"  I'm getting an error like this:
>> 
>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>> error: buffer::malformed_input: void 
>> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer 
>> understand old encoding version 2 < 111
>> 
>> My ceph version: 13.2.2
>> 
>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu 
>> yazdı:
>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
 If I didn't do it wrong, I got the output as below.
 
 ceph-kvstore-tool rocksdb 
 /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating 
 creating > dump
 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
 families: [default]
 
 ceph-dencoder type creating_pgs_t import dump dump_json
>>> 
>>> Sorry, should be
>>> 
>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>> 
>>> s
>>> 
 {
   "last_scan_epoch": 0,
   "creating_pgs": [],
   "queue": [],
   "created_pools": []
 }
 
 You can find the "dump" link below.
 
 dump: 
 https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
 
 
 Sage Weil  şunları yazdı (3 Eki 2018 18:45):
 
>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> We are starting to work on it. First step is getting the structure 
>> out and dumping the current value as you say.
>> 
>> And you were correct we did not run force_create_pg.
> 
> Great.
> 
> So, eager to see what the current structure is... please attach once 
> you 
> have it.
> 
> The new replacement one should look like this (when hexdump -C'd):
> 
>   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
> ||
> 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
> |..B...|
> 001e
> 
> ...except that from byte 6 you want to put in a recent OSDMap epoch, 
> in 
> hex, little endian (least significant byte first), in place of the 
> 0x10 
> that is there now.  It should dump like this:
> 
> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> {
>  "last_scan_epoch": 16,   <--- but with a recent epoch here
>  

Re: [ceph-users] CephFS performance.

2018-10-04 Thread Ronny Aasen

On 10/4/18 7:04 AM, jes...@krogh.cc wrote:

Hi All.

First thanks for the good discussion and strong answer's I've gotten so far.

Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.

Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.

I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?

Above is actually ok for production, thus .. not a big issue, just
information.

Single threaded performance is still struggling

Cold HHD (read from disk in NFS-server end) / NFS performance:

jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second


Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   29.24 GB in 00h00m09.15s:3.19 GB/second
jk@zebra03:~$

Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second

Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?

On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.

40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.




the problem with single threaded performance in ceph. Is that it reads 
the spindles in serial. so you are practically reading one and one 
drive, and see a single disk's performance, subtracted all the overheads 
from ceph, network, mds, etc.
So you do not get the combined performance of the drives, only one drive 
at the time. So the trick for ceph performance is to get more spindles 
working for you at the same time.



There are ways to get more performance out of a single thread:
- faster components in the path, ie faster disk/network/cpu/memory
- larger pre-fetching/read-ahead, with a large enough read-ahead more 
osd's will participate in reading simultaneously. [1] shows a table of 
benchmarks with different read-ahead sizes.
- erasure coding. while erasure coding does add latency vs replicated 
pools. You will get more spindles involved in reading in parallel. so 
for large sequential loads erasure coding can have a benefit.
- some sort of extra caching scheme, I have not looked at cachefiles, 
but it may provide some benefit.



you can also play with different cephfs implementations, there is a fuse 
client, where you can play with different cache solutions. But generally 
the kernel client is faster.


in rbd there is a fancy striping solution, by using --stripe-unit and 
--stripe-count. This would get more spindles running ; perhaps consider 
using rbd instead of cephfs if it fits the workload.



[1] 
https://tracker.ceph.com/projects/ceph/wiki/Kernel_client_read_ahead_optimization


good luck
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best handling network maintenance

2018-10-04 Thread Paul Emmerich
Mons are also on a 30s timeout.
Even a short loss of quorum isn‘t noticeable for ongoing IO.

Paul

> Am 04.10.2018 um 11:03 schrieb Martin Palma :
> 
> Also monitor election? That is the most fear we have since the monitor
> nodes will no see each other for that timespan...
>> On Thu, Oct 4, 2018 at 10:21 AM Paul Emmerich  wrote:
>> 
>> 10 seconds is far below any relevant timeout values (generally 20-30 
>> seconds); so you will be fine without any special configuration.
>> 
>> Paul
>> 
>> Am 04.10.2018 um 09:38 schrieb Konstantin Shalygin :
>> 
 What can we do of best handling this scenario to have minimal or no
 impact on Ceph?
 
 We plan to set "noout", "nobackfill", "norebalance", "noscrub",
 "nodeep",  "scrub" are there any other suggestions?
>>> 
>>> ceph osd set noout
>>> 
>>> ceph osd pause
>>> 
>>> 
>>> 
>>> k
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best handling network maintenance

2018-10-04 Thread Martin Palma
Also monitor election? That is the most fear we have since the monitor
nodes will no see each other for that timespan...
On Thu, Oct 4, 2018 at 10:21 AM Paul Emmerich  wrote:
>
> 10 seconds is far below any relevant timeout values (generally 20-30 
> seconds); so you will be fine without any special configuration.
>
> Paul
>
> Am 04.10.2018 um 09:38 schrieb Konstantin Shalygin :
>
> >> What can we do of best handling this scenario to have minimal or no
> >> impact on Ceph?
> >>
> >> We plan to set "noout", "nobackfill", "norebalance", "noscrub",
> >> "nodeep",  "scrub" are there any other suggestions?
> >
> > ceph osd set noout
> >
> > ceph osd pause
> >
> >
> >
> > k
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best handling network maintenance

2018-10-04 Thread Paul Emmerich
10 seconds is far below any relevant timeout values (generally 20-30 seconds); 
so you will be fine without any special configuration.

Paul

Am 04.10.2018 um 09:38 schrieb Konstantin Shalygin :

>> What can we do of best handling this scenario to have minimal or no
>> impact on Ceph?
>> 
>> We plan to set "noout", "nobackfill", "norebalance", "noscrub",
>> "nodeep",  "scrub" are there any other suggestions?
> 
> ceph osd set noout
> 
> ceph osd pause
> 
> 
> 
> k
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hardware heterogeneous in same pool

2018-10-04 Thread Janne Johansson
Den tors 4 okt. 2018 kl 00:09 skrev Bruno Carvalho :

> Hi Cephers, I would like to know how you are growing the cluster.
> Using dissimilar hardware in the same pool or creating a pool for each
> different hardware group.
> What problem would I have many problems using different hardware (CPU,
> memory, disk) in the same pool?


I don't think CPU and RAM (and other hw related things like HBA controller
card brand) matters
a lot, more is always nicer, but as long as you don't add worse machines
like Jonathan wrote you
should not see any degradation.

What you might want to look out for is if the new disks are very uneven
compared to the old
setup, so if you used to have servers with 10x2TB drives and suddenly add
one with 2x10TB,
things might become very unbalanced, since those differences will not be
handled seamlessly
by the crush map.

Apart from that, the only issues for us is "add drives, quickly set crush
reweight to 0.0 before
all existing OSD hosts shoot massive amounts of I/O on them, then script a
slower raise of
crush weight upto what they should end up at", to lessen the impact for our
24/7 operations.

If you have weekends where noone accesses the cluster or night-time low-IO
usage patterns,
just upping the weight at the right hour might suffice.

Lastly, for ssd/nvme setups with good networking, this is almost moot, they
converge so fast
its almost unfair. A real joy working with expanding flash-only
pools/clusters.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best handling network maintenance

2018-10-04 Thread Konstantin Shalygin

What can we do of best handling this scenario to have minimal or no
impact on Ceph?

We plan to set "noout", "nobackfill", "norebalance", "noscrub",
"nodeep",  "scrub" are there any other suggestions?


ceph osd set noout

ceph osd pause



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best handling network maintenance

2018-10-04 Thread Martin Palma
Hi all,

our Ceph cluster is distributed across two datacenter. Due do network
maintenance the link between the two datacenter will be down for ca. 8
- 10 seconds. In this time the public network of Ceph between the two
DCs will also be down.

What can we do of best handling this scenario to have minimal or no
impact on Ceph?

We plan to set "noout", "nobackfill", "norebalance", "noscrub",
"nodeep",  "scrub" are there any other suggestions?

Best,
Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com