Re: [ceph-users] Questions regarding hardware design of an SSD only cluster

2018-04-24 Thread Christian Balzer

Hello,

On Tue, 24 Apr 2018 11:39:33 +0200 Florian Florensa wrote:

> 2018-04-24 3:24 GMT+02:00 Christian Balzer :
> > Hello,
> >  
> 
> Hi Christian, and thanks for your detailed answer.
> 
> > On Mon, 23 Apr 2018 17:43:03 +0200 Florian Florensa wrote:
> >  
> >> Hello everyone,
> >>
> >> I am in the process of designing a Ceph cluster, that will contain
> >> only SSD OSDs, and I was wondering how should I size my cpu.  
> > Several threads about this around here, but first things first.
> > Any specifics about the storage needs, i.e. do you think you need the SSDs
> > for bandwidth or for IOPS reasons primarily?
> > Lots of smallish writes or large reads/writes?
> >  
> >> The cluster will only be used for block storage.
> >> The OSDs will be Samsung PM863 (2Tb or 4Tb, this will be determined  
> >
> > I assume PM863a, the non "a" model seems to be gone.
> > And that's a 1.3 DWPD drive, with a collocated journal or lots of small
> > writes and a collocated WAL/DB it will be half of that.
> > So run the numbers and make sure this is actually a good fit in the
> > endurance area.
> > Of course depending on your needs, journals or WAL/DB on higher endurance
> > NVMes might be a much better fit anyway.
> >  
> 
> Well, if it does makes sense and maybe a few NVMEs for WAL/DB would
> make sense, how many SSDs should i put on a single nvme ?
> And how should i size them, AFAIK its ~1.6%of drive capacity for
> WAL+DB using default values.
> 
There've been numerous sizing threads here and the conclusion was
unsurprising  "it depends". That said, they will give you a good basis on
where to start.
Size will also depend on what kind of NVMe you can/will deploy, if you go
with a smallish one that has n times the endurance for the SSDs behind it
you will need to be more precise, if you can't get the endurance and
compensate for it by a larger NVMe, you obviously can go all out with the
WAL/DB size here. 

Given the number of hosts you plan to deploy initially, a ratio of up to
1:5 seems sensible. 

Of course the speed of the NVMe factors in here as well, not as
predictable as with filestore journals, but still, esp. with regards to
IOPS. 

Again, this depends on your use case, but a 1.3 DWPD endurance that could
be as low as 0.65 DWPD strikes me as rather low. 
The SM variant might be a better fit if you go for a design w/o NVMes. 

> >> when we will set the total volumetry in stone), and it will be in 2U
> >> 24SSDs servers  
> > How many servers are you thinking about?
> > Because the fact that you're willing to double the SSD size but not the
> > number of servers suggests that you're thinking about a small number of
> > servers.
> > And while dense servers will save you space and money, more and smaller
> > servers are generally a better fit for Ceph, not least when considering
> > failure domains (a host typically).
> >  
> The cluster should start somewhere between 20-30 OSDs nodes, and 3monitors.
> And it should grow in the forseeable future to up to 50 ODSs nodes,
> while keeping 3monitors, but that would be in a while (like 2-3years).
> Of course this number of node depends on the usability of storage and iops.
> The goal is to replace the SANs for Hypervisors and some diskless baremetal
> servers.
> 
20-30 OSD nodes with 24 SSDs, no wonder you didn't blink at the price tag
of the Xeon 2699v4.
Anyway, a decent number that will mediate the impact of of a node loss
nicely.

If IOPS are crucial, the aforementioned fast CPUs on top of fast storage
are as well.

> >> Those server will probably be either Supermicro 2029U-E1CR4T or
> >> Supermicro 2028R-E1CR48L.
> >> I’ve read quite a lot of documentation regarding hardware choices, and
> >> I can’t find a ‘guideline’ for OSDs on SSD with colocated journal.  
> > If this is a new cluster, that would be collocated WAL/DB and Bluestore.
> > Never mind my misgivings about Bluestore, at this point in time you
> > probably don't want to deploy a new cluster with filestore, unless you
> > have very specific needs and know what you're doing.
> >  
> 
> Yup the goal was to go for bluestore, as in an RBD workload it seems to
> be the better option (avoiding the write amplification and its induced 
> latency)
> 
Ah, but the filestore journal does _improve_ latency usually when on SSD.
That's why for small writes the WAL/DB gets used with bluestore for
journaling/coalescing as well, otherwise it would be slower than
the same setup with filestore.
And that is why you need to keep this little detail in mind when looking
at endurance, DWPD figures.

For larger writes bluestore wins out and does single writes indeed.


> >> I was pointing for either dual ‘Xeon gold 6146’ or dual ‘Xeon 2699v4’
> >> for the cpus, depending on the chassis.  
> > The first one is a much better fit in terms of the "a fast core for each
> > OSD" philosophy needed for low latency and high IOPS. The 2nd is just
> > overkill, 24 real cores will do and for extreme cases I'm sure I can still
> > whip a fio 

Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Linh Vu
Thanks Patrick! Good to know that it's nothing and will be fixed soon :)


From: Patrick Donnelly 
Sent: Wednesday, 25 April 2018 5:17:57 AM
To: Linh Vu
Cc: ceph-users
Subject: Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with 
manual pinning

Hello Linh,

On Tue, Apr 24, 2018 at 12:34 AM, Linh Vu  wrote:
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.

As Dan said, this is simply a spurious log message. Nothing is being
exported. This will be fixed in 12.2.6 as part of several fixes to the
load balancer:

https://github.com/ceph/ceph/pull/21412/commits/cace918dd044b979cd0d54b16a6296094c8a9f90

--
Patrick Donnelly

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have an inconsistent PG, repair not working

2018-04-24 Thread David Turner
Neither the issue I created nor Michael's [1] ticket that it was rolled
into are getting any traction.  How are y'all fairing with your clusters?
I've had 3 PGs inconsistent with 5 scrub errors for a few weeks now.  I
assumed that the third PG was just like the first 2 in that it couldn't be
scrubbed, but I just checked the last scrub timestamp of the 3 PGs and the
third one is able to run scrubs.  I'm going to increase the logging on it
after I finish a round of maintenance we're performing on some OSDs.
Hopefully I'll find something more about these objects.


[1] http://tracker.ceph.com/issues/23576

On Fri, Apr 6, 2018 at 12:30 PM David Turner  wrote:

> I'm using filestore.  I think the root cause is something getting stuck in
> the code.  As such I went ahead and created a [1] bug tracker for this.
> Hopefully it gets some traction as I'm not particularly looking forward to
> messing with deleting PGs with the ceph-objectstore-tool in production.
>
> [1] http://tracker.ceph.com/issues/23577
>
> On Fri, Apr 6, 2018 at 11:40 AM Michael Sudnick 
> wrote:
>
>> I've tried a few more things to get a deep-scrub going on my PG. I tried
>> instructing the involved osds to scrub all their PGs and it looks like that
>> didn't do it.
>>
>> Do you have any documentation on the object-store-tool? What I've found
>> online talks about filestore and not bluestore.
>>
>> On 6 April 2018 at 09:27, David Turner  wrote:
>>
>>> I'm running into this exact same situation.  I'm running 12.2.2 and I
>>> have an EC PG with a scrub error.  It has the same output for [1] rados
>>> list-inconsistent-obj as mentioned before.  This is the [2] full health
>>> detail.  This is the [3] excerpt from the log from the deep-scrub that
>>> marked the PG inconsistent.  The scrub happened when the PG was starting up
>>> after using ceph-objectstore-tool to split its filestore subfolders.  This
>>> is using a script that I've used for months without any side effects.
>>>
>>> I have tried quite a few things to get this PG to deep-scrub or repair,
>>> but to no avail.  It will not do anything.  I have set every osd's
>>> osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep
>>> scrubbing to finish, then increased the 11 OSDs for this PG to 1 before
>>> issuing a deep-scrub.  And it will sit there for over an hour without
>>> deep-scrubbing.  My current testing of this is to set all osds to 1,
>>> increase all of the osds for this PG to 4, and then issue the repair... but
>>> similarly nothing happens.  Each time I issue the deep-scrub or repair, the
>>> output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but
>>> nothing shows up in the log for the OSD and the PG state stays
>>> 'active+clean+inconsistent'.
>>>
>>> My next step, unless anyone has a better idea, is to find the exact copy
>>> of the PG with the missing object, use object-store-tool to back up that
>>> copy of the PG and remove it.  Then starting the OSD back up should
>>> backfill the full copy of the PG and be healthy again.
>>>
>>>
>>>
>>> [1] $ rados list-inconsistent-obj 145.2e3
>>> No scrub information available for pg 145.2e3
>>> error 2: (2) No such file or directory
>>>
>>> [2] $ ceph health detail
>>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>>> OSD_SCRUB_ERRORS 1 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>> pg 145.2e3 is active+clean+inconsistent, acting
>>> [234,132,33,331,278,217,55,358,79,3,24]
>>>
>>> [3] 2018-04-04 15:24:53.603380 7f54d1820700  0 log_channel(cluster) log
>>> [DBG] : 145.2e3 deep-scrub starts
>>> 2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log
>>> [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
>>> 2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log
>>> [ERR] : 145.2e3 deep-scrub 1 errors
>>>
>>> On Mon, Apr 2, 2018 at 4:51 PM Michael Sudnick <
>>> michael.sudn...@gmail.com> wrote:
>>>
 Hi Kjetil,

 I've tried to get the pg scrubbing/deep scrubbing and nothing seems to
 be happening. I've tried it a few times over the last few days. My cluster
 is recovering from a failed disk (which was probably the reason for the
 inconsistency), do I need to wait for the cluster to heal before
 repair/deep scrub works?

 -Michael

 On 2 April 2018 at 14:13, Kjetil Joergensen 
 wrote:

> Hi,
>
> scrub or deep-scrub the pg, that should in theory get you back to
> list-inconsistent-obj spitting out what's wrong, then mail that info to 
> the
> list.
>
> -KJ
>
> On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick <
> michael.sudn...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a small cluster with an inconsistent pg. I've tried ceph pg
>> repair multiple times to no luck. rados list-inconsistent-obj 49.11c
>> returns:
>>
>> # rados list-inconsistent-obj 49.11c
>> No scrub information available for pg 49.

[ceph-users] v12.2.5 Luminous released

2018-04-24 Thread Abhishek

Hello cephers,

We're glad to announce the fifth bugfix release of Luminous v12.2.x long 
term stable

release series. This release contains a range of bug fixes across all
compoenents of Ceph. We recommend all the users of 12.2.x series to
update.

Notable Changes
---

* MGR

  The ceph-rest-api command-line tool included in the ceph-mon
  package has been obsoleted by the MGR "restful" module. The
  ceph-rest-api tool is hereby declared deprecated and will be dropped
  in Mimic.

  The MGR "restful" module provides similar functionality via a "pass 
through"
  method. See http://docs.ceph.com/docs/luminous/mgr/restful for 
details.


* CephFS

  Upgrading an MDS cluster to 12.2.3+ will result in all active MDS
  exiting due to feature incompatibilities once an upgraded MDS comes
  online (even as standby). Operators may ignore the error messages
  and continue upgrading/restarting or follow this upgrade sequence:

  Reduce the number of ranks to 1 (`ceph fs set  max_mds 1`),
  wait for all other MDS to deactivate, leaving the one active MDS,
  upgrade the single active MDS, then upgrade/start standbys. Finally,
  restore the previous max_mds.

  See also: https://tracker.ceph.com/issues/23172


Other Notable Changes
-

* add --add-bucket and --move options to crushtool (issue#23472, 
issue#23471, pr#21079, Kefu Chai)
* BlueStore.cc: _balance_bluefs_freespace: assert(0 == "allocate failed, 
wtf") (issue#23063, pr#21394, Igor Fedotov, xie xingguo, Sage Weil, Zac 
Medico)
* bluestore: correctly check all block devices to decide if journal 
is\_… (issue#23173, issue#23141, pr#20651, Greg Farnum)
* bluestore: statfs available can go negative (issue#23074, pr#20554, 
Igor Fedotov, Sage Weil)
* build Debian installation packages failure (issue#22856, issue#22828, 
pr#20250, Tone Zhang)
* build/ops: deb: move python-jinja2 dependency to mgr (issue#22457, 
pr#20748, Nathan Cutler)
* build/ops: deb: move python-jinja2 dependency to mgr (issue#22457, 
pr#21233, Nathan Cutler)
* build/ops: run-make-check.sh: fix SUSE support (issue#22875, 
issue#23178, pr#20737, Nathan Cutler)
* cephfs-journal-tool: Fix Dumper destroyed before shutdown 
(issue#22862, issue#22734, pr#20251, dongdong tao)
* ceph.in: print all matched commands if arg missing (issue#22344, 
issue#23186, pr#20664, Luo Kexue, Kefu Chai)
* ceph-objectstore-tool command to trim the pg log (issue#23242, 
pr#20803, Josh Durgin, David Zafman)
* ceph osd force-create-pg cause all ceph-mon to crash and unable to 
come up again (issue#22942, pr#20399, Sage Weil)
* ceph-volume: adds raw device support to 'lvm list' (issue#23140, 
pr#20647, Andrew Schoen)
* ceph-volume: allow parallel creates (issue#23757, pr#21509, Theofilos 
Mouratidis)
* ceph-volume: allow skipping systemd interactions on activate/create 
(issue#23678, pr#21538, Alfredo Deza)
* ceph-volume: automatic VDO detection (issue#23581, pr#21505, Alfredo 
Deza)

* ceph-volume be resilient to $PATH issues (pr#20716, Alfredo Deza)
* ceph-volume: fix action plugins path in tox (pr#20923, Guillaume 
Abrioux)
* ceph-volume Implement an 'activate all' to help with dense servers or 
migrating OSDs (pr#21533, Alfredo Deza)
* ceph-volume improve robustness when reloading vms in tests (pr#21072, 
Alfredo Deza)
* ceph-volume lvm.activate error if no bluestore OSDs are found 
(issue#23644, pr#21335, Alfredo Deza)

* ceph-volume: Nits noticed while studying code (pr#21565, Dan Mick)
* ceph-volume tests alleviate libvirt timeouts when reloading 
(issue#23163, pr#20754, Alfredo Deza)
* ceph-volume update man page for prepare/activate flags (pr#21574, 
Alfredo Deza)
* ceph-volume: Using --readonly for {vg|pv|lv}s commands (pr#21519, 
Erwan Velu)
* client: allow client to use caps that are revoked but not yet returned 
(issue#23028, issue#23314, pr#20904, Jeff Layton)

* : Client:Fix readdir bug (issue#22936, pr#20356, dongdong tao)
* client: release revoking Fc after invalidate cache (issue#22652, 
pr#20342, "Yan, Zheng")
* Client: setattr should drop "Fs" rather than "As" for mtime and size 
(issue#22935, pr#20354, dongdong tao)
* client: use either dentry_invalidate_cb or remount_cb to invalidate k… 
(issue#23355, pr#20960, Zhi Zhang)
* cls/rbd: group_image_list incorrectly flagged as RW (issue#23407, 
issue#23388, pr#20967, Jason Dillaman)
* cls/rgw: fix bi_log_iterate_entries return wrong truncated 
(issue#22737, issue#23225, pr#21054, Tianshan Qu)
* cmake: rbd resource agent needs to be executable (issue#22980, 
pr#20617, Tim Bishop)
* common/dns_resolv.cc: Query for -record if ms_bind_ipv6 is True 
(issue#23078, issue#23174, pr#20710, Wido den Hollander)
* common/ipaddr: Do not select link-local IPv6 addresses (issue#21813, 
pr#2, Willem Jan Withagen)
* common: omit short option for id in help for clients (issue#23156, 
issue#23041, pr#20654, Patrick Donnelly)
* common: should not check for VERSION_ID (issue#23477, issue#23478, 
pr#21090, Kefu Chai, Shengjing Zhu)
* config:

Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Patrick Donnelly
Hello Linh,

On Tue, Apr 24, 2018 at 12:34 AM, Linh Vu  wrote:
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.

As Dan said, this is simply a spurious log message. Nothing is being
exported. This will be fixed in 12.2.6 as part of several fixes to the
load balancer:

https://github.com/ceph/ceph/pull/21412/commits/cace918dd044b979cd0d54b16a6296094c8a9f90

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor read performance.

2018-04-24 Thread Jonathan Proulx
Hi All,

I seem to be seeing consitently poor read performance on my cluster
relative to both write performance and read perormance of a single
backend disk, by quite a lot.

cluster is luminous with 174 7.2k SAS drives across 12 storage servers
with 10G ethernet and jumbo frames.  Drives are mix 4T and 2T
bluestore with DB on ssd.

The performence I really care about is over rbd for VMs in my
OpenStack but 'rbd bench' seems to line up frety well with 'fio' test
inside VMs so a more or less typical random write rbd bench (from a
monitor node with 10G connection on same net as osds):

rbd bench  --io-total=4G --io-size 4096 --io-type write \
--io-pattern rand --io-threads 16 mypool/myvol



elapsed:   361  ops:  1048576  ops/sec:  2903.82  bytes/sec: 11894034.98

same for random read is an order of magnitude lower:

rbd bench  --io-total=4G --io-size 4096 --io-type read \
--io-pattern rand --io-threads 16  mypool/myvol

elapsed:  3354  ops:  1048576  ops/sec:   312.60  bytes/sec: 1280403.47

(sequencial reads and bigger io-size help but not a lot)

ceph -s from during read bench so get a sense of comparing traffic:

  cluster:
id: 
health: HEALTH_OK
 
  services:
mon: 3 daemons, quorum ceph-mon0,ceph-mon1,ceph-mon2
mgr: ceph-mon0(active), standbys: ceph-mon2, ceph-mon1
osd: 174 osds: 174 up, 174 in
rgw: 3 daemon active
 
  data:
pools:   19 pools, 10240 pgs
objects: 17342k objects, 80731 GB
usage:   240 TB used, 264 TB / 505 TB avail
pgs: 10240 active+clean
 
  io:
client:   4296 kB/s rd, 417 MB/s wr, 1635 op/s rd, 3518 op/s wr


During deep-scrubs overnight I can see the disks doing >500MBps reads
and ~150rx/iops (each at peak), while during read bench (including all
traffic from ~1k VMs) individual osd data partitions peak around 25
rx/iops and 1.5MBps rx bandwidth so it seems like there should be
performance to spare.

Obviosly given my disk choices this isn't designed as a particularly
high performance setup but I do expect a bit mroe performance out of
it.

Are my expectations wrong? If not any clues what I've don (or failed
to do) that is wrong?

Pretty sure rx/wx was much more sysmetric in earlier versions (subset
of same hardware and filestore backend) but used a different perf tool
so don't want to make direct comparisons.

-Jon

-- 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephalocon APAC 2018 report, videos and slides

2018-04-24 Thread Ronny Aasen

On 24.04.2018 17:30, Leonardo Vaz wrote:

Hi,

Last night I posted the Cephalocon 2018 conference report on the Ceph
blog[1], published the video recordings from the sessions on
YouTube[2] and the slide decks on Slideshare[3].

[1] https://ceph.com/community/cephalocon-apac-2018-report/
[2] https://www.youtube.com/playlist?list=PLrBUGiINAakNgeLvjald7NcWps_yDCblr
[3] https://www.slideshare.net/Inktank_Ceph/tag/cephalocon-apac-2018

I'd like to take the opportunity to apologize for the lots of posts on
Twitter and Google+ about the video uploads last night. Seems that
even I disabled the checkbox to make posts announcing the new uploads
on social media YouTube decided to post it anyway. Sorry for the
inconvenience.

Kindest regards,

Leo



thanks to the presenters and yourself for your awesome work.
this is a goldmine for us that could not attend. :)


kind regards

Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] configuration section for each host

2018-04-24 Thread Ronny Aasen

On 24.04.2018 18:24, Robert Stanford wrote:


 In examples I see that each host has a section in ceph.conf, on every 
host (host-a has a section in its conf on host-a, but there's also a 
host-a section in the ceph.conf on host-b, etc.) Is this really 
necessary?  I've been using just generic osd and monitor sections, and 
that has worked out fine so far.  Am I setting myself up for 
unexpected problems?


only if you want to override default values for that individual host. i 
have never had anything but generic sections.


ceph is moving more and more away from must_have information in the 
configuration file.
in next version you will probably not need initial monitors either since 
they can be discovered via SRV dns records.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] configuration section for each host

2018-04-24 Thread Robert Stanford
 In examples I see that each host has a section in ceph.conf, on every host
(host-a has a section in its conf on host-a, but there's also a host-a
section in the ceph.conf on host-b, etc.)  Is this really necessary?  I've
been using just generic osd and monitor sections, and that has worked out
fine so far.  Am I setting myself up for unexpected problems?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephalocon APAC 2018 report, videos and slides

2018-04-24 Thread kefu chai
On Tue, Apr 24, 2018 at 11:30 PM, Leonardo Vaz  wrote:
> Hi,
>
> Last night I posted the Cephalocon 2018 conference report on the Ceph
> blog[1], published the video recordings from the sessions on
> YouTube[2] and the slide decks on Slideshare[3].
>
> [1] https://ceph.com/community/cephalocon-apac-2018-report/
> [2] https://www.youtube.com/playlist?list=PLrBUGiINAakNgeLvjald7NcWps_yDCblr
> [3] https://www.slideshare.net/Inktank_Ceph/tag/cephalocon-apac-2018
>
> I'd like to take the opportunity to apologize for the lots of posts on
> Twitter and Google+ about the video uploads last night. Seems that
> even I disabled the checkbox to make posts announcing the new uploads
> on social media YouTube decided to post it anyway. Sorry for the
> inconvenience.

Thank you Leo! it's a ton of work. i really love these slides!


-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW GC Processing Stuck

2018-04-24 Thread Sean Redmond
Hi,

sure no problem, I posted it here

http://tracker.ceph.com/issues/23839

On Tue, 24 Apr 2018, 16:04 Matt Benjamin,  wrote:

> Hi Sean,
>
> Could you create an issue in tracker.ceph.com with this info?  That
> would make it easier to iterate on.
>
> thanks and regards,
>
> Matt
>
> On Tue, Apr 24, 2018 at 10:45 AM, Sean Redmond 
> wrote:
> > Hi,
> > We are currently using Jewel 10.2.7 and recently, we have been
> experiencing
> > some issues with objects being deleted using the gc. After a bucket was
> > unsuccessfully deleted using –purge-objects (first error next discussed
> > occurred), all of the rgw’s are occasionally becoming unresponsive and
> > require a restart of the processes before they will accept requests
> again.
> > On investigation of the garbage collection, it has an enormous list
> which we
> > are struggling to count the length of, but seem stuck on a particular
> object
> > which is not updating, shown in the logs below:
> >
> >
> >
> > 2018-04-23 15:16:04.101660 7f1fdcc29a00  0 gc::process: removing
> > .rgw.buckets:default.290071.4_XXX//XX/XX/XXX.ZIP
> >
> > 2018-04-23 15:16:04.104231 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_1
> >
> > 2018-04-23 15:16:04.105541 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_2
> >
> > 2018-04-23 15:16:04.176235 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_3
> >
> > 2018-04-23 15:16:04.178435 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_4
> >
> > 2018-04-23 15:16:04.250883 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_5
> >
> > 2018-04-23 15:16:04.297912 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_6
> >
> > 2018-04-23 15:16:04.298803 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_7
> >
> > 2018-04-23 15:16:04.320202 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_8
> >
> > 2018-04-23 15:16:04.340124 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_9
> >
> > 2018-04-23 15:16:04.383924 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_10
> >
> > 2018-04-23 15:16:04.386865 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_11
> >
> > 2018-04-23 15:16:04.389067 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_12
> >
> > 2018-04-23 15:16:04.413938 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_13
> >
> > 2018-04-23 15:16:04.487977 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_14
> >
> > 2018-04-23 15:16:04.544235 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_1
> >
> > 2018-04-23 15:16:04.546978 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_2
> >
> > 2018-04-23 15:16:04.598644 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_3
> >
> > 2018-04-23 15:16:04.629519 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_4
> >
> > 2018-04-23 15:16:04.700492 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_5
> >
> > 2018-04-23 15:16:04.765798 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_6
> >
> > 2018-04-23 15:16:04.772774 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_7
> >
> > 2018-04-23 15:16:04.846379 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_8
> >
> > 2018-04-23 15:16:04.935023 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_9
> >
> > 2018-04-23 15:16:04.937229 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_10
> >
> > 2018-04-23 15:16:04.968289 7f1fdcc29a00  0 gc::process: removing
> >
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjE

[ceph-users] Cephalocon APAC 2018 report, videos and slides

2018-04-24 Thread Leonardo Vaz
Hi,

Last night I posted the Cephalocon 2018 conference report on the Ceph
blog[1], published the video recordings from the sessions on
YouTube[2] and the slide decks on Slideshare[3].

[1] https://ceph.com/community/cephalocon-apac-2018-report/
[2] https://www.youtube.com/playlist?list=PLrBUGiINAakNgeLvjald7NcWps_yDCblr
[3] https://www.slideshare.net/Inktank_Ceph/tag/cephalocon-apac-2018

I'd like to take the opportunity to apologize for the lots of posts on
Twitter and Google+ about the video uploads last night. Seems that
even I disabled the checkbox to make posts announcing the new uploads
on social media YouTube decided to post it anyway. Sorry for the
inconvenience.

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [rgw] user stats understanding

2018-04-24 Thread Rudenko Aleksandr
Hi, friends.

We use RGW user stats in our billing.

Example on Luminous:

radosgw-admin usage show --uid 5300c830-82e2-4dce-ac6d-1d97a65def33

{
"entries": [
{
"user": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"buckets": [
{
"bucket": "",
"time": "2018-04-06 19:00:00.00Z",
"epoch": 1523041200,
"owner": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
{
"category": "list_buckets",
"bytes_sent": 141032,
"bytes_received": 0,
"ops": 402,
"successful_ops": 402
}
]
},
{
"bucket": "-",
"time": "2018-04-24 13:00:00.00Z",
"epoch": 1524574800,
"owner": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
{
"category": "get_obj",
"bytes_sent": 422,
"bytes_received": 0,
"ops": 2,
"successful_ops": 0
}
]
},
{
"bucket": "test",
"time": "2018-04-06 19:00:00.00Z",
"epoch": 1523041200,
"owner": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
…

...
{
"category": "get_obj",
"bytes_sent": 642,
"bytes_received": 0,
"ops": 3,
"successful_ops": 0
},
…

...
 ]
}
]
}
],
"summary": [
{
"user": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
…

...
{
"category": "get_obj",
"bytes_sent": 2569,
"bytes_received": 0,
"ops": 12,
"successful_ops": 0
},
{
"category": "list_bucket",
"bytes_sent": 185537,
"bytes_received": 0,
"ops": 302,
"successful_ops": 302
},
{
"category": "list_buckets",
"bytes_sent": 141032,
"bytes_received": 0,
"ops": 402,
"successful_ops": 402
},
…

...
],
"total": {
"bytes_sent": 884974,
"bytes_received": 0,
"ops": 1521,
"successful_ops": 1507
}
}
]
}

What statistics are in dictionaries with "bucket": "", and "bucket":  "-",?
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dying OSDs

2018-04-24 Thread Jan Marquardt
Hi,

it's been a while, but we are still fighting with this issue.

As suggested we deleted all snapshots, but the errors still occur.

We were able to gather some more information:

The reason why they are crashing is this assert:
https://github.com/ceph/ceph/blob/luminous/src/osd/PrimaryLogPG.cc#L353

With debug 20 we see this right before the OSD crashes:

2018-04-24 13:59:38.047697 7f929ba0d700 20 osd.4 pg_epoch: 144994
pg[0.103( v 140091'469328 (125640'467824,140091'469328] lb
0:c0e04acc:::rbd_data.221bf2eb141f2.00016379:head (bitwise)
local-lis/les=137681/137682 n=9535 ec=115/115 lis/c 144979/49591 les/c/f
144980/49596/0 144978/144979/144979) [4,17,2]/[2,17] r=-1 lpr=144979
pi=[49591,144979)/3 luod=0'0 crt=140091'469328 lcod 0'0 active+remapped]
 snapset 0=[]:[] legacy_snaps []

2018-04-24 16:34:54.558159 7f1c40e32700 20 osd.11 pg_epoch: 145549
pg[0.103( v 140091'469328 (125640'467824,140091'469328] lb
0:c0e04acc:::rbd_data.221bf2eb141f2.00016379:head (bitwise)
local-lis/les=138310/138311 n=9535 ec=115/115 lis/c 145548/49591 les/c/f
145549/49596/0 145547/145548/145548) [11,17,2]/[2,17] r=-1 lpr=145548
pi=[49591,145548)/3 luod=0'0 crt=140091'469328 lcod 0'0 active+remapped]
 snapset 0=[]:[] legacy_snaps []

Which is caused from this code:
https://github.com/ceph/ceph/blob/luminous/src/osd/PrimaryLogPG.cc#L349-L350

Any help would really be appreciated.

Best Regards

Jan


Am 12.04.18 um 10:53 schrieb Paul Emmerich:
> Hi,
> 
> thanks, but unfortunately it's not the thing I suspected :(
> Anyways, there's something wrong with your snapshots, the log also
> contains a lot of entries like this:
> 
> 2018-04-09 06:58:53.703353 7fb8931a0700 -1 osd.28 pg_epoch: 88438
> pg[0.5d( v 88438'223279 (86421'221681,88438'223279]
> local-lis/les=87450/87451 n=5634 ec=115/115 lis/c 87450/87450 les/c/f
> 87451/87451/0 87352/87450/87450) [37,6,28] r=2 lpr=87450 luod=0'0
> crt=88438'223279 lcod 88438'223278 active] _scan_snaps no head for
> 0:ba087b0f:::rbd_data.221bf2eb141f2.1436:46aa (have MIN)
> 
> The cluster I've debugged with the same crash also got a lot of snapshot
> problems including this one.
> In the end, only manually marking all snap_ids as deleted in the pool
> helped.
> 
> 
> Paul
> 
> 2018-04-10 21:48 GMT+02:00 Jan Marquardt  >:
> 
> Am 10.04.18 um 20:22 schrieb Paul Emmerich:
> > Hi,
> > 
> > I encountered the same crash a few months ago, see
> > https://tracker.ceph.com/issues/23030
> 
> > 
> > Can you post the output of
> > 
> >    ceph osd pool ls detail -f json-pretty
> > 
> > 
> > Paul
> 
> Yes, of course.
> 
> # ceph osd pool ls detail -f json-pretty
> 
> [
>     {
>         "pool_name": "rbd",
>         "flags": 1,
>         "flags_names": "hashpspool",
>         "type": 1,
>         "size": 3,
>         "min_size": 2,
>         "crush_rule": 0,
>         "object_hash": 2,
>         "pg_num": 768,
>         "pg_placement_num": 768,
>         "crash_replay_interval": 0,
>         "last_change": "91256",
>         "last_force_op_resend": "0",
>         "last_force_op_resend_preluminous": "0",
>         "auid": 0,
>         "snap_mode": "selfmanaged",
>         "snap_seq": 35020,
>         "snap_epoch": 91219,
>         "pool_snaps": [],
>         "removed_snaps":
> 
> "[1~4562,47f1~58,484a~9,4854~70,48c5~36,48fc~48,4945~d,4953~1,4957~1,495a~3,4960~1,496e~3,497a~1,4980~2,4983~3,498b~1,4997~1,49a8~1,49ae~1,49b1~2,49b4~1,49b7~1,49b9~3,49bd~5,49c3~6,49ca~5,49d1~4,49d6~1,49d8~2,49df~2,49e2~1,49e4~2,49e7~5,49ef~2,49f2~2,49f5~6,49fc~1,49fe~3,4a05~9,4a0f~4,4a14~4,4a1a~6,4a21~6,4a29~2,4a2c~3,4a30~1,4a33~5,4a39~3,4a3e~b,4a4a~1,4a4c~2,4a50~1,4a52~7,4a5a~1,4a5c~2,4a5f~4,4a64~1,4a66~2,4a69~2,4a6c~4,4a72~1,4a74~2,4a78~3,4a7c~6,4a84~2,4a87~b,4a93~4,4a99~1,4a9c~4,4aa1~7,4aa9~1,4aab~6,4ab2~2,4ab5~5,4abb~2,4abe~9,4ac8~a,4ad3~4,4ad8~13,4aec~16,4b03~6,4b0a~c,4b17~2,4b1a~3,4b1f~4,4b24~c,4b31~d,4b3f~13,4b53~1,4bfc~13ed,61e1~4a,622c~8,6235~a0,62d6~ac,63a6~2,63b2~2,63d0~2,63f7~2,6427~2,6434~10f]",
>         "quota_max_bytes": 0,
>         "quota_max_objects": 0,
>         "tiers": [],
>         "tier_of": -1,
>         "read_tier": -1,
>         "write_tier": -1,
>         "cache_mode": "none",
>         "target_max_bytes": 0,
>         "target_max_objects": 0,
>         "cache_target_dirty_ratio_micro": 0,
>         "cache_target_dirty_high_ratio_micro": 0,
>         "cache_target_full_ratio_micro": 0,
>         "cache_min_flush_age": 0,
>         "cache_min_evict_age": 0,
>         "erasure_code_profile": "",
>         "hit_set_params": {
>             "type": "none"
>         },
>         "hit_set_period": 0,
>         "hit_set_count": 0,
>         "use_gmt_hitset": true,
>         "min_read_r

Re: [ceph-users] RGW GC Processing Stuck

2018-04-24 Thread Matt Benjamin
Hi Sean,

Could you create an issue in tracker.ceph.com with this info?  That
would make it easier to iterate on.

thanks and regards,

Matt

On Tue, Apr 24, 2018 at 10:45 AM, Sean Redmond  wrote:
> Hi,
> We are currently using Jewel 10.2.7 and recently, we have been experiencing
> some issues with objects being deleted using the gc. After a bucket was
> unsuccessfully deleted using –purge-objects (first error next discussed
> occurred), all of the rgw’s are occasionally becoming unresponsive and
> require a restart of the processes before they will accept requests again.
> On investigation of the garbage collection, it has an enormous list which we
> are struggling to count the length of, but seem stuck on a particular object
> which is not updating, shown in the logs below:
>
>
>
> 2018-04-23 15:16:04.101660 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.290071.4_XXX//XX/XX/XXX.ZIP
>
> 2018-04-23 15:16:04.104231 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_1
>
> 2018-04-23 15:16:04.105541 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_2
>
> 2018-04-23 15:16:04.176235 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_3
>
> 2018-04-23 15:16:04.178435 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_4
>
> 2018-04-23 15:16:04.250883 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_5
>
> 2018-04-23 15:16:04.297912 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_6
>
> 2018-04-23 15:16:04.298803 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_7
>
> 2018-04-23 15:16:04.320202 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_8
>
> 2018-04-23 15:16:04.340124 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_9
>
> 2018-04-23 15:16:04.383924 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_10
>
> 2018-04-23 15:16:04.386865 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_11
>
> 2018-04-23 15:16:04.389067 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_12
>
> 2018-04-23 15:16:04.413938 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_13
>
> 2018-04-23 15:16:04.487977 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_14
>
> 2018-04-23 15:16:04.544235 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_1
>
> 2018-04-23 15:16:04.546978 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_2
>
> 2018-04-23 15:16:04.598644 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_3
>
> 2018-04-23 15:16:04.629519 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_4
>
> 2018-04-23 15:16:04.700492 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_5
>
> 2018-04-23 15:16:04.765798 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_6
>
> 2018-04-23 15:16:04.772774 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_7
>
> 2018-04-23 15:16:04.846379 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_8
>
> 2018-04-23 15:16:04.935023 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_9
>
> 2018-04-23 15:16:04.937229 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_10
>
> 2018-04-23 15:16:04.968289 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_11
>
> 2018-04-23 15:16:05.005194 7f1fdcc29a00  0 gc::process: removing
> .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_12
>
>
>
> We seem completely unable to get this deleted, and nothing else of immediate
> concern is flagging up as a potential cause of all RGWs become unresponsive
> at the same time. On the bucket containing this object 

Re: [ceph-users] Questions regarding hardware design of an SSD only cluster

2018-04-24 Thread Wido den Hollander


On 04/24/2018 05:01 AM, Mohamad Gebai wrote:
> 
> 
> On 04/23/2018 09:24 PM, Christian Balzer wrote:
>> 
>>> If anyone has some ideas/thoughts/pointers, I would be glad to hear them.
>>>
>> RAM, you'll need a lot of it, even more with Bluestore given the current
>> caching.
>> I'd say 1GB per TB storage as usual and 1-2GB extra per OSD.
> 
> Does that still stand? I was under the impression that with Bluestore,
> the required RAM is mostly a function of the Bluestore cache size rather
> than raw storage size (we're currently in the process of confirming this).
> 

It is. The amount of storage doesn't matter, but the BlueStore cache is
a per OSD config value indeed. So a 1TB of 4TB won't have a big
difference in the amount of memory it uses in most situations.

Setting the BlueStore cache to a higher value will improve performance
though as RocksDB lookups are faster.

Wido

> Mohamad
> 
>>> Regards,
>>>
>>> Florian
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW GC Processing Stuck

2018-04-24 Thread Sean Redmond
Hi,
We are currently using Jewel 10.2.7 and recently, we have been experiencing
some issues with objects being deleted using the gc. After a bucket was
unsuccessfully deleted using –purge-objects (first error next discussed
occurred), all of the rgw’s are occasionally becoming unresponsive and
require a restart of the processes before they will accept requests again.
On investigation of the garbage collection, it has an enormous list which
we are struggling to count the length of, but seem stuck on a particular
object which is not updating, shown in the logs below:



2018-04-23 15:16:04.101660 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.290071.4_XXX//XX/XX/XXX.ZIP

2018-04-23 15:16:04.104231 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_1

2018-04-23 15:16:04.105541 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_2

2018-04-23 15:16:04.176235 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_3

2018-04-23 15:16:04.178435 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_4

2018-04-23 15:16:04.250883 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_5

2018-04-23 15:16:04.297912 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_6

2018-04-23 15:16:04.298803 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_7

2018-04-23 15:16:04.320202 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_8

2018-04-23 15:16:04.340124 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_9

2018-04-23 15:16:04.383924 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_10

2018-04-23 15:16:04.386865 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_11

2018-04-23 15:16:04.389067 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_12

2018-04-23 15:16:04.413938 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_13

2018-04-23 15:16:04.487977 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_14

2018-04-23 15:16:04.544235 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_1

2018-04-23 15:16:04.546978 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_2

2018-04-23 15:16:04.598644 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_3

2018-04-23 15:16:04.629519 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_4

2018-04-23 15:16:04.700492 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_5

2018-04-23 15:16:04.765798 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_6

2018-04-23 15:16:04.772774 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_7

2018-04-23 15:16:04.846379 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_8

2018-04-23 15:16:04.935023 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_9

2018-04-23 15:16:04.937229 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_10

2018-04-23 15:16:04.968289 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_11

2018-04-23 15:16:05.005194 7f1fdcc29a00  0 gc::process: removing
.rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_12



We seem completely unable to get this deleted, and nothing else of
immediate concern is flagging up as a potential cause of all RGWs become
unresponsive at the same time. On the bucket containing this object (the
one we originally tried to purge), I have attempted a further purge passing
the “—bypass-gc” parameter to it, but this also resulted in all rgws
becoming unresponsive within 30 minutes and so I terminated the operation
and restarted the rgws again.



The bucket we attempted to remove has no shards and I have attached the
details below. 90% of the conten

Re: [ceph-users] Questions regarding hardware design of an SSD only cluster

2018-04-24 Thread Florian Florensa
2018-04-24 3:24 GMT+02:00 Christian Balzer :
> Hello,
>

Hi Christian, and thanks for your detailed answer.

> On Mon, 23 Apr 2018 17:43:03 +0200 Florian Florensa wrote:
>
>> Hello everyone,
>>
>> I am in the process of designing a Ceph cluster, that will contain
>> only SSD OSDs, and I was wondering how should I size my cpu.
> Several threads about this around here, but first things first.
> Any specifics about the storage needs, i.e. do you think you need the SSDs
> for bandwidth or for IOPS reasons primarily?
> Lots of smallish writes or large reads/writes?
>
>> The cluster will only be used for block storage.
>> The OSDs will be Samsung PM863 (2Tb or 4Tb, this will be determined
>
> I assume PM863a, the non "a" model seems to be gone.
> And that's a 1.3 DWPD drive, with a collocated journal or lots of small
> writes and a collocated WAL/DB it will be half of that.
> So run the numbers and make sure this is actually a good fit in the
> endurance area.
> Of course depending on your needs, journals or WAL/DB on higher endurance
> NVMes might be a much better fit anyway.
>

Well, if it does makes sense and maybe a few NVMEs for WAL/DB would
make sense, how many SSDs should i put on a single nvme ?
And how should i size them, AFAIK its ~1.6%of drive capacity for
WAL+DB using default values.

>> when we will set the total volumetry in stone), and it will be in 2U
>> 24SSDs servers
> How many servers are you thinking about?
> Because the fact that you're willing to double the SSD size but not the
> number of servers suggests that you're thinking about a small number of
> servers.
> And while dense servers will save you space and money, more and smaller
> servers are generally a better fit for Ceph, not least when considering
> failure domains (a host typically).
>
The cluster should start somewhere between 20-30 OSDs nodes, and 3monitors.
And it should grow in the forseeable future to up to 50 ODSs nodes,
while keeping 3monitors, but that would be in a while (like 2-3years).
Of course this number of node depends on the usability of storage and iops.
The goal is to replace the SANs for Hypervisors and some diskless baremetal
servers.

>> Those server will probably be either Supermicro 2029U-E1CR4T or
>> Supermicro 2028R-E1CR48L.
>> I’ve read quite a lot of documentation regarding hardware choices, and
>> I can’t find a ‘guideline’ for OSDs on SSD with colocated journal.
> If this is a new cluster, that would be collocated WAL/DB and Bluestore.
> Never mind my misgivings about Bluestore, at this point in time you
> probably don't want to deploy a new cluster with filestore, unless you
> have very specific needs and know what you're doing.
>

Yup the goal was to go for bluestore, as in an RBD workload it seems to
be the better option (avoiding the write amplification and its induced latency)

>> I was pointing for either dual ‘Xeon gold 6146’ or dual ‘Xeon 2699v4’
>> for the cpus, depending on the chassis.
> The first one is a much better fit in terms of the "a fast core for each
> OSD" philosophy needed for low latency and high IOPS. The 2nd is just
> overkill, 24 real cores will do and for extreme cases I'm sure I can still
> whip a fio setting that will saturate the 44 real cores of the 2nd setup.
> Of course dual CPU configurations like this come with a potential latency
> penalty for NUMA misses.
>

Should'nt i be able to 'pin' OSD daemon to specific CPU to avoid NUMA
zone crossing ?

> Unfortunately Supermicro didn't release my suggested Epyc based Ceph
> storage node (yet?).
> I was mentioning a single socket 1U (or 2U double) with 10 2.5 bays, with
> up to 2 NVMe in those bays.
> But even dual CPU Epyc based systems have a clear speed advantage when it
> comes to NUMA misses due to the socket interconnect (Infinity Fabric).
>
> Do consider this alternative setup:
> https://www.supermicro.com.tw/Aplus/system/1U/1123/AS-1123US-TR4.cfm
> With either 8 SSDs and 2 NVMes or 10 SSDs and either
> 2x Epyc 7251 (adequate core ratio and speed, cheap) or
> 2x Epyc 7351 (massive overkill, but still 1/4 of the Intel price tag).
>
> The unreleased AS-2123US-TN24R25 with 2x Epyc 7351 might be a good fit as
> well.
>

I was also considering Epyc, but i was considering using EC on my RBD pools to
maximize the available capacity. Anyone has experiences using EC on RBD with or
without Epyc cpus ?
Also if i am able to go for single cpu chassis, that would decrease
the electrical footprint
of each node, thus allowing me to put more of them per rack.

>> For the network part, I was thinking of using two Dual port connectx4
>> Lx from mellanox per servers.
>>
> Going to what kind of network/switches?
>

I was thinking of having up to 4*25Gb ethernet for each node going to
a pair of switch
to be able to withstand the 'loss' of a switch and a network card.

>> If anyone has some ideas/thoughts/pointers, I would be glad to hear them.
>>
> RAM, you'll need a lot of it, even more with Bluestore given the current
> caching.
> I'd say

Re: [ceph-users] Ceph 12.2.4 MGR spams syslog with "mon failed to return metadata for mds"

2018-04-24 Thread John Spray
On Fri, Apr 20, 2018 at 11:29 AM, Charles Alva  wrote:
> Marc,
>
> Thanks.
>
> The mgr log spam occurs even without dashboard module enabled. I never
> checked the ceph mgr log before because the ceph cluster is always healthy.
> Based on the ceph mgr logs in syslog, the spam occurred long before and
> after I enabled the dashboard module.
>
>> # ceph -s
>>   cluster:
>> id: xxx
>> health: HEALTH_OK
>>
>>   services:
>> mon: 3 daemons, quorum mon1,mon2,mon3
>> mgr: mon1(active), standbys: mon2, mon3
>> mds: cephfs-1/1/1 up  {0=mds1=up:active}, 2 up:standby
>> osd: 14 osds: 14 up, 14 in
>> rgw: 3 daemons active
>>
>>   data:
>> pools:   10 pools, 248 pgs
>> objects: 546k objects, 2119 GB
>> usage:   6377 GB used, 6661 GB / 13039 GB avail
>> pgs: 248 active+clean
>>
>>   io:
>> client:   25233 B/s rd, 1409 kB/s wr, 6 op/s rd, 59 op/s wr
>
>
>
> My ceph mgr log is spam with following log every second. This happens on 2
> separate Ceph 12.2.4 clusters.

(I assume that the mon, mgr and mds are all 12.2.4)

The "failed to return metadata" part is kind of mysterious.  Do you
also get an error if you try to do "ceph mds metadata mds1" by hand?
(that's what the mgr is trying to do).

If the metadata works when using the CLI by hand, you may have an
issue with the mgr's auth caps, check that its mon caps are set to
"allow profile mgr".

The "unhandled message" part is from a path where the mgr code is
ignoring messages from services that don't have any metadata (I think
this is actually a bug, as we should be considering these messages as
handled even if we're ignoring them).

John

>> # less +F /var/log/ceph/ceph-mgr.mon1.log
>>
>>  ...
>>
>> 2018-04-20 06:21:18.782861 7fca238ff700  1 mgr send_beacon active
>> 2018-04-20 06:21:19.050671 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf897d1c00 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:19.051047 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> 2018-04-20 06:21:20.050889 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf897eac00 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:20.051351 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> 2018-04-20 06:21:20.784455 7fca238ff700  1 mgr send_beacon active
>> 2018-04-20 06:21:21.050968 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf897d0d00 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:21.051441 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> 2018-04-20 06:21:22.051254 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf897ec100 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:22.051704 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> 2018-04-20 06:21:22.786656 7fca238ff700  1 mgr send_beacon active
>> 2018-04-20 06:21:23.051235 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf897d0400 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:23.051712 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> 2018-04-20 06:21:24.051353 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf897e6000 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:24.051971 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> 2018-04-20 06:21:24.788228 7fca238ff700  1 mgr send_beacon active
>> 2018-04-20 06:21:25.051642 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf897d1900 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:25.052182 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> 2018-04-20 06:21:26.051641 7fca14809700  0 ms_deliver_dispatch: unhandled
>> message 0x55bf89835600 mgrreport(mds.mds1 +24-0 packed 214) v5 from mds.0
>> 10.100.100.114:6800/4132681434
>> 2018-04-20 06:21:26.052169 7fca25102700  1 mgr finish mon failed to return
>> metadata for mds.mds1: (2) No such file or directory
>> ...
>
>
> Kind regards,
>
> Charles Alva
> Sent from Gmail Mobile
>
> On Fri, Apr 20, 2018 at 10:57 AM, Marc Roos 
> wrote:
>>
>>
>> Hi Charles,
>>
>> I am more or less responding to your syslog issue. I don’t have the
>> experience on cephfs to give you a reliable advice. So lets wait for the
>> experts to reply. But I guess you have to give a little more background
>> info, like
>>
>> This happened to running cluster, you d

Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Linh Vu
Hi Dan,


Thanks! Ah so the "nicely exporting" thing is just a distraction, that's good 
to know.


I did bump mds log max segments and max expiring to 240 after reading the 
previous discussion. It seemed to help when there was just 1 active MDS. It 
doesn't really do much at the moment, although the load remains roughly the 
same as before.


I also saw messages about old clients failing to release caps, but wasn't sure 
what caused it due to the "nicely exporting" warnings. Evicting old clients 
only cleared the issue for about 10s, then more of other clients joined the 
warning list. Only restarting mds.0 so that the standby mds replaces it 
restored cluster health.


Cheers,

Linh


From: Dan van der Ster 
Sent: Tuesday, 24 April 2018 6:20:18 PM
To: Linh Vu
Cc: ceph-users
Subject: Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with 
manual pinning

That "nicely exporting" thing is a logging issue that was apparently
fixed in https://github.com/ceph/ceph/pull/19220. I'm not sure if that
will be backported to luminous.

Otherwise the slow requests could be due to either slow trimming (see
previous discussions about mds log max expiring and mds log max
segments options) or old clients failing to release caps correctly
(you would see appropriate warnings about this).

-- Dan


On Tue, Apr 24, 2018 at 9:34 AM, Linh Vu  wrote:
> Hi all,
>
>
> I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1
> standby. I have 3 shares: /projects, /home and /scratch, and I've decided to
> try manual pinning as described here:
> http://docs.ceph.com/docs/master/cephfs/multimds/
>
>
> /projects is pinned to mds.0 (rank 0)
>
> /home and /scratch are pinned to mds.1 (rank 1)
>
> Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[]
> | [.dir.path, .auth_first, .export_pin]'`
>
>
> Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13.
>
>
> On our test cluster (same version and setup), it works as I think it should.
> I simulate metadata load via mdtest (up to around 2000 req/s on each mds,
> which is VM with 4 cores, 16GB RAM), and loads on /projects go to mds.0,
> loads on the other shares go to mds.1. Nothing pops up in the logs. I can
> also successfully reset to no pinning (i.e using the default load balancing)
> via setting the ceph.dir.pin value to -1, and vice versa. All that happens
> is this show in the logs:
>
>   mds.mds1-test-ceph2 asok_command: get subtrees (starting...)
>
>   mds.mds1-test-ceph2 asok_command: get subtrees (complete)
>
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.
>
> Sometimes (depending on which mds starts first), I would get the same
> message but the other way around i.e "mds.0.migrator nicely exporting to
> mds.1" the workload that mds.0 should be doing. This only appears on one
> mds, never the other, until one is restarted.
>
> And we've had a couple of occasions where we get this sort of slow requests:
>
> 7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406
> seconds old, received at 2018-04-20 08:17:35.970498:
> client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116
> 2018-04-20 08:17:35.970319 caller_uid=10171, caller_gid=1{1,10123,})
> currently failed to authpin local pins
>
> Which then seems to snowball into thousands of slow requests, until mds.0 is
> restarted. When these slow requests happen, loads are fairly low on the
> active MDSes, although it is possible that the users could be doing
> something funky with metadata on production that I can't reproduce with
> mdtest.
>
> I thought the manual pinning likely isn't working as intended due to the
> "mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it
> seems to indicate that we have a bad load balancing situation) but I can't
> seem to replicate this issue in test. Test cluster seems to be working as
> intended.
>
> Am I doing manual pinning right? Should I even be using it?
>
> Cheers,
> Linh
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing li

[ceph-users] Fixing Remapped PG's

2018-04-24 Thread Dilip Renkila
Hi all,

We have a ceph kraken cluster. Last week, we lost an OSD server. Then we
added one more  OSD servers with same configuration.Then we let cluster to
recover,but i think it didn't happened.Still most of PG's are stuck in
remapped and in degraded state. When i restart all osd daemons, it itself
fixes some pg's but not all.
I think there is a bug in ceph. Please tell me if you want more info to
debug this.

root@node16:~# ceph -v
ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7)

root@node15:~# ceph -s
cluster 7c75f6e9-b858-4ac4-aa26-48ae1f33eda2
 health HEALTH_WARN
361 pgs backfill_wait
399 pgs degraded
1 pgs recovering
38 pgs recovery_wait
399 pgs stuck degraded
400 pgs stuck unclean
361 pgs stuck undersized
361 pgs undersized
recovery 98076/465244 objects degraded (21.081%)
recovery 102362/465244 objects misplaced (22.002%)
recovery 1/153718 unfound (0.001%)
pool cinder-volumes pg_num 300 > pgp_num 128
pool ephemeral-vms pg_num 300 > pgp_num 128
1 mons down, quorum 0,1 node15,node16
 monmap e2: 3 mons at {node15=
10.0.5.15:6789/0,node16=10.0.5.16:6789/0,node17=10.0.5.17:6789/0}
election epoch 1230, quorum 0,1 node15,node16
mgr active: node15 standbys: node16
 osdmap e7924: 6 osds: 6 up, 6 in; 362 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v16920461: 600 pgs, 2 pools, 586 GB data, 150 kobjects
1400 GB used, 4165 GB / 5566 GB avail
98076/465244 objects degraded (21.081%)
102362/465244 objects misplaced (22.002%)
1/153718 unfound (0.001%)
 360 active+undersized+degraded+remapped+backfill_wait
 200 active+clean
  38 active+recovery_wait+degraded
   1 active+recovering+undersized+degraded+remapped
   1 active+remapped+backfill_wait
  client io 147 kB/s wr, 0 op/s rd, 21 op/s wr

root@node16:~# ceph osd tree
ID WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.81839 root default
-2 1.81839 host node9
11 0.90919 osd.11   up  1.0  1.0
 1 0.90919 osd.1up  1.0  1.0
-3 2.0 host node10
 0 1.0 osd.0up  1.0  1.0
 2 1.0 osd.2up  1.0  1.0
-4 2.0 host node8
 3 1.0 osd.3up  1.0  1.0
 6 1.0 osd.6up  1.0  1.0
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Dan van der Ster
That "nicely exporting" thing is a logging issue that was apparently
fixed in https://github.com/ceph/ceph/pull/19220. I'm not sure if that
will be backported to luminous.

Otherwise the slow requests could be due to either slow trimming (see
previous discussions about mds log max expiring and mds log max
segments options) or old clients failing to release caps correctly
(you would see appropriate warnings about this).

-- Dan


On Tue, Apr 24, 2018 at 9:34 AM, Linh Vu  wrote:
> Hi all,
>
>
> I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1
> standby. I have 3 shares: /projects, /home and /scratch, and I've decided to
> try manual pinning as described here:
> http://docs.ceph.com/docs/master/cephfs/multimds/
>
>
> /projects is pinned to mds.0 (rank 0)
>
> /home and /scratch are pinned to mds.1 (rank 1)
>
> Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[]
> | [.dir.path, .auth_first, .export_pin]'`
>
>
> Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13.
>
>
> On our test cluster (same version and setup), it works as I think it should.
> I simulate metadata load via mdtest (up to around 2000 req/s on each mds,
> which is VM with 4 cores, 16GB RAM), and loads on /projects go to mds.0,
> loads on the other shares go to mds.1. Nothing pops up in the logs. I can
> also successfully reset to no pinning (i.e using the default load balancing)
> via setting the ceph.dir.pin value to -1, and vice versa. All that happens
> is this show in the logs:
>
>   mds.mds1-test-ceph2 asok_command: get subtrees (starting...)
>
>   mds.mds1-test-ceph2 asok_command: get subtrees (complete)
>
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.
>
> Sometimes (depending on which mds starts first), I would get the same
> message but the other way around i.e "mds.0.migrator nicely exporting to
> mds.1" the workload that mds.0 should be doing. This only appears on one
> mds, never the other, until one is restarted.
>
> And we've had a couple of occasions where we get this sort of slow requests:
>
> 7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406
> seconds old, received at 2018-04-20 08:17:35.970498:
> client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116
> 2018-04-20 08:17:35.970319 caller_uid=10171, caller_gid=1{1,10123,})
> currently failed to authpin local pins
>
> Which then seems to snowball into thousands of slow requests, until mds.0 is
> restarted. When these slow requests happen, loads are fairly low on the
> active MDSes, although it is possible that the users could be doing
> something funky with metadata on production that I can't reproduce with
> mdtest.
>
> I thought the manual pinning likely isn't working as intended due to the
> "mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it
> seems to indicate that we have a bad load balancing situation) but I can't
> seem to replicate this issue in test. Test cluster seems to be working as
> intended.
>
> Am I doing manual pinning right? Should I even be using it?
>
> Cheers,
> Linh
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

2018-04-24 Thread Linh Vu
Hi all,


I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1 
standby. I have 3 shares: /projects, /home and /scratch, and I've decided to 
try manual pinning as described here: 
http://docs.ceph.com/docs/master/cephfs/multimds/


/projects is pinned to mds.0 (rank 0)

/home and /scratch are pinned to mds.1 (rank 1)

Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[] | 
[.dir.path, .auth_first, .export_pin]'`


Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13.


On our test cluster (same version and setup), it works as I think it should. I 
simulate metadata load via mdtest (up to around 2000 req/s on each mds, which 
is VM with 4 cores, 16GB RAM), and loads on /projects go to mds.0, loads on the 
other shares go to mds.1. Nothing pops up in the logs. I can also successfully 
reset to no pinning (i.e using the default load balancing) via setting the 
ceph.dir.pin value to -1, and vice versa. All that happens is this show in the 
logs:

  mds.mds1-test-ceph2 asok_command: get subtrees (starting...)

  mds.mds1-test-ceph2 asok_command: get subtrees (complete)

However, on our production cluster, with more powerful MDSes (10 cores 3.4GHz, 
256GB RAM, much faster networking), I get this in the logs constantly:

2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting to 
mds.0 [dir 0x110cd91.1110* /home/ [2,head] auth{0=1017} v=5632699 
cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84 
55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711 
423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1 
replicated=1 dirty=1 authpin=0 0x55691ccf1c00]

To clarify, /home is pinned to mds.1, so there is no reason it should export 
this to mds.0, and the loads on both MDSes (req/s, network load, CPU load) are 
fairly low, lower than those on the test MDS VMs.

Sometimes (depending on which mds starts first), I would get the same message 
but the other way around i.e "mds.0.migrator nicely exporting to mds.1" the 
workload that mds.0 should be doing. This only appears on one mds, never the 
other, until one is restarted.

And we've had a couple of occasions where we get this sort of slow requests:

7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406 
seconds old, received at 2018-04-20 08:17:35.970498: 
client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116 2018-04-20 
08:17:35.970319 caller_uid=10171, caller_gid=1{1,10123,}) currently 
failed to authpin local pins

Which then seems to snowball into thousands of slow requests, until mds.0 is 
restarted. When these slow requests happen, loads are fairly low on the active 
MDSes, although it is possible that the users could be doing something funky 
with metadata on production that I can't reproduce with mdtest.

I thought the manual pinning likely isn't working as intended due to the 
"mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it seems 
to indicate that we have a bad load balancing situation) but I can't seem to 
replicate this issue in test. Test cluster seems to be working as intended.

Am I doing manual pinning right? Should I even be using it?

Cheers,
Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com