from:"Dan Van Der Ster"

Re: [ceph-users] MDS: obscene buffer_anon memory use when scanning lots of files

2020-01-21 Thread Dan van der Ster

On Wed, Jan 22, 2020 at 12:24 AM Patrick Donnelly 
wrote:

> On Tue, Jan 21, 2020 at 8:32 AM John Madden  wrote:
> >
> > On 14.2.5 but also present in Luminous, buffer_anon memory use spirals
> > out of control when scanning many thousands of files. The use case is
> > more or less "look up this file and if it exists append this chunk to
> > it, otherwise create it with this chunk." The memory is recovered as
> > soon as the workload stops, and at most only 20-100 files are ever
> > open at one time.
> >
> > Cache gets oversized but that's more or less expected, it's pretty
> > much always/immediately in some warn state, which makes me wonder if a
> > much larger cache might help buffer_anon use, looking for advice
> > there. This is on a deeply-hashed directory, but overall very little
> > data (<20GB), lots of tiny files.
> >
> > As I typed this post the pool went from ~60GB to ~110GB. I've resorted
> > to a cronjob that restarts the active MDS when it reaches swap just to
> > keep the cluster alive.
>
> This looks like it will be fixed by
>
> https://tracker.ceph.com/issues/42943
>
> That will be available in v14.2.7.
>

Couldn't John confirm that this is the issue by checking the heap stats and
triggering the release via

  ceph tell mds.mds1 heap stats
  ceph tell mds.mds1 heap release

(this would be much less disruptive than restarting the MDS)

-- Dan



>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Senior Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD's hang after network blip

2020-01-16 Thread Dan van der Ster

We upgraded to 14.2.4 back in October and this week to v14.2.6.
But I don't think the cluster had a network outage until yesterday, so I
wouldn't have thought this is a .6 regression.

If it happens again I'll look for the waiting for map message.

-- dan


On Thu, Jan 16, 2020 at 12:08 PM Nick Fisk  wrote:

> On Thursday, January 16, 2020 09:15 GMT, Dan van der Ster <
> d...@vanderster.com> wrote:
>
> > Hi Nick,
> >
> > We saw the exact same problem yesterday after a network outage -- a few
> of
> > our down OSDs were stuck down until we restarted their processes.
> >
> > -- Dan
> >
> >
> > On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk  wrote:
> >
> > > Hi All,
> > >
> > > Running 14.2.5, currently experiencing some network blips isolated to a
> > > single rack which is under investigation. However, it appears
> following a
> > > network blip, random OSD's in unaffected racks are sometimes not
> recovering
> > > from the incident and are left running running in a zombie state. The
> OSD's
> > > appear to be running from a process perspective, but the cluster thinks
> > > they are down and will not rejoin the cluster until the OSD process is
> > > restarted, which incidentally takes a lot longer than usual (systemctl
> > > command takes a couple of minutes to complete).
> > >
> > > If the OSD is left in this state, CPU and memory usage of the process
> > > appears to climb, but never rejoins, at least for several hours that I
> have
> > > left them. Not exactly sure what the OSD is trying to do during this
> > > period. There's nothing in the logs during this hung state to indicate
> that
> > > anything is happening, but I will try and inject more verbose logging
> next
> > > time it occurs.
> > >
> > > Not sure if anybody has come across this before or any ideas? In the
> past
> > > as long as OSD's have been running they have always re-joint following
> any
> > > network issues.
> > >
> > > Nick
> > >
> > > Sample from OSD and cluster logs below. Blip happened at 12:06, I
> > > restarted OSD at 12:26
> > >
> > > OSD Logs from OSD that hung (Note this OSD was not directly affected by
> > > network outage)
> > > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991
> heartbeat_check: no
> > > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back,
> first
> > > ping sent 2020-01-15 12:06:1
> > > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN]
> :
> > > Monitor daemon marked osd.43 down, but it is still running
> > > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG]
> :
> > > map e2342992 wrongly marked me down at e2342992
> > > 2020-01-15 12:06:34.034 7f419480a700  1 osd

Re: [ceph-users] OSD's hang after network blip

2020-01-16 Thread Dan van der Ster

Hi Nick,

We saw the exact same problem yesterday after a network outage -- a few of
our down OSDs were stuck down until we restarted their processes.

-- Dan


On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk  wrote:

> Hi All,
>
> Running 14.2.5, currently experiencing some network blips isolated to a
> single rack which is under investigation. However, it appears following a
> network blip, random OSD's in unaffected racks are sometimes not recovering
> from the incident and are left running running in a zombie state. The OSD's
> appear to be running from a process perspective, but the cluster thinks
> they are down and will not rejoin the cluster until the OSD process is
> restarted, which incidentally takes a lot longer than usual (systemctl
> command takes a couple of minutes to complete).
>
> If the OSD is left in this state, CPU and memory usage of the process
> appears to climb, but never rejoins, at least for several hours that I have
> left them. Not exactly sure what the OSD is trying to do during this
> period. There's nothing in the logs during this hung state to indicate that
> anything is happening, but I will try and inject more verbose logging next
> time it occurs.
>
> Not sure if anybody has come across this before or any ideas? In the past
> as long as OSD's have been running they have always re-joint following any
> network issues.
>
> Nick
>
> Sample from OSD and cluster logs below. Blip happened at 12:06, I
> restarted OSD at 12:26
>
> OSD Logs from OSD that hung (Note this OSD was not directly affected by
> network outage)
> 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN] :
> Monitor daemon marked osd.43 down, but it is still running
> 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG] :
> map e2342992 wrongly marked me down at e2342992
> 2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992
> start_waiting_for_healthy
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline

Re: [ceph-users] Acting sets sometimes may violate crush rule ?

2020-01-13 Thread Dan van der Ster

Hi,

One way this can happen is if you change the crush rule of a pool after the
balancer has been running awhile.
This is because the balancer upmaps are only validated when they are
initially created.

ceph osd dump | grep upmap

Does it explain your issue?

.. Dan


On Tue, 14 Jan 2020, 04:17 Yi-Cian Pu,  wrote:

> Hi all,
>
> We sometimes can observe that acting set seems to violate crush rule. For
> example, we had an environment before:
>
> [root@Ann-per-R7-3 /]# ceph -s
>   cluster:
> id: 248ce880-f57b-4a4c-a53a-3fc2b3eb142a
> health: HEALTH_WARN
> 34/8019 objects misplaced (0.424%)
>
>   services:
> mon: 3 daemons, quorum Ann-per-R7-3,Ann-per-R7-7,Ann-per-R7-1
> mgr: Ann-per-R7-3(active), standbys: Ann-per-R7-7, Ann-per-R7-1
> mds: cephfs-1/1/1 up  {0=qceph-mds-Ann-per-R7-1=up:active}, 2 up:standby
> osd: 7 osds: 7 up, 7 in; 1 remapped pgs
>
>   data:
> pools:   7 pools, 128 pgs
> objects: 2.67 k objects, 10 GiB
> usage:   107 GiB used, 3.1 TiB / 3.2 TiB avail
> pgs: 34/8019 objects misplaced (0.424%)
>  127 active+clean
>  1   active+clean+remapped
>
> [root@Ann-per-R7-3 /]# ceph pg ls remapped
> PG  OBJECTS DEGRADED MISPLACED UNFOUND BYTES LOG STATE 
> STATE_STAMPVERSION REPORTED UP  ACTINGSCRUB_STAMP 
>DEEP_SCRUB_STAMP
> 1.7  34034   0 134217728  42 active+clean+remapped 
> 2019-11-05 10:39:58.639533  144'42  229:407 [6,1]p6 [6,1,2]p6 2019-11-04 
> 10:36:19.519820 2019-11-04 10:36:19.519820
>
>
> [root@Ann-per-R7-3 /]# ceph osd tree
> ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
> -2 0 root perf_osd
> -1   3.10864 root default
> -7   0.44409 host Ann-per-R7-1
>  5   hdd 0.44409 osd.5 up  1.0 1.0
> -3   1.33228 host Ann-per-R7-3
>  0   hdd 0.44409 osd.0 up  1.0 1.0
>  1   hdd 0.44409 osd.1 up  1.0 1.0
>  2   hdd 0.44409 osd.2 up  1.0 1.0
> -9   1.33228 host Ann-per-R7-7
>  6   hdd 0.44409 osd.6 up  1.0 1.0
>  7   hdd 0.44409 osd.7 up  1.0 1.0
>  8   hdd 0.44409 osd.8 up  1.0 1.0
>
>
> [root@Ann-per-R7-3 /]# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE VAR  PGS
>  5   hdd 0.44409  1.0 465 GiB  21 GiB 444 GiB 4.49 1.36 127
>  0   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.16 0.96  44
>  1   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.14 0.95  52
>  2   hdd 0.44409  1.0 465 GiB  14 GiB 451 GiB 2.98 0.91  33
>  6   hdd 0.44409  1.0 465 GiB  14 GiB 451 GiB 2.97 0.90  43
>  7   hdd 0.44409  1.0 465 GiB  15 GiB 450 GiB 3.19 0.97  41
>  8   hdd 0.44409  1.0 465 GiB  14 GiB 450 GiB 3.09 0.94  44
> TOTAL 3.2 TiB 107 GiB 3.1 TiB 3.29
> MIN/MAX VAR: 0.90/1.36  STDDEV: 0.49
>
>
> Based on our crush map, crush rule should select 1 OSD from each host.
> However, from above log, we can see that an acting set is [6,1,2] and osd.1
> and osd.2 are in the same host, which seems to violate crush rule. So, my
> question is how does this happen...? Any enlightenment is much appreciated.
>
> Best
> Cian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-04 Thread Dan van der Ster

My advice is to wait.

We built a 13.2.7 + https://github.com/ceph/ceph/pull/26448 cherry
picked and the OSDs no longer crash.

My vote would be for a quick 13.2.8.

-- Dan

On Wed, Dec 4, 2019 at 2:41 PM Frank Schilder  wrote:
>
> Is this issue now a no-go for updating to 13.2.7 or are there only some 
> specific unsafe scenarios?
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: ceph-users  on behalf of Dan van der 
> Ster 
> Sent: 03 December 2019 16:42:45
> To: ceph-users
> Subject: Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg
>
> I created https://tracker.ceph.com/issues/43106 and we're downgrading
> our osds back to 13.2.6.
>
> -- dan
>
> On Tue, Dec 3, 2019 at 4:09 PM Dan van der Ster  wrote:
> >
> > Hi all,
> >
> > We're midway through an update from 13.2.6 to 13.2.7 and started
> > getting OSDs crashing regularly like this [1].
> > Does anyone obviously know what the issue is? (Maybe
> > https://github.com/ceph/ceph/pull/26448/files ?)
> > Or is it some temporary problem while we still have v13.2.6 and
> > v13.2.7 osds running concurrently?
> >
> > Thanks!
> >
> > Dan
> >
> > [1]
> >
> > 2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889
> > build_incremental_map_msg missing incremental map 2758889
> > 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> > build_incremental_map_msg missing incremental map 2758889
> > 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> > build_incremental_map_msg unable to load latest map 2758889
> > 2019-12-03 15:53:51.822 7ff3a453a700 -1 *** Caught signal (Aborted) **
> >  in thread 7ff3a453a700 thread_name:tp_osd_tp
> >
> >  ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic 
> > (stable)
> >  1: (()+0xf5f0) [0x7ff3c620b5f0]
> >  2: (gsignal()+0x37) [0x7ff3c522b337]
> >  3: (abort()+0x148) [0x7ff3c522ca28]
> >  4: (OSDService::build_incremental_map_msg(unsigned int, unsigned int,
> > OSDSuperblock&)+0x767) [0x555d60e8d797]
> >  5: (OSDService::send_incremental_map(unsigned int, Connection*,
> > std::shared_ptr&)+0x39e) [0x555d60e8dbee]
> >  6: (OSDService::share_map_peer(int, Connection*,
> > std::shared_ptr)+0x159) [0x555d60e8eda9]
> >  7: (OSDService::send_message_osd_cluster(int, Message*, unsigned
> > int)+0x1a5) [0x555d60e8f085]
> >  8: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&,
> > unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t,
> > hobject_t, std::vector
> > > const&, boost::optional&,
> > ReplicatedBackend::InProgressOp*, ObjectStore::Transaction&)+0x452)
> > [0x555d6116e522]
> >  9: (ReplicatedBackend::submit_transaction(hobject_t const&,
> > object_stat_sum_t const&, eversion_t const&,
> > std::unique_ptr >&&,
> > eversion_t const&, eversion_t const&, std::vector > std::allocator > const&,
> > boost::optional&, Context*, unsigned long,
> > osd_reqid_t, boost::intrusive_ptr)+0x6f5) [0x555d6117ed85]
> >  10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> > PrimaryLogPG::OpContext*)+0xd62) [0x555d60ff5142]
> >  11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf12)
> > [0x555d61035902]
> >  12: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3679)
> > [0x555d610397a9]
> >  13: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> > ThreadPool::TPHandle&)+0xc99) [0x555d6103d869]
> >  14: (OSD::dequeue_op(boost::intrusive_ptr,
> > boost::intrusive_ptr, ThreadPool::TPHandle&)+0x1b7)
> > [0x555d60e8e8a7]
> >  15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
> > ThreadPool::TPHandle&)+0x62) [0x555d611144c2]
> >  16: (OSD::ShardedOpWQ::_process(unsigned int,
> > ceph::heartbeat_handle_d*)+0x592) [0x555d60eb25f2]
> >  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
> > [0x7ff3c929f5b3]
> >  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7ff3c92a01a0]
> >  19: (()+0x7e65) [0x7ff3c6203e65]
> >  20: (clone()+0x6d) [0x7ff3c52f388d]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-03 Thread Dan van der Ster

I created https://tracker.ceph.com/issues/43106 and we're downgrading
our osds back to 13.2.6.

-- dan

On Tue, Dec 3, 2019 at 4:09 PM Dan van der Ster  wrote:
>
> Hi all,
>
> We're midway through an update from 13.2.6 to 13.2.7 and started
> getting OSDs crashing regularly like this [1].
> Does anyone obviously know what the issue is? (Maybe
> https://github.com/ceph/ceph/pull/26448/files ?)
> Or is it some temporary problem while we still have v13.2.6 and
> v13.2.7 osds running concurrently?
>
> Thanks!
>
> Dan
>
> [1]
>
> 2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889
> build_incremental_map_msg missing incremental map 2758889
> 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> build_incremental_map_msg missing incremental map 2758889
> 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
> build_incremental_map_msg unable to load latest map 2758889
> 2019-12-03 15:53:51.822 7ff3a453a700 -1 *** Caught signal (Aborted) **
>  in thread 7ff3a453a700 thread_name:tp_osd_tp
>
>  ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
>  1: (()+0xf5f0) [0x7ff3c620b5f0]
>  2: (gsignal()+0x37) [0x7ff3c522b337]
>  3: (abort()+0x148) [0x7ff3c522ca28]
>  4: (OSDService::build_incremental_map_msg(unsigned int, unsigned int,
> OSDSuperblock&)+0x767) [0x555d60e8d797]
>  5: (OSDService::send_incremental_map(unsigned int, Connection*,
> std::shared_ptr&)+0x39e) [0x555d60e8dbee]
>  6: (OSDService::share_map_peer(int, Connection*,
> std::shared_ptr)+0x159) [0x555d60e8eda9]
>  7: (OSDService::send_message_osd_cluster(int, Message*, unsigned
> int)+0x1a5) [0x555d60e8f085]
>  8: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&,
> unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t,
> hobject_t, std::vector
> > const&, boost::optional&,
> ReplicatedBackend::InProgressOp*, ObjectStore::Transaction&)+0x452)
> [0x555d6116e522]
>  9: (ReplicatedBackend::submit_transaction(hobject_t const&,
> object_stat_sum_t const&, eversion_t const&,
> std::unique_ptr >&&,
> eversion_t const&, eversion_t const&, std::vector std::allocator > const&,
> boost::optional&, Context*, unsigned long,
> osd_reqid_t, boost::intrusive_ptr)+0x6f5) [0x555d6117ed85]
>  10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xd62) [0x555d60ff5142]
>  11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf12)
> [0x555d61035902]
>  12: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3679)
> [0x555d610397a9]
>  13: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0xc99) [0x555d6103d869]
>  14: (OSD::dequeue_op(boost::intrusive_ptr,
> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x1b7)
> [0x555d60e8e8a7]
>  15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x62) [0x555d611144c2]
>  16: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x592) [0x555d60eb25f2]
>  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
> [0x7ff3c929f5b3]
>  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7ff3c92a01a0]
>  19: (()+0x7e65) [0x7ff3c6203e65]
>  20: (clone()+0x6d) [0x7ff3c52f388d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] v13.2.7 osds crash in build_incremental_map_msg

2019-12-03 Thread Dan van der Ster

Hi all,

We're midway through an update from 13.2.6 to 13.2.7 and started
getting OSDs crashing regularly like this [1].
Does anyone obviously know what the issue is? (Maybe
https://github.com/ceph/ceph/pull/26448/files ?)
Or is it some temporary problem while we still have v13.2.6 and
v13.2.7 osds running concurrently?

Thanks!

Dan

[1]

2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889
build_incremental_map_msg missing incremental map 2758889
2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
build_incremental_map_msg missing incremental map 2758889
2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889
build_incremental_map_msg unable to load latest map 2758889
2019-12-03 15:53:51.822 7ff3a453a700 -1 *** Caught signal (Aborted) **
 in thread 7ff3a453a700 thread_name:tp_osd_tp

 ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
 1: (()+0xf5f0) [0x7ff3c620b5f0]
 2: (gsignal()+0x37) [0x7ff3c522b337]
 3: (abort()+0x148) [0x7ff3c522ca28]
 4: (OSDService::build_incremental_map_msg(unsigned int, unsigned int,
OSDSuperblock&)+0x767) [0x555d60e8d797]
 5: (OSDService::send_incremental_map(unsigned int, Connection*,
std::shared_ptr&)+0x39e) [0x555d60e8dbee]
 6: (OSDService::share_map_peer(int, Connection*,
std::shared_ptr)+0x159) [0x555d60e8eda9]
 7: (OSDService::send_message_osd_cluster(int, Message*, unsigned
int)+0x1a5) [0x555d60e8f085]
 8: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&,
unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t,
hobject_t, std::vector
> const&, boost::optional&,
ReplicatedBackend::InProgressOp*, ObjectStore::Transaction&)+0x452)
[0x555d6116e522]
 9: (ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&,
std::unique_ptr >&&,
eversion_t const&, eversion_t const&, std::vector > const&,
boost::optional&, Context*, unsigned long,
osd_reqid_t, boost::intrusive_ptr)+0x6f5) [0x555d6117ed85]
 10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0xd62) [0x555d60ff5142]
 11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf12)
[0x555d61035902]
 12: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x3679)
[0x555d610397a9]
 13: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xc99) [0x555d6103d869]
 14: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x1b7)
[0x555d60e8e8a7]
 15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x62) [0x555d611144c2]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x592) [0x555d60eb25f2]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
[0x7ff3c929f5b3]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7ff3c92a01a0]
 19: (()+0x7e65) [0x7ff3c6203e65]
 20: (clone()+0x6d) [0x7ff3c52f388d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mdss keep on crashing after update to 14.2.3

2019-09-19 Thread Dan van der Ster

You were running v14.2.2 before?

It seems that that  ceph_assert you're hitting was indeed added
between v14.2.2. and v14.2.3 in this commit
https://github.com/ceph/ceph/commit/12f8b813b0118b13e0cdac15b19ba8a7e127730b

There's a comment in the tracker for that commit which says the
original fix was incomplete
(https://tracker.ceph.com/issues/39987#note-5)

So perhaps nautilus needs
https://github.com/ceph/ceph/pull/28459/commits/0a1e92abf1cfc8bddf526cbf5bceea7b854dcfe8
??

Did you already try going back to v14.2.2 (on the MDS's only) ??

-- dan

On Thu, Sep 19, 2019 at 4:59 PM Kenneth Waegeman
 wrote:
>
> Hi all,
>
> I updated our ceph cluster to 14.2.3 yesterday, and today the mds are 
> crashing one after another. I'm using two active mds.
>
> I've made a tracker ticket, but I was wondering if someone else also has seen 
> this issue yet?
>
>-27> 2019-09-19 15:42:00.196 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8887 lookup 
> #0x100166004d4/WindowsPhone-MSVC-CXX.cmake 2019-09-19 15:42:00.203132 
> caller_uid=0, caller_gid=0{0,}) v4
>-26> 2019-09-19 15:42:00.196 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865372:5815 lookup 
> #0x20005a6eb3a/selectable.cpython-37.pyc 2019-09-19 15:42:00.204970 
> caller_uid=0, caller_gid=0{0,}) v4
>-25> 2019-09-19 15:42:00.196 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333: lookup 
> #0x100166004d4/WindowsPhone.cmake 2019-09-19 15:42:00.206381 caller_uid=0, 
> caller_gid=0{0,}) v4
>-24> 2019-09-19 15:42:00.206 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8889 lookup 
> #0x100166004d4/WindowsStore-MSVC-C.cmake 2019-09-19 15:42:00.209703 
> caller_uid=0, caller_gid=0{0,}) v4
>-23> 2019-09-19 15:42:00.206 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8890 lookup 
> #0x100166004d4/WindowsStore-MSVC-CXX.cmake 2019-09-19 15:42:00.213200 
> caller_uid=0, caller_gid=0{0,}) v4
>-22> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8891 lookup 
> #0x100166004d4/WindowsStore.cmake 2019-09-19 15:42:00.216577 caller_uid=0, 
> caller_gid=0{0,}) v4
>-21> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8892 lookup 
> #0x100166004d4/Xenix.cmake 2019-09-19 15:42:00.220230 caller_uid=0, 
> caller_gid=0{0,}) v4
>-20> 2019-09-19 15:42:00.216 7f0369aeb700  2 mds.1.cache Memory usage:  
> total 4603496, rss 4167920, heap 323836, baseline 323836, 501 / 1162471 
> inodes have caps, 506 caps, 0.00043528 caps per inode
>-19> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log _submit_thread 
> 30520209420029~9062 : EUpdate scatter_writebehind [metablob 0x1000bd8ac7b, 2 
> dirs]
>-18> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log _submit_thread 
> 30520209429111~10579 : EUpdate scatter_writebehind [metablob 0x1000bf26309, 9 
> dirs]
>-17> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log _submit_thread 
> 30520209439710~2305 : EUpdate scatter_writebehind [metablob 
> 0x1000bf2745b.001*, 2 dirs]
>-16> 2019-09-19 15:42:00.216 7f03652e2700  5 mds.1.log _submit_thread 
> 30520209442035~1845 : EUpdate scatter_writebehind [metablob 0x1000c233753, 2 
> dirs]
>-15> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8893 lookup 
> #0x100166004d4/eCos.cmake 2019-09-19 15:42:00.223360 caller_uid=0, 
> caller_gid=0{0,}) v4
>-14> 2019-09-19 15:42:00.216 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865319:2381 lookup 
> #0x1001172f39d/microsoft-cp1251 2019-09-19 15:42:00.224940 caller_uid=0, 
> caller_gid=0{0,}) v4
>-13> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8894 lookup 
> #0x100166004d4/gas.cmake 2019-09-19 15:42:00.226624 caller_uid=0, 
> caller_gid=0{0,}) v4
>-12> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865319:2382 readdir 
> #0x1001172f3d7 2019-09-19 15:42:00.228673 caller_uid=0, caller_gid=0{0,}) v4
>-11> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8895 lookup 
> #0x100166004d4/kFreeBSD.cmake 2019-09-19 15:42:00.229668 caller_uid=0, 
> caller_gid=0{0,}) v4
>-10> 2019-09-19 15:42:00.226 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8896 lookup 
> #0x100166004d4/syllable.cmake 2019-09-19 15:42:00.232746 caller_uid=0, 
> caller_gid=0{0,}) v4
> -9> 2019-09-19 15:42:00.236 7f036c2f0700  4 mds.1.server 
> handle_client_request client_request(client.37865333:8897 readdir 
> #0x10016601379 2019-09-19 15:42:00.240672 caller_uid=0, caller_gid=0{0,}) v4
> -8> 2019-09-19 15:42:00.236 7f036c2f0700  4

Re: [ceph-users] cephfs full, 2/3 Raw capacity used

2019-08-26 Thread Dan van der Ster

Thanks. The version and balancer config look good.

So you can try `ceph osd reweight osd.10 0.8` to see if it helps to
get you out of this.

-- dan

On Mon, Aug 26, 2019 at 11:35 AM Simon Oosthoek
 wrote:
>
> On 26-08-19 11:16, Dan van der Ster wrote:
> > Hi,
> >
> > Which version of ceph are you using? Which balancer mode?
>
> Nautilus (14.2.2), balancer is in upmap mode.
>
> > The balancer score isn't a percent-error or anything humanly usable.
> > `ceph osd df tree` can better show you exactly which osds are
> > over/under utilized and by how much.
> >
>
> Aha, I ran this and sorted on the %full column:
>
>   81   hdd   10.81149  1.0  11 TiB 5.2 TiB 5.1 TiB   4 KiB  14 GiB
> 5.6 TiB 48.40 0.73  96 up osd.81
>   48   hdd   10.81149  1.0  11 TiB 5.3 TiB 5.2 TiB  15 KiB  14 GiB
> 5.5 TiB 49.08 0.74  95 up osd.48
> 154   hdd   10.81149  1.0  11 TiB 5.5 TiB 5.4 TiB 2.6 GiB  15 GiB
> 5.3 TiB 50.95 0.76  96 up osd.154
> 129   hdd   10.81149  1.0  11 TiB 5.5 TiB 5.4 TiB 5.1 GiB  16 GiB
> 5.3 TiB 51.33 0.77  96 up osd.129
>   42   hdd   10.81149  1.0  11 TiB 5.6 TiB 5.5 TiB 2.6 GiB  14 GiB
> 5.2 TiB 51.81 0.78  96 up osd.42
> 122   hdd   10.81149  1.0  11 TiB 5.7 TiB 5.6 TiB  16 KiB  14 GiB
> 5.1 TiB 52.47 0.79  96 up osd.122
> 120   hdd   10.81149  1.0  11 TiB 5.7 TiB 5.6 TiB 2.6 GiB  15 GiB
> 5.1 TiB 52.92 0.79  95 up osd.120
>   96   hdd   10.81149  1.0  11 TiB 5.8 TiB 5.7 TiB 2.6 GiB  15 GiB
> 5.0 TiB 53.58 0.80  96 up osd.96
>   26   hdd   10.81149  1.0  11 TiB 5.8 TiB 5.7 TiB  20 KiB  15 GiB
> 5.0 TiB 53.68 0.80  97 up osd.26
> ...
>6   hdd   10.81149  1.0  11 TiB 8.3 TiB 8.2 TiB  88 KiB  18 GiB
> 2.5 TiB 77.14 1.16  96 up osd.6
>   16   hdd   10.81149  1.0  11 TiB 8.4 TiB 8.3 TiB  28 KiB  18 GiB
> 2.4 TiB 77.56 1.16  95 up osd.16
>0   hdd   10.81149  1.0  11 TiB 8.6 TiB 8.4 TiB  48 KiB  17 GiB
> 2.2 TiB 79.24 1.19  96 up osd.0
> 144   hdd   10.81149  1.0  11 TiB 8.6 TiB 8.5 TiB 2.6 GiB  18 GiB
> 2.2 TiB 79.57 1.19  95 up osd.144
> 136   hdd   10.81149  1.0  11 TiB 8.6 TiB 8.5 TiB  48 KiB  17 GiB
> 2.2 TiB 79.60 1.19  95 up osd.136
>   63   hdd   10.81149  1.0  11 TiB 8.6 TiB 8.5 TiB 2.6 GiB  17 GiB
> 2.2 TiB 79.60 1.19  95 up osd.63
> 155   hdd   10.81149  1.0  11 TiB 8.6 TiB 8.5 TiB   8 KiB  19 GiB
> 2.2 TiB 79.85 1.20  95 up osd.155
>   89   hdd   10.81149  1.0  11 TiB 8.7 TiB 8.5 TiB  12 KiB  20 GiB
> 2.2 TiB 80.04 1.20  96 up osd.89
> 106   hdd   10.81149  1.0  11 TiB 8.8 TiB 8.7 TiB  64 KiB  19 GiB
> 2.0 TiB 81.38 1.22  96 up osd.106
>   94   hdd   10.81149  1.0  11 TiB 9.0 TiB 8.9 TiB 0 B  19 GiB
> 1.8 TiB 83.53 1.25  96 up osd.94
>   33   hdd   10.81149  1.0  11 TiB 9.1 TiB 9.0 TiB  44 KiB  19 GiB
> 1.7 TiB 84.40 1.27  96 up osd.33
>   15   hdd   10.81149  1.0  11 TiB  10 TiB 9.8 TiB  16 KiB  20 GiB
> 877 GiB 92.08 1.38  96 up osd.15
>   53   hdd   10.81149  1.0  11 TiB  10 TiB  10 TiB 2.6 GiB  20 GiB
> 676 GiB 93.90 1.41  96 up osd.53
>   51   hdd   10.81149  1.0  11 TiB  10 TiB  10 TiB 2.6 GiB  20 GiB
> 666 GiB 93.98 1.41  96 up osd.51
>   10   hdd   10.81149  1.0  11 TiB  10 TiB  10 TiB  40 KiB  22 GiB
> 552 GiB 95.01 1.42  97 up osd.10
>
> So the fullest one is at 95.01%, the emptiest one at 48.4%, so there's
> some balancing to be done.
>
> > You might be able to manually fix things by using `ceph osd reweight
> > ...` on the most full osds to move data elsewhere.
>
> I'll look into this, but I was hoping that the balancer module would
> take care of this...
>
> >
> > Otherwise, in general, its good to setup monitoring so you notice and
> > take action well before the osds fill up.
>
> Yes, I'm still working on this, I want to add some checks to our
> check_mk+icinga setup using native plugins, but my python skills are not
> quite up to the task, at least, not yet ;-)
>
> Cheers
>
> /Simon
>
> >
> > Cheers, Dan
> >
> > On Mon, Aug 26, 2019 at 11:09 AM Simon Oosthoek
> >  wrote:
> >>
> >> Hi all,
> >>
> >> we're building up our experience with our ceph cluster before we take it
> >> into production. I've now tried to fill up the cluster with cephfs

Re: [ceph-users] cephfs full, 2/3 Raw capacity used

2019-08-26 Thread Dan van der Ster

Hi,

Which version of ceph are you using? Which balancer mode?
The balancer score isn't a percent-error or anything humanly usable.
`ceph osd df tree` can better show you exactly which osds are
over/under utilized and by how much.

You might be able to manually fix things by using `ceph osd reweight
...` on the most full osds to move data elsewhere.

Otherwise, in general, its good to setup monitoring so you notice and
take action well before the osds fill up.

Cheers, Dan

On Mon, Aug 26, 2019 at 11:09 AM Simon Oosthoek
 wrote:
>
> Hi all,
>
> we're building up our experience with our ceph cluster before we take it
> into production. I've now tried to fill up the cluster with cephfs,
> which we plan to use for about 95% of all data on the cluster.
>
> The cephfs pools are full when the cluster reports 67% raw capacity
> used. There are 4 pools we use for cephfs data, 3-copy, 4-copy, EC 8+3
> and EC 5+7. The balancer module is turned on and `ceph balancer eval`
> gives `current cluster score 0.013255 (lower is better)`, so well within
> the default 5% margin. Is there a setting we can tweak to increase the
> usable RAW capacity to say 85% or 90%, or is this the most we can expect
> to store on the cluster?
>
> [root@cephmon1 ~]# ceph df
> RAW STORAGE:
>  CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>  hdd   1.8 PiB 605 TiB 1.2 PiB  1.2 PiB 66.71
>  TOTAL 1.8 PiB 605 TiB 1.2 PiB  1.2 PiB 66.71
>
> POOLS:
>  POOLID STORED  OBJECTS USED
> %USED  MAX AVAIL
>  cephfs_data  1 111 MiB  79.26M 1.2 GiB
> 100.00   0 B
>  cephfs_metadata  2  52 GiB   4.91M  52 GiB
> 100.00   0 B
>  cephfs_data_4copy3 106 TiB  46.36M 428 TiB
> 100.00   0 B
>  cephfs_data_3copy8  93 TiB  42.08M 282 TiB
> 100.00   0 B
>  cephfs_data_ec8313 106 TiB  50.11M 161 TiB
> 100.00   0 B
>  rbd 14  21 GiB   5.62k  63 GiB
> 100.00   0 B
>  .rgw.root   15 1.2 KiB   4   1 MiB
> 100.00   0 B
>  default.rgw.control 16 0 B   8 0 B
>  0   0 B
>  default.rgw.meta17   765 B   4   1 MiB
> 100.00   0 B
>  default.rgw.log 18 0 B 207 0 B
>  0   0 B
>  scbench 19 133 GiB  34.14k 400 GiB
> 100.00   0 B
>  cephfs_data_ec5720 126 TiB  51.84M 320 TiB
> 100.00   0 B
> [root@cephmon1 ~]# ceph balancer eval
> current cluster score 0.013255 (lower is better)
>
>
> Being full at 2/3 Raw used is a bit too "pretty" to be accidental, it
> seems like this could be a parameter for cephfs, however, I couldn't
> find anything like this in the documentation for Nautilus.
>
>
> The logs in the dashboard show this:
> 2019-08-26 11:00:00.000630
> [ERR]
> overall HEALTH_ERR 3 backfillfull osd(s); 1 full osd(s); 12 pool(s) full
>
> 2019-08-26 10:57:44.539964
> [INF]
> Health check cleared: POOL_BACKFILLFULL (was: 12 pool(s) backfillfull)
>
> 2019-08-26 10:57:44.539944
> [WRN]
> Health check failed: 12 pool(s) full (POOL_FULL)
>
> 2019-08-26 10:57:44.539926
> [ERR]
> Health check failed: 1 full osd(s) (OSD_FULL)
>
> 2019-08-26 10:57:44.539899
> [WRN]
> Health check update: 3 backfillfull osd(s) (OSD_BACKFILLFULL)
>
> 2019-08-26 10:00:00.88
> [WRN]
> overall HEALTH_WARN 4 backfillfull osd(s); 12 pool(s) backfillfull
>
> So it seems that ceph is completely stuck at 2/3 full, while we
> anticipated being able to fill up the cluster to at least 85-90% of the
> raw capacity. Or at least so that we would keep a functioning cluster
> when we have a single osd node fail.
>
> Cheers
>
> /Simon
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] loaded dup inode (but no mds crash)

2019-07-29 Thread Dan van der Ster

On Mon, Jul 29, 2019 at 3:47 PM Yan, Zheng  wrote:
>
> On Mon, Jul 29, 2019 at 9:13 PM Dan van der Ster  wrote:
> >
> > On Mon, Jul 29, 2019 at 2:52 PM Yan, Zheng  wrote:
> > >
> > > On Fri, Jul 26, 2019 at 4:45 PM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Last night we had 60 ERRs like this:
> > > >
> > > > 2019-07-26 00:56:44.479240 7efc6cca1700  0 mds.2.cache.dir(0x617)
> > > > _fetched  badness: got (but i already had) [inode 0x
> > > > [...2,head] ~mds2/stray1/10006289992 auth v14438219972 dirtyparent
> > > > s=116637332 nl=8 n(v0 rc2019-07-26 00:56:17.199090 b116637332 1=1+0)
> > > > (iversion lock) | request=0 lock=0 caps=0 remoteparent=0 dirtyparent=1
> > > > dirty=1 authpin=0 0x5561321eee00] mode 33188 mtime 2017-07-11
> > > > 16:20:50.00
> > > > 2019-07-26 00:56:44.479333 7efc6cca1700 -1 log_channel(cluster) log
> > > > [ERR] : loaded dup inode 0x10006289992 [2,head] v14437387948 at
> > > > ~mds2/stray3/10006289992, but inode 0x10006289992.head v14438219972
> > > > already exists at ~mds2/stray1/10006289992
> > > >
> > > > Looking through this ML this often corresponds to crashing MDS's and
> > > > needing a disaster recovery procedure to follow.
> > > > We haven't had any crash
> > > >
> > > > Is there something we should do *now* to fix these before any assert
> > > > is triggered?
> > >
> > > you can use rados rmomapkey to delete inode with smaller version. For
> > > above case:
> > >
> > > rados -p cephfs_metadata rmomapkey 617. 10006289992_head.
> >
> > I just checked and all of those inodes are no longer stray.
> >
> > # rados -p cephfs_metadata listomapkeys 617. | grep 10006289992
> > #
> >
> > They were originally from hardlink deletion, and another link has been
> > stat'ed in the meanwhile.
> > I also double checked the parent xattr on the inodes in cephfs_data
> > and they refer to a real parent dir, not stray.
> >
> > So it looks like all those dup inodes have been reintegrated. Am I safe ?
> >
>
> check if 10006289992_head is in 615. (mds2/stray1). If it is, delete 
> it.

Not there:

# rados -p cephfs_metadata listomapkeys 615. | grep 10006289992
#

It's not present in any stray omap (600...61d).

-- dan




> > -- dan
> >
> > > I suggest run 'cephfs-data-scan scan_links' after taking down cephfs
> > > (either use 'mds set  down true' or  'flush all journasl and
> > > kill all mds')
> > >
> > >
> > > Regards
> > > Yan, Zheng
> > >
> > >
> > >
> > > >
> > > > Thanks!
> > > >
> > > > Dan
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] loaded dup inode (but no mds crash)

2019-07-29 Thread Dan van der Ster

On Mon, Jul 29, 2019 at 2:52 PM Yan, Zheng  wrote:
>
> On Fri, Jul 26, 2019 at 4:45 PM Dan van der Ster  wrote:
> >
> > Hi all,
> >
> > Last night we had 60 ERRs like this:
> >
> > 2019-07-26 00:56:44.479240 7efc6cca1700  0 mds.2.cache.dir(0x617)
> > _fetched  badness: got (but i already had) [inode 0x
> > [...2,head] ~mds2/stray1/10006289992 auth v14438219972 dirtyparent
> > s=116637332 nl=8 n(v0 rc2019-07-26 00:56:17.199090 b116637332 1=1+0)
> > (iversion lock) | request=0 lock=0 caps=0 remoteparent=0 dirtyparent=1
> > dirty=1 authpin=0 0x5561321eee00] mode 33188 mtime 2017-07-11
> > 16:20:50.00
> > 2019-07-26 00:56:44.479333 7efc6cca1700 -1 log_channel(cluster) log
> > [ERR] : loaded dup inode 0x10006289992 [2,head] v14437387948 at
> > ~mds2/stray3/10006289992, but inode 0x10006289992.head v14438219972
> > already exists at ~mds2/stray1/10006289992
> >
> > Looking through this ML this often corresponds to crashing MDS's and
> > needing a disaster recovery procedure to follow.
> > We haven't had any crash
> >
> > Is there something we should do *now* to fix these before any assert
> > is triggered?
>
> you can use rados rmomapkey to delete inode with smaller version. For
> above case:
>
> rados -p cephfs_metadata rmomapkey 617. 10006289992_head.

I just checked and all of those inodes are no longer stray.

# rados -p cephfs_metadata listomapkeys 617. | grep 10006289992
#

They were originally from hardlink deletion, and another link has been
stat'ed in the meanwhile.
I also double checked the parent xattr on the inodes in cephfs_data
and they refer to a real parent dir, not stray.

So it looks like all those dup inodes have been reintegrated. Am I safe ?

-- dan

> I suggest run 'cephfs-data-scan scan_links' after taking down cephfs
> (either use 'mds set  down true' or  'flush all journasl and
> kill all mds')
>
>
> Regards
> Yan, Zheng
>
>
>
> >
> > Thanks!
> >
> > Dan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] loaded dup inode (but no mds crash)

2019-07-26 Thread Dan van der Ster

Hi all,

Last night we had 60 ERRs like this:

2019-07-26 00:56:44.479240 7efc6cca1700  0 mds.2.cache.dir(0x617)
_fetched  badness: got (but i already had) [inode 0x10006289992
[...2,head] ~mds2/stray1/10006289992 auth v14438219972 dirtyparent
s=116637332 nl=8 n(v0 rc2019-07-26 00:56:17.199090 b116637332 1=1+0)
(iversion lock) | request=0 lock=0 caps=0 remoteparent=0 dirtyparent=1
dirty=1 authpin=0 0x5561321eee00] mode 33188 mtime 2017-07-11
16:20:50.00
2019-07-26 00:56:44.479333 7efc6cca1700 -1 log_channel(cluster) log
[ERR] : loaded dup inode 0x10006289992 [2,head] v14437387948 at
~mds2/stray3/10006289992, but inode 0x10006289992.head v14438219972
already exists at ~mds2/stray1/10006289992

Looking through this ML this often corresponds to crashing MDS's and
needing a disaster recovery procedure to follow.
We haven't had any crash

Is there something we should do *now* to fix these before any assert
is triggered?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] how to power off a cephfs cluster cleanly

2019-07-25 Thread Dan van der Ster

Hi all,

In September we'll need to power down a CephFS cluster (currently
mimic) for a several-hour electrical intervention.

Having never done this before, I thought I'd check with the list.
Here's our planned procedure:

1. umounts /cephfs from all hpc clients.
2. ceph osd set noout
3. wait until there is zero IO on the cluster
4. stop all mds's (active + standby)
5. stop all osds.
(6. we won't stop all mon's as they are not affected by that
electrical intervention)
7. power off the cluster.
...
8. power on the cluster, osd's first, then mds's. wait for health_ok.
9. ceph osd unset noout

Seems pretty simple... Are there any gotchas I'm missing? Maybe
there's some special procedure to stop the mds's cleanly?

Cheers, dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Dan van der Ster

On Mon, Jul 8, 2019 at 1:02 PM Lars Marowsky-Bree  wrote:
>
> On 2019-07-08T12:25:30, Dan van der Ster  wrote:
>
> > Is there a specific bench result you're concerned about?
>
> We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
> pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
> rather harsh, even for EC.

How does that pool manage with the same client pattern but 3x replication?

The difference between 4kB and 4MB writes could be many things.

-- dan


>
> > I would think that small write perf could be kept reasonable thanks to
> > bluestore's deferred writes.
>
> I believe we're being hit by the EC read-modify-write cycle on
> overwrites.
>
> > FWIW, our bench results (all flash cluster) didn't show a massive
> > performance difference between 3 replica and 4+2 EC.
>
> I'm guessing that this was not 4 KiB but a more reasonable blocksize
> that was a multiple of stripe_width?
>
>
> Regards,
> Lars
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 
> (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Dan van der Ster

Hi Lars,

Is there a specific bench result you're concerned about?
I would think that small write perf could be kept reasonable thanks to
bluestore's deferred writes.
FWIW, our bench results (all flash cluster) didn't show a massive
performance difference between 3 replica and 4+2 EC.

I agree about not needing to read the parity during a write though.
Hopefully that's just a typo? (Or maybe there's a fast way to update
EC chunks without communicating across OSDs ?)

-- dan


-- Dan

On Mon, Jul 8, 2019 at 10:47 AM Lars Marowsky-Bree  wrote:
>
> Morning all,
>
> since Luminous/Mimic, Ceph supports allow_ec_overwrites. However, this has a
> performance impact that looks even worse than what I'd expect from a
> Read-Modify-Write cycle.
>
> https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/ also
> mentions that the small writes would read the previous value from all
> k+m OSDs; shouldn't the k stripes be sufficient (assuming we're not
> currently degraded)?
>
> Is there any suggestion on how to make this go faster, or suggestions on
> which solution one could implement going forward?
>
>
> Regards,
> Lars
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 
> (AG Nürnberg)
> "Architects should open possibilities and not determine everything." (Ueli 
> Zbinden)
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-21 Thread Dan van der Ster

http://tracker.ceph.com/issues/40480

On Thu, Jun 20, 2019 at 9:12 PM Dan van der Ster  wrote:
>
> I will try to reproduce with logs and create a tracker once I find the
> smoking gun...
>
> It's very strange -- I had the osd mode set to 'passive', and pool
> option set to 'force', and the osd was compressing objects for around
> 15 minutes. Then suddenly it just stopped compressing, until I did
> 'ceph daemon osd.130 config set bluestore_compression_mode force',
> where it restarted immediately.
>
> FTR, it *should* compress with osd bluestore_compression_mode=none and
> the pool's compression_mode=force, right?
>
> -- dan
>
> -- Dan
>
> On Thu, Jun 20, 2019 at 6:57 PM Igor Fedotov  wrote:
> >
> > I'd like to see more details (preferably backed with logs) on this...
> >
> > On 6/20/2019 6:23 PM, Dan van der Ster wrote:
> > > P.S. I know this has been discussed before, but the
> > > compression_(mode|algorithm) pool options [1] seem completely broken
> > > -- With the pool mode set to force, we see that sometimes the
> > > compression is invoked and sometimes it isn't. AFAICT,
> > > the only way to compress every object is to set
> > > bluestore_compression_mode=force on the osd.
> > >
> > > -- dan
> > >
> > > [1] 
> > > http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
> > >
> > >
> > > On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  
> > > wrote:
> > >> Hi all,
> > >>
> > >> I'm trying to compress an rbd pool via backfilling the existing data,
> > >> and the allocated space doesn't match what I expect.
> > >>
> > >> Here is the test: I marked osd.130 out and waited for it to erase all 
> > >> its data.
> > >> Then I set (on the pool) compression_mode=force and 
> > >> compression_algorithm=zstd.
> > >> Then I marked osd.130 to get its PGs/objects back (this time compressing 
> > >> them).
> > >>
> > >> After a few 10s of minutes we have:
> > >>  "bluestore_compressed": 989250439,
> > >>  "bluestore_compressed_allocated": 3859677184,
> > >>  "bluestore_compressed_original": 7719354368,
> > >>
> > >> So, the allocated is exactly 50% of original, but we are wasting space
> > >> because compressed is 12.8% of original.
> > >>
> > >> I don't understand why...
> > >>
> > >> The rbd images all use 4MB objects, and we use the default chunk and
> > >> blob sizes (in v13.2.6):
> > >> osd_recovery_max_chunk = 8MB
> > >> bluestore_compression_max_blob_size_hdd = 512kB
> > >> bluestore_compression_min_blob_size_hdd = 128kB
> > >> bluestore_max_blob_size_hdd = 512kB
> > >> bluestore_min_alloc_size_hdd = 64kB
> > >>
> > >>  From my understanding, backfilling should read a whole 4MB object from
> > >> the src osd, then write it to osd.130's bluestore, compressing in
> > >> 512kB blobs. Those compress on average at 12.8% so I would expect to
> > >> see allocated being closer to bluestore_min_alloc_size_hdd /
> > >> bluestore_compression_max_blob_size_hdd = 12.5%.
> > >>
> > >> Does someone understand where the 0.5 ratio is coming from?
> > >>
> > >> Thanks!
> > >>
> > >> Dan
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster

I will try to reproduce with logs and create a tracker once I find the
smoking gun...

It's very strange -- I had the osd mode set to 'passive', and pool
option set to 'force', and the osd was compressing objects for around
15 minutes. Then suddenly it just stopped compressing, until I did
'ceph daemon osd.130 config set bluestore_compression_mode force',
where it restarted immediately.

FTR, it *should* compress with osd bluestore_compression_mode=none and
the pool's compression_mode=force, right?

-- dan

-- Dan

On Thu, Jun 20, 2019 at 6:57 PM Igor Fedotov  wrote:
>
> I'd like to see more details (preferably backed with logs) on this...
>
> On 6/20/2019 6:23 PM, Dan van der Ster wrote:
> > P.S. I know this has been discussed before, but the
> > compression_(mode|algorithm) pool options [1] seem completely broken
> > -- With the pool mode set to force, we see that sometimes the
> > compression is invoked and sometimes it isn't. AFAICT,
> > the only way to compress every object is to set
> > bluestore_compression_mode=force on the osd.
> >
> > -- dan
> >
> > [1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values
> >
> >
> > On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  
> > wrote:
> >> Hi all,
> >>
> >> I'm trying to compress an rbd pool via backfilling the existing data,
> >> and the allocated space doesn't match what I expect.
> >>
> >> Here is the test: I marked osd.130 out and waited for it to erase all its 
> >> data.
> >> Then I set (on the pool) compression_mode=force and 
> >> compression_algorithm=zstd.
> >> Then I marked osd.130 to get its PGs/objects back (this time compressing 
> >> them).
> >>
> >> After a few 10s of minutes we have:
> >>  "bluestore_compressed": 989250439,
> >>  "bluestore_compressed_allocated": 3859677184,
> >>  "bluestore_compressed_original": 7719354368,
> >>
> >> So, the allocated is exactly 50% of original, but we are wasting space
> >> because compressed is 12.8% of original.
> >>
> >> I don't understand why...
> >>
> >> The rbd images all use 4MB objects, and we use the default chunk and
> >> blob sizes (in v13.2.6):
> >> osd_recovery_max_chunk = 8MB
> >> bluestore_compression_max_blob_size_hdd = 512kB
> >> bluestore_compression_min_blob_size_hdd = 128kB
> >> bluestore_max_blob_size_hdd = 512kB
> >> bluestore_min_alloc_size_hdd = 64kB
> >>
> >>  From my understanding, backfilling should read a whole 4MB object from
> >> the src osd, then write it to osd.130's bluestore, compressing in
> >> 512kB blobs. Those compress on average at 12.8% so I would expect to
> >> see allocated being closer to bluestore_min_alloc_size_hdd /
> >> bluestore_compression_max_blob_size_hdd = 12.5%.
> >>
> >> Does someone understand where the 0.5 ratio is coming from?
> >>
> >> Thanks!
> >>
> >> Dan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster

On Thu, Jun 20, 2019 at 6:55 PM Igor Fedotov  wrote:
>
> Hi Dan,
>
> bluestore_compression_max_blob_size is applied for objects marked with
> some additional hints only:
>
>if ((alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_SEQUENTIAL_READ) &&
>(alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_READ) == 0 &&
>(alloc_hints & (CEPH_OSD_ALLOC_HINT_FLAG_IMMUTABLE |
>CEPH_OSD_ALLOC_HINT_FLAG_APPEND_ONLY)) &&
>(alloc_hints & CEPH_OSD_ALLOC_HINT_FLAG_RANDOM_WRITE) == 0) {
>
>  dout(20) << __func__ << " will prefer large blob and csum sizes" <<
> dendl;
>
>
> For regular objects "bluestore_compression_max_blob_size" is used. Which
> results in minimum ratio  = 0.5

I presume you mean bluestore_compression_min_blob_size ...

Going back to the thread Frank linked later in this thread I
understand now I can double bluestore_compression_min_blob_size to get
0.25, or halve bluestore_min_alloc_size_hdd  (at osd creation time) to
get 0.25. That seems clear now (though I wonder if the option names
are slightly misleading ...)

Now I'll try to observe any performance impact of increased
min_blob_size... Do you recall if there were some benchmarks done to
pick those current defaults?

Thanks!

Dan


-- Dan

>
>
> Thanks,
>
> Igor
>
> On 6/20/2019 5:33 PM, Dan van der Ster wrote:
> > Hi all,
> >
> > I'm trying to compress an rbd pool via backfilling the existing data,
> > and the allocated space doesn't match what I expect.
> >
> > Here is the test: I marked osd.130 out and waited for it to erase all its 
> > data.
> > Then I set (on the pool) compression_mode=force and 
> > compression_algorithm=zstd.
> > Then I marked osd.130 to get its PGs/objects back (this time compressing 
> > them).
> >
> > After a few 10s of minutes we have:
> >  "bluestore_compressed": 989250439,
> >  "bluestore_compressed_allocated": 3859677184,
> >  "bluestore_compressed_original": 7719354368,
> >
> > So, the allocated is exactly 50% of original, but we are wasting space
> > because compressed is 12.8% of original.
> >
> > I don't understand why...
> >
> > The rbd images all use 4MB objects, and we use the default chunk and
> > blob sizes (in v13.2.6):
> > osd_recovery_max_chunk = 8MB
> > bluestore_compression_max_blob_size_hdd = 512kB
> > bluestore_compression_min_blob_size_hdd = 128kB
> > bluestore_max_blob_size_hdd = 512kB
> > bluestore_min_alloc_size_hdd = 64kB
> >
> >  From my understanding, backfilling should read a whole 4MB object from
> > the src osd, then write it to osd.130's bluestore, compressing in
> > 512kB blobs. Those compress on average at 12.8% so I would expect to
> > see allocated being closer to bluestore_min_alloc_size_hdd /
> > bluestore_compression_max_blob_size_hdd = 12.5%.
> >
> > Does someone understand where the 0.5 ratio is coming from?
> >
> > Thanks!
> >
> > Dan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster

P.S. I know this has been discussed before, but the
compression_(mode|algorithm) pool options [1] seem completely broken
-- With the pool mode set to force, we see that sometimes the
compression is invoked and sometimes it isn't. AFAICT,
the only way to compress every object is to set
bluestore_compression_mode=force on the osd.

-- dan

[1] http://docs.ceph.com/docs/master/rados/operations/pools/#set-pool-values


On Thu, Jun 20, 2019 at 4:33 PM Dan van der Ster  wrote:
>
> Hi all,
>
> I'm trying to compress an rbd pool via backfilling the existing data,
> and the allocated space doesn't match what I expect.
>
> Here is the test: I marked osd.130 out and waited for it to erase all its 
> data.
> Then I set (on the pool) compression_mode=force and 
> compression_algorithm=zstd.
> Then I marked osd.130 to get its PGs/objects back (this time compressing 
> them).
>
> After a few 10s of minutes we have:
> "bluestore_compressed": 989250439,
> "bluestore_compressed_allocated": 3859677184,
> "bluestore_compressed_original": 7719354368,
>
> So, the allocated is exactly 50% of original, but we are wasting space
> because compressed is 12.8% of original.
>
> I don't understand why...
>
> The rbd images all use 4MB objects, and we use the default chunk and
> blob sizes (in v13.2.6):
>osd_recovery_max_chunk = 8MB
>bluestore_compression_max_blob_size_hdd = 512kB
>bluestore_compression_min_blob_size_hdd = 128kB
>bluestore_max_blob_size_hdd = 512kB
>bluestore_min_alloc_size_hdd = 64kB
>
> From my understanding, backfilling should read a whole 4MB object from
> the src osd, then write it to osd.130's bluestore, compressing in
> 512kB blobs. Those compress on average at 12.8% so I would expect to
> see allocated being closer to bluestore_min_alloc_size_hdd /
> bluestore_compression_max_blob_size_hdd = 12.5%.
>
> Does someone understand where the 0.5 ratio is coming from?
>
> Thanks!
>
> Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] understanding the bluestore blob, chunk and compression params

2019-06-20 Thread Dan van der Ster

Hi all,

I'm trying to compress an rbd pool via backfilling the existing data,
and the allocated space doesn't match what I expect.

Here is the test: I marked osd.130 out and waited for it to erase all its data.
Then I set (on the pool) compression_mode=force and compression_algorithm=zstd.
Then I marked osd.130 to get its PGs/objects back (this time compressing them).

After a few 10s of minutes we have:
"bluestore_compressed": 989250439,
"bluestore_compressed_allocated": 3859677184,
"bluestore_compressed_original": 7719354368,

So, the allocated is exactly 50% of original, but we are wasting space
because compressed is 12.8% of original.

I don't understand why...

The rbd images all use 4MB objects, and we use the default chunk and
blob sizes (in v13.2.6):
   osd_recovery_max_chunk = 8MB
   bluestore_compression_max_blob_size_hdd = 512kB
   bluestore_compression_min_blob_size_hdd = 128kB
   bluestore_max_blob_size_hdd = 512kB
   bluestore_min_alloc_size_hdd = 64kB

>From my understanding, backfilling should read a whole 4MB object from
the src osd, then write it to osd.130's bluestore, compressing in
512kB blobs. Those compress on average at 12.8% so I would expect to
see allocated being closer to bluestore_min_alloc_size_hdd /
bluestore_compression_max_blob_size_hdd = 12.5%.

Does someone understand where the 0.5 ratio is coming from?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-17 Thread Dan van der Ster

We have resharded a bucket with 60 million objects from 32 to 64
shards without any problem. (Though there were several slow ops at the
"stalls after counting the objects phase", so I set nodown as a
precaution).
We're now resharding that bucket from 64 to 1024.

In your case I wonder if it was the large step up to 1024 shards that
caused the crashes somehow? Or maybe your bluefs didn't have enough
free space for the compaction after the large omaps were removed?

-- dan

On Mon, Jun 17, 2019 at 11:14 AM Harald Staub  wrote:
>
> We received the large omap warning before, but for some reasons we could
> not react quickly. We accepted the risk of the bucket becoming slow, but
> had not thought of further risks ...
>
> On 17.06.19 10:15, Dan van der Ster wrote:
> > Nice to hear this was resolved in the end.
> >
> > Coming back to the beginning -- is it clear to anyone what was the
> > root cause and how other users can avoid this from happening? Maybe
> > some better default configs to warn users earlier about too-large
> > omaps?
> >
> > Cheers, Dan
> >
> > On Thu, Jun 13, 2019 at 7:36 PM Harald Staub  wrote:
> >>
> >> Looks fine (at least so far), thank you all!
> >>
> >> After having exported all 3 copies of the bad PG, we decided to try it
> >> in-place. We also set norebalance to make sure that no data is moved.
> >> When the PG was up, the resharding finished with a "success" message.
> >> The large omap warning is gone after deep-scrubbing the PG.
> >>
> >> Then we set the 3 OSDs to out. Soon after, one after the other was down
> >> (maybe for 2 minutes) and we got degraded PGs, but only once.
> >>
> >> Thank you!
> >>Harry
> >>
> >> On 13.06.19 16:14, Sage Weil wrote:
> >>> On Thu, 13 Jun 2019, Harald Staub wrote:
> >>>> On 13.06.19 15:52, Sage Weil wrote:
> >>>>> On Thu, 13 Jun 2019, Harald Staub wrote:
> >>>> [...]
> >>>>> I think that increasing the various suicide timeout options will allow
> >>>>> it to stay up long enough to clean up the ginormous objects:
> >>>>>
> >>>>> ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
> >>>>
> >>>> ok
> >>>>
> >>>>>> It looks healthy so far:
> >>>>>> ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
> >>>>>> fsck success
> >>>>>>
> >>>>>> Now we have to choose how to continue, trying to reduce the risk of 
> >>>>>> losing
> >>>>>> data (most bucket indexes are intact currently). My guess would be to 
> >>>>>> let
> >>>>>> this
> >>>>>> OSD (which was not the primary) go in and hope that it recovers. In 
> >>>>>> case
> >>>>>> of a
> >>>>>> problem, maybe we could still use the other OSDs "somehow"? In case of
> >>>>>> success, we would bring back the other OSDs as well?
> >>>>>>
> >>>>>> OTOH we could try to continue with the key dump from earlier today.
> >>>>>
> >>>>> I would start all three osds the same way, with 'noout' set on the
> >>>>> cluster.  You should try to avoid triggering recovery because it will 
> >>>>> have
> >>>>> a hard time getting through the big index object on that bucket (i.e., 
> >>>>> it
> >>>>> will take a long time, and might trigger some blocked ios and so forth).
> >>>>
> >>>> This I do not understand, how would I avoid recovery?
> >>>
> >>> Well, simply doing 'ceph osd set noout' is sufficient to avoid
> >>> recovery, I suppose.  But in any case, getting at least 2 of the
> >>> existing copies/OSDs online (assuming your pool's min_size=2) will mean
> >>> you can finish the reshard process and clean up the big object without
> >>> copying the PG anywhere.
> >>>
> >>> I think you may as well do all 3 OSDs this way, then clean up the big
> >>> object--that way in the end no data will have to move.
> >>>
> >>> This is Nautilus, right?  If you scrub the PGs in question, that will also
> >>> now raise the health alert if there are any remaining big omap objects...
> >>> if that warning goes away you'll know you're doing cleaning up.  A final
> >>> rocksdb compaction should then be enough to remove any remaing weirdness
> >>> from rocksdb's internal layout.
> >>>
> >>>>> (Side note that since you started the OSD read-write using the internal
> >>>>> copy of rocksdb, don't forget that the external copy you extracted
> >>>>> (/mnt/ceph/db?) is now stale!)
> >>>>
> >>>> As suggested by Paul Emmerich (see next E-mail in this thread), I 
> >>>> exported
> >>>> this PG. It took not that long (20 minutes).
> >>>
> >>> Great :)
> >>>
> >>> sage
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rocksdb corruption, stale pg, rebuild bucket index

2019-06-17 Thread Dan van der Ster

Nice to hear this was resolved in the end.

Coming back to the beginning -- is it clear to anyone what was the
root cause and how other users can avoid this from happening? Maybe
some better default configs to warn users earlier about too-large
omaps?

Cheers, Dan

On Thu, Jun 13, 2019 at 7:36 PM Harald Staub  wrote:
>
> Looks fine (at least so far), thank you all!
>
> After having exported all 3 copies of the bad PG, we decided to try it
> in-place. We also set norebalance to make sure that no data is moved.
> When the PG was up, the resharding finished with a "success" message.
> The large omap warning is gone after deep-scrubbing the PG.
>
> Then we set the 3 OSDs to out. Soon after, one after the other was down
> (maybe for 2 minutes) and we got degraded PGs, but only once.
>
> Thank you!
>   Harry
>
> On 13.06.19 16:14, Sage Weil wrote:
> > On Thu, 13 Jun 2019, Harald Staub wrote:
> >> On 13.06.19 15:52, Sage Weil wrote:
> >>> On Thu, 13 Jun 2019, Harald Staub wrote:
> >> [...]
> >>> I think that increasing the various suicide timeout options will allow
> >>> it to stay up long enough to clean up the ginormous objects:
> >>>
> >>>ceph config set osd.NNN osd_op_thread_suicide_timeout 2h
> >>
> >> ok
> >>
>  It looks healthy so far:
>  ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
>  fsck success
> 
>  Now we have to choose how to continue, trying to reduce the risk of 
>  losing
>  data (most bucket indexes are intact currently). My guess would be to let
>  this
>  OSD (which was not the primary) go in and hope that it recovers. In case
>  of a
>  problem, maybe we could still use the other OSDs "somehow"? In case of
>  success, we would bring back the other OSDs as well?
> 
>  OTOH we could try to continue with the key dump from earlier today.
> >>>
> >>> I would start all three osds the same way, with 'noout' set on the
> >>> cluster.  You should try to avoid triggering recovery because it will have
> >>> a hard time getting through the big index object on that bucket (i.e., it
> >>> will take a long time, and might trigger some blocked ios and so forth).
> >>
> >> This I do not understand, how would I avoid recovery?
> >
> > Well, simply doing 'ceph osd set noout' is sufficient to avoid
> > recovery, I suppose.  But in any case, getting at least 2 of the
> > existing copies/OSDs online (assuming your pool's min_size=2) will mean
> > you can finish the reshard process and clean up the big object without
> > copying the PG anywhere.
> >
> > I think you may as well do all 3 OSDs this way, then clean up the big
> > object--that way in the end no data will have to move.
> >
> > This is Nautilus, right?  If you scrub the PGs in question, that will also
> > now raise the health alert if there are any remaining big omap objects...
> > if that warning goes away you'll know you're doing cleaning up.  A final
> > rocksdb compaction should then be enough to remove any remaing weirdness
> > from rocksdb's internal layout.
> >
> >>> (Side note that since you started the OSD read-write using the internal
> >>> copy of rocksdb, don't forget that the external copy you extracted
> >>> (/mnt/ceph/db?) is now stale!)
> >>
> >> As suggested by Paul Emmerich (see next E-mail in this thread), I exported
> >> this PG. It took not that long (20 minutes).
> >
> > Great :)
> >
> > sage
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] problem with degraded PG

2019-06-14 Thread Dan van der Ster

Ahh I was thinking of chooseleaf_vary_r, which you already have.
So probably not related to tunables. What is your `ceph osd tree` ?

By the way, 12.2.9 has an unrelated bug (details
http://tracker.ceph.com/issues/36686)
AFAIU you will just need to update to v12.2.11 or v12.2.12 for that fix.

-- Dan

On Fri, Jun 14, 2019 at 11:29 AM Luk  wrote:
>
> Hi,
>
> here is the output:
>
> ceph osd crush show-tunables
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 100,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 1,
> "chooseleaf_stable": 0,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 22,
> "profile": "unknown",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "minimum_required_version": "hammer",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 0,
> "require_feature_tunables3": 1,
> "has_v3_rules": 0,
> "has_v4_buckets": 1,
> "require_feature_tunables5": 0,
> "has_v5_rules": 0
> }
>
> [root@ceph-mon-01 ~]#
>
> --
> Regards
> Lukasz
>
> > Hi,
> > This looks like a tunables issue.
> > What is the output of `ceph osd crush show-tunables `
>
> > -- Dan
>
> > On Fri, Jun 14, 2019 at 11:19 AM Luk  wrote:
> >>
> >> Hello,
> >>
> >> Maybe  somone  was  fighting  with this kind of stuck in ceph already.
> >> This  is  production  cluster,  can't/don't  want to make wrong steps,
> >> please advice, what to do.
> >>
> >> After  changing  of  one failed disk (it was osd-7) on our cluster ceph
> >> didn't recover to HEALTH_OK, it stopped in state:
> >>
> >> [root@ceph-mon-01 ~]# ceph -s
> >>   cluster:
> >> id: b6f23cff-7279-f4b0-ff91-21fadac95bb5
> >> health: HEALTH_WARN
> >> noout,noscrub,nodeep-scrub flag(s) set
> >> Degraded data redundancy: 24761/45994899 objects degraded 
> >> (0.054%), 8 pgs degraded, 8 pgs undersized
> >>
> >>   services:
> >> mon:3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03
> >> mgr:ceph-mon-03(active), standbys: ceph-mon-02, ceph-mon-01
> >> osd:144 osds: 144 up, 144 in
> >> flags noout,noscrub,nodeep-scrub
> >> rbd-mirror: 3 daemons active
> >> rgw:6 daemons active
> >>
> >>   data:
> >> pools:   18 pools, 2176 pgs
> >> objects: 15.33M objects, 49.3TiB
> >> usage:   151TiB used, 252TiB / 403TiB avail
> >> pgs: 24761/45994899 objects degraded (0.054%)
> >>  2168 active+clean
> >>  8active+undersized+degraded
> >>
> >>   io:
> >> client:   435MiB/s rd, 415MiB/s wr, 7.94kop/s rd, 2.96kop/s wr
> >>
> >> Restart of OSD didn't helped, changing choose_total_tries from 50 to 100 
> >> didn't help.
> >>
> >> I checked one of degraded PG, 10.3c4
> >>
> >> [root@ceph-mon-01 ~]# ceph pg dump 2>&1 | grep -w 10.3c4
> >> 10.3c4 3593  0 3593 0   0 14769891858 
> >> 1007610076active+undersized+degraded 2019-06-13 08:19:39.802219  
> >> 37380'71900564 37380:119411139   [9,109]  9   [9,109]  
> >> 9  33550'69130424 2019-06-08 02:28:40.508790  33550'69130424 
> >> 2019-06-08 02:28:40.50879018
> >>
> >>
> >> [root@ceph-mon-01 ~]# ceph pg 10.3c4 query | jq '.["peer_info"][] | {peer: 
> >> .peer, last_update:.last_update}'
> >> {
> >>   "peer": "0",
> >>   "last_update": "36847'71412720"
> >> }
> >> {
> >>   "peer": "109",
> >>   "last_update": "37380'71900570"
> >> }
> >> {
> >>   "peer": "117",
> >>   "last_update": "0'0"
> >> }
> >>
> >>
> >> [root@ceph-mon-01 ~]#
> >> I have checked space taken for this PG on storage nodes:
> >> here is how to check where is particular OSD (on which physical storage 
> >> node):
> >> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 9 "
> >> |  9  |   stor-a02  | 2063G | 5386G |   52   |  1347k  |   53   |   292k  
> >> | exists,up |
> >> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 109 "
> >> | 109 |   stor-a01  | 1285G | 4301G |5   |  31.0k  |6   |  59.2k  
> >> | exists,up |
> >> [root@ceph-mon-01 ~]# watch ceph -s
> >> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 117 "
> >> | 117 |   stor-b02  | 1334G | 4252G |   54   |  1216k  |   13   |  27.4k  
> >> | exists,up |
> >> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 0 "
> >> |  0  |   stor-a01  | 2156G | 5293G |   58   |   387k  |   29   |  30.7k  
> >> | exists,up |
> >> [root@ceph-mon-01 ~]#
> >> and checking sizes on servers:
> >> stor-a01 (this PG shouldn't be on the same host):
> >> [root@stor-a01 /var/lib/ceph/osd/ceph-0/current]# du -sh 10.3c4_*
> >> 2.4G10.3c4_head
> >> 0   10.3c4_TEMP
> >> [root@stor-a01 /var/lib/ceph/osd/ceph-109/current]# du -sh 10.3c4_*
> >> 14G 10.3c4_head
> >> 0   10.3c4_TEMP
> >> [root@stor-a01 /var/lib/ceph/osd/ceph-109/current]#
> >> stor-a02:
> >> [root@stor-a02 /var/lib/ceph/osd/ceph-9/current]# du -sh 10.3c4_*
> >> 14G

Re: [ceph-users] problem with degraded PG

2019-06-14 Thread Dan van der Ster

Hi,
This looks like a tunables issue.
What is the output of `ceph osd crush show-tunables `

-- Dan

On Fri, Jun 14, 2019 at 11:19 AM Luk  wrote:
>
> Hello,
>
> Maybe  somone  was  fighting  with this kind of stuck in ceph already.
> This  is  production  cluster,  can't/don't  want to make wrong steps,
> please advice, what to do.
>
> After  changing  of  one failed disk (it was osd-7) on our cluster ceph
> didn't recover to HEALTH_OK, it stopped in state:
>
> [root@ceph-mon-01 ~]# ceph -s
>   cluster:
> id: b6f23cff-7279-f4b0-ff91-21fadac95bb5
> health: HEALTH_WARN
> noout,noscrub,nodeep-scrub flag(s) set
> Degraded data redundancy: 24761/45994899 objects degraded 
> (0.054%), 8 pgs degraded, 8 pgs undersized
>
>   services:
> mon:3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03
> mgr:ceph-mon-03(active), standbys: ceph-mon-02, ceph-mon-01
> osd:144 osds: 144 up, 144 in
> flags noout,noscrub,nodeep-scrub
> rbd-mirror: 3 daemons active
> rgw:6 daemons active
>
>   data:
> pools:   18 pools, 2176 pgs
> objects: 15.33M objects, 49.3TiB
> usage:   151TiB used, 252TiB / 403TiB avail
> pgs: 24761/45994899 objects degraded (0.054%)
>  2168 active+clean
>  8active+undersized+degraded
>
>   io:
> client:   435MiB/s rd, 415MiB/s wr, 7.94kop/s rd, 2.96kop/s wr
>
> Restart of OSD didn't helped, changing choose_total_tries from 50 to 100 
> didn't help.
>
> I checked one of degraded PG, 10.3c4
>
> [root@ceph-mon-01 ~]# ceph pg dump 2>&1 | grep -w 10.3c4
> 10.3c4 3593  0 3593 0   0 14769891858 
> 1007610076active+undersized+degraded 2019-06-13 08:19:39.802219  
> 37380'71900564 37380:119411139   [9,109]  9   [9,109] 
>  9  33550'69130424 2019-06-08 02:28:40.508790  33550'69130424 2019-06-08 
> 02:28:40.50879018
>
>
> [root@ceph-mon-01 ~]# ceph pg 10.3c4 query | jq '.["peer_info"][] | {peer: 
> .peer, last_update:.last_update}'
> {
>   "peer": "0",
>   "last_update": "36847'71412720"
> }
> {
>   "peer": "109",
>   "last_update": "37380'71900570"
> }
> {
>   "peer": "117",
>   "last_update": "0'0"
> }
>
>
> [root@ceph-mon-01 ~]#
> I have checked space taken for this PG on storage nodes:
> here is how to check where is particular OSD (on which physical storage node):
> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 9 "
> |  9  |   stor-a02  | 2063G | 5386G |   52   |  1347k  |   53   |   292k  | 
> exists,up |
> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 109 "
> | 109 |   stor-a01  | 1285G | 4301G |5   |  31.0k  |6   |  59.2k  | 
> exists,up |
> [root@ceph-mon-01 ~]# watch ceph -s
> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 117 "
> | 117 |   stor-b02  | 1334G | 4252G |   54   |  1216k  |   13   |  27.4k  | 
> exists,up |
> [root@ceph-mon-01 ~]# ceph osd status 2>&1 | grep " 0 "
> |  0  |   stor-a01  | 2156G | 5293G |   58   |   387k  |   29   |  30.7k  | 
> exists,up |
> [root@ceph-mon-01 ~]#
> and checking sizes on servers:
> stor-a01 (this PG shouldn't be on the same host):
> [root@stor-a01 /var/lib/ceph/osd/ceph-0/current]# du -sh 10.3c4_*
> 2.4G10.3c4_head
> 0   10.3c4_TEMP
> [root@stor-a01 /var/lib/ceph/osd/ceph-109/current]# du -sh 10.3c4_*
> 14G 10.3c4_head
> 0   10.3c4_TEMP
> [root@stor-a01 /var/lib/ceph/osd/ceph-109/current]#
> stor-a02:
> [root@stor-a02 /var/lib/ceph/osd/ceph-9/current]# du -sh 10.3c4_*
> 14G 10.3c4_head
> 0   10.3c4_TEMP
> [root@stor-a02 /var/lib/ceph/osd/ceph-9/current]#
> stor-b02:
> [root@stor-b02 /var/lib/ceph/osd/ceph-117/current]# du -sh 10.3c4_*
> zsh: no matches found: 10.3c4_*
>
> information about ceph:
> [root@ceph-mon-01 ~]# ceph versions
> {
> "mon": {
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 3
> },
> "mgr": {
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 3
> },
> "osd": {
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 144
> },
> "mds": {},
> "rbd-mirror": {
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 3
> },
> "rgw": {
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 6
> },
> "overall": {
> "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
> luminous (stable)": 159
> }
> }
>
> crushmap: https://pastebin.com/cpC2WmyS
> ceph osd tree: https://pastebin.com/XvZ2cNZZ
>
> I'm  cross-posting this do devel because maybe there is some known bug
> in  this  particular  version  of  ceph,  and  You  could  point  some
> directions to fix this problem.
>
> --
> Regards
> Lukasz
>
___
ceph-users mailing list

Re: [ceph-users] typical snapmapper size

2019-06-07 Thread Dan van der Ster

On Thu, Jun 6, 2019 at 8:00 PM Sage Weil  wrote:
>
> Hello RBD users,
>
> Would you mind running this command on a random OSD on your RBD-oriented
> cluster?
>
> ceph-objectstore-tool \
>  --data-path /var/lib/ceph/osd/ceph-NNN \
>  
> '["meta",{"oid":"snapmapper","key":"","snapid":0,"hash":2758339587,"max":0,"pool":-1,"namespace":"","max":0}]'
>  \
>  list-omap | wc -l
>
> ...and share the number of lines along with the overall size and
> utilization % of the OSD?  The OSD needs to be stopped, then run that
> command, then start it up again.
>

6872

ID   CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
769   hdd 5.45798  1.0 5.46TiB 2.89TiB 2.57TiB 52.98 1.00  34

Not sure how to classify heavy or light use of snapshots, ceph osd
pool ls detail output is here: https://pastebin.com/CpPwUQgR

-- Dan

> I'm trying to guage how much snapmapper metadata there is in a "typical"
> RBD environment.  If you have some sense of whether your users make
> relatively heavy or light use of snapshots, that would be helpful too!
>
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] v12.2.12 mds FAILED assert(session->get_nref() == 1)

2019-06-07 Thread Dan van der Ster

Hi all,

Just a quick heads up, and maybe a check if anyone else is affected.

After upgrading our MDS's from v12.2.11 to v12.2.12, we started
getting crashes with

 /builddir/build/BUILD/ceph-12.2.12/src/mds/MDSRank.cc: 1304:
FAILED assert(session->get_nref() == 1)

I opened a ticket here with more details: http://tracker.ceph.com/issues/40200

Thanks,

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] obj_size_info_mismatch error handling

2019-06-03 Thread Dan van der Ster

Hi Reed and Brad,

Did you ever learn more about this problem?
We currently have a few inconsistencies arriving with the same env
(cephfs, v13.2.5) and symptoms.

PG Repair doesn't fix the inconsistency, nor does Brad's omap
workaround earlier in the thread.
In our case, we can fix by cp'ing the file to a new inode, deleting
the inconsistent file, then scrubbing the PG.

-- Dan


On Fri, May 3, 2019 at 3:18 PM Reed Dier  wrote:
>
> Just to follow up for the sake of the mailing list,
>
> I had not had a chance to attempt your steps yet, but things appear to have 
> worked themselves out on their own.
>
> Both scrub errors cleared without intervention, and I'm not sure if it is the 
> results of that object getting touched in CephFS that triggered the update of 
> the size info, or if something else was able to clear it.
>
> Didn't see anything relating to the clearing in mon, mgr, or osd logs.
>
> So, not entirely sure what fixed it, but it is resolved on its own.
>
> Thanks,
>
> Reed
>
> On Apr 30, 2019, at 8:01 PM, Brad Hubbard  wrote:
>
> On Wed, May 1, 2019 at 10:54 AM Brad Hubbard  wrote:
>
>
> Which size is correct?
>
>
> Sorry, accidental discharge =D
>
> If the object info size is *incorrect* try forcing a write to the OI
> with something like the following.
>
> 1. rados -p [name_of_pool_17] setomapval 10008536718.
> temporary-key anything
> 2. ceph pg deep-scrub 17.2b9
> 3. Wait for the scrub to finish
> 4. rados -p [name_of_pool_2] rmomapkey 10008536718. temporary-key
>
> If the object info size is *correct* you could try just doing a rados
> get followed by a rados put of the object to see if the size is
> updated correctly.
>
> It's more likely the object info size is wrong IMHO.
>
>
> On Tue, Apr 30, 2019 at 1:06 AM Reed Dier  wrote:
>
>
> Hi list,
>
> Woke up this morning to two PG's reporting scrub errors, in a way that I 
> haven't seen before.
>
> $ ceph versions
> {
>"mon": {
>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
>},
>"mgr": {
>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
>},
>"osd": {
>"ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156
>},
>"mds": {
>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 2
>},
>"overall": {
>"ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156,
>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 8
>}
> }
>
>
> OSD_SCRUB_ERRORS 8 scrub errors
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
>pg 17.72 is active+clean+inconsistent, acting [3,7,153]
>pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
>
>
> Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
>
> {
>"epoch": 134582,
>"inconsistents": [
>{
>"object": {
>"name": "10008536718.",
>"nspace": "",
>"locator": "",
>"snap": "head",
>"version": 0
>},
>"errors": [],
>"union_shard_errors": [
>"obj_size_info_mismatch"
>],
>"shards": [
>{
>"osd": 7,
>"primary": false,
>"errors": [
>"obj_size_info_mismatch"
>],
>"size": 5883,
>"object_info": {
>"oid": {
>"oid": "10008536718.",
>"key": "",
>"snapid": -2,
>"hash": 1752643257,
>"max": 0,
>"pool": 17,
>"namespace": ""
>},
>"version": "134599'448331",
>"prior_version": "134599'448330",
>"last_reqid": "client.1580931080.0:671854",
>"user_version": 448331,
>"size": 3505,
>"mtime": "2019-04-28 15:32:20.003519",
>"local_mtime": "2019-04-28 15:32:25.991015",
>"lost": 0,
>"flags": [
>"dirty",
>"data_digest",
>"omap_digest"
>],
>"truncate_seq": 899,
>"truncate_size": 0,
>"data_digest": "0xf99a3bd3",
>"omap_digest": "0x",
>"expected_object_size": 0,
>"expected_write_size": 0,
>"alloc_hint_flags": 0,
>

Re: [ceph-users] [events] Ceph Day CERN September 17 - CFP now open!

2019-05-27 Thread Dan van der Ster

Tuesday Sept 17 is indeed the correct day!

We had to move it by one day to get a bigger room... sorry for the confusion.

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Dan van der Ster

On Mon, May 27, 2019 at 11:54 AM Oliver Freyermuth
 wrote:
>
> Dear Dan,
>
> thanks for the quick reply!
>
> Am 27.05.19 um 11:44 schrieb Dan van der Ster:
> > Hi Oliver,
> >
> > We saw the same issue after upgrading to mimic.
> >
> > IIRC we could make the max_bytes xattr visible by touching an empty
> > file in the dir (thereby updating the dir inode).
> >
> > e.g. touch  /cephfs/user/freyermu/.quota; rm  /cephfs/user/freyermu/.quota
>
> sadly, no, not even with sync's in between:
> -
> $ touch /cephfs/user/freyermu/.quota; sync; rm -f 
> /cephfs/user/freyermu/.quota; sync; getfattr --absolute-names --only-values 
> -n ceph.quota.max_bytes /cephfs/user/freyermu/
> /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
> -
> Also restarting the FUSE client after that does not change it. Maybe this 
> requires the rest of the cluster to be upgraded to work?
> I'm just guessing here, but maybe the MDS needs the file creation / update of 
> the directory inode to "update" the way the quota attributes are exported. If 
> something changed here with Mimic,
> this would explain why the "touch" is needed. And this would also explain why 
> this might only help if the MDS is upgraded to Mimic, too.
>

I think the relevant change which is causing this is the new_snaps in mimic.

Did you already enable them? `ceph fs set cephfs allow_new_snaps 1`

-- dan


> We have scheduled the remaining parts of the upgrade for Wednesday, and worst 
> case could survive until then without quota enforcement, but it's a really 
> strange and unexpected incompatibility.
>
> Cheers,
> Oliver
>
> >
> > Does that work?
> >
> > -- dan
> >
> >
> > On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth
> >  wrote:
> >>
> >> Dear Cephalopodians,
> >>
> >> in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
> >> (13.2.5), we have upgraded the FUSE clients first (we took the chance 
> >> during a time of low activity),
> >> thinking that this should not cause any issues. All MDS+MON+OSDs are still 
> >> on Luminous, 12.2.12.
> >>
> >> However, it seems quotas have stopped working - with a (FUSE) Mimic client 
> >> (13.2.5), I see:
> >> $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
> >> /cephfs/user/freyermu/
> >> /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
> >>
> >> A Luminous client (12.2.12) on the same cluster sees:
> >> $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
> >> /cephfs/user/freyermu/
> >> 5
> >>
> >> It does not seem as if the attribute has been renamed (e.g. 
> >> https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py 
> >> still references it, same for the docs),
> >> and I have to assume the clients also do not enforce quota if they do not 
> >> see it.
> >>
> >> Is this a known incompatibility between Mimic clients and a Luminous 
> >> cluster?
> >> The release notes of Mimic only mention that quota support was added to 
> >> the kernel client, but nothing else quota related catches my eye.
> >>
> >> Cheers,
> >>  Oliver
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Oliver Freyermuth
> Universität Bonn
> Physikalisches Institut, Raum 1.047
> Nußallee 12
> 53115 Bonn
> --
> Tel.: +49 228 73 2367
> Fax:  +49 228 73 7869
> --
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Quotas with Mimic (CephFS-FUSE) clients in a Luminous Cluster

2019-05-27 Thread Dan van der Ster

Hi Oliver,

We saw the same issue after upgrading to mimic.

IIRC we could make the max_bytes xattr visible by touching an empty
file in the dir (thereby updating the dir inode).

e.g. touch  /cephfs/user/freyermu/.quota; rm  /cephfs/user/freyermu/.quota

Does that work?

-- dan


On Mon, May 27, 2019 at 11:36 AM Oliver Freyermuth
 wrote:
>
> Dear Cephalopodians,
>
> in the process of migrating a cluster from Luminous (12.2.12) to Mimic 
> (13.2.5), we have upgraded the FUSE clients first (we took the chance during 
> a time of low activity),
> thinking that this should not cause any issues. All MDS+MON+OSDs are still on 
> Luminous, 12.2.12.
>
> However, it seems quotas have stopped working - with a (FUSE) Mimic client 
> (13.2.5), I see:
> $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
> /cephfs/user/freyermu/
> /cephfs/user/freyermu/: ceph.quota.max_bytes: No such attribute
>
> A Luminous client (12.2.12) on the same cluster sees:
> $ getfattr --absolute-names --only-values -n ceph.quota.max_bytes 
> /cephfs/user/freyermu/
> 5
>
> It does not seem as if the attribute has been renamed (e.g. 
> https://github.com/ceph/ceph/blob/mimic/qa/tasks/cephfs/test_quota.py still 
> references it, same for the docs),
> and I have to assume the clients also do not enforce quota if they do not see 
> it.
>
> Is this a known incompatibility between Mimic clients and a Luminous cluster?
> The release notes of Mimic only mention that quota support was added to the 
> kernel client, but nothing else quota related catches my eye.
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Dan van der Ster

I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer
their degraded PGs.

Open a window with `watch ceph -s`, then in another window slowly do

ceph osd down 1
# then wait a minute or so for that osd.1 to re-peer fully.
ceph osd down 11
...

Continue that for each of the osds with stuck requests, or until there
are no more recovery_wait/degraded PGs.

After each `ceph osd down...`, you should expect to see several PGs
re-peer, and then ideally the slow requests will disappear and the
degraded PGs will become active+clean.
If anything else happens, you should stop and let us know.


-- dan

On Thu, May 23, 2019 at 10:59 AM Kevin Flöh  wrote:
>
> This is the current status of ceph:
>
>
>cluster:
>  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
>  health: HEALTH_ERR
>  9/125481144 objects unfound (0.000%)
>  Degraded data redundancy: 9/497011417 objects degraded
> (0.000%), 7 pgs degraded
>  9 stuck requests are blocked > 4096 sec. Implicated osds
> 1,11,21,32,43,50,65
>
>services:
>  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
>  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
>  mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu=up:active}, 3
> up:standby
>  osd: 96 osds: 96 up, 96 in
>
>data:
>  pools:   2 pools, 4096 pgs
>  objects: 125.48M objects, 259TiB
>  usage:   370TiB used, 154TiB / 524TiB avail
>  pgs: 9/497011417 objects degraded (0.000%)
>   9/125481144 objects unfound (0.000%)
>   4078 active+clean
>   11   active+clean+scrubbing+deep
>   7active+recovery_wait+degraded
>
>io:
>      client:   211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr
>
> On 23.05.19 10:54 vorm., Dan van der Ster wrote:
> > What's the full ceph status?
> > Normally recovery_wait just means that the relevant osd's are busy
> > recovering/backfilling another PG.
> >
> > On Thu, May 23, 2019 at 10:53 AM Kevin Flöh  wrote:
> >> Hi,
> >>
> >> we have set the PGs to recover and now they are stuck in 
> >> active+recovery_wait+degraded and instructing them to deep-scrub does not 
> >> change anything. Hence, the rados report is empty. Is there a way to stop 
> >> the recovery wait to start the deep-scrub and get the output? I guess the 
> >> recovery_wait might be caused by missing objects. Do we need to delete 
> >> them first to get the recovery going?
> >>
> >> Kevin
> >>
> >> On 22.05.19 6:03 nachm., Robert LeBlanc wrote:
> >>
> >> On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:
> >>> Hi,
> >>>
> >>> thank you, it worked. The PGs are not incomplete anymore. Still we have
> >>> another problem, there are 7 PGs inconsistent and a cpeh pg repair is
> >>> not doing anything. I just get "instructing pg 1.5dd on osd.24 to
> >>> repair" and nothing happens. Does somebody know how we can get the PGs
> >>> to repair?
> >>>
> >>> Regards,
> >>>
> >>> Kevin
> >>
> >> Kevin,
> >>
> >> I just fixed an inconsistent PG yesterday. You will need to figure out why 
> >> they are inconsistent. Do these steps and then we can figure out how to 
> >> proceed.
> >> 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of 
> >> them)
> >> 2. Print out the inconsistent report for each inconsistent PG. `rados 
> >> list-inconsistent-obj  --format=json-pretty`
> >> 3. You will want to look at the error messages and see if all the shards 
> >> have the same data.
> >>
> >> Robert LeBlanc
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Dan van der Ster

What's the full ceph status?
Normally recovery_wait just means that the relevant osd's are busy
recovering/backfilling another PG.

On Thu, May 23, 2019 at 10:53 AM Kevin Flöh  wrote:
>
> Hi,
>
> we have set the PGs to recover and now they are stuck in 
> active+recovery_wait+degraded and instructing them to deep-scrub does not 
> change anything. Hence, the rados report is empty. Is there a way to stop the 
> recovery wait to start the deep-scrub and get the output? I guess the 
> recovery_wait might be caused by missing objects. Do we need to delete them 
> first to get the recovery going?
>
> Kevin
>
> On 22.05.19 6:03 nachm., Robert LeBlanc wrote:
>
> On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:
>>
>> Hi,
>>
>> thank you, it worked. The PGs are not incomplete anymore. Still we have
>> another problem, there are 7 PGs inconsistent and a cpeh pg repair is
>> not doing anything. I just get "instructing pg 1.5dd on osd.24 to
>> repair" and nothing happens. Does somebody know how we can get the PGs
>> to repair?
>>
>> Regards,
>>
>> Kevin
>
>
> Kevin,
>
> I just fixed an inconsistent PG yesterday. You will need to figure out why 
> they are inconsistent. Do these steps and then we can figure out how to 
> proceed.
> 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of 
> them)
> 2. Print out the inconsistent report for each inconsistent PG. `rados 
> list-inconsistent-obj  --format=json-pretty`
> 3. You will want to look at the error messages and see if all the shards have 
> the same data.
>
> Robert LeBlanc
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crush rule for "ssd first" but without knowing how much

2019-05-23 Thread Dan van der Ster

Did I understand correctly: you have a crush tree with both ssd and
hdd devices, and you want to direct PGs to the ssds, until they reach
some fullness threshold, and only then start directing PGs to the
hdds?

I can't think of a crush rule alone to achieve that. But something you
could do is add all the ssds & hdds to the crush tree, set the hdd
crush weights to 0.0, then start increasing those weights manually
once the ssd's reach 80% full or whatever.

-- dan

On Thu, May 23, 2019 at 10:29 AM Florent B  wrote:
>
> Hi everyone,
>
> I would like to create a crush rule saying to store as much data as
> possible on ssd class OSDs first (then hdd), but without entering how
> much OSDs in the rule (I don't know in advance how much there will be).
>
> Is it possible ? All examples seen on the web are always writing the
> number of OSD to select.
>
> Thank you.
>
> Florent
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Dan van der Ster

On Wed, May 22, 2019 at 3:03 PM Rainer Krienke  wrote:
>
> Hello,
>
> I created an erasure code profile named ecprofile-42 with the following
> parameters:
>
> $ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2
>
> Next I created a new pool using the ec profile from above:
>
> $ ceph osd pool create my_erasure_pool 64 64  erasure ecprofile-42
>
> The pool created then has an autogenerated crush rule with the contents
> as shown at the end of this mail (see: ceph osd crush rule dump
> my_erasure_pool).
>
> What I am missing in the output of the crush rule dump below are the k,m
> values used for this pool or a "link" from the crushrule to the erasure
> code profile that contains these settings and was used creating the pool
> and thus the ec crushrule.  If I had several ec profiles and pools
> created with the different ec profiles how else could I see which k,m
> values were used for the different pools?
>
> For a replicated crush rule there is the size parameter which is part of
> the crush-rule and indirectly tells you the number of replicas, but what
> about erasure coded pools?

Is this what you're looking for?

# ceph osd pool ls detail  -f json | jq .[0].erasure_code_profile
"jera_4plus2"

-- Dan


>
> Probably there is somewhere the link I am looking for, but I din't find
> it yet...
>
> Thanks Rainer
>
> #
> # autogenerated crush rule my_erasure_pool:
> #
> $ ceph osd crush rule dump my_erasure_pool
> {
> "rule_id": 1,
> "rule_name": "my_erasure_pool",
> "ruleset": 1,
> "type": 3,
> "min_size": 3,
> "max_size": 6,
> "steps": [
> {
> "op": "set_chooseleaf_tries",
> "num": 5
> },
> {
> "op": "set_choose_tries",
> "num": 100
> },
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_indep",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
>
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster

On Tue, May 14, 2019 at 5:13 PM Kevin Flöh  wrote:
>
> ok, so now we see at least a diffrence in the recovery state:
>
>  "recovery_state": [
>  {
>  "name": "Started/Primary/Peering/Incomplete",
>  "enter_time": "2019-05-14 14:15:15.650517",
>  "comment": "not enough complete instances of this PG"
>  },
>  {
>  "name": "Started/Primary/Peering",
>  "enter_time": "2019-05-14 14:15:15.243756",
>  "past_intervals": [
>  {
>  "first": "49767",
>  "last": "59580",
>  "all_participants": [
>  {
>  "osd": 2,
>  "shard": 0
>  },
>  {
>  "osd": 4,
>  "shard": 1
>  },
>  {
>  "osd": 23,
>  "shard": 2
>  },
>  {
>  "osd": 24,
>  "shard": 0
>  },
>  {
>  "osd": 72,
>  "shard": 1
>  },
>  {
>  "osd": 79,
>  "shard": 3
>  }
>  ],
>  "intervals": [
>  {
>  "first": "59562",
>  "last": "59563",
>  "acting": "4(1),24(0),79(3)"
>  },
>  {
>  "first": "59564",
>  "last": "59567",
>  "acting": "23(2),24(0),79(3)"
>  },
>  {
>  "first": "59570",
>  "last": "59574",
>  "acting": "4(1),23(2),79(3)"
>  },
>  {
>  "first": "59577",
>  "last": "59580",
>  "acting": "4(1),23(2),24(0)"
>  }
>  ]
>  }
>  ],
>  "probing_osds": [
>      "2(0)",
>  "4(1)",
>  "23(2)",
>  "24(0)",
>  "72(1)",
>  "79(3)"
>  ],
>  "down_osds_we_would_probe": [],
>  "peering_blocked_by": []
>  },
>  {
>  "name": "Started",
>  "enter_time": "2019-05-14 14:15:15.243663"
>  }
>  ],
>
> the peering does not seem to be blocked anymore. But still there is no
> recovery going on. Is there anything else we can try?

What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan



>
>
> On 14.05.19 11:02 vorm., Dan van der Ster wrote:
> > On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:
> >>
> >> On 14.05.19 10:08 vorm., Dan van der Ster wrote:
> >>
> >> On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:
> >>
> >> On 13.05.19 10:51 nachm., Lionel Bouton wrote:
> >>
> >> Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
> >>
> >> Dear ceph experts,
> >>
> >> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> >> Here is what happened: One osd daemon could not be started and
> >> therefore we decided to mark the osd as lost and set it up from
> >> scratch. Ceph started recovering and then we lost anothe

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster

On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:
>
>
> On 14.05.19 10:08 vorm., Dan van der Ster wrote:
>
> On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:
>
> On 13.05.19 10:51 nachm., Lionel Bouton wrote:
>
> Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
>
> Dear ceph experts,
>
> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> Here is what happened: One osd daemon could not be started and
> therefore we decided to mark the osd as lost and set it up from
> scratch. Ceph started recovering and then we lost another osd with
> the same behavior. We did the same as for the first osd.
>
> With 3+1 you only allow a single OSD failure per pg at a given time.
> You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
> separate servers (assuming standard crush rules) is a death sentence
> for the data on some pgs using both of those OSD (the ones not fully
> recovered before the second failure).
>
> OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
> that the recovery of the first was finished before the second failed.
> Nonetheless, both problematic pgs have been on both OSDs. We think, that
> we still have enough shards left. For one of the pgs, the recovery state
> looks like this:
>
>  "recovery_state": [
>  {
>  "name": "Started/Primary/Peering/Incomplete",
>  "enter_time": "2019-05-09 16:11:48.625966",
>  "comment": "not enough complete instances of this PG"
>  },
>  {
>  "name": "Started/Primary/Peering",
>  "enter_time": "2019-05-09 16:11:48.611171",
>  "past_intervals": [
>  {
>  "first": "49767",
>  "last": "59313",
>  "all_participants": [
>  {
>  "osd": 2,
>  "shard": 0
>  },
>  {
>  "osd": 4,
>  "shard": 1
>  },
>  {
>  "osd": 23,
>  "shard": 2
>  },
>  {
>  "osd": 24,
>  "shard": 0
>  },
>  {
>  "osd": 72,
>  "shard": 1
>  },
>  {
>  "osd": 79,
>  "shard": 3
>  }
>  ],
>  "intervals": [
>  {
>  "first": "58860",
>  "last": "58861",
>  "acting": "4(1),24(0),79(3)"
>  },
>  {
>  "first": "58875",
>  "last": "58877",
>  "acting": "4(1),23(2),24(0)"
>  },
>  {
>  "first": "59002",
>  "last": "59009",
>  "acting": "4(1),23(2),79(3)"
>  },
>  {
>  "first": "59010",
>  "last": "59012",
>  "acting": "2(0),4(1),23(2),79(3)"
>  },
>  {
>  "first": "59197",
>  "last": "59233",
>  "acting": "23(2),24(0),79(3)"
>  },
>  {
>  "first": "59234",
>  "last": "59313",
>  "acting": "23(2),24(0),72(1),79(3)"
>

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Dan van der Ster

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:
>
> On 13.05.19 10:51 nachm., Lionel Bouton wrote:
> > Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
> >> Dear ceph experts,
> >>
> >> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> >> Here is what happened: One osd daemon could not be started and
> >> therefore we decided to mark the osd as lost and set it up from
> >> scratch. Ceph started recovering and then we lost another osd with
> >> the same behavior. We did the same as for the first osd.
> >
> > With 3+1 you only allow a single OSD failure per pg at a given time.
> > You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
> > separate servers (assuming standard crush rules) is a death sentence
> > for the data on some pgs using both of those OSD (the ones not fully
> > recovered before the second failure).
>
> OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
> that the recovery of the first was finished before the second failed.
> Nonetheless, both problematic pgs have been on both OSDs. We think, that
> we still have enough shards left. For one of the pgs, the recovery state
> looks like this:
>
>  "recovery_state": [
>  {
>  "name": "Started/Primary/Peering/Incomplete",
>  "enter_time": "2019-05-09 16:11:48.625966",
>  "comment": "not enough complete instances of this PG"
>  },
>  {
>  "name": "Started/Primary/Peering",
>  "enter_time": "2019-05-09 16:11:48.611171",
>  "past_intervals": [
>  {
>  "first": "49767",
>  "last": "59313",
>  "all_participants": [
>  {
>  "osd": 2,
>  "shard": 0
>  },
>  {
>  "osd": 4,
>  "shard": 1
>  },
>  {
>  "osd": 23,
>  "shard": 2
>  },
>  {
>  "osd": 24,
>  "shard": 0
>  },
>  {
>  "osd": 72,
>  "shard": 1
>  },
>  {
>  "osd": 79,
>  "shard": 3
>  }
>  ],
>  "intervals": [
>  {
>  "first": "58860",
>  "last": "58861",
>  "acting": "4(1),24(0),79(3)"
>  },
>  {
>  "first": "58875",
>  "last": "58877",
>  "acting": "4(1),23(2),24(0)"
>  },
>  {
>  "first": "59002",
>  "last": "59009",
>  "acting": "4(1),23(2),79(3)"
>  },
>  {
>  "first": "59010",
>  "last": "59012",
>  "acting": "2(0),4(1),23(2),79(3)"
>  },
>  {
>  "first": "59197",
>  "last": "59233",
>  "acting": "23(2),24(0),79(3)"
>  },
>  {
>  "first": "59234",
>  "last": "59313",
>  "acting": "23(2),24(0),72(1),79(3)"
>  }
>  ]
>  }
>  ],
>  "probing_osds": [
>  "2(0)",
>  "4(1)",
>  "23(2)",
>  "24(0)",
>  "72(1)",
>  "79(3)"
>  ],
>  "down_osds_we_would_probe": [],
>  "peering_blocked_by": [],
>  "peering_blocked_by_detail": [
>  {
>  "detail": "peering_blocked_by_history_les_bound"
>  }
>  ]
>  },
>  {
>  "name": "Started",
>  "enter_time": "2019-05-09 16:11:48.611121"
>  }
>  ],
> Is there a chance to recover this pg from the shards on OSDs 2, 72, 79?
> ceph pg repair/deep-scrub/scrub did not work.

repair/scrub are not related to this problem so they won't help.

How exactly did you use the osd_find_best_info_ignore_history_les option?

One correct procedure would be to set it to true in ceph.conf, then
restart each of

Re: [ceph-users] Major ceph disaster

2019-05-13 Thread Dan van der Ster

Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs?
It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)

If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:
>
> Dear ceph experts,
>
> we have several (maybe related) problems with our ceph cluster, let me
> first show you the current ceph status:
>
>cluster:
>  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
>  health: HEALTH_ERR
>  1 MDSs report slow metadata IOs
>  1 MDSs report slow requests
>  1 MDSs behind on trimming
>  1/126319678 objects unfound (0.000%)
>  19 scrub errors
>  Reduced data availability: 2 pgs inactive, 2 pgs incomplete
>  Possible data damage: 7 pgs inconsistent
>  Degraded data redundancy: 1/500333881 objects degraded
> (0.000%), 1 pg degraded
>  118 stuck requests are blocked > 4096 sec. Implicated osds
> 24,32,91
>
>services:
>  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
>  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
>  mds: cephfs-1/1/1 up  {0=ceph-node02.etp.kit.edu=up:active}, 3
> up:standby
>  osd: 96 osds: 96 up, 96 in
>
>data:
>  pools:   2 pools, 4096 pgs
>  objects: 126.32M objects, 260TiB
>  usage:   372TiB used, 152TiB / 524TiB avail
>  pgs: 0.049% pgs not active
>   1/500333881 objects degraded (0.000%)
>   1/126319678 objects unfound (0.000%)
>   4076 active+clean
>   10   active+clean+scrubbing+deep
>   7active+clean+inconsistent
>   2incomplete
>   1active+recovery_wait+degraded
>
>io:
>  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr
>
>
> and ceph health detail:
>
>
> HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
> 1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
> scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
> incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
> redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
> stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
>  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
> blocked > 30 secs, oldest blocked for 351193 secs
> MDS_SLOW_REQUEST 1 MDSs report slow requests
>  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec
> MDS_TRIM 1 MDSs behind on trimming
>  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128)
> max_segments: 128, num_segments: 46034
> OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
>  pg 1.24c has 1 unfound objects
> OSD_SCRUB_ERRORS 19 scrub errors
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
>  pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31
> min_size from 3 may help; search ceph.com/docs for 'incomplete')
>  pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31
> min_size from 3 may help; search ceph.com/docs for 'incomplete')
> PG_DAMAGED Possible data damage: 7 pgs inconsistent
>  pg 1.17f is active+clean+inconsistent, acting [65,49,25,4]
>  pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81]
>  pg 1.203 is active+clean+inconsistent, acting [43,49,4,72]
>  pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4]
>  pg 1.779 is active+clean+inconsistent, acting [50,4,77,62]
>  pg 1.77c is active+clean+inconsistent, acting [21,49,40,4]
>  pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4]
> PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded
> (0.000%), 1 pg degraded
>  pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1
> unfound
> REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Implicated osds
> 24,32,91
>  118 ops are blocked > 536871 sec
>  osds 24,32,91 have stuck requests > 536871 sec
>
>
> Let me briefly summarize the setup: We have 4 nodes with 24 osds each
> and use 3+1 erasure coding. The nodes run on centos7 and we use, due to
> a major mistake when setting up the cluster, more than one

Re: [ceph-users] co-located cephfs client deadlock

2019-05-02 Thread Dan van der Ster

On the stuck client:

  cat /sys/kernel/debug/ceph/*/osdc

REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS
REQUESTS 1 homeless 0
245540 osd100 1.9443e2a5 1.2a5 [100,1,75]/100 [100,1,75]/100 e74658
fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.0001
0x400024 1 write
LINGER REQUESTS
BACKOFFS

osd.100 is clearly there ^^

-- dan

On Thu, May 2, 2019 at 9:25 AM Marc Roos  wrote:
>
>
> How did you retreive what osd nr to restart?
>
> Just for future reference, when I run into a similar situation. If you
> have a client hang on a osd node. This can be resolved by restarting
> the osd that it is reading from?
>
>
>
>
> -----Original Message-
> From: Dan van der Ster [mailto:d...@vanderster.com]
> Sent: donderdag 2 mei 2019 8:51
> To: Yan, Zheng
> Cc: ceph-users; pablo.llo...@cern.ch
> Subject: Re: [ceph-users] co-located cephfs client deadlock
>
> On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng  wrote:
> >
> > On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster 
> wrote:
> > >
> > > Hi all,
> > >
> > > We have been benchmarking a hyperconverged cephfs cluster (kernel
> > > clients + osd on same machines) for awhile. Over the weekend (for
> > > the first time) we had one cephfs mount deadlock while some clients
> > > were running ior.
> > >
> > > All the ior processes are stuck in D state with this stack:
> > >
> > > [] wait_on_page_bit+0x83/0xa0 []
>
> > > __filemap_fdatawait_range+0x111/0x190
> > > [] filemap_fdatawait_range+0x14/0x30
> > > [] filemap_write_and_wait_range+0x56/0x90
> > > [] ceph_fsync+0x55/0x420 [ceph]
> > > [] do_fsync+0x67/0xb0 []
> > > SyS_fsync+0x10/0x20 []
> > > system_call_fastpath+0x22/0x27 []
> > > 0x
> > >
> >
> > are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc?
>
> We never managed to reproduce on this cluster.
>
> But on a separate (not co-located) cluster we had a similar issue. A
> client was stuck like this for several hours:
>
> HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
> report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to
> respond to capability release
> mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
> failing to respond to capability release client_id: 69092525
> MDS_SLOW_REQUEST 1 MDSs report slow requests
> mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30
> sec
>
>
> Indeed there was a hung write on hpc070.cern.ch:
>
> 245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
> e74658
> fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.0001
> 0x4000241 write
>
> I restarted osd.100 and the deadlocked request went away.
> Does this sound like a known issue?
>
> Thanks, Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] co-located cephfs client deadlock

2019-05-02 Thread Dan van der Ster

On Mon, Apr 1, 2019 at 1:46 PM Yan, Zheng  wrote:
>
> On Mon, Apr 1, 2019 at 6:45 PM Dan van der Ster  wrote:
> >
> > Hi all,
> >
> > We have been benchmarking a hyperconverged cephfs cluster (kernel
> > clients + osd on same machines) for awhile. Over the weekend (for the
> > first time) we had one cephfs mount deadlock while some clients were
> > running ior.
> >
> > All the ior processes are stuck in D state with this stack:
> >
> > [] wait_on_page_bit+0x83/0xa0
> > [] __filemap_fdatawait_range+0x111/0x190
> > [] filemap_fdatawait_range+0x14/0x30
> > [] filemap_write_and_wait_range+0x56/0x90
> > [] ceph_fsync+0x55/0x420 [ceph]
> > [] do_fsync+0x67/0xb0
> > [] SyS_fsync+0x10/0x20
> > [] system_call_fastpath+0x22/0x27
> > [] 0x
> >
>
> are there hang osd requests in /sys/kernel/debug/ceph/xxx/osdc?

We never managed to reproduce on this cluster.

But on a separate (not co-located) cluster we had a similar issue. A
client was stuck like this for several hours:

HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
mdscephflax-mds-2a4cfd0e2c(mds.1): Client hpc070.cern.ch:hpcscid02
failing to respond to capability release client_id: 69092525
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdscephflax-mds-2a4cfd0e2c(mds.1): 1 slow requests are blocked > 30 sec


Indeed there was a hung write on hpc070.cern.ch:

245540  osd100  1.9443e2a5 1.2a5   [100,1,75]/100  [100,1,75]/100
e74658  fsvolumens_393f2dcc-6b09-44d7-8d20-0e84b072ed26/2000b2f5905.0001
0x4000241 write

I restarted osd.100 and the deadlocked request went away.
Does this sound like a known issue?

Thanks, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster

On Tue, Apr 30, 2019 at 9:01 PM Igor Podlesny  wrote:
>
> On Wed, 1 May 2019 at 01:26, Igor Podlesny  wrote:
> > On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
> > >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform 
> > >> > on our clusters.
> > >>
> > >> mode upmap ?
> > >
> > > yes, mgr balancer, mode upmap.
>
> Also -- do your CEPHs have single root hierarchy pools (like
> "default"), or there're some pools that use non-default ones?
>
> Looking through docs I didn't find a way to narrow balancer's scope
> down to specific pool(s), although personally I would prefer it to
> operate on a small set of them.
>

We have a mix of both single and dual root hierarchies -- the upmap
balancer works for all.
(E.g. this works: pool A with 3 replicas in root A, pool B with 3
replicas in root B.
However if you have a cluster with two roots, and a pool that does
something complex like put 2 replicas in root A and 1 replica in root
B -- I haven't tested that recently).

In luminous and mimic there isn't a way to scope the auto balancing
down to limited pools.
In practice that doesn't really matter, because of how it works, roughly:

while true:
   select a random pool
   get the pg distribution for that pool
   create upmaps (or remove existing upmaps) to balance the pgs for that pool
   sleep 60s

Eventually it attacks all pools and gets them fully balanced. (It
anyway spends most of the time balancing the pools that matter,
because the ones that don't have data get "balanced" quickly).
If you absolutely must limit the pools, you have to script something
to loop on `ceph balancer optimize myplan ; ceph balancer exec
myplan`

Something to reiterate: v12.2.12 has the latest upmap balancing
heuristics, which are miles better than 12.2.11. (Big thanks to Xie
Xingguo who worked hard to get this right!!!)
Mimic v13.2.5 doesn't have those fixes (maybe in the pipeline for
13.2.6?) and I haven't checked Nautilus.
If you're on mimic, then it's upmap balancer heuristics are better
than nothing, but it might be imperfect or not work in certain cases
(e.g. multi-root).

-- Dan

> --
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster

On Tue, Apr 30, 2019 at 8:26 PM Igor Podlesny  wrote:
>
> On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
> >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on 
> >> > our clusters.
> >>
> >> mode upmap ?
> >
> > yes, mgr balancer, mode upmap.
>
> I see. Was it a matter of just:
>
> 1) ceph balancer mode upmap
> 2) ceph balancer on
>
> or were there any other steps?

All of the clients need to be luminous our newer:

# ceph osd set-require-min-compat-client luminous

You need to enable the module:

# ceph mgr module enable balancer

You probably don't want to it run 24/7:

# ceph config-key set mgr/balancer/begin_time 0800
# ceph config-key set mgr/balancer/end_time 1800

The default rate that it balances things are a bit too high for my taste:

# ceph config-key set mgr/balancer/max_misplaced 0.005
# ceph config-key set mgr/balancer/upmap_max_iterations 2

(Those above are optional... YMMV)

Now fail the active mgr so that the new one reads those new options above.

# ceph mgr fail 

Enable the upmap mode:

# ceph balancer mode upmap

Test it once to see that it works at all:

# ceph balancer optimize myplan
# ceph balancer show myplan
# ceph balancer reset

(any errors, start debugging -- use debug_mgr = 4/5 and check the
active mgr's log for the balancer details.)

# ceph balancer on

Now it'll start moving the PGs around until things are quite well balanced.
In our clusters that process takes a week or two... it depends on
cluster size, numpgs, etc...

Hope that helps!

Dan

>
> --
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster

Removing pools won't make a difference.

Read up to slide 22 here:
https://www.slideshare.net/mobile/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

..
Dan

(Apologies for terseness, I'm mobile)



On Tue, 30 Apr 2019, 20:02 Shain Miley,  wrote:

> Here is the per pool pg_num info:
>
> 'data' pg_num 64
> 'metadata' pg_num 64
> 'rbd' pg_num 64
> 'npr_archive' pg_num 6775
> '.rgw.root' pg_num 64
> '.rgw.control' pg_num 64
> '.rgw' pg_num 64
> '.rgw.gc' pg_num 64
> '.users.uid' pg_num 64
> '.users.email' pg_num 64
> '.users' pg_num 64
> '.usage' pg_num 64
> '.rgw.buckets.index' pg_num 128
> '.intent-log' pg_num 8
> '.rgw.buckets' pg_num 64
> 'kube' pg_num 512
> '.log' pg_num 8
>
> Here is the df output:
>
> GLOBAL:
>  SIZEAVAIL  RAW USED %RAW USED
>  1.06PiB 306TiB   778TiB 71.75
> POOLS:
>  NAME   ID USED%USED MAX AVAIL OBJECTS
>  data   0  11.7GiB  0.14 8.17TiB 3006
>  metadata   1   0B 0 8.17TiB0
>  rbd2  43.2GiB  0.51 8.17TiB11147
>  npr_archive3   258TiB 97.93 5.45TiB 82619649
>  .rgw.root  41001B 0 8.17TiB5
>  .rgw.control   5   0B 0 8.17TiB8
>  .rgw   6  6.16KiB 0 8.17TiB   35
>  .rgw.gc7   0B 0 8.17TiB   32
>  .users.uid 8   0B 0 8.17TiB0
>  .users.email   9   0B 0 8.17TiB0
>  .users 10  0B 0 8.17TiB0
>  .usage 11  0B 0 8.17TiB1
>  .rgw.buckets.index 12  0B 0 8.17TiB   26
>  .intent-log17  0B 0 5.45TiB0
>  .rgw.buckets   18 24.2GiB  0.29 8.17TiB 6622
>  kube   21 1.82GiB  0.03 5.45TiB  550
>  .log   22  0B 0 5.45TiB  176
>
>
> The stuff in the data pool and the rwg pools is old data that we used
> for testing...if you guys think that removing everything outside of rbd
> and npr_archive would make a significant impact I will give it a try.
>
> Thanks,
>
> Shain
>
>
>
> On 4/30/19 1:15 PM, Jack wrote:
> > Hi,
> >
> > I see that you are using rgw
> > RGW comes with many pools, yet most of them are used for metadata and
> > configuration, those do not store many data
> > Such pools do not need more than a couple PG, each (I use pg_num = 8)
> >
> > You need to allocate your pg on pool that actually stores the data
> >
> > Please do the following, to let us know more:
> > Print the pg_num per pool:
> > for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
> > pg_num; done
> >
> > Print the usage per pool:
> > ceph df
> >
> > Also, instead of doing a "ceph osd reweight-by-utilization", check out
> > the balancer plugin :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_mimic_mgr_balancer_=DwICAg=E2nBno7hEddFhl23N5nD1Q=cqFccwnwHGRorPuRWs36Dw=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M=YoiU-wa-ZXHUEj8xYmiSVRVnXnDenoUaRZMa-bfRFvo=
> >
> > Finally, in nautilus, the pg can now upscale and downscale automaticaly
> > See
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_rados_new-2Din-2Dnautilus-2Dpg-2Dmerging-2Dand-2Dautotuning_=DwICAg=E2nBno7hEddFhl23N5nD1Q=cqFccwnwHGRorPuRWs36Dw=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M=7-W9i3gJAcCtrL7MzjJlG5LZ_91zeesYBT7g0rGrLh0=
> >
> >
> > On 04/30/2019 06:34 PM, Shain Miley wrote:
> >> Hi,
> >>
> >> We have a cluster with 235 osd's running version 12.2.11 with a
> >> combination of 4 and 6 TB drives.  The data distribution across osd's
> >> varies from 52% to 94%.
> >>
> >> I have been trying to figure out how to get this a bit more balanced as
> >> we are running into 'backfillfull' issues on a regular basis.
> >>
> >> I've tried adding more pgs...but this did not seem to do much in terms
> >> of the imbalance.
> >>
> >> Here is the end output from 'ceph osd df':
> >>
> >> MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73
> >>
> >> We have 8199 pgs total with 6775 of them in the pool that has 97% of the
> >> data.
> >>
> >> The other pools are not really used (data, metadata, .rgw.root,
> >> .rgw.control, etc).  I have thought about deleting those unused pools so
> >> that most if not all the pgs are being used by the pool with the
> >> majority of the data.
> >>
> >> However...before I do that...there anything else I can do or try in
> >> order to see if I can balance out the data more uniformly?
> >>
> >> Thanks in advance,
> >>
> >> Shain
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> >

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster

On Tue, 30 Apr 2019, 19:32 Igor Podlesny,  wrote:

> On Wed, 1 May 2019 at 00:24, Dan van der Ster  wrote:
> >
> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on
> our clusters.
> >
> > .. Dan
>
> mode upmap ?
>

yes, mgr balancer, mode upmap.

..  Dan



> --
> End of message. Next message?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster

The upmap balancer in v12.2.12 works really well... Perfectly uniform on
our clusters.

.. Dan


On Tue, 30 Apr 2019, 19:22 Kenneth Van Alstyne, 
wrote:

> Unfortunately it looks like he’s still on Luminous, but if upgrading is an
> option, the options are indeed significantly better.  If I recall
> correctly, at least the balancer module is available in Luminous.
>
> Thanks,
>
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> c: 228-547-8045 f: 571-266-3106
> www.knightpoint.com
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3
>
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure,
> or distribution is STRICTLY prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy all copies
> of the original message.
>
> On Apr 30, 2019, at 12:15 PM, Jack  wrote:
>
> Hi,
>
> I see that you are using rgw
> RGW comes with many pools, yet most of them are used for metadata and
> configuration, those do not store many data
> Such pools do not need more than a couple PG, each (I use pg_num = 8)
>
> You need to allocate your pg on pool that actually stores the data
>
> Please do the following, to let us know more:
> Print the pg_num per pool:
> for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
> pg_num; done
>
> Print the usage per pool:
> ceph df
>
> Also, instead of doing a "ceph osd reweight-by-utilization", check out
> the balancer plugin : http://docs.ceph.com/docs/mimic/mgr/balancer/
>
> Finally, in nautilus, the pg can now upscale and downscale automaticaly
> See https://ceph.com/rados/new-in-nautilus-pg-merging-and-autotuning/
>
>
> On 04/30/2019 06:34 PM, Shain Miley wrote:
>
> Hi,
>
> We have a cluster with 235 osd's running version 12.2.11 with a
> combination of 4 and 6 TB drives.  The data distribution across osd's
> varies from 52% to 94%.
>
> I have been trying to figure out how to get this a bit more balanced as
> we are running into 'backfillfull' issues on a regular basis.
>
> I've tried adding more pgs...but this did not seem to do much in terms
> of the imbalance.
>
> Here is the end output from 'ceph osd df':
>
> MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73
>
> We have 8199 pgs total with 6775 of them in the pool that has 97% of the
> data.
>
> The other pools are not really used (data, metadata, .rgw.root,
> .rgw.control, etc).  I have thought about deleting those unused pools so
> that most if not all the pgs are being used by the pool with the
> majority of the data.
>
> However...before I do that...there anything else I can do or try in
> order to see if I can balance out the data more uniformly?
>
> Thanks in advance,
>
> Shain
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Were fixed CephFS lock ups when it's running on nodes with OSDs?

2019-04-23 Thread Dan van der Ster

On Mon, 22 Apr 2019, 22:20 Gregory Farnum,  wrote:

> On Sat, Apr 20, 2019 at 9:29 AM Igor Podlesny  wrote:
> >
> > I remember seeing reports in regards but it's being a while now.
> > Can anyone tell?
>
> No, this hasn't changed. It's unlikely it ever will; I think NFS
> resolved the issue but it took a lot of ridiculous workarounds and
> imposes a permanent memory cost on the client.
>

On the other hand, we've been running osds and local kernel mounts through
some ior stress testing and managed to lock up only one node, only once
(and that was with a 2TB shared output file).

Maybe the necessary memory pressure conditions get less likely as the
number of clients and osds gets larger? (i.e. it's probably easy to trigger
with one single node/osd because all IO is local, but for large clusters
most IO is remote).

.. Dan

-Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Save the date: Ceph Day for Research @ CERN -- Sept 16, 2019

2019-04-15 Thread Dan van der Ster

Hey Cephalopods!

This is an early heads up that we are planning a Ceph Day event at
CERN in Geneva, Switzerland on September 16, 2019 [1].

For this Ceph Day, we want to focus on use-cases and solutions for
research, academia, or other non-profit applications [2].

Registration and call for proposals will be available by mid-May.

All the Best,

Dan van der Ster
CERN IT Department
Ceph Governing Board, Academic Liaison

[1] Sept 16 is the day after CERN Open Days, where there will be
plenty to visit on our campus if you arrive a couple of days before
https://home.cern/news/news/cern/cern-open-days-explore-future-us

[2] Marine biologists studying actual Cephalopods with Ceph are
especially welcome ;-)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to trigger offline filestore merge

2019-04-09 Thread Dan van der Ster

Hi again,

Thanks to a hint from another user I seem to have gotten past this.

The trick was to restart the osds with a positive merge threshold (10)
then cycle through rados bench several hundred times, e.g.

   while true ; do rados bench -p default.rgw.buckets.index 10 write
-b 4096 -t 128; sleep 5 ; done

After running that for awhile the PG filestore structure has merged
down and now listing the pool and backfilling are back to normal.

Thanks!

Dan


On Tue, Apr 9, 2019 at 7:05 PM Dan van der Ster  wrote:
>
> Hi all,
>
> We have a slight issue while trying to migrate a pool from filestore
> to bluestore.
>
> This pool used to have 20 million objects in filestore -- it now has
> 50,000. During its life, the filestore pgs were internally split
> several times, but never merged. Now the pg _head dirs have mostly
> empty directories.
> This creates some problems:
>
>   1. rados ls -p  hangs a long time, eventually triggering slow
> requests while the filestore_op threads time out. (They time out while
> listing the collections).
>   2. backfilling from these PGs is impossible, similarly because
> listing the objects to backfill eventually leads to the osd flapping.
>
> So I want to merge the filestore pgs.
>
> I tried ceph-objectstore-tool --op apply-layout-settings, but it seems
> that this only splits, not merges?
>
> Does someone have a better idea?
>
> Thanks!
>
> Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] how to trigger offline filestore merge

2019-04-09 Thread Dan van der Ster

Hi all,

We have a slight issue while trying to migrate a pool from filestore
to bluestore.

This pool used to have 20 million objects in filestore -- it now has
50,000. During its life, the filestore pgs were internally split
several times, but never merged. Now the pg _head dirs have mostly
empty directories.
This creates some problems:

  1. rados ls -p  hangs a long time, eventually triggering slow
requests while the filestore_op threads time out. (They time out while
listing the collections).
  2. backfilling from these PGs is impossible, similarly because
listing the objects to backfill eventually leads to the osd flapping.

So I want to merge the filestore pgs.

I tried ceph-objectstore-tool --op apply-layout-settings, but it seems
that this only splits, not merges?

Does someone have a better idea?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-08 Thread Dan van der Ster

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell osd.0 heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet  wrote:
>
> Hi,
>
> on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
> osd_memory_target :
>
> daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
> ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29 1903:42 
> /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser ceph --setgroup ceph
> ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29 1443:41 
> /usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser ceph --setgroup ceph
> ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29 1889:41 
> /usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser ceph --setgroup ceph
> ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29 2198:47 
> /usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser ceph --setgroup ceph
> ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29 1866:05 
> /usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser ceph --setgroup ceph
> ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29 1634:30 
> /usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser ceph --setgroup ceph
> ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29 1882:42 
> /usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser ceph --setgroup ceph
> ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29 1782:52 
> /usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser ceph --setgroup ceph
>
> daevel-ob@ssdr712h:~$ free -m
>   totalusedfree  shared  buff/cache   
> available
> Mem:  47771   452101643  17 917   
> 43556
> Swap: 0   0   0
>
> # ceph daemon osd.147 config show | grep memory_target
> "osd_memory_target": "4294967296",
>
>
> And there is no recovery / backfilling, the cluster is fine :
>
>$ ceph status
>  cluster:
>id: de035250-323d-4cf6-8c4b-cf0faf6296b1
>health: HEALTH_OK
>
>  services:
>mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
>mgr: tsyne(active), standbys: olkas, tolriq, lorunde, amphel
>osd: 120 osds: 116 up, 116 in
>
>  data:
>pools:   20 pools, 12736 pgs
>objects: 15.29M objects, 31.1TiB
>usage:   101TiB used, 75.3TiB / 177TiB avail
>pgs: 12732 active+clean
> 4 active+clean+scrubbing+deep
>
>  io:
>client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd, 1.29kop/s wr
>
>
>On an other host, in the same pool, I see also high memory usage :
>
>daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
>ceph6287  6.6 10.6 6027388 5190032 ? Ssl  mars21 1511:07 
> /usr/bin/ceph-osd -f --cluster ceph --id 131 --setuser ceph --setgroup ceph
>ceph6759  7.3 11.2 6299140 5484412 ? Ssl  mars21 1665:22 
> /usr/bin/ceph-osd -f --cluster ceph --id 132 --setuser ceph --setgroup ceph
>ceph7114  7.0 11.7 6576168 5756236 ? Ssl  mars21 1612:09 
> /usr/bin/ceph-osd -f --cluster ceph --id 133 --setuser ceph --setgroup ceph
>ceph7467  7.4 11.1 6244668 5430512 ? Ssl  mars21 1704:06 
> /usr/bin/ceph-osd -f --cluster ceph --id 134 --setuser ceph --setgroup ceph
>ceph7821  7.7 11.1 6309456 5469376 ? Ssl  mars21 1754:35 
> /usr/bin/ceph-osd -f --cluster ceph --id 135 --setuser ceph --setgroup ceph
>ceph8174  6.9 11.6 6545224 5705412 ? Ssl  mars21 1590:31 
> /usr/bin/ceph-osd -f --cluster ceph --id 136 --setuser ceph --setgroup ceph
>ceph8746  6.6 11.1 6290004 5477204 ? Ssl  mars21 1511:11 
> /usr/bin/ceph-osd -f --cluster ceph --id 137 --setuser ceph --setgroup ceph
>ceph9100  7.7 11.6 6552080 5713560 ? Ssl  mars21 1757:22 
> /usr/bin/ceph-osd -f --cluster ceph --id 138 --setuser ceph --setgroup ceph
>
>But ! On a similar host, in a different pool, the problem is less visible :
>
>daevel-ob@ssdr712i:~$ ps auxw | grep ceph-osd
>ceph3617  2.8  9.9 5660308 4847444 ? Ssl  mars29 313:05 
> /usr/bin/ceph-osd -f --cluster ceph --id 151 --setuser ceph --setgroup ceph
>ceph3958  2.3  9.8 5661936 4834320 ? Ssl  mars29 256:55 
> /usr/bin/ceph-osd -f --cluster ceph --id 152 --setuser ceph --setgroup ceph
>ceph4299  2.3  9.8 5620616 4807248 ? Ssl  mars29 266:26 
> /usr/bin/ceph-osd -f --cluster ceph --id 153 --setuser ceph --setgroup ceph
>ceph4643  2.3  9.6 5527724 4713572 ? Ssl  mars29 262:50 
> /usr/bin/ceph-osd -f --cluster ceph --id 154 --setuser ceph --setgroup ceph
>ceph5016  2.2  9.7 5597504 4783412 ? Ssl  mars29 248:37 
> /usr/bin/ceph-osd -f --cluster ceph --id 155 --setuser ceph

Re: [ceph-users] co-located cephfs client deadlock

2019-04-01 Thread Dan van der Ster

It's the latest CentOS 7.6 kernel. Known pain there?

The user was running a 1.95TiB ior benchmark -- so, trying to do
parallel writes to one single 1.95TiB file.
We have
  max_file_size 219902322  (exactly 2 TiB)
so it should fit.

Thanks!
Dan


On Mon, Apr 1, 2019 at 1:06 PM Paul Emmerich  wrote:
>
> Which kernel version are you using? We've had lots of problems with
> random deadlocks in kernels with cephfs but 4.19 seems to be pretty
> stable.
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Mon, Apr 1, 2019 at 12:45 PM Dan van der Ster  wrote:
> >
> > Hi all,
> >
> > We have been benchmarking a hyperconverged cephfs cluster (kernel
> > clients + osd on same machines) for awhile. Over the weekend (for the
> > first time) we had one cephfs mount deadlock while some clients were
> > running ior.
> >
> > All the ior processes are stuck in D state with this stack:
> >
> > [] wait_on_page_bit+0x83/0xa0
> > [] __filemap_fdatawait_range+0x111/0x190
> > [] filemap_fdatawait_range+0x14/0x30
> > [] filemap_write_and_wait_range+0x56/0x90
> > [] ceph_fsync+0x55/0x420 [ceph]
> > [] do_fsync+0x67/0xb0
> > [] SyS_fsync+0x10/0x20
> > [] system_call_fastpath+0x22/0x27
> > [] 0x
> >
> > We tried restarting the co-located OSDs, and tried evicting the
> > client, but the processes stay deadlocked.
> >
> > We've seen the recent issue related to co-location
> > (https://bugzilla.redhat.com/show_bug.cgi?id=1665248) but we don't
> > have the `usercopy` warning in dmesg.
> >
> > Are there other known issues related to co-locating?
> >
> > Thanks!
> > Dan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] co-located cephfs client deadlock

2019-04-01 Thread Dan van der Ster

Hi all,

We have been benchmarking a hyperconverged cephfs cluster (kernel
clients + osd on same machines) for awhile. Over the weekend (for the
first time) we had one cephfs mount deadlock while some clients were
running ior.

All the ior processes are stuck in D state with this stack:

[] wait_on_page_bit+0x83/0xa0
[] __filemap_fdatawait_range+0x111/0x190
[] filemap_fdatawait_range+0x14/0x30
[] filemap_write_and_wait_range+0x56/0x90
[] ceph_fsync+0x55/0x420 [ceph]
[] do_fsync+0x67/0xb0
[] SyS_fsync+0x10/0x20
[] system_call_fastpath+0x22/0x27
[] 0x

We tried restarting the co-located OSDs, and tried evicting the
client, but the processes stay deadlocked.

We've seen the recent issue related to co-location
(https://bugzilla.redhat.com/show_bug.cgi?id=1665248) but we don't
have the `usercopy` warning in dmesg.

Are there other known issues related to co-locating?

Thanks!
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "No space left on device" when deleting a file

2019-03-26 Thread Dan van der Ster

See http://tracker.ceph.com/issues/38849

As an immediate workaround you can increase `mds bal fragment size
max` to 20 (which will increase the max number of strays to 2
million.)
(Try injecting that option to the mds's -- I think it is read at runtime).

And you don't need to stop the mds's and flush the journal/scan_links/etc...
Just find where the old rm'd hardlinks are and ls -l their directories
-- that should be enough to remove the files from stray.

-- Dan


On Tue, Mar 26, 2019 at 5:50 PM Toby Darling  wrote:
>
> Hi
>
> [root@ceph1 ~]# ceph version
> ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic
> (stable)
>
> We've run into a "No space left on device" issue when trying to delete a
> file, despite there being free space:
>
> [root@ceph1 ~]# ceph df
> GLOBAL:
> SIZEAVAIL   RAW USED %RAW USED
> 3.2 PiB 871 TiB  2.3 PiB 73.42
> POOLS:
> NAME  ID USED%USED MAX AVAIL
> OBJECTS
> mds_nvme  3  2.7 MiB 0   527 GiB
>   2636225
> compressed_ecpool 4  2.3 PiB 92.13   197 TiB
> 858679630
> ecpool_comp_TEST_ONLY 5  0 B 0   197 TiB
> 0
>
> Creating files/directories is fine.
>
> We do have 1M strays:
>
> [root@ceph1 ~]# ceph daemon mds.ceph1 perf dump | grep num_strays
> "num_strays": 100,
> "num_strays_delayed": 0,
> "num_strays_enqueuing": 0,
>
> and I found a post from 2016
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013646.html)
> that suggests:
>
>   run ' ceph daemon mds.xxx flush journal' to flush MDS journal
>   stop all mds
>   run 'cephfs-data-scan scan_links'
>   restart mds
>   run 'ceph daemon mds.x scrub_path / recursive repair'
>
> That was for jewel, is this still the recommended action for mimic?
>
> Cheers
> Toby
> --
> Toby Darling, Scientific Computing (2N249)
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue
> Cambridge Biomedical Campus
> Cambridge CB2 0QH
> Phone 01223 267070
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs manila snapshots best practices

2019-03-22 Thread Dan van der Ster

Perfect, thanks for that input.
So, in your experience the async snaptrim mechanism doesn't make it as
transparent as scrubbing etc.?
On the rbd side, the impact of snaptrim seems invisible to us.

-- dan


On Fri, Mar 22, 2019 at 3:42 PM Paul Emmerich 
wrote:

> I wouldn't give users the ability to perform snapshots directly on
> Ceph unless you have full control over your users or fully trust them.
> Too easy to ruin your day by creating lots of small files and lots of
> snapshots that will wreck your performance...
>
> Also, snapshots aren't really accounted in their quota by CephFS :/
>
>
> Paul
> On Wed, Mar 20, 2019 at 4:34 PM Dan van der Ster 
> wrote:
> >
> > Hi all,
> >
> > We're currently upgrading our cephfs (managed by OpenStack Manila)
> > clusters to Mimic, and want to start enabling snapshots of the file
> > shares.
> > There are different ways to approach this, and I hope someone can
> > share their experiences with:
> >
> > 1. Do you give users the 's' flag in their cap, so that they can
> > create snapshots themselves? We're currently planning *not* to do this
> > -- we'll create snapshots for the users.
> > 2. We want to create periodic snaps for all cephfs volumes. I can see
> > pros/cons to creating the snapshots in /volumes/.snap or in
> > /volumes/_nogroup//.snap. Any experience there? Or maybe even
> > just an fs-wide snap in /.snap is the best approach ?
> > 3. I found this simple cephfs-snap script which should do the job:
> > http://images.45drives.com/ceph/cephfs/cephfs-snap  Does anyone have a
> > different recommendation?
> >
> > Thanks!
> >
> > Dan
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-21 Thread Dan van der Ster

On Thu, Mar 21, 2019 at 12:14 PM Eugen Block  wrote:
>
> Hi Dan,
>
> I don't know about keeping the osd-id but I just partially recreated
> your scenario. I wiped one OSD and recreated it. You are trying to
> re-use the existing block.db-LV with the device path (--block.db
> /dev/vg-name/lv-name) instead the lv notation (--block.db
> vg-name/lv-name):
>
> > # ceph-volume lvm create --data /dev/sdq --block.db
> > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > --osd-id 240
>
> This fails in my test, too. But if I use the LV notation it works:
>
> ceph-2:~ # ceph-volume lvm create --data /dev/sda --block.db
> ceph-journals/journal-osd3
> [...]
> Running command: /bin/systemctl enable --runtime ceph-osd@3
> Running command: /bin/systemctl start ceph-osd@3
> --> ceph-volume lvm activate successful for osd ID: 3
> --> ceph-volume lvm create successful for: /dev/sda
>

Yes that's it! Worked for me too.

Thanks!

Dan


> This is a Nautilus test cluster, but I remember having this on a
> Luminous cluster, too. I hope this helps.
>
> Regards,
> Eugen
>
>
> Zitat von Dan van der Ster :
>
> > On Tue, Mar 19, 2019 at 12:25 PM Dan van der Ster  
> > wrote:
> >>
> >> On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
> >> >
> >> > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> >> > >
> >> > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster
> >>  wrote:
> >> > > >
> >> > > > Hi all,
> >> > > >
> >> > > > We've just hit our first OSD replacement on a host created with
> >> > > > `ceph-volume lvm batch` with mixed hdds+ssds.
> >> > > >
> >> > > > The hdd /dev/sdq was prepared like this:
> >> > > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> >> > > >
> >> > > > Then /dev/sdq failed and was then zapped like this:
> >> > > >   # ceph-volume lvm zap /dev/sdq --destroy
> >> > > >
> >> > > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> >> > > > /dev/sdac (see P.S.)
> >> > >
> >> > > That is correct behavior for the zap command used.
> >> > >
> >> > > >
> >> > > > Now we're replaced /dev/sdq and we're wondering how to proceed. We 
> >> > > > see
> >> > > > two options:
> >> > > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> >> > > > change when we re-create, right?)
> >> > >
> >> > > This is possible but you are right that in the current state, the FSID
> >> > > and other cluster data exist in the LV metadata. To reuse this LV for
> >> > > a new (replaced) OSD
> >> > > then you would need to zap the LV *without* the --destroy flag, which
> >> > > would clear all metadata on the LV and do a wipefs. The command would
> >> > > need the full path to
> >> > > the LV associated with osd.240, something like:
> >> > >
> >> > > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> >> > >
> >> > > >   2. remove the db lv from sdac then run
> >> > > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> >> > > >  which should do the correct thing.
> >> > >
> >> > > This would also work if the db lv is fully removed with --destroy
> >> > >
> >> > > >
> >> > > > This is all v12.2.11 btw.
> >> > > > If (2) is the prefered approached, then it looks like a bug that the
> >> > > > db lv was not destroyed by lvm zap --destroy.
> >> > >
> >> > > Since /dev/sdq was passed in to zap, just that one device was removed,
> >> > > so this is working as expected.
> >> > >
> >> > > Alternatively, zap has the ability to destroy or zap LVs associated
> >> > > with an OSD ID. I think this is not released yet for Luminous but
> >> > > should be in the next release (which seems to be what you want)
> >> >
> >> > Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> >> > can also zap by OSD FSID, both way will zap (and optionally destroy if
> >> > using --destroy)
> >> > all LVs associated with the OSD.
> >> >
> >

Re: [ceph-users] cephfs manila snapshots best practices

2019-03-21 Thread Dan van der Ster

On Thu, Mar 21, 2019 at 1:50 PM Tom Barron  wrote:
>
> On 20/03/19 16:33 +0100, Dan van der Ster wrote:
> >Hi all,
> >
> >We're currently upgrading our cephfs (managed by OpenStack Manila)
> >clusters to Mimic, and want to start enabling snapshots of the file
> >shares.
> >There are different ways to approach this, and I hope someone can
> >share their experiences with:
> >
> >1. Do you give users the 's' flag in their cap, so that they can
> >create snapshots themselves? We're currently planning *not* to do this
> >-- we'll create snapshots for the users.
> >2. We want to create periodic snaps for all cephfs volumes. I can see
> >pros/cons to creating the snapshots in /volumes/.snap or in
> >/volumes/_nogroup//.snap. Any experience there? Or maybe even
> >just an fs-wide snap in /.snap is the best approach ?
> >3. I found this simple cephfs-snap script which should do the job:
> >http://images.45drives.com/ceph/cephfs/cephfs-snap  Does anyone have a
> >different recommendation?
> >
> >Thanks!
> >
> >Dan
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Dan,
>
> Manila of course provides users with self-service file share snapshot
> capability with quota control of the snapshots.  I'm sure you are
> aware of this but just wanted to get it on record in this thread.
>
> Snapshots are not enabled by default for cephfs native or cephfs with
> nfs in Manila because cephfs snapshots were experimental when the
> cephfs driver was added and we maintain backwards compatability in
> the Manila configuration.  To enable, one sets:
>
>cephfs_enable_snapshots = True
>
> in the configuration stanza for cephfsnative or cephfsnfs back end.
>
> Also, the ``share_type`` referenced when creating shares (either
> explicitly or the default one) needs to have the snapshot_support
> capability enabled -- e.g. the cloud admin would (one time) issue a
> command like the following:
>
>   $ manila type-key  set snapshot_support=True
>
> With this approach either the user or the administrator can create
> snapshots of file shares.
>
> Dan, I expect you have your reasons for choosing to control snapshots
> via a script that calls cephfs-snap directly rather than using Manila
> -- and of course that's fine -- but if you'd share them it will help
> us Manila developers consider whether there are use cases that we are
> not currently addressing that we should consider.
>

Hi Tom, Thanks for the detailed response.
The majority of our users are coming from ZFS/NFS Filers, where
they've gotten used to zfs-auto-snapshots, which we create for them
periodically with some retention. So accidental deletions or
overwrites are never a problem because they can quickly access
yesterday's files.
So our initial idea was to replicate this with CephFS/Manila.
I hadn't thought of using the Manila managed snapshots for these
auto-snaps -- it is indeed another option. Have you already considered
Manila-managed auto-snapshots?

Otherwise, I wonder if CephFS would work well with both the fs-wide
auto-snaps *and* user-managed Manila snapshots. Has anyone tried such
a thing?

Thanks!

dan



> Thanks,
>
> -- Tom Barron
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: effects of using hard links

2019-03-21 Thread Dan van der Ster

On Thu, Mar 21, 2019 at 8:51 AM Gregory Farnum  wrote:
>
> On Wed, Mar 20, 2019 at 6:06 PM Dan van der Ster  wrote:
>>
>> On Tue, Mar 19, 2019 at 9:43 AM Erwin Bogaard  
>> wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > For a number of application we use, there is a lot of file duplication. 
>> > This wastes precious storage space, which I would like to avoid.
>> >
>> > When using a local disk, I can use a hard link to let all duplicate files 
>> > point to the same inode (use “rdfind”, for example).
>> >
>> >
>> >
>> > As there isn’t any deduplication in Ceph(FS) I’m wondering if I can use 
>> > hard links on CephFS in the same way as I use for ‘regular’ file systems 
>> > like ext4 and xfs.
>> >
>> > 1. Is it advisible to use hard links on CephFS? (It isn’t in the ‘best 
>> > practices’: http://docs.ceph.com/docs/master/cephfs/app-best-practices/)
>> >
>> > 2. Is there any performance (dis)advantage?
>> >
>> > 3. When using hard links, is there an actual space savings, or is there 
>> > some trickery happening?
>> >
>> > 4. Are there any issues (other than the regular hard link ‘gotcha’s’) I 
>> > need to keep in mind combining hard links with CephFS?
>>
>> The only issue we've seen is if you hardlink b to a, then rm a, then
>> never stat b, the inode is added to the "stray" directory. By default
>> there is a limit of 1 million stray entries -- so if you accumulate
>> files in this state eventually users will be unable to rm any files,
>> until you stat the `b` files.
>
>
> Eek. Do you know if we have any tickets about that issue? It's easy to see 
> how that happens but definitely isn't a good user experience!

I'm not aware of a ticket -- I had thought it was just a fact of life
with hardlinks and cephfs.
After hitting this issue in prod, we found the explanation here in
this old thread (with your useful post ;) ):

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013621.html

Our immediate workaround was to increase mds bal fragment size max
(e.g. to 20).
In our env we now monitor num_strays in case these get out of control again.

BTW, now thinking about this more... isn't directory fragmentation
supposed to let the stray dir grow to unlimited shards? (on our side
it seems limited to 10 shards). Maybe this is just some configuration
issue on our side?

-- dan



> -Greg
>
>>
>>
>> -- dan
>>
>>
>> -- dan
>>
>>
>> >
>> >
>> >
>> > Thanks
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cephfs manila snapshots best practices

2019-03-20 Thread Dan van der Ster

Hi all,

We're currently upgrading our cephfs (managed by OpenStack Manila)
clusters to Mimic, and want to start enabling snapshots of the file
shares.
There are different ways to approach this, and I hope someone can
share their experiences with:

1. Do you give users the 's' flag in their cap, so that they can
create snapshots themselves? We're currently planning *not* to do this
-- we'll create snapshots for the users.
2. We want to create periodic snaps for all cephfs volumes. I can see
pros/cons to creating the snapshots in /volumes/.snap or in
/volumes/_nogroup//.snap. Any experience there? Or maybe even
just an fs-wide snap in /.snap is the best approach ?
3. I found this simple cephfs-snap script which should do the job:
http://images.45drives.com/ceph/cephfs/cephfs-snap  Does anyone have a
different recommendation?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS: effects of using hard links

2019-03-20 Thread Dan van der Ster

On Tue, Mar 19, 2019 at 9:43 AM Erwin Bogaard  wrote:
>
> Hi,
>
>
>
> For a number of application we use, there is a lot of file duplication. This 
> wastes precious storage space, which I would like to avoid.
>
> When using a local disk, I can use a hard link to let all duplicate files 
> point to the same inode (use “rdfind”, for example).
>
>
>
> As there isn’t any deduplication in Ceph(FS) I’m wondering if I can use hard 
> links on CephFS in the same way as I use for ‘regular’ file systems like ext4 
> and xfs.
>
> 1. Is it advisible to use hard links on CephFS? (It isn’t in the ‘best 
> practices’: http://docs.ceph.com/docs/master/cephfs/app-best-practices/)
>
> 2. Is there any performance (dis)advantage?
>
> 3. When using hard links, is there an actual space savings, or is there some 
> trickery happening?
>
> 4. Are there any issues (other than the regular hard link ‘gotcha’s’) I need 
> to keep in mind combining hard links with CephFS?

The only issue we've seen is if you hardlink b to a, then rm a, then
never stat b, the inode is added to the "stray" directory. By default
there is a limit of 1 million stray entries -- so if you accumulate
files in this state eventually users will be unable to rm any files,
until you stat the `b` files.

-- dan


-- dan


>
>
>
> Thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-20 Thread Dan van der Ster

On Tue, Mar 19, 2019 at 12:25 PM Dan van der Ster  wrote:
>
> On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
> >
> > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> > >
> > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > We've just hit our first OSD replacement on a host created with
> > > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > > >
> > > > The hdd /dev/sdq was prepared like this:
> > > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > > >
> > > > Then /dev/sdq failed and was then zapped like this:
> > > >   # ceph-volume lvm zap /dev/sdq --destroy
> > > >
> > > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > > /dev/sdac (see P.S.)
> > >
> > > That is correct behavior for the zap command used.
> > >
> > > >
> > > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > > two options:
> > > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > > change when we re-create, right?)
> > >
> > > This is possible but you are right that in the current state, the FSID
> > > and other cluster data exist in the LV metadata. To reuse this LV for
> > > a new (replaced) OSD
> > > then you would need to zap the LV *without* the --destroy flag, which
> > > would clear all metadata on the LV and do a wipefs. The command would
> > > need the full path to
> > > the LV associated with osd.240, something like:
> > >
> > > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> > >
> > > >   2. remove the db lv from sdac then run
> > > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > > >  which should do the correct thing.
> > >
> > > This would also work if the db lv is fully removed with --destroy
> > >
> > > >
> > > > This is all v12.2.11 btw.
> > > > If (2) is the prefered approached, then it looks like a bug that the
> > > > db lv was not destroyed by lvm zap --destroy.
> > >
> > > Since /dev/sdq was passed in to zap, just that one device was removed,
> > > so this is working as expected.
> > >
> > > Alternatively, zap has the ability to destroy or zap LVs associated
> > > with an OSD ID. I think this is not released yet for Luminous but
> > > should be in the next release (which seems to be what you want)
> >
> > Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> > can also zap by OSD FSID, both way will zap (and optionally destroy if
> > using --destroy)
> > all LVs associated with the OSD.
> >
> > Full examples on this can be found here:
> >
> > http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
> >
> >
>
> Ohh that's an improvement! (Our goal is outsourcing the failure
> handling to non-ceph experts, so this will help simplify things.)
>
> In our example, the operator needs to know the osd id, then can do:
>
> 1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
> the lvm from sdac for osd.240)
> 2. replace the hdd
> 3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240
>
> But I just remembered that the --osd-ids flag hasn't been backported
> to luminous, so we can't yet do that. I guess we'll follow the first
> (1) procedure to re-use the existing db lv.

Hmm... re-using the db lv didn't work.

We zapped it (see https://pastebin.com/N6PwpbYu) then got this error
when trying to create:

# ceph-volume lvm create --data /dev/sdq --block.db
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
--osd-id 240
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
9f63b457-37e0-4e33-971e-c0fc24658b65 240
Running command: vgcreate --force --yes
ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45 /dev/sdq
 stdout: Physical volume "/dev/sdq" successfully created.
 stdout: Volume group "ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45"
successfully created
Running command: lvcreate --yes -l 100%FREE -n
osd-block-9f63b457-37e0-4e33-971e-c0fc24658b65
ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45
 stdout: Logic

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster

On Tue, Mar 19, 2019 at 1:05 PM Alfredo Deza  wrote:
>
> On Tue, Mar 19, 2019 at 7:26 AM Dan van der Ster  wrote:
> >
> > On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
> > >
> > > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> > > >
> > > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
> > > > wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > We've just hit our first OSD replacement on a host created with
> > > > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > > > >
> > > > > The hdd /dev/sdq was prepared like this:
> > > > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > > > >
> > > > > Then /dev/sdq failed and was then zapped like this:
> > > > >   # ceph-volume lvm zap /dev/sdq --destroy
> > > > >
> > > > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > > > /dev/sdac (see P.S.)
> > > >
> > > > That is correct behavior for the zap command used.
> > > >
> > > > >
> > > > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > > > two options:
> > > > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > > > change when we re-create, right?)
> > > >
> > > > This is possible but you are right that in the current state, the FSID
> > > > and other cluster data exist in the LV metadata. To reuse this LV for
> > > > a new (replaced) OSD
> > > > then you would need to zap the LV *without* the --destroy flag, which
> > > > would clear all metadata on the LV and do a wipefs. The command would
> > > > need the full path to
> > > > the LV associated with osd.240, something like:
> > > >
> > > > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> > > >
> > > > >   2. remove the db lv from sdac then run
> > > > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > > > >  which should do the correct thing.
> > > >
> > > > This would also work if the db lv is fully removed with --destroy
> > > >
> > > > >
> > > > > This is all v12.2.11 btw.
> > > > > If (2) is the prefered approached, then it looks like a bug that the
> > > > > db lv was not destroyed by lvm zap --destroy.
> > > >
> > > > Since /dev/sdq was passed in to zap, just that one device was removed,
> > > > so this is working as expected.
> > > >
> > > > Alternatively, zap has the ability to destroy or zap LVs associated
> > > > with an OSD ID. I think this is not released yet for Luminous but
> > > > should be in the next release (which seems to be what you want)
> > >
> > > Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> > > can also zap by OSD FSID, both way will zap (and optionally destroy if
> > > using --destroy)
> > > all LVs associated with the OSD.
> > >
> > > Full examples on this can be found here:
> > >
> > > http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
> > >
> > >
> >
> > Ohh that's an improvement! (Our goal is outsourcing the failure
> > handling to non-ceph experts, so this will help simplify things.)
> >
> > In our example, the operator needs to know the osd id, then can do:
> >
> > 1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
> > the lvm from sdac for osd.240)
> > 2. replace the hdd
> > 3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240
> >
> > But I just remembered that the --osd-ids flag hasn't been backported
> > to luminous, so we can't yet do that. I guess we'll follow the first
> > (1) procedure to re-use the existing db lv.
>
> It has! (I initially thought it wasn't). Check if `ceph-volume lvm zap
> --help` has the flags available, I think they should appear for
> 12.2.11

Is it there? Indeed I see zap --osd-id, but for the recreation I'm
referring to batch --osd-ids, which afaict is only in nautilus:

https://github.com/ceph/ceph/blob/nautilus/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L248

-- dan


> >
> > -- dan
> >
> > > >
> > > > >
> > > > > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > > > > lv

Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster

On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
>
> On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> >
> > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
> > wrote:
> > >
> > > Hi all,
> > >
> > > We've just hit our first OSD replacement on a host created with
> > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > >
> > > The hdd /dev/sdq was prepared like this:
> > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > >
> > > Then /dev/sdq failed and was then zapped like this:
> > >   # ceph-volume lvm zap /dev/sdq --destroy
> > >
> > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > /dev/sdac (see P.S.)
> >
> > That is correct behavior for the zap command used.
> >
> > >
> > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > two options:
> > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > change when we re-create, right?)
> >
> > This is possible but you are right that in the current state, the FSID
> > and other cluster data exist in the LV metadata. To reuse this LV for
> > a new (replaced) OSD
> > then you would need to zap the LV *without* the --destroy flag, which
> > would clear all metadata on the LV and do a wipefs. The command would
> > need the full path to
> > the LV associated with osd.240, something like:
> >
> > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> >
> > >   2. remove the db lv from sdac then run
> > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > >  which should do the correct thing.
> >
> > This would also work if the db lv is fully removed with --destroy
> >
> > >
> > > This is all v12.2.11 btw.
> > > If (2) is the prefered approached, then it looks like a bug that the
> > > db lv was not destroyed by lvm zap --destroy.
> >
> > Since /dev/sdq was passed in to zap, just that one device was removed,
> > so this is working as expected.
> >
> > Alternatively, zap has the ability to destroy or zap LVs associated
> > with an OSD ID. I think this is not released yet for Luminous but
> > should be in the next release (which seems to be what you want)
>
> Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> can also zap by OSD FSID, both way will zap (and optionally destroy if
> using --destroy)
> all LVs associated with the OSD.
>
> Full examples on this can be found here:
>
> http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
>
>

Ohh that's an improvement! (Our goal is outsourcing the failure
handling to non-ceph experts, so this will help simplify things.)

In our example, the operator needs to know the osd id, then can do:

1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
the lvm from sdac for osd.240)
2. replace the hdd
3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240

But I just remembered that the --osd-ids flag hasn't been backported
to luminous, so we can't yet do that. I guess we'll follow the first
(1) procedure to re-use the existing db lv.

-- dan

> >
> > >
> > > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > > lvm batch doc.
> > >
> > > Thanks!
> > >
> > > Dan
> > >
> > > P.S:
> > >
> > > = osd.240 ==
> > >
> > >   [  db]
> > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > >
> > >   type  db
> > >   osd id240
> > >   cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
> > >   cluster name  ceph
> > >   osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
> > >   db device
> > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > >   encrypted 0
> > >   db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
> > >   cephx lockbox secret
> > >   block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
> > >   block device
> > > /dev/ceph-f78ff8a3-803d-4b6d-823b-260b301109ac/osd-data-9e4bf34d-1aa3-4c0a-9655-5dba52dcfcd7
> > >   vdo   0
> > >   crush device classNone
> > >   devices   /dev/sdac
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster

Hi all,

We've just hit our first OSD replacement on a host created with
`ceph-volume lvm batch` with mixed hdds+ssds.

The hdd /dev/sdq was prepared like this:
   # ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes

Then /dev/sdq failed and was then zapped like this:
  # ceph-volume lvm zap /dev/sdq --destroy

The zap removed the pv/vg/lv from sdq, but left behind the db on
/dev/sdac (see P.S.)

Now we're replaced /dev/sdq and we're wondering how to proceed. We see
two options:
  1. reuse the existing db lv from osd.240 (Though the osd fsid will
change when we re-create, right?)
  2. remove the db lv from sdac then run
# ceph-volume lvm batch /dev/sdq /dev/sdac
 which should do the correct thing.

This is all v12.2.11 btw.
If (2) is the prefered approached, then it looks like a bug that the
db lv was not destroyed by lvm zap --destroy.

Once we sort this out, we'd be happy to contribute to the ceph-volume
lvm batch doc.

Thanks!

Dan

P.S:

= osd.240 ==

  [  db]
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd

  type  db
  osd id240
  cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
  cluster name  ceph
  osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
  db device
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
  encrypted 0
  db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
  cephx lockbox secret
  block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
  block device
/dev/ceph-f78ff8a3-803d-4b6d-823b-260b301109ac/osd-data-9e4bf34d-1aa3-4c0a-9655-5dba52dcfcd7
  vdo   0
  crush device classNone
  devices   /dev/sdac
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd pg-upmap-items not working

2019-03-18 Thread Dan van der Ster

;>>>
>>>>
>>>>
>>>> 发件人：DanvanderSter 
>>>> 收件人：Kári Bertilsson ;
>>>> 抄送人：ceph-users ;谢型果10072465;
>>>> 日 期 ：2019年03月01日 14:48
>>>> 主 题 ：Re: [ceph-users] ceph osd pg-upmap-items not working
>>>> It looks like that somewhat unusual crush rule is confusing the new
>>>> upmap cleaning.
>>>> (debug_mon 10 on the active mon should show those cleanups).
>>>>
>>>> I'm copying Xie Xingguo, and probably you should create a tracker for this.
>>>>
>>>> -- dan
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Mar 1, 2019 at 3:12 AM Kári Bertilsson  
>>>> wrote:
>>>> >
>>>> > This is the pool
>>>> > pool 41 'ec82_pool' erasure size 10 min_size 8 crush_rule 1 object_hash 
>>>> > rjenkins pg_num 512 pgp_num 512 last_change 63794 lfor 21731/21731 flags 
>>>> > hashpspool,ec_overwrites stripe_width 32768 application cephfs
>>>> >removed_snaps [1~5]
>>>> >
>>>> > Here is the relevant crush rule:
>>>> > rule ec_pool { id 1 type erasure min_size 3 max_size 10 step 
>>>> > set_chooseleaf_tries 5 step set_choose_tries 100 step take default class 
>>>> > hdd step choose indep 5 type host step choose indep 2 type osd step emit 
>>>> > }
>>>> >
>>>> > Both OSD 23 and 123 are in the same host. So this change should be 
>>>> > perfectly acceptable by the rule set.
>>>> > Something must be blocking the change, but i can't find anything about 
>>>> > it in any logs.
>>>> >
>>>> > - Kári
>>>> >
>>>> > On Thu, Feb 28, 2019 at 8:07 AM Dan van der Ster  
>>>> > wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> pg-upmap-items became more strict in v12.2.11 when validating upmaps.
>>>> >> E.g., it now won't let you put two PGs in the same rack if the crush
>>>> >> rule doesn't allow it.
>>>> >>
>>>> >> Where are OSDs 23 and 123 in your cluster? What is the relevant crush 
>>>> >> rule?
>>>> >>
>>>> >> -- dan
>>>> >>
>>>> >>
>>>> >> On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson  
>>>> >> wrote:
>>>> >> >
>>>> >> > Hello
>>>> >> >
>>>> >> > I am trying to diagnose why upmap stopped working where it was 
>>>> >> > previously working fine.
>>>> >> >
>>>> >> > Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
>>>> >> >
>>>> >> > # ceph osd pg-upmap-items 41.1 23 123
>>>> >> > set 41.1 pg_upmap_items mapping to [23->123]
>>>> >> >
>>>> >> > No rebalacing happens and if i run it again it shows the same output 
>>>> >> > every time.
>>>> >> >
>>>> >> > I have in config
>>>> >> > debug mgr = 4/5
>>>> >> > debug mon = 4/5
>>>> >> >
>>>> >> > Paste from mon & mgr logs. Also output from "ceph osd dump"
>>>> >> > https://pastebin.com/9VrT4YcU
>>>> >> >
>>>> >> >
>>>> >> > I have run "ceph osd set-require-min-compat-client luminous" long 
>>>> >> > time ago. And all servers running ceph have been rebooted numerous 
>>>> >> > times since then.
>>>> >> > But somehow i am still seeing "min_compat_client jewel". I believe 
>>>> >> > that upmap was previously working anyway with that "jewel" line 
>>>> >> > present.
>>>> >> >
>>>> >> > I see no indication in any logs why the upmap commands are being 
>>>> >> > ignored.
>>>> >> >
>>>> >> > Any suggestions on how to debug further or what could be the issue ?
>>>> >> > ___
>>>> >> > ceph-users mailing list
>>>> >> > ceph-users@lists.ceph.com
>>>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >
>>>> > ___
>>>> > ceph-users mailing list
>>>> > ceph-users@lists.ceph.com
>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Safe to remove objects from default.rgw.meta ?

2019-03-12 Thread Dan van der Ster

Answering my own question (getting help from Pavan), I see that all
the details are in this PR: https://github.com/ceph/ceph/pull/11051

So, the zone was updated to set metadata_heap: "" with

$ radosgw-admin zone get --rgw-zone=default > zone.json
[edit zone.json]
$ radosgw-admin zone set --rgw-zone=default --infile=zone.json

and now I can safely remove the default.rgw.meta pool.

-- Dan


On Tue, Mar 12, 2019 at 3:17 PM Dan van der Ster  wrote:
>
> Hi all,
>
> We have an S3 cluster with >10 million objects in default.rgw.meta.
>
> # radosgw-admin zone get | jq .metadata_heap
> "default.rgw.meta"
>
> In these old tickets I realized that this setting is obsolete, and
> those objects are probably useless:
>http://tracker.ceph.com/issues/17256
>http://tracker.ceph.com/issues/18174
>
> We will clear the metadata_heap setting in the zone json, but then can
> we simply `rados rm` all the objects in the default.rgw.meta pool?
>
> The objects seem to come in three flavours:
>
>.meta:user:dvanders:_KpWMw94jrX75PgAfhDymKTo:2
>.meta:bucket:atlas-eventservice:_byPmpJS9V9l7DULEVxlDC2A:1
>
> .meta:bucket.instance:atlas-eventservice:61c59385-085d-4caa-9070-63a3868dccb6.3191998.599860:_PQCKPJVTzvtwgU41Dw0Cdx6:1
>
> Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Safe to remove objects from default.rgw.meta ?

2019-03-12 Thread Dan van der Ster

Hi all,

We have an S3 cluster with >10 million objects in default.rgw.meta.

# radosgw-admin zone get | jq .metadata_heap
"default.rgw.meta"

In these old tickets I realized that this setting is obsolete, and
those objects are probably useless:
   http://tracker.ceph.com/issues/17256
   http://tracker.ceph.com/issues/18174

We will clear the metadata_heap setting in the zone json, but then can
we simply `rados rm` all the objects in the default.rgw.meta pool?

The objects seem to come in three flavours:

   .meta:user:dvanders:_KpWMw94jrX75PgAfhDymKTo:2
   .meta:bucket:atlas-eventservice:_byPmpJS9V9l7DULEVxlDC2A:1
   
.meta:bucket.instance:atlas-eventservice:61c59385-085d-4caa-9070-63a3868dccb6.3191998.599860:_PQCKPJVTzvtwgU41Dw0Cdx6:1

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd pg-upmap-items not working

2019-02-28 Thread Dan van der Ster

It looks like that somewhat unusual crush rule is confusing the new
upmap cleaning.
(debug_mon 10 on the active mon should show those cleanups).

I'm copying Xie Xingguo, and probably you should create a tracker for this.

-- dan




On Fri, Mar 1, 2019 at 3:12 AM Kári Bertilsson  wrote:
>
> This is the pool
> pool 41 'ec82_pool' erasure size 10 min_size 8 crush_rule 1 object_hash 
> rjenkins pg_num 512 pgp_num 512 last_change 63794 lfor 21731/21731 flags 
> hashpspool,ec_overwrites stripe_width 32768 application cephfs
>removed_snaps [1~5]
>
> Here is the relevant crush rule:
> rule ec_pool { id 1 type erasure min_size 3 max_size 10 step 
> set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd 
> step choose indep 5 type host step choose indep 2 type osd step emit }
>
> Both OSD 23 and 123 are in the same host. So this change should be perfectly 
> acceptable by the rule set.
> Something must be blocking the change, but i can't find anything about it in 
> any logs.
>
> - Kári
>
> On Thu, Feb 28, 2019 at 8:07 AM Dan van der Ster  wrote:
>>
>> Hi,
>>
>> pg-upmap-items became more strict in v12.2.11 when validating upmaps.
>> E.g., it now won't let you put two PGs in the same rack if the crush
>> rule doesn't allow it.
>>
>> Where are OSDs 23 and 123 in your cluster? What is the relevant crush rule?
>>
>> -- dan
>>
>>
>> On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson  
>> wrote:
>> >
>> > Hello
>> >
>> > I am trying to diagnose why upmap stopped working where it was previously 
>> > working fine.
>> >
>> > Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
>> >
>> > # ceph osd pg-upmap-items 41.1 23 123
>> > set 41.1 pg_upmap_items mapping to [23->123]
>> >
>> > No rebalacing happens and if i run it again it shows the same output every 
>> > time.
>> >
>> > I have in config
>> > debug mgr = 4/5
>> > debug mon = 4/5
>> >
>> > Paste from mon & mgr logs. Also output from "ceph osd dump"
>> > https://pastebin.com/9VrT4YcU
>> >
>> >
>> > I have run "ceph osd set-require-min-compat-client luminous" long time 
>> > ago. And all servers running ceph have been rebooted numerous times since 
>> > then.
>> > But somehow i am still seeing "min_compat_client jewel". I believe that 
>> > upmap was previously working anyway with that "jewel" line present.
>> >
>> > I see no indication in any logs why the upmap commands are being ignored.
>> >
>> > Any suggestions on how to debug further or what could be the issue ?
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd pg-upmap-items not working

2019-02-28 Thread Dan van der Ster

Hi,

pg-upmap-items became more strict in v12.2.11 when validating upmaps.
E.g., it now won't let you put two PGs in the same rack if the crush
rule doesn't allow it.

Where are OSDs 23 and 123 in your cluster? What is the relevant crush rule?

-- dan


On Wed, Feb 27, 2019 at 9:17 PM Kári Bertilsson  wrote:
>
> Hello
>
> I am trying to diagnose why upmap stopped working where it was previously 
> working fine.
>
> Trying to move pg 41.1 to 123 has no effect and seems to be ignored.
>
> # ceph osd pg-upmap-items 41.1 23 123
> set 41.1 pg_upmap_items mapping to [23->123]
>
> No rebalacing happens and if i run it again it shows the same output every 
> time.
>
> I have in config
> debug mgr = 4/5
> debug mon = 4/5
>
> Paste from mon & mgr logs. Also output from "ceph osd dump"
> https://pastebin.com/9VrT4YcU
>
>
> I have run "ceph osd set-require-min-compat-client luminous" long time ago. 
> And all servers running ceph have been rebooted numerous times since then.
> But somehow i am still seeing "min_compat_client jewel". I believe that upmap 
> was previously working anyway with that "jewel" line present.
>
> I see no indication in any logs why the upmap commands are being ignored.
>
> Any suggestions on how to debug further or what could be the issue ?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Dan van der Ster

Not really.

You should just restart your mons though -- if done one at a time it
has zero impact on your clients.

-- dan


On Mon, Feb 18, 2019 at 12:11 PM M Ranga Swami Reddy
 wrote:
>
> Hi Sage - If the mon data increases, is this impacts the ceph cluster
> performance (ie on ceph osd bench, etc)?
>
> On Fri, Feb 15, 2019 at 3:13 PM M Ranga Swami Reddy
>  wrote:
> >
> > today I again hit the warn with 30G also...
> >
> > On Thu, Feb 14, 2019 at 7:39 PM Sage Weil  wrote:
> > >
> > > On Thu, 7 Feb 2019, Dan van der Ster wrote:
> > > > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> > > >  wrote:
> > > > >
> > > > > Hi Dan,
> > > > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > > >But the intended behavior is that once the PGs are all active+clean,
> > > > > >the old maps should be trimmed and the disk space freed.
> > > > >
> > > > > old maps not trimmed after cluster reached to "all+clean" state for 
> > > > > all PGs.
> > > > > Is there (known) bug here?
> > > > > As the size of dB showing > 15G, do I need to run the compact commands
> > > > > to do the trimming?
> > > >
> > > > Compaction isn't necessary -- you should only need to restart all
> > > > peon's then the leader. A few minutes later the db's should start
> > > > trimming.
> > >
> > > The next time someone sees this behavior, can you please
> > >
> > > - enable debug_mon = 20 on all mons (*before* restarting)
> > >ceph tell mon.* injectargs '--debug-mon 20'
> > > - wait for 10 minutes or so to generate some logs
> > > - add 'debug mon = 20' to ceph.conf (on mons only)
> > > - restart the monitors
> > > - wait for them to start trimming
> > > - remove 'debug mon = 20' from ceph.conf (on mons only)
> > > - tar up the log files, ceph-post-file them, and share them with ticket
> > > http://tracker.ceph.com/issues/38322
> > >
> > > Thanks!
> > > sage
> > >
> > >
> > >
> > >
> > > > -- dan
> > > >
> > > >
> > > > >
> > > > > Thanks
> > > > > Swami
> > > > >
> > > > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > With HEALTH_OK a mon data dir should be under 2GB for even such a 
> > > > > > large cluster.
> > > > > >
> > > > > > During backfilling scenarios, the mons keep old maps and grow quite
> > > > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > > > But the intended behavior is that once the PGs are all active+clean,
> > > > > > the old maps should be trimmed and the disk space freed.
> > > > > >
> > > > > > However, several people have noted that (at least in luminous
> > > > > > releases) the old maps are not trimmed until after HEALTH_OK *and* 
> > > > > > all
> > > > > > mons are restarted. This ticket seems related:
> > > > > > http://tracker.ceph.com/issues/37875
> > > > > >
> > > > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > > > > > mon stores dropping from >15GB to ~700MB each time).
> > > > > >
> > > > > > -- Dan
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > > > > > >
> > > > > > > Hi Swami
> > > > > > >
> > > > > > > The limit is somewhat arbitrary, based on cluster sizes we had 
> > > > > > > seen when
> > > > > > > we picked it.  In your case it should be perfectly safe to 
> > > > > > > increase it.
> > > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > > > >
> > > > > > > > Hello -  Are the any limits for mon_data_size for cluster with 
> > > > > > > > 2PB
> > > > > > > > (with 2000+ OSDs)?
> > > > > > > >
> > > > > > > > Currently it set as 15G. What is logic behind this? Can we 
> > > > > > > > increase
> > > > > > > > when we get the mon_data_size_warn messages?
> > > > > > > >
> > > > > > > > I am getting the mon_data_size_warn message even though there a 
> > > > > > > > ample
> > > > > > > > of free space on the disk (around 300G free disk)
> > > > > > > >
> > > > > > > > Earlier thread on the same discusion:
> > > > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Swami
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > ___
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users@lists.ceph.com
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Dan van der Ster

On Thu, Feb 14, 2019 at 2:31 PM Sage Weil  wrote:
>
> On Thu, 7 Feb 2019, Dan van der Ster wrote:
> > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Hi Dan,
> > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > >But the intended behavior is that once the PGs are all active+clean,
> > > >the old maps should be trimmed and the disk space freed.
> > >
> > > old maps not trimmed after cluster reached to "all+clean" state for all 
> > > PGs.
> > > Is there (known) bug here?
> > > As the size of dB showing > 15G, do I need to run the compact commands
> > > to do the trimming?
> >
> > Compaction isn't necessary -- you should only need to restart all
> > peon's then the leader. A few minutes later the db's should start
> > trimming.
>
> The next time someone sees this behavior, can you please
>
> - enable debug_mon = 20 on all mons (*before* restarting)
>ceph tell mon.* injectargs '--debug-mon 20'
> - wait for 10 minutes or so to generate some logs
> - add 'debug mon = 20' to ceph.conf (on mons only)
> - restart the monitors
> - wait for them to start trimming
> - remove 'debug mon = 20' from ceph.conf (on mons only)
> - tar up the log files, ceph-post-file them, and share them with ticket
> http://tracker.ceph.com/issues/38322
>

Not sure if you noticed, but we sent some logs Friday.

-- dan

> Thanks!
> sage
>
>
>
>
> > -- dan
> >
> >
> > >
> > > Thanks
> > > Swami
> > >
> > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > With HEALTH_OK a mon data dir should be under 2GB for even such a large 
> > > > cluster.
> > > >
> > > > During backfilling scenarios, the mons keep old maps and grow quite
> > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > But the intended behavior is that once the PGs are all active+clean,
> > > > the old maps should be trimmed and the disk space freed.
> > > >
> > > > However, several people have noted that (at least in luminous
> > > > releases) the old maps are not trimmed until after HEALTH_OK *and* all
> > > > mons are restarted. This ticket seems related:
> > > > http://tracker.ceph.com/issues/37875
> > > >
> > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > > > mon stores dropping from >15GB to ~700MB each time).
> > > >
> > > > -- Dan
> > > >
> > > >
> > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > > > >
> > > > > Hi Swami
> > > > >
> > > > > The limit is somewhat arbitrary, based on cluster sizes we had seen 
> > > > > when
> > > > > we picked it.  In your case it should be perfectly safe to increase 
> > > > > it.
> > > > >
> > > > > sage
> > > > >
> > > > >
> > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > >
> > > > > > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > > > > > (with 2000+ OSDs)?
> > > > > >
> > > > > > Currently it set as 15G. What is logic behind this? Can we increase
> > > > > > when we get the mon_data_size_warn messages?
> > > > > >
> > > > > > I am getting the mon_data_size_warn message even though there a 
> > > > > > ample
> > > > > > of free space on the disk (around 300G free disk)
> > > > > >
> > > > > > Earlier thread on the same discusion:
> > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > >
> > > > > > Thanks
> > > > > > Swami
> > > > > >
> > > > > >
> > > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Dan van der Ster

On Fri, Feb 15, 2019 at 12:01 PM Willem Jan Withagen  wrote:
>
> On 15/02/2019 11:56, Dan van der Ster wrote:
> > On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  
> > wrote:
> >>
> >> On 15/02/2019 10:39, Ilya Dryomov wrote:
> >>> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
> >>>>
> >>>> Hi Marc,
> >>>>
> >>>> You can see previous designs on the Ceph store:
> >>>>
> >>>> https://www.proforma.com/sdscommunitystore
> >>>
> >>> Hi Mike,
> >>>
> >>> This site stopped working during DevConf and hasn't been working since.
> >>> I think Greg has contacted some folks about this, but it would be great
> >>> if you could follow up because it's been a couple of weeks now...
> >>
> >> Ilya,
> >>
> >> The site is working for me.
> >> It only does not contain the Nautilus shirts (yet)
> >
> > I found in the past that the http redirection for www.proforma.com
> > doesn't work from over here in Europe.
> > If someone can post the redirection target then we can access it directly.
>
> Like:
>
> https://proformaprostores.com/Category
>
>
> at least, that is where I get directed to.

Exactly! That URL works here at CERN... www.proforma.com is stuck forever.

-- dan


>
> --WjW
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Dan van der Ster

On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  wrote:
>
> On 15/02/2019 10:39, Ilya Dryomov wrote:
> > On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
> >>
> >> Hi Marc,
> >>
> >> You can see previous designs on the Ceph store:
> >>
> >> https://www.proforma.com/sdscommunitystore
> >
> > Hi Mike,
> >
> > This site stopped working during DevConf and hasn't been working since.
> > I think Greg has contacted some folks about this, but it would be great
> > if you could follow up because it's been a couple of weeks now...
>
> Ilya,
>
> The site is working for me.
> It only does not contain the Nautilus shirts (yet)

I found in the past that the http redirection for www.proforma.com
doesn't work from over here in Europe.
If someone can post the redirection target then we can access it directly.

-- dan


>
> --WjW
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HDD OSD 100% busy reading OMAP keys RGW

2019-02-14 Thread Dan van der Ster

On Thu, Feb 14, 2019 at 12:07 PM Wido den Hollander  wrote:
>
>
>
> On 2/14/19 11:26 AM, Dan van der Ster wrote:
> > On Thu, Feb 14, 2019 at 11:13 AM Wido den Hollander  wrote:
> >>
> >> On 2/14/19 10:20 AM, Dan van der Ster wrote:
> >>> On Thu., Feb. 14, 2019, 6:17 a.m. Wido den Hollander  >>>>
> >>>> Hi,
> >>>>
> >>>> On a cluster running RGW only I'm running into BlueStore 12.2.11 OSDs
> >>>> being 100% busy sometimes.
> >>>>
> >>>> This cluster has 85k stale indexes (stale-instances list) and I've been
> >>>> slowly trying to remove them.
> >>>>
> >>>
> >>> Is your implication here that 'stale-instances rm' isn't going to work
> >>> for your case?
> >>>
> >>
> >> No, just saying that I've been running it to remove stale indexes. But
> >> suddenly the OSDs are spiking to 100% busy when reading from BlueFS.
> >
> > How do you "slowly" rm the stale-instances? If you called
> > stale-instances rm, doesn't that mean there's some rgw in your cluster
> > busily now trying to delete 85k indices? I imagine that could be
> > triggering this load spike.
>
> I ran it with:
>
> $ timeout 900 
>
> Just to see what would happen.
>
> The problem is that even without this running I now see OSDs spiking too
> 100% busy.
>
> The bluefs debug logs tell me that when a particular object is queried
> it will cause the OSD to scan it's RocksDB database and render the OSD
> useless for a few minutes.
>
> > (I'm trying to grok the stale-instances rm implementation -- it seems
> > to do the work 1000 keys at a time, but will nevertheless queue up the
> > work to rm all indices from one call to the command <--- please
> > correct me if I'm wrong).
> >
> > -- dan
> >
> >
> >>
> >>> (We haven't updated to 12.2.11 yet but have a similar number of stale
> >>> instances to remove).
> >>>
> >>
> >> In this case the index pool already went from 222k objects to 186k and
> >> is still going down if we run the GC.
> >>
> >>> For the rest below, I didn't understand *when* are the osd's getting
> >>> 100% busy -- is that just during normal operations (after the 12.2.11
> >>> upgrade) or is it while listing the indexes?
> >>>
> >>
> >> Not sure either. For certain Objects this can be triggered by just
> >> listing the OMAP keys. This OSD eats all I/O at that moment.
> >>
> >>> Also, have you already tried compacting the omap on the relevant osds?
> >>>
> >>
> >> I tried that, but on a offline OSD I cancelled it after 25 minutes as it
> >> was still running.
> >>
> >> Not sure yet why this happens on those OSDs.
> >>

Do the osd ops show anything suspicious? Are these coming from some
rgw iterating over keys or some internal osd process not resulting
from an op?

It reminds me of this old rgw scrub issue, which I resolved by simply
rados rm'ing the object with millions of keys:
   http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html

(Symptom is similar, though I doubt the cause it related).

-- dan


> >> Wido
> >>
> >>> -- Dan
> >>>
> >>>
> >>>>
> >>>>
> >>>> I noticed that regularly OSDs read their HDD heavily and that device
> >>>> then becomes 100% busy. (iostat)
> >>>>
> >>>> $ radosgw-admin reshard stale-instances list > stale.json
> >>>> $ cat stale.json|jq -r '.[]'|wc -l
> >>>>
> >>>> I increased debug_bluefs and debug_bluestore to 10 and I found:
> >>>>
> >>>> 2019-02-14 05:11:18.417097 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.205_head oid
> >>>> #13:a05231a1:::.dir.ams02.36062237.821.79:head# = 0
> >>>> 2019-02-14 05:11:18.417127 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.205_head
> >>>> #13:a05231a1:::.dir.ams02.36062237.821.79:head#
> >>>> 2019-02-14 05:11:18.417133 7f627732d700 10
> >>>> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1
> >>>>
> >>>> 2019-02-14 05:11:18.417169 7f627732d700 10 bluefs _read_random h
> >>>> 0x560cb77c5080 0x17a8cd0~fba from file(ino 71562 size 0x2d96a43 mtime
> >>>> 2019-02-14

Re: [ceph-users] HDD OSD 100% busy reading OMAP keys RGW

2019-02-14 Thread Dan van der Ster

On Thu, Feb 14, 2019 at 11:13 AM Wido den Hollander  wrote:
>
> On 2/14/19 10:20 AM, Dan van der Ster wrote:
> > On Thu., Feb. 14, 2019, 6:17 a.m. Wido den Hollander  >>
> >> Hi,
> >>
> >> On a cluster running RGW only I'm running into BlueStore 12.2.11 OSDs
> >> being 100% busy sometimes.
> >>
> >> This cluster has 85k stale indexes (stale-instances list) and I've been
> >> slowly trying to remove them.
> >>
> >
> > Is your implication here that 'stale-instances rm' isn't going to work
> > for your case?
> >
>
> No, just saying that I've been running it to remove stale indexes. But
> suddenly the OSDs are spiking to 100% busy when reading from BlueFS.

How do you "slowly" rm the stale-instances? If you called
stale-instances rm, doesn't that mean there's some rgw in your cluster
busily now trying to delete 85k indices? I imagine that could be
triggering this load spike.
(I'm trying to grok the stale-instances rm implementation -- it seems
to do the work 1000 keys at a time, but will nevertheless queue up the
work to rm all indices from one call to the command <--- please
correct me if I'm wrong).

-- dan


>
> > (We haven't updated to 12.2.11 yet but have a similar number of stale
> > instances to remove).
> >
>
> In this case the index pool already went from 222k objects to 186k and
> is still going down if we run the GC.
>
> > For the rest below, I didn't understand *when* are the osd's getting
> > 100% busy -- is that just during normal operations (after the 12.2.11
> > upgrade) or is it while listing the indexes?
> >
>
> Not sure either. For certain Objects this can be triggered by just
> listing the OMAP keys. This OSD eats all I/O at that moment.
>
> > Also, have you already tried compacting the omap on the relevant osds?
> >
>
> I tried that, but on a offline OSD I cancelled it after 25 minutes as it
> was still running.
>
> Not sure yet why this happens on those OSDs.
>
> Wido
>
> > -- Dan
> >
> >
> >>
> >>
> >> I noticed that regularly OSDs read their HDD heavily and that device
> >> then becomes 100% busy. (iostat)
> >>
> >> $ radosgw-admin reshard stale-instances list > stale.json
> >> $ cat stale.json|jq -r '.[]'|wc -l
> >>
> >> I increased debug_bluefs and debug_bluestore to 10 and I found:
> >>
> >> 2019-02-14 05:11:18.417097 7f627732d700 10
> >> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.205_head oid
> >> #13:a05231a1:::.dir.ams02.36062237.821.79:head# = 0
> >> 2019-02-14 05:11:18.417127 7f627732d700 10
> >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.205_head
> >> #13:a05231a1:::.dir.ams02.36062237.821.79:head#
> >> 2019-02-14 05:11:18.417133 7f627732d700 10
> >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1
> >>
> >> 2019-02-14 05:11:18.417169 7f627732d700 10 bluefs _read_random h
> >> 0x560cb77c5080 0x17a8cd0~fba from file(ino 71562 size 0x2d96a43 mtime
> >> 2019-02-14 02:52:16.370746 bdev 1 allocated 2e0 extents
> >> [1:0x3228f0+2e0])
> >> 2019-02-14 05:11:23.129645 7f627732d700 10 bluefs _read_random h
> >> 0x560c14167780 0x17bb6b7~f52 from file(ino 68900 size 0x41919ef mtime
> >> 2019-02-01 01:19:59.216218 bdev 1 allocated 420 extents
> >> [1:0x8b31a0+20,1:0x8b31e0+e0,1:0x8b32d0+170,1:0x8b3ce0+1b0])
> >> 2019-02-14 05:11:23.144550 7f627732d700 10 bluefs _read_random h
> >> 0x560c14c86b80 0x96d020~ef3 from file(ino 67189 size 0x419b603 mtime
> >> 2019-02-01 00:45:12.743836 bdev 1 allocated 420 extents
> >> [1:0x53da9a0+420])
> >>
> >> 2019-02-14 05:11:23.149958 7f627732d700 10
> >> bluestore(/var/lib/ceph/osd/ceph-266) omap_get_header 13.e8_head oid
> >> #13:171bcbd3:::.dir.ams02.39023047.682.114:head# = 0
> >> 2019-02-14 05:11:23.149975 7f627732d700 10
> >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator 13.e8_head
> >> #13:171bcbd3:::.dir.ams02.39023047.682.114:head#
> >> 2019-02-14 05:11:23.149981 7f627732d700 10
> >> bluestore(/var/lib/ceph/osd/ceph-266) get_omap_iterator has_omap = 1
> >>
> >> 2019-02-14 05:11:23.150012 7f627732d700 10 bluefs _read_random h
> >> 0x560c14e42500 0x1a18670~ff0 from file(ino 71519 size 0x417a60f mtime
> >> 2019-02-14 02:51:35.125629 bdev 1 allocated 420 extents
> >> [1:0x1c30d0+420])
> >> 2019-02-14 05:11:23.155679 7f627732d700 10 bluefs _read_random h
&g

Re: [ceph-users] v12.2.11 Luminous released

2019-02-07 Thread Dan van der Ster

On Fri, Feb 1, 2019 at 10:18 PM Neha Ojha  wrote:
>
> On Fri, Feb 1, 2019 at 1:09 PM Robert Sander
>  wrote:
> >
> > Am 01.02.19 um 19:06 schrieb Neha Ojha:
> >
> > > If you would have hit the bug, you should have seen failures like
> > > https://tracker.ceph.com/issues/36686.
> > > Yes, pglog_hardlimit is off by default in 12.2.11. Since you are
> > > running 12.2.9(which has the patch that allows you to limit the length
> > > of the pg log), you could follow the steps and upgrade to 12.2.11 and
> > > set this flag.
> >
> > The question is: If I am now on 12.2.9 and see no issues, do I have to
> > set this flag after upgrading to 12.2.11?
> You don't have to.
> This flag lets you restrict the length of your pg logs, so if you do
> not want to use this functionality, no need to set this.

I guess that a 12.2.11 cluster with pglog_hardlimit enabled cannot
upgrade to mimic until 13.2.5 is released?


-- Dan


>
> >
> > Regards
> > --
> > Robert Sander
> > Heinlein Support GmbH
> > Schwedter Str. 8/9b, 10119 Berlin
> >
> > http://www.heinlein-support.de
> >
> > Tel: 030 / 405051-43
> > Fax: 030 / 405051-19
> >
> > Zwangsangaben lt. §35a GmbHG:
> > HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> > Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-07 Thread Dan van der Ster

On Thu, Feb 7, 2019 at 4:12 PM M Ranga Swami Reddy  wrote:
>
> >Compaction isn't necessary -- you should only need to restart all
> >peon's then the leader. A few minutes later the db's should start
> >trimming.
>
> As we on production cluster, which may not be safe to restart the
> ceph-mon, instead prefer to do the compact on non-leader mons.
> Is this ok?
>

Compaction doesn't solve this particular problem, because the maps
have not yet been deleted by the ceph-mon process.

-- dan


> Thanks
> Swami
>
> On Thu, Feb 7, 2019 at 6:30 PM Dan van der Ster  wrote:
> >
> > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Hi Dan,
> > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > >But the intended behavior is that once the PGs are all active+clean,
> > > >the old maps should be trimmed and the disk space freed.
> > >
> > > old maps not trimmed after cluster reached to "all+clean" state for all 
> > > PGs.
> > > Is there (known) bug here?
> > > As the size of dB showing > 15G, do I need to run the compact commands
> > > to do the trimming?
> >
> > Compaction isn't necessary -- you should only need to restart all
> > peon's then the leader. A few minutes later the db's should start
> > trimming.
> >
> > -- dan
> >
> >
> > >
> > > Thanks
> > > Swami
> > >
> > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > With HEALTH_OK a mon data dir should be under 2GB for even such a large 
> > > > cluster.
> > > >
> > > > During backfilling scenarios, the mons keep old maps and grow quite
> > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > But the intended behavior is that once the PGs are all active+clean,
> > > > the old maps should be trimmed and the disk space freed.
> > > >
> > > > However, several people have noted that (at least in luminous
> > > > releases) the old maps are not trimmed until after HEALTH_OK *and* all
> > > > mons are restarted. This ticket seems related:
> > > > http://tracker.ceph.com/issues/37875
> > > >
> > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > > > mon stores dropping from >15GB to ~700MB each time).
> > > >
> > > > -- Dan
> > > >
> > > >
> > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > > > >
> > > > > Hi Swami
> > > > >
> > > > > The limit is somewhat arbitrary, based on cluster sizes we had seen 
> > > > > when
> > > > > we picked it.  In your case it should be perfectly safe to increase 
> > > > > it.
> > > > >
> > > > > sage
> > > > >
> > > > >
> > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > >
> > > > > > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > > > > > (with 2000+ OSDs)?
> > > > > >
> > > > > > Currently it set as 15G. What is logic behind this? Can we increase
> > > > > > when we get the mon_data_size_warn messages?
> > > > > >
> > > > > > I am getting the mon_data_size_warn message even though there a 
> > > > > > ample
> > > > > > of free space on the disk (around 300G free disk)
> > > > > >
> > > > > > Earlier thread on the same discusion:
> > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > >
> > > > > > Thanks
> > > > > > Swami
> > > > > >
> > > > > >
> > > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-07 Thread Dan van der Ster

On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
 wrote:
>
> Hi Dan,
> >During backfilling scenarios, the mons keep old maps and grow quite
> >quickly. So if you have balancing, pg splitting, etc. ongoing for
> >awhile, the mon stores will eventually trigger that 15GB alarm.
> >But the intended behavior is that once the PGs are all active+clean,
> >the old maps should be trimmed and the disk space freed.
>
> old maps not trimmed after cluster reached to "all+clean" state for all PGs.
> Is there (known) bug here?
> As the size of dB showing > 15G, do I need to run the compact commands
> to do the trimming?

Compaction isn't necessary -- you should only need to restart all
peon's then the leader. A few minutes later the db's should start
trimming.

-- dan


>
> Thanks
> Swami
>
> On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  wrote:
> >
> > Hi,
> >
> > With HEALTH_OK a mon data dir should be under 2GB for even such a large 
> > cluster.
> >
> > During backfilling scenarios, the mons keep old maps and grow quite
> > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > awhile, the mon stores will eventually trigger that 15GB alarm.
> > But the intended behavior is that once the PGs are all active+clean,
> > the old maps should be trimmed and the disk space freed.
> >
> > However, several people have noted that (at least in luminous
> > releases) the old maps are not trimmed until after HEALTH_OK *and* all
> > mons are restarted. This ticket seems related:
> > http://tracker.ceph.com/issues/37875
> >
> > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > mon stores dropping from >15GB to ~700MB each time).
> >
> > -- Dan
> >
> >
> > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > >
> > > Hi Swami
> > >
> > > The limit is somewhat arbitrary, based on cluster sizes we had seen when
> > > we picked it.  In your case it should be perfectly safe to increase it.
> > >
> > > sage
> > >
> > >
> > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > >
> > > > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > > > (with 2000+ OSDs)?
> > > >
> > > > Currently it set as 15G. What is logic behind this? Can we increase
> > > > when we get the mon_data_size_warn messages?
> > > >
> > > > I am getting the mon_data_size_warn message even though there a ample
> > > > of free space on the disk (around 300G free disk)
> > > >
> > > > Earlier thread on the same discusion:
> > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > >
> > > > Thanks
> > > > Swami
> > > >
> > > >
> > > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Need help with upmap feature on luminous

2019-02-06 Thread Dan van der Ster

Note that there are some improved upmap balancer heuristics in
development here: https://github.com/ceph/ceph/pull/26187

-- dan

On Tue, Feb 5, 2019 at 10:18 PM Kári Bertilsson  wrote:
>
> Hello
>
> I previously enabled upmap and used automatic balancing with "ceph balancer 
> on". I got very good results and OSD's ended up with perfectly distributed 
> pg's.
>
> Now after adding several new OSD's, auto balancing does not seem to be 
> working anymore. OSD's have 30-50% usage where previously all had almost the 
> same %.
>
> I turned off auto balancer and tried manually running a plan
>
> # ceph balancer reset
> # ceph balancer optimize myplan
> # ceph balancer show myplan
> ceph osd pg-upmap-items 41.1 106 125 95 121 84 34 36 99 72 126
> ceph osd pg-upmap-items 41.5 12 121 65 3 122 52 5 126
> ceph osd pg-upmap-items 41.b 117 99 65 125
> ceph osd pg-upmap-items 41.c 49 121 81 131
> ceph osd pg-upmap-items 41.e 61 82 73 52 122 46 84 118
> ceph osd pg-upmap-items 41.f 71 127 15 121 56 82
> ceph osd pg-upmap-items 41.12 81 92
> ceph osd pg-upmap-items 41.17 35 127 71 44
> ceph osd pg-upmap-items 41.19 81 131 21 119 18 52
> ceph osd pg-upmap-items 41.25 18 52 37 125 40 3 41 34 71 127 4 128
>
>
> After running this plan there's no difference and still huge inbalance on the 
> OSD's. Creating a new plan give the same plan again.
>
> # ceph balancer eval
> current cluster score 0.015162 (lower is better)
>
> Balancer eval shows quite low number, so it seems to think the pg 
> distribution is already optimized ?
>
> Since i'm not getting this working again. I looked into the offline 
> optimization at http://docs.ceph.com/docs/mimic/rados/operations/upmap/
>
> I have 2 pools.
> Replicated pool using 3 OSD's with "10k" device class.
> And remaining OSD's have "hdd" device class.
>
> The resulting out.txt creates a much larger plan, but would map alot of PG's 
> to the "10k" OSD's (where they should not be). And i can't seem to find any 
> way to exclude these 3 OSD's.
>
> Any ideas how to proceed ?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-06 Thread Dan van der Ster

Hi,

With HEALTH_OK a mon data dir should be under 2GB for even such a large cluster.

During backfilling scenarios, the mons keep old maps and grow quite
quickly. So if you have balancing, pg splitting, etc. ongoing for
awhile, the mon stores will eventually trigger that 15GB alarm.
But the intended behavior is that once the PGs are all active+clean,
the old maps should be trimmed and the disk space freed.

However, several people have noted that (at least in luminous
releases) the old maps are not trimmed until after HEALTH_OK *and* all
mons are restarted. This ticket seems related:
http://tracker.ceph.com/issues/37875

(Over here we're restarting mons every ~2-3 weeks, resulting in the
mon stores dropping from >15GB to ~700MB each time).

-- Dan

On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
>
> Hi Swami
>
> The limit is somewhat arbitrary, based on cluster sizes we had seen when
> we picked it.  In your case it should be perfectly safe to increase it.
>
> sage
>
>
> On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
>
> > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > (with 2000+ OSDs)?
> >
> > Currently it set as 15G. What is logic behind this? Can we increase
> > when we get the mon_data_size_warn messages?
> >
> > I am getting the mon_data_size_warn message even though there a ample
> > of free space on the disk (around 300G free disk)
> >
> > Earlier thread on the same discusion:
> > https://www.spinics.net/lists/ceph-users/msg42456.html
> >
> > Thanks
> > Swami
> >
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lumunious 12.2.10 update send to 12.2.11

2019-02-05 Thread Dan van der Ster

No idea, but maybe this commit which landed in v12.2.11 is relevant:

commit 187bc76957dcd8a46a839707dea3c26b3285bd8f
Author: runsisi 
Date:   Mon Nov 12 20:01:32 2018 +0800

librbd: fix missing unblock_writes if shrink is not allowed

Fixes: http://tracker.ceph.com/issues/36778

Signed-off-by: runsisi 
(cherry picked from commit 3899bee9f5ea2c4b19fb1266a8b59f6e04e99926)



On Tue, Feb 5, 2019 at 9:53 AM Marc Roos  wrote:
>
>
> Has some protocol or so changed? I am resizing an rbd device on a
> luminous 12.2.10 cluster and a 12.2.11  client does not resond (all
> centos7)
>
> 2019-02-05 09:46:27.336885 7f9227fff700 -1 librbd::Operations: update
> notification timed-out
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Broken CephFS stray entries?

2019-01-22 Thread Dan van der Ster

On Tue, Jan 22, 2019 at 3:33 PM Yan, Zheng  wrote:
>
> On Tue, Jan 22, 2019 at 9:08 PM Dan van der Ster  wrote:
> >
> > Hi Zheng,
> >
> > We also just saw this today and got a bit worried.
> > Should we change to:
> >
>
> What is the error message (on stray dir or other dir)? does the
> cluster ever enable multi-acitive mds?
>

It was during an upgrade from v12.2.8 to v12.2.10. 5 active MDS's
during the upgrade.

2019-01-22 10:08:22.629545 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 36 : cluster [WRN]  replayed op
client.54045065:2282648,2282514 used ino 0x3001c85b193 but session
next is 0x3001c28f018
2019-01-22 10:08:22.629617 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 37 : cluster [WRN]  replayed op
client.54045065:2282649,2282514 used ino 0x3001c85b194 but session
next is 0x3001c28f018
2019-01-22 10:08:22.629652 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 38 : cluster [WRN]  replayed op
client.54045065:2282650,2282514 used ino 0x3001c85b195 but session
next is 0x3001c28f018
2019-01-22 10:08:37.373704 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2748 : cluster [INF] daemon mds.p01001532184554
is now active in filesystem cephfs as rank 2
2019-01-22 10:08:37.805675 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2749 : cluster [INF] Health check cleared:
FS_DEGRADED (was: 1 filesystem is degraded)
2019-01-22 10:08:39.784260 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 547 : cluster [ERR] bad/negative dir
size on 0x61b f(v27 m2019-01-22 10:07:38.509466 0=-1+1)
2019-01-22 10:08:39.784271 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 548 : cluster [ERR] unmatched fragstat
on 0x61b, inode has f(v28 m2019-01-22 10:07:38.509466 0=-1+1),
dirfrags have f(v0 m2019-01-22 10:07:38.509466 1=0+1)
2019-01-22 10:10:02.605036 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2803 : cluster [INF] Health check cleared:
MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons
available)
2019-01-22 10:10:02.605089 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2804 : cluster [INF] Cluster is now healthy





> > diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
> > index e8c1bc8bc1..e2539390fb 100644
> > --- a/src/mds/CInode.cc
> > +++ b/src/mds/CInode.cc
> > @@ -2040,7 +2040,7 @@ void CInode::finish_scatter_gather_update(int type)
> >
> > if (pf->fragstat.nfiles < 0 ||
> > pf->fragstat.nsubdirs < 0) {
> > - clog->error() << "bad/negative dir size on "
> > + clog->warn() << "bad/negative dir size on "
> >   << dir->dirfrag() << " " << pf->fragstat;
> >   assert(!"bad/negative fragstat" == g_conf->mds_verify_scatter);
> >
> > @@ -2077,7 +2077,7 @@ void CInode::finish_scatter_gather_update(int type)
> >   if (state_test(CInode::STATE_REPAIRSTATS)) {
> > dout(20) << " dirstat mismatch, fixing" << dendl;
> >   } else {
> > -   clog->error() << "unmatched fragstat on " << ino() << ", inode 
> > has "
> > +   clog->warn() << "unmatched fragstat on " << ino() << ", inode 
> > has "
> >   << pi->dirstat << ", dirfrags have " << dirstat;
> > assert(!"unmatched fragstat" == g_conf->mds_verify_scatter);
> >   }
> >
> >
> > Cheers, Dan
> >
> >
> > On Sat, Oct 20, 2018 at 2:33 AM Yan, Zheng  wrote:
> >>
> >> no action is required. mds fixes this type of error atomically.
> >> On Fri, Oct 19, 2018 at 6:59 PM Burkhard Linke
> >>  wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> > upon failover or restart, or MDS complains that something is wrong with
> >> > one of the stray directories:
> >> >
> >> >
> >> > 2018-10-19 12:56:06.442151 7fc908e2d700 -1 log_channel(cluster) log
> >> > [ERR] : bad/negative dir size on 0x607 f(v133 m2018-10-19
> >> > 12:51:12.016360 -4=-5+1)
> >> > 2018-10-19 12:56:06.442182 7fc908e2d700 -1 log_channel(cluster) log
> >> > [ERR] : unmatched fragstat on 0x607, inode has f(v134 m2018-10-19
> >> > 12:51:12.016360 -4=-5+1), dirfrags have f(v0 m2018-10-19 12:51:12.016360
> >> > 1=0+1)
> >> >
> >> >
> >> > How do we handle this problem?
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Burkhard
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Broken CephFS stray entries?

2019-01-22 Thread Dan van der Ster

Hi Zheng,

We also just saw this today and got a bit worried.
Should we change to:

diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index e8c1bc8bc1..e2539390fb 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -2040,7 +2040,7 @@ void CInode::finish_scatter_gather_update(int type)

if (pf->fragstat.nfiles < 0 ||
pf->fragstat.nsubdirs < 0) {
- clog->error() << "bad/negative dir size on "
+ clog->warn() << "bad/negative dir size on "
  << dir->dirfrag() << " " << pf->fragstat;
  assert(!"bad/negative fragstat" == g_conf->mds_verify_scatter);

@@ -2077,7 +2077,7 @@ void CInode::finish_scatter_gather_update(int type)
  if (state_test(CInode::STATE_REPAIRSTATS)) {
dout(20) << " dirstat mismatch, fixing" << dendl;
  } else {
-   clog->error() << "unmatched fragstat on " << ino() << ", inode
has "
+   clog->warn() << "unmatched fragstat on " << ino() << ", inode
has "
  << pi->dirstat << ", dirfrags have " << dirstat;
assert(!"unmatched fragstat" == g_conf->mds_verify_scatter);
  }


Cheers, Dan


On Sat, Oct 20, 2018 at 2:33 AM Yan, Zheng  wrote:

> no action is required. mds fixes this type of error atomically.
> On Fri, Oct 19, 2018 at 6:59 PM Burkhard Linke
>  wrote:
> >
> > Hi,
> >
> >
> > upon failover or restart, or MDS complains that something is wrong with
> > one of the stray directories:
> >
> >
> > 2018-10-19 12:56:06.442151 7fc908e2d700 -1 log_channel(cluster) log
> > [ERR] : bad/negative dir size on 0x607 f(v133 m2018-10-19
> > 12:51:12.016360 -4=-5+1)
> > 2018-10-19 12:56:06.442182 7fc908e2d700 -1 log_channel(cluster) log
> > [ERR] : unmatched fragstat on 0x607, inode has f(v134 m2018-10-19
> > 12:51:12.016360 -4=-5+1), dirfrags have f(v0 m2018-10-19 12:51:12.016360
> > 1=0+1)
> >
> >
> > How do we handle this problem?
> >
> >
> > Regards,
> >
> > Burkhard
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multi-filesystem wthin a cluster

2019-01-17 Thread Dan van der Ster

On Wed, Jan 16, 2019 at 11:17 PM Patrick Donnelly  wrote:
>
> On Wed, Jan 16, 2019 at 1:21 AM Marvin Zhang  wrote:
> > Hi CephFS experts,
> > From document, I know multi-fs within a cluster is still experiment feature.
> > 1. Is there any estimation about stability and performance for this feature?
>
> Remaining blockers [1] need completed. No developer has yet taken on
> this task. Perhaps by O release.
>
> > 2. It seems that each FS will consume at least 1 active MDS and
> > different FS can't share MDS. Suppose I want to create 10 FS , I need
> > at least 10 MDS. Is it right? Is ther any limit number for MDS within
> > a cluster?
>
> No limit on number of MDS but there is a limit on the number of
> actives (multimds).

TIL...
What is the max number of actives in a single FS?

Cheers, Dan

> In the not-to-distant future, container
> orchestration platforms (e.g. Rook) underneath Ceph would provide a
> way to dynamically spin up new MDSs in response to the creation of a
> file system.
>
> [1] http://tracker.ceph.com/issues/22477
>
> --
> Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rocksdb mon stores growing until restart

2019-01-15 Thread Dan van der Ster

On Wed, Sep 19, 2018 at 7:01 PM Bryan Stillwell  wrote:
>
> > On 08/30/2018 11:00 AM, Joao Eduardo Luis wrote:
> > > On 08/30/2018 09:28 AM, Dan van der Ster wrote:
> > > Hi,
> > > Is anyone else seeing rocksdb mon stores slowly growing to >15GB,
> > > eventually triggering the 'mon is using a lot of disk space' warning?
> > > Since upgrading to luminous, we've seen this happen at least twice.
> > > Each time, we restart all the mons and then stores slowly trim down to
> > > <500MB. We have 'mon compact on start = true', but it's not the
> > > compaction that's shrinking the rockdb's -- the space used seems to
> > > decrease over a few minutes only after *all* mons have been restarted.
> > > This reminds me of a hammer-era issue where references to trimmed maps
> > > were leaking -- I can't find that bug at the moment, though.
> >
> > Next time this happens, mind listing the store contents and check if you
> > are holding way too many osdmaps? You shouldn't be holding more osdmaps
> > than the default IF the cluster is healthy and all the pgs are clean.
> >
> > I've chased a bug pertaining this last year, even got a patch, but then
> > was unable to reproduce it. Didn't pursue merging the patch any longer
> > (I think I may still have an open PR for it though), simply because it
> > was no longer clear if it was needed.
>
> I just had this happen to me while using ceph-gentle-split on a 12.2.5
> cluster with 1,370 OSDs.  Unfortunately, I restarted the mon nodes which
> fixed the problem before finding this thread.  I'm only halfway done
> with the split, so I'll see if the problem resurfaces again.
>

I think I've understood the what's causing this -- it's related to the
issue we've seen where osdmaps are not being trimmed on osds.
It seems that once the oldest_map and newest_map are within 500, they
are no longer trimmed ever until the mon's are restarted.

I updated this tracker: http://tracker.ceph.com/issues/37875

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-osd processes restart during Luminous -> Mimic upgrade on CentOS 7

2019-01-15 Thread Dan van der Ster

Hi Wido,

`rpm -q --scripts ceph-selinux` will tell you why.

It was the same from 12.2.8 to 12.2.10: http://tracker.ceph.com/issues/21672

And the problem is worse than you described, because the daemons are
even restarted before all the package files have been updated.

Our procedure on these upgrades is systemctl stop ceph.target; yum
update; systemctl start ceph.target (or ceph-volume lvm activate
--all).

Cheers, Dan

On Tue, Jan 15, 2019 at 11:33 AM Wido den Hollander  wrote:
>
> Hi,
>
> I'm in the middle of upgrading a 12.2.8 cluster to 13.2.4 and I've
> noticed that during the Yum/RPM upgrade the OSDs are being restarted.
>
> Jan 15 11:24:25 x yum[2348259]: Updated: 2:ceph-base-13.2.4-0.el7.x86_64
> Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
> start/stop all ceph*@.service instances at once.
> Jan 15 11:24:47 x systemd[1]: Stopped target ceph target allowing to
> start/stop all ceph-osd@.service instances at once.
> Jan 15 11:24:47 x systemd[1]: Stopping Ceph object storage daemon
> osd.267...
> 
> 
> Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
> osd.143.
> Jan 15 11:24:54 x systemd[1]: Started Ceph object storage daemon
> osd.1156.
> Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
> start/stop all ceph-osd@.service instances at once.
> Jan 15 11:24:54 x systemd[1]: Reached target ceph target allowing to
> start/stop all ceph*@.service instances at once.
> Jan 15 11:24:54 x yum[2348259]: Updated:
> 2:ceph-selinux-13.2.4-0.el7.x86_64
> Jan 15 11:24:59 x yum[2348259]: Updated: 2:ceph-osd-13.2.4-0.el7.x86_64
>
> In /etc/sysconfig/ceph there is CEPH_AUTO_RESTART_ON_UPGRADE=no
>
> So this makes me wonder, what causes the OSDs to be restarted after the
> package upgrade as we are not allowing this restart.
>
> Checking cloud.spec.in in both the Luminous and Mimic branch I can't
> find a good reason why this is happening because it checks for
> 'CEPH_AUTO_RESTART_ON_UPGRADE' which isn't set to 'yes'.
>
> In addition, ceph.spec.in never restarts 'ceph.target' which is being
> restarted.
>
> Could it be that the SELinux upgrade initiates the restart of these daemons?
>
> CentOS Linux release 7.6.1810 (Core)
> Luminous 12.2.8
> Mimic 13.2.4
>
> Has anybody seen this before?
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems after migrating to straw2 (to enable the balancer)

2019-01-14 Thread Dan van der Ster

Your crush rule is ok:

step chooseleaf firstn 0 type host

You are replicating host-wise, not rack wise.

This is what I would suggest for you cluster, but keep in mind that a
whole-rack outage will leave some PGs incomplete.

Regarding the straw2 change causing 12% data movement -- in this case
it is a bit more than I would have expected.

-- dan



On Mon, Jan 14, 2019 at 3:40 PM Massimo Sgaravatto
 wrote:
>
> Hi Dan
>
> I have indeed at the moment only 5 OSD nodes on 3 racks.
> The crush-map is attached.
> Are you suggesting to replicate only between nodes and not between racks 
> (since the very few resources) ?
> Thanks, Massimo
>
> On Mon, Jan 14, 2019 at 3:29 PM Dan van der Ster  wrote:
>>
>> On Mon, Jan 14, 2019 at 3:18 PM Massimo Sgaravatto
>>  wrote:
>> >
>> > Thanks for the prompt reply
>> >
>> > Indeed I have different racks with different weights.
>>
>> Are you sure you're replicating across racks? You have only 3 racks,
>> one of which is half the size of the other two -- if yes, then your
>> cluster will be full once that rack is full.
>>
>> -- dan
>>
>>
>> > Below the ceph osd tree" output
>> >
>> > [root@ceph-mon-01 ~]# ceph osd tree
>> > ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF
>> > -1   272.80426 root default
>> > -7   109.12170 rack Rack11-PianoAlto
>> > -854.56085 host ceph-osd-04
>> > 30   hdd   5.45609 osd.30up  1.0 1.0
>> > 31   hdd   5.45609 osd.31up  1.0 1.0
>> > 32   hdd   5.45609 osd.32up  1.0 1.0
>> > 33   hdd   5.45609 osd.33up  1.0 1.0
>> > 34   hdd   5.45609 osd.34up  1.0 1.0
>> > 35   hdd   5.45609 osd.35up  1.0 1.0
>> > 36   hdd   5.45609 osd.36up  1.0 1.0
>> > 37   hdd   5.45609 osd.37up  1.0 1.0
>> > 38   hdd   5.45609 osd.38up  1.0 1.0
>> > 39   hdd   5.45609 osd.39up  1.0 1.0
>> > -954.56085 host ceph-osd-05
>> > 40   hdd   5.45609 osd.40up  1.0 1.0
>> > 41   hdd   5.45609 osd.41up  1.0 1.0
>> > 42   hdd   5.45609 osd.42up  1.0 1.0
>> > 43   hdd   5.45609 osd.43up  1.0 1.0
>> > 44   hdd   5.45609 osd.44up  1.0 1.0
>> > 45   hdd   5.45609 osd.45up  1.0 1.0
>> > 46   hdd   5.45609 osd.46up  1.0 1.0
>> > 47   hdd   5.45609 osd.47up  1.0 1.0
>> > 48   hdd   5.45609 osd.48up  1.0 1.0
>> > 49   hdd   5.45609 osd.49up  1.0 1.0
>> > -6   109.12170 rack Rack15-PianoAlto
>> > -354.56085 host ceph-osd-02
>> > 10   hdd   5.45609 osd.10up  1.0 1.0
>> > 11   hdd   5.45609 osd.11up  1.0 1.0
>> > 12   hdd   5.45609 osd.12up  1.0 1.0
>> > 13   hdd   5.45609 osd.13up  1.0 1.0
>> > 14   hdd   5.45609 osd.14up  1.0 1.0
>> > 15   hdd   5.45609 osd.15up  1.0 1.0
>> > 16   hdd   5.45609 osd.16up  1.0 1.0
>> > 17   hdd   5.45609 osd.17up  1.0 1.0
>> > 18   hdd   5.45609 osd.18up  1.0 1.0
>> > 19   hdd   5.45609 osd.19up  1.0 1.0
>> > -454.56085 host ceph-osd-03
>> > 20   hdd   5.45609 osd.20up  1.0 1.0
>> > 21   hdd   5.45609 osd.21up  1.0 1.0
>> > 22   hdd   5.45609 osd.22up  1.0 1.0
>> > 23   hdd   5.45609 osd.23up  1.0 1.0
>> > 24   hdd   5.45609 osd.24up  1.0 1.0
>> > 25   hdd   5.45609 osd.25up  1.0 1.0
>> > 26   hdd   5.45609 osd.26up  1.0 1.0
>> > 27   hdd   5.45609 osd.27up  1.0 1.0
>> > 28   hd

Re: [ceph-users] Problems after migrating to straw2 (to enable the balancer)

2019-01-14 Thread Dan van der Ster

On Mon, Jan 14, 2019 at 3:18 PM Massimo Sgaravatto
 wrote:
>
> Thanks for the prompt reply
>
> Indeed I have different racks with different weights.

Are you sure you're replicating across racks? You have only 3 racks,
one of which is half the size of the other two -- if yes, then your
cluster will be full once that rack is full.

-- dan


> Below the ceph osd tree" output
>
> [root@ceph-mon-01 ~]# ceph osd tree
> ID CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF
> -1   272.80426 root default
> -7   109.12170 rack Rack11-PianoAlto
> -854.56085 host ceph-osd-04
> 30   hdd   5.45609 osd.30up  1.0 1.0
> 31   hdd   5.45609 osd.31up  1.0 1.0
> 32   hdd   5.45609 osd.32up  1.0 1.0
> 33   hdd   5.45609 osd.33up  1.0 1.0
> 34   hdd   5.45609 osd.34up  1.0 1.0
> 35   hdd   5.45609 osd.35up  1.0 1.0
> 36   hdd   5.45609 osd.36up  1.0 1.0
> 37   hdd   5.45609 osd.37up  1.0 1.0
> 38   hdd   5.45609 osd.38up  1.0 1.0
> 39   hdd   5.45609 osd.39up  1.0 1.0
> -954.56085 host ceph-osd-05
> 40   hdd   5.45609 osd.40up  1.0 1.0
> 41   hdd   5.45609 osd.41up  1.0 1.0
> 42   hdd   5.45609 osd.42up  1.0 1.0
> 43   hdd   5.45609 osd.43up  1.0 1.0
> 44   hdd   5.45609 osd.44up  1.0 1.0
> 45   hdd   5.45609 osd.45up  1.0 1.0
> 46   hdd   5.45609 osd.46up  1.0 1.0
> 47   hdd   5.45609 osd.47up  1.0 1.0
> 48   hdd   5.45609 osd.48up  1.0 1.0
> 49   hdd   5.45609 osd.49up  1.0 1.0
> -6   109.12170 rack Rack15-PianoAlto
> -354.56085 host ceph-osd-02
> 10   hdd   5.45609 osd.10up  1.0 1.0
> 11   hdd   5.45609 osd.11up  1.0 1.0
> 12   hdd   5.45609 osd.12up  1.0 1.0
> 13   hdd   5.45609 osd.13up  1.0 1.0
> 14   hdd   5.45609 osd.14up  1.0 1.0
> 15   hdd   5.45609 osd.15up  1.0 1.0
> 16   hdd   5.45609 osd.16up  1.0 1.0
> 17   hdd   5.45609 osd.17up  1.0 1.0
> 18   hdd   5.45609 osd.18up  1.0 1.0
> 19   hdd   5.45609 osd.19up  1.0 1.0
> -454.56085 host ceph-osd-03
> 20   hdd   5.45609 osd.20up  1.0 1.0
> 21   hdd   5.45609 osd.21up  1.0 1.0
> 22   hdd   5.45609 osd.22up  1.0 1.0
> 23   hdd   5.45609 osd.23up  1.0 1.0
> 24   hdd   5.45609 osd.24up  1.0 1.0
> 25   hdd   5.45609 osd.25up  1.0 1.0
> 26   hdd   5.45609 osd.26up  1.0 1.0
> 27   hdd   5.45609 osd.27up  1.0 1.0
> 28   hdd   5.45609 osd.28up  1.0 1.0
> 29   hdd   5.45609 osd.29up  1.0 1.0
> -554.56085 rack Rack17-PianoAlto
> -254.56085 host ceph-osd-01
>  0   hdd   5.45609 osd.0 up  1.0 1.0
>  1   hdd   5.45609 osd.1 up  1.0 1.0
>  2   hdd   5.45609 osd.2 up  1.0 1.0
>  3   hdd   5.45609 osd.3 up  1.0 1.0
>  4   hdd   5.45609 osd.4 up  1.0 1.0
>  5   hdd   5.45609 osd.5 up  1.0 1.0
>  6   hdd   5.45609 osd.6 up  1.0 1.0
>  7   hdd   5.45609 osd.7 up  1.0 1.0
>  8   hdd   5.45609 osd.8 up  1.0 1.0
>  9   hdd   5.45609 osd.9 up  1.0 1.0
> [root@ceph-mon-01 ~]#
>
> On Mon, Jan 14, 2019 at 3:13 PM Dan van der Ster  wrote:
>>
>> On Mon, Jan 14, 2019 at 3:06 PM Massimo Sgaravatto
>>  wrote:
>> >
>> > I have a ceph luminous cluster running on CentOS7 nodes.
>> > This cluster has 50 OSDs, all with the same size and all with the same

Re: [ceph-users] Problems after migrating to straw2 (to enable the balancer)

2019-01-14 Thread Dan van der Ster

On Mon, Jan 14, 2019 at 3:06 PM Massimo Sgaravatto
 wrote:
>
> I have a ceph luminous cluster running on CentOS7 nodes.
> This cluster has 50 OSDs, all with the same size and all with the same weight.
>
> Since I noticed that there was a quite "unfair" usage of OSD nodes (some used 
> at 30 %, some used at 70 %) I tried to activate the balancer.
>
> But the balancer doesn't start I guess because of this problem:
>
> [root@ceph-mon-01 ~]# ceph osd crush weight-set create-compat
> Error EPERM: crush map contains one or more bucket(s) that are not straw2
>
>
> So I issued the command to convert from straw to straw2 (all the clients are 
> running luminous):
>
>
> [root@ceph-mon-01 ~]# ceph osd crush set-all-straw-buckets-to-straw2
> Error EINVAL: new crush map requires client version hammer but 
> require_min_compat_client is firefly
> [root@ceph-mon-01 ~]# ceph osd set-require-min-compat-client jewel
> set require_min_compat_client to jewel
> [root@ceph-mon-01 ~]# ceph osd crush set-all-straw-buckets-to-straw2
> [root@ceph-mon-01 ~]#
>
>
> After having issued the command, the cluster went in WARNING state because ~ 
> 12 % objects were misplaced.
>
> Is this normal ?
> I read somewhere that the migration from straw to straw2 should trigger a 
> data migration only if the OSDs have different sizes, which is not my case.

The relevant sizes to compare are the crush buckets across which you
are replicating.
Are you replicating host-wise or rack-wise?
Do you have hosts/racks with a different crush weight (e.g. different
crush size).
Maybe share your `ceph osd tree`.

Cheers, dan



>
>
> The cluster is still recovering, but what is worrying me is that it looks 
> like that data are being moved to the most used OSDs and the MAX_AVAIL value 
> is decreasing quite quickly.
>
> I hope that the recovery can finish without causing problems: then I will 
> immediately activate the balancer.
>
> But, if some OSDs are getting too full, is it safe to decrease their weights  
> while the cluster is still being recovered ?
>
> Thanks a lot for your help
> Of course I can provide other info, if needed
>
>
> Cheers, Massimo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] recovering vs backfilling

2019-01-10 Thread Dan van der Ster

Hi Caspar,

On Thu, Jan 10, 2019 at 1:31 PM Caspar Smit  wrote:
>
> Hi all,
>
> I wanted to test Dan's upmap-remapped script for adding new osd's to a 
> cluster. (Then letting the balancer gradually move pgs to the new OSD 
> afterwards)

Cool. Insert "no guarantees or warranties" comment here.
And btw, I have noticed that the method doesn't always work in this
use case: sometimes the upmap balancer will not remove pg-upmap-items
entries if there are very few severely underloaded OSDs. The
calc_pg_upmaps might need some code changes to fully work in this
scenario.

> I've created a fresh (virtual) 12.2.10 4-node cluster with very small disks 
> (16GB each). 2 OSD's per node.
> Put ~20GB of data on the cluster.
>
> Now when i set the norebalance flag and add a new OSD, 99% of pgs end up 
> recovering or in recovery_wait. Only a few will be backfill_wait.
>
> The recovery starts as expected (norebalance only stops backfilling pgs) and 
> finished eventually
>
> The upmap-remapped script only works with pgs which need to be backfilled.

The script looks for pgs which are remapped but not degraded, then
uses pg-upmap-items to move pgs from the state active+remapped to
active+clean.

> it does work for the handful of pgs in backfill_wait status but my question 
> is:
>
> When is ceph doing recovery in stead of backfilling? Only when the cluster is 
> rather empty or what is the criteria? Are the OSD's too small?

Vaguely, recovery is done when the pg log holds all of the changes
made on a pg while the pg was degraded.
Once that pg log was overfilled, the osd needs to backfill.
In other words: recovery is a process to replay a log of ops. But that
log has a size limit, so backfilling is the fallback to scan all
objects for changes.

Cheers, Dan
>
> Kind regards,
> Caspar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osdmaps not being cleaned up in 12.2.8

2019-01-10 Thread Dan van der Ster

Hi Bryan,

I think this is the old hammer thread you refer to:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013060.html

We also have osdmaps accumulating on v12.2.8 -- ~12000 per osd at the moment.

I'm trying to churn the osdmaps like before, but our maps are not being trimmed.

Did you need to restart the osd's before the churn trick would work?
If so, it seems that something is holding references to old maps, like
like that old hammer issue.

Cheers, Dan


On Tue, Jan 8, 2019 at 5:39 PM Bryan Stillwell  wrote:
>
> I was able to get the osdmaps to slowly trim (maybe 50 would trim with each 
> change) by making small changes to the CRUSH map like this:
>
>
>
> for i in {1..100}; do
>
> ceph osd crush reweight osd.1754 4.1
>
> sleep 5
>
> ceph osd crush reweight osd.1754 4
>
> sleep 5
>
> done
>
>
>
> I believe this was the solution Dan came across back in the hammer days.  It 
> works, but not ideal for sure.  Across the cluster it freed up around 50TB of 
> data!
>
>
>
> Bryan
>
>
>
> From: ceph-users  on behalf of Bryan 
> Stillwell 
> Date: Monday, January 7, 2019 at 2:40 PM
> To: ceph-users 
> Subject: [ceph-users] osdmaps not being cleaned up in 12.2.8
>
>
>
> I have a cluster with over 1900 OSDs running Luminous (12.2.8) that isn't 
> cleaning up old osdmaps after doing an expansion.  This is even after the 
> cluster became 100% active+clean:
>
>
>
> # find /var/lib/ceph/osd/ceph-1754/current/meta -name 'osdmap*' | wc -l
>
> 46181
>
>
>
> With the osdmaps being over 600KB in size this adds up:
>
>
>
> # du -sh /var/lib/ceph/osd/ceph-1754/current/meta
>
> 31G/var/lib/ceph/osd/ceph-1754/current/meta
>
>
>
> I remember running into this during the hammer days:
>
>
>
> http://tracker.ceph.com/issues/13990
>
>
>
> Did something change recently that may have broken this fix?
>
>
>
> Thanks,
>
> Bryan
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is it possible to increase Ceph Mon store?

2019-01-08 Thread Dan van der Ster

On Tue, Jan 8, 2019 at 12:48 PM Thomas Byrne - UKRI STFC
 wrote:
>
> For what it's worth, I think the behaviour Pardhiv and Bryan are describing 
> is not quite normal, and sounds similar to something we see on our large 
> luminous cluster with elderly (created as jewel?) monitors. After large 
> operations which result in the mon stores growing to 20GB+, leaving the 
> cluster with all PGs active+clean for days/weeks will usually not result in 
> compaction, and the store sizes will slowly grow.
>
> I've played around with restarting monitors with and without 
> mon_compact_on_start set, and using 'ceph tell mon.[id] compact'. For this 
> cluster, I found the most reliable way to trigger a compaction was to restart 
> all monitors daemons, one at a time, *without* compact_on_start set. The 
> stores rapidly compact down to ~1GB in a minute or less after the last mon 
> restarts.

+1, exactly the same issue and workaround here. See this thread, which
had no resolution:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029423.html

-- dan

>
>
> It's worth noting that occasionally (1 out of every 10 times, or fewer) the 
> stores will compact without prompting after all PGs become active+clean.
>
> I haven't put much time into this as I am planning on reinstalling the 
> monitors to get rocksDB mon stores. If the problem persists with the new 
> monitors I'll have another look at it.
>
> Cheers
> Tom
>
> > -Original Message-
> > From: ceph-users  On Behalf Of Wido
> > den Hollander
> > Sent: 08 January 2019 08:28
> > To: Pardhiv Karri ; Bryan Stillwell
> > 
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] Is it possible to increase Ceph Mon store?
> >
> >
> >
> > On 1/7/19 11:15 PM, Pardhiv Karri wrote:
> > > Thank you Bryan, for the information. We have 816 OSDs of size 2TB each.
> > > The mon store too big popped up when no rebalancing happened in that
> > > month. It is slightly above the 15360 threshold around 15900 or 16100
> > > and stayed there for more than a week. We ran the "ceph tell mon.[ID]
> > > compact" to get it back earlier this week. Currently the mon store is
> > > around 12G on each monitor. If it doesn't grow then I won't change the
> > > value but if it grows and gives the warning then I will increase it
> > > using "mon_data_size_warn".
> > >
> >
> > This is normal. The MONs will keep a history of OSDMaps if one or more PGs
> > are not active+clean
> >
> > They will trim after all the PGs are clean again, nothing to worry about.
> >
> > You can increase the setting for the warning, but that will not shrink the
> > database.
> >
> > Just make sure your monitors have enough free space.
> >
> > Wido
> >
> > > Thanks,
> > > Pardhiv Karri
> > >
> > >
> > >
> > > On Mon, Jan 7, 2019 at 1:55 PM Bryan Stillwell  > > > wrote:
> > >
> > > I believe the option you're looking for is mon_data_size_warn.  The
> > > default is set to 16106127360.
> > >
> > > __ __
> > >
> > > I've found that sometimes the mons need a little help getting
> > > started with trimming if you just completed a large expansion.
> > > Earlier today I had a cluster where the mon's data directory was
> > > over 40GB on all the mons.  When I restarted them one at a time with
> > > 'mon_compact_on_start = true' set in the '[mon]' section of
> > > ceph.conf, they stayed around 40GB in size.   However, when I was
> > > about to hit send on an email to the list about this very topic, the
> > > warning cleared up and now the data directory is now between 1-3GB
> > > on each of the mons.  This was on a cluster with >1900 OSDs.
> > >
> > > __ __
> > >
> > > Bryan
> > >
> > > __ __
> > >
> > > *From: *ceph-users  > > > on behalf of Pardhiv
> > > Karri mailto:meher4in...@gmail.com>>
> > > *Date: *Monday, January 7, 2019 at 11:08 AM
> > > *To: *ceph-users  > > >
> > > *Subject: *[ceph-users] Is it possible to increase Ceph Mon
> > > store?
> > >
> > > __ __
> > >
> > > Hi, __ __
> > >
> > > __ __
> > >
> > > We have a large Ceph cluster (Hammer version). We recently saw its
> > > mon store growing too big > 15GB on all 3 monitors without any
> > > rebalancing happening for quiet sometime. We have compacted the DB
> > > using  "#ceph tell mon.[ID] compact" for now. But is there a way to
> > > increase the size of the mon store to 32GB or something to avoid
> > > getting the Ceph health to warning state due to Mon store growing
> > > too big?
> > >
> > > __ __
> > >
> > > -- 
> > >
> > > Thanks,
> > >
> > > *P**ardhiv **K**arri*
> > >
> > >
> > > 
> > >
> > > __ __
> > >
> > >
> > >
> > > --
> > > *Pardhiv Karri*
> > > "Rise and Rise again untilLAMBSbecome LIONS"
> > >
> > >
> > >
> > >

Re: [ceph-users] Ceph monitors overloaded on large cluster restart

2018-12-19 Thread Dan van der Ster

Hey Andras,

Three mons is possibly too few for such a large cluster. We've had lots of
good stable experience with 5-mon clusters. I've never tried 7, so I can't
say if that would lead to other problems (e.g. leader/peon sync
scalability).

That said, our 1-osd bigbang tests managed with only 3 mons, and I
assume that outside of this full system reboot scenario your 3 cope well
enough. You should probably add 2 more, but I wouldn't expect that alone to
solve this problem in the future.

Instead, with a slightly tuned procedure and a bit of osd log grepping, I
think you could've booted this cluster more quickly than 4 hours with those
mere 3 mons.

As you know, each osds boot process requires the downloading of all known
osdmaps. If all osds are booting together, and the mons are saturated, the
osds can become sluggish when responding to their peers, which could lead
to the flapping scenario you saw. Flapping leads to new osdmap epochs that
then need to be distributed, worsening the issue. It's good that you used
nodown and noout, because without these the boot time would've been even
longer. Next time also set noup and noin to further reduce the osdmap churn.

One other thing: there's a debug_osd level -- 10 or 20, I forget exactly --
that you can set to watch the maps sync up on each osd. Grep the osd logs
for some variations on "map" and "epoch".

In short, here's what I would've done:

0. boot the mons, waiting until they have a full quorum.
1. set nodown, noup, noin, noout   <-- with these, there should be zero new
osdmaps generated while the osds boot.
2. start booting osds. set the necessary debug_osd level to see the osdmap
sync progress in the ceph-osd logs.
3. if the mons are over saturated, boot progressively -- one rack at a
time, for example.
4. once all osds have caught up to the current osdmap, unset noup. The osds
should then all "boot" (as far as the mons are concerned) and be marked up.
(this might be sluggish on a 3400 osd cluster, perhaps taking a few 10s of
seconds). the pgs should be active+clean at this point.
5. unset nodown, noin, noout. which should change nothing provided all went
well.

Hope that helps for next time!

Dan

On Wed, Dec 19, 2018 at 11:39 PM Andras Pataki <
apat...@flatironinstitute.org> wrote:

> Forgot to mention: all nodes are on Luminous 12.2.8 currently on CentOS
> 7.5.
>
> On 12/19/18 5:34 PM, Andras Pataki wrote:
> > Dear ceph users,
> >
> > We have a large-ish ceph cluster with about 3500 osds.  We run 3 mons
> > on dedicated hosts, and the mons typically use a few percent of a
> > core, and generate about 50Mbits/sec network traffic.  They are
> > connected at 20Gbits/sec (bonded dual 10Gbit) and are running on 2x14
> > core servers.
> >
> > We recently had to shut ceph down completely for maintenance (which we
> > rarely do), and had significant difficulties starting it up.  The
> > symptoms included OSDs hanging on startup, being marked down, flapping
> > and all that bad stuff.  After some investigation we found that the
> > 20Gbit/sec network interfaces of the monitors were completely
> > saturated as the OSDs were starting, while the monitor processes were
> > using about 3 cores (300% CPU).  We ended up having to start the OSDs
> > up super slow to make sure that the monitors could keep up - it took
> > about 4 hours to start 3500 OSDs (at a rate about 4 seconds per OSD).
> > We've tried setting noout and nodown, but that didn't really help
> > either.  A few questions that would be good to understand in order to
> > move to a better configuration.
> >
> > 1. How does the monitor traffic scale with the number of OSDs?
> > Presumably the traffic comes from distributing cluster maps as the
> > cluster changes on OSD starts.  The cluster map is perhaps O(N) for N
> > OSDs, and each OSD needs an update on a cluster change so that would
> > make one change an O(N^2) traffic.  As OSDs start, the cluster changes
> > quite a lot (N times?), so would that make the startup traffic
> > O(N^3)?  If so, that sounds pretty scary for scalability.
> >
> > 2. Would adding more monitors help here?  I.e. presumably each OSD
> > gets its maps from one monitor, so they would share the traffic. Would
> > the inter-monitor communication/elections/etc. be problematic for more
> > monitors (5, 7 or even more)?  Would more monitors be recommended?  If
> > so, how many is practical?
> >
> > 3. Are there any config parameters useful for tuning the traffic
> > (perhaps send mon updates less frequently, or something along those
> > lines)?
> >
> > Any other advice on this topic would also be helpful.
> >
> > Thanks,
> >
> > Andras
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous (12.2.8 on CentOS), recover or recreate incomplete PG

2018-12-19 Thread Dan van der Ster

Glad to hear it helped.

This particular option is ultra dangerous, so imho its obfuscated name is
just perfect!

Finally, since I didn't mention it earlier, don't forget to disable the
option and restart the relevant OSDs now that their active again. And it
would be sensible to deep scrub that PG now.

Cheers,

Dan




On Wed, Dec 19, 2018, 5:46 PM Fulvio Galeazzi  Ciao Dan,
>  thanks a lot for your message!  :-)
>
>Indeed, the procedure you outlined did the trick and I am now back to
> healthy state.
> --yes-i-really-really-love-ceph-parameter-names !!!
>
>Ciao ciao
>
> Fulvio
>
>  Original Message 
> Subject: Re: [ceph-users] Luminous (12.2.8 on CentOS), recover or
> recreate incomplete PG
> From: Dan van der Ster 
> To: fulvio.galea...@garr.it
> CC: ceph-users 
> Date: 12/18/2018 11:38 AM
>
> > Hi Fulvio!
> >
> > Are you able to query that pg -- which osd is it waiting for?
> >
> > Also, since you're prepared for data loss anyway, you might have
> > success setting osd_find_best_info_ignore_history_les=true on the
> > relevant osds (set it conf, restart those osds).
> >
> > -- dan
> >
> >
> > -- dan
> >
> > On Tue, Dec 18, 2018 at 11:31 AM Fulvio Galeazzi
> >  wrote:
> >>
> >> Hallo Cephers,
> >>   I am stuck with an incomplete PG and am seeking help.
> >>
> >> At some point I had a bad configuration for gnocchi which caused a
> >> flooding of tiny objects to the backend Ceph rados pool. While cleaning
> >> things up, the load on the OSD disks was such that 3 of them "commited
> >> suicide" and were marked down.
> >> Now that the situation is calm, I am left with one stubborn
> >> incomplete PG.
> >>
> >> PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg
> incomplete
> >>pg 107.33 is incomplete, acting [41,22,156] (reducing pool
> >> gnocchi-ct1-cl1 min_size from 2 may help; search ceph.com/docs for
> >> 'incomplete')
> >>  (by the way, reducing min_size did not help)
> >>
> >> I found this page and tried to follow the procedure outlined:
> >>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019674.html
> >>
> >> On one of the 3 replicas, the "PG export" produced some decently
> >> sized file, but when I tried to import it on the acting OSD I got error:
> >>
> >> [root@r1srv07.ct1 ~]# ceph-objectstore-tool --data-path
> >> /var/lib/ceph/osd/ceph-41 --op import --file /tmp/recover.107.33 --force
> >> pgid 107.33 already exists
> >>
> >>
> >> Questions now is: could anyone please suggest a recovery procedure? Note
> >> that for this specific case I would not mind wiping the PG.
> >>
> >> Thanks for your help!
> >>
> >>  Fulvio
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] IRC channels now require registered and identified users

2018-12-18 Thread Dan van der Ster

Hi Joao,

Has that broken the Slack connection? I can't tell if its broken or
just quiet... last message on #ceph-devel was today at 1:13am.

-- Dan


On Tue, Dec 18, 2018 at 12:11 PM Joao Eduardo Luis  wrote:
>
> All,
>
>
> Earlier this week our IRC channels were set to require users to be
> registered and identified before being allowed to join a channel. This
> looked like the most reasonable option to combat the onslaught of spam
> bots we've been getting in the last weeks/months.
>
> As of today, this is in effect for #ceph, #ceph-devel, #ceph-dashboard,
> and #ceph-orchestrators.
>
> If you are unable to join a channel because of this requirement, you
> should first register your username with
>
>   /msg nickserv register  
>
> and you'll be able to identify yourself with
>
>   /msg nickserv identify 
>
> please see [1] for more information on how to use OFTC's services.
>
>
>   -Joao
>
>
> [1] https://www.oftc.net/Services/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous (12.2.8 on CentOS), recover or recreate incomplete PG

2018-12-18 Thread Dan van der Ster

Hi Fulvio!

Are you able to query that pg -- which osd is it waiting for?

Also, since you're prepared for data loss anyway, you might have
success setting osd_find_best_info_ignore_history_les=true on the
relevant osds (set it conf, restart those osds).

-- dan


-- dan

On Tue, Dec 18, 2018 at 11:31 AM Fulvio Galeazzi
 wrote:
>
> Hallo Cephers,
>  I am stuck with an incomplete PG and am seeking help.
>
>At some point I had a bad configuration for gnocchi which caused a
> flooding of tiny objects to the backend Ceph rados pool. While cleaning
> things up, the load on the OSD disks was such that 3 of them "commited
> suicide" and were marked down.
>Now that the situation is calm, I am left with one stubborn
> incomplete PG.
>
> PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
>   pg 107.33 is incomplete, acting [41,22,156] (reducing pool
> gnocchi-ct1-cl1 min_size from 2 may help; search ceph.com/docs for
> 'incomplete')
> (by the way, reducing min_size did not help)
>
>I found this page and tried to follow the procedure outlined:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019674.html
>
>On one of the 3 replicas, the "PG export" produced some decently
> sized file, but when I tried to import it on the acting OSD I got error:
>
> [root@r1srv07.ct1 ~]# ceph-objectstore-tool --data-path
> /var/lib/ceph/osd/ceph-41 --op import --file /tmp/recover.107.33 --force
> pgid 107.33 already exists
>
>
> Questions now is: could anyone please suggest a recovery procedure? Note
> that for this specific case I would not mind wiping the PG.
>
>Thanks for your help!
>
> Fulvio
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous radosgw S3/Keystone integration issues

2018-12-17 Thread Dan van der Ster

Hi all,

Bringing up this old thread with a couple questions:

1. Did anyone ever follow up on the 2nd part of this thread? -- is
there any way to cache keystone EC2 credentials?

2. A question for Valery: could you please explain exactly how you
added the EC2 credentials to the local backend (your workaround)? Did
you add the key to the existing uid with type=keystone? or did you
create a new user (rgw-admin user create..) with the needed EC2 creds?

Cheers, Dan


On Thu, Feb 1, 2018 at 4:45 PM Valery Tschopp  wrote:
>
> Hi,
>
> We are operating a Luminous 12.2.2 radosgw, with the S3 Keystone
> authentication enabled.
>
> Some customers are uploading millions of objects per bucket at once,
> therefore the radosgw is doing millions of s3tokens POST requests to the
> Keystone. All those s3tokens requests to Keystone are the same (same
> customer, same EC2 credentials). But because there is no cache in
> radosgw for the EC2 credentials, every incoming S3 operation generates a
> call to the external auth Keystone. It can generate hundreds of s3tokens
> requests per second to Keystone.
>
> We had already this problem with Jewel, but we implemented a workaround.
> The EC2 credentials of the customer were added directly in the local
> auth engine of radosgw. So for this particular heavy user, the radosgw
> local authentication was checked first, and no external auth request to
> Keystone was necessary.
>
> But the default behavior for the S3 authentication have change in Luminous.
>
> In Luminous, if you enable the S3 Keystone authentication, every
> incoming S3 operation will first check for anonymous authentication,
> then external authentication (Keystone and/or LDAP), and only then local
> authentication.
> See https://github.com/ceph/ceph/blob/master/src/rgw/rgw_auth_s3.h#L113-L141
>
> Is there a way to get the old authentication behavior (anonymous ->
> local -> external) to work again?
>
> Or is it possible to implement a caching mechanism (similar to the Token
> cache) for the EC2 credentials?
>
> Cheers,
> Valery
>
> --
> SWITCH
> Valéry Tschopp, Software Engineer
> Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
> email: valery.tsch...@switch.ch phone: +41 44 268 1544
>
> 30 years of pioneering the Swiss Internet.
> Celebrate with us at https://swit.ch/30years
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Scheduling deep-scrub operations

2018-12-14 Thread Dan van der Ster

Luminous has:

osd_scrub_begin_week_day
osd_scrub_end_week_day

Maybe these aren't documented. I usually check here for available option:

https://github.com/ceph/ceph/blob/luminous/src/common/options.cc#L2533

-- Dan


On Fri, Dec 14, 2018 at 12:25 PM Caspar Smit  wrote:
>
> Hi all,
>
> We have operating hours from 4 pm until 7 am each weekday and 24 hour days in 
> the weekend.
>
> I was wondering if it's possible to allow deep-scrubbing from 7 am until 15 
> pm only on weekdays and prevent any deep-scrubbing in the weekend.
>
> I've seen the osd scrub begin/end hour settings but that doesn't allow for 
> preventing deep-scrubs in the weekend.
>
> Kind regards,
> Caspar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous v12.2.10 released

2018-12-12 Thread Dan van der Ster

Hey Abhishek,

We just noticed that the debuginfo is missing for 12.2.10:
http://download.ceph.com/rpm-luminous/el7/x86_64/ceph-debuginfo-12.2.10-0.el7.x86_64.rpm

Did something break in the publishing?

Cheers, Dan

On Tue, Nov 27, 2018 at 3:50 PM Abhishek Lekshmanan  wrote:
>
>
> We're happy to announce the tenth bug fix release of the Luminous
> v12.2.x long term stable release series. The previous release, v12.2.9,
> introduced the PG hard-limit patches which were found to cause an issue
> in certain upgrade scenarios, and this release was expedited to revert
> those patches. If you already successfully upgraded to v12.2.9, you
> should **not** upgrade to v12.2.10, but rather **wait** for a release in
> which http://tracker.ceph.com/issues/36686 is addressed. All other users
> are encouraged to upgrade to this release.
>
> Notable Changes
> ---
>
> * This release reverts the PG hard-limit patches added in v12.2.9 in which,
>   a partial upgrade during a recovery/backfill, can cause the osds on the
>   previous version, to fail with assert(trim_to <= info.last_complete). The
>   workaround for users is to upgrade and restart all OSDs to a version with 
> the
>   pg hard limit, or only upgrade when all PGs are active+clean.
>
>   See also: http://tracker.ceph.com/issues/36686
>
>   As mentioned above if you've successfully upgraded to v12.2.9 DO NOT
>   upgrade to v12.2.10 until the linked tracker issue has been fixed.
>
> * The bluestore_cache_* options are no longer needed. They are replaced
>   by osd_memory_target, defaulting to 4GB. BlueStore will expand
>   and contract its cache to attempt to stay within this
>   limit. Users upgrading should note this is a higher default
>   than the previous bluestore_cache_size of 1GB, so OSDs using
>   BlueStore will use more memory by default.
>
>   For more details, see BlueStore docs[1]
>
>
> For the complete release notes with changelog, please check out the
> release blog entry at:
> http://ceph.com/releases/v12-2-10-luminous-released
>
> Getting ceph:
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-12.2.10.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
> * Release git sha1: 177915764b752804194937482a39e95e0ca3de94
>
>
> [1]: 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#cache-size
>
> --
> Abhishek Lekshmanan
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrade to Luminous (mon+osd)

2018-12-03 Thread Dan van der Ster

On Mon, Dec 3, 2018 at 5:00 PM Jan Kasprzak  wrote:
>
> Dan van der Ster wrote:
> : It's not that simple see http://tracker.ceph.com/issues/21672
> :
> : For the 12.2.8 to 12.2.10 upgrade it seems the selinux module was
> : updated -- so the rpms restart the ceph.target.
> : What's worse is that this seems to happen before all the new updated
> : files are in place.
> :
> : Our 12.2.8 to 12.2.10 upgrade procedure is:
> :
> : systemctl stop ceph.target
> : yum update
> : systemctl start ceph.target
>
> Yes, this looks reasonable. Except that when upgrading
> from Jewel, even after the restart the OSDs do not work until
> _all_ mons are upgraded. So effectively if a PG happens to be placed
> on the mon hosts only, there will be service outage during upgrade
> from Jewel.
>
> So I guess the upgrade procedure described here:
>
> http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken
>
> is misleading - the mons and osds get restarted anyway by the package
> upgrade itself. The user should be warned that for this reason the package
> upgrades should be run sequentially. And that the upgrade is not possible
> without service outage, when there are OSDs on the mon hosts and when
> the cluster is running under SELinux.

Note that ceph-selinux will only restart ceph.target if selinux is enabled.

So probably you could set /etc/selinux/config ... SELINUX=disabled,
reboot, then upgrade the rpms and restart the daemons selectively.

And BTW, setenforce 0 apparently doesn't disable enough of selinux --
you really do need to reboot.

# setenforce 0
# /usr/sbin/selinuxenabled
# echo $?
0

-- dan

>
> Also, there is another important thing omitted by the above upgrade
> procedure: After "ceph osd require-osd-release luminous"
> I have got HEALTH_WARN saying "application not enabled on X pool(s)".
> I have fixed this by running the following scriptlet:
>
> ceph osd pool ls | while read pool; do ceph osd pool application enable $pool 
> rbd; done
>
> (yes, all of my pools are used for rbd for now). Maybe this should be fixed
> in the release notes as well. Thanks,
>
> -Yenya
>
> : On Mon, Dec 3, 2018 at 12:42 PM Paul Emmerich  
> wrote:
> : >
> : > Upgrading Ceph packages does not restart the services -- exactly for
> : > this reason.
> : >
> : > This means there's something broken with your yum setup if the
> : > services are restarted when only installing the new version.
> : >
> : >
> : > Paul
> : >
> : > --
> : > Paul Emmerich
> : >
> : > Looking for help with your Ceph cluster? Contact us at https://croit.io
> : >
> : > croit GmbH
> : > Freseniusstr. 31h
> : > 81247 München
> : > www.croit.io
> : > Tel: +49 89 1896585 90
> : >
> : > Am Mo., 3. Dez. 2018 um 11:56 Uhr schrieb Jan Kasprzak :
> : > >
> : > > Hello, ceph users,
> : > >
> : > > I have a small(-ish) Ceph cluster, where there are osds on each host,
> : > > and in addition to that, there are mons on the first three hosts.
> : > > Is it possible to upgrade the cluster to Luminous without service
> : > > interruption?
> : > >
> : > > I have tested that when I run "yum --enablerepo Ceph update" on a
> : > > mon host, the osds on that host remain down until all three mons
> : > > are upgraded to Luminous. Is it possible to upgrade ceph-mon only,
> : > > and keep ceph-osd running the old version (Jewel in my case) as long
> : > > as possible? It seems RPM dependencies forbid this, but with --nodeps
> : > > it could be done.
> : > >
> : > > Is there a supported way how to upgrade host running both mon and osd
> : > > to Luminous?
> : > >
> : > > Thanks,
> : > >
> : > > -Yenya
> : > >
> : > > --
> : > > | Jan "Yenya" Kasprzak  private}> |
> : > > | http://www.fi.muni.cz/~kas/ GPG: 
> 4096R/A45477D5 |
> : > >  This is the world we live in: the way to deal with computers is to 
> google
> : > >  the symptoms, and hope that you don't have to watch a video. --P. 
> Zaitcev
> : > > ___
> : > > ceph-users mailing list
> : > > ceph-users@lists.ceph.com
> : > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> : > ___
> : > ceph-users mailing list
> : > ceph-users@lists.ceph.com
> : > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google
>  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 4 5 6 >

1 - 100 of 571 matches

Mail list logo