[ceph-users] Announcing Ceph Day NYC 2024 - April 26th!

2024-03-07 Thread Dan van der Ster
Hi everyone,

Ceph Days are coming to New York City again this year, co-hosted by
Bloomberg Engineering and Clyso!

We're planning a full day of Ceph content, well timed to learn about the
latest and greatest Squid release.

https://ceph.io/en/community/events/2024/ceph-days-nyc/

We're opening the CFP for presenters today -- it will close on March 26th
so please get your proposals in quickly!

Registration is also open now and space is limited, so book now to reserve
your seat!

Looking forward to seeing you there!

-- dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Dan van der Ster
Hi,

I just want to echo what the others are saying.

Keep in mind that RADOS needs to guarantee read-after-write consistency for
the higher level apps to work (RBD, RGW, CephFS). If you corrupt VM block
devices, S3 objects or bucket metadata/indexes, or CephFS metadata, you're
going to suffer some long days and nights recovering.

Anyway, I think that what you proposed has at best a similar reliability to
min_size=1. And note that min_size=1 is strongly discouraged because of the
very high likelihood that a device/network/power failure turns into a
visible outage. In short: your idea would turn every OSD into a SPoF.

How would you handle this very common scenario: a power outage followed by
at least one device failing to start afterwards?

1. Write object A from client.
2. Fsync to primary device completes.
3. Ack to client.
4. Writes sent to replicas.
5. Cluster wide power outage (before replicas committed).
6. Power restored, but the primary osd does not start (e.g. permanent hdd
failure).
7. Client tries to read object A.

Today, with min_size=1 such a scenario manifests as data loss: you get
either a down PG (with many many objects offline/IO blocked until you
manually decide which data loss mode to accept) or unfounded objects (with
IO blocked until you accept data loss). With min_size=2 the likelihood of
data loss is dramatically reduced.

Another thing about that power loss scenario is that all dirty PGs would
need to be recovered when the cluster reboots. You'd lose all the writes in
transit and have to replay them from the primary's pg_log, or backfill if
the pg_log was too short. Again, any failure during that recovery would
lead to data loss.

So I think that to maintain any semblance of reliability, you'd need to at
least wait for a commit ack from the first replica (i.e. min_size=2). But
since the replica writes are dispatched in parallel, your speedup would
evaporate.

Another thing: I suspect this idea would result in many inconsistencies
from transient issues. You'd need to ramp up the number of parallel
deep-scrubs to look for those inconsistencies quickly, which would also
work against any potential speedup.

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
w: https://clyso.com | e: dan.vanders...@clyso.com

Try our Ceph Analyzer!: https://analyzer.clyso.com/
We are hiring: https://www.clyso.com/jobs/


On Wed, Jan 31, 2024, 11:49 quag...@bol.com.br  wrote:

> Hello everybody,
>  I would like to make a suggestion for improving performance in Ceph
> architecture.
>  I don't know if this group would be the best place or if my proposal
> is correct.
>
>  My suggestion would be in the item
> https://docs.ceph.com/en/latest/architecture/, at the end of the topic
> "Smart Daemons Enable Hyperscale".
>
>  The Client needs to "wait" for the configured amount of replicas to
> be written (so that the client receives an ok and continues). This way, if
> there is slowness on any of the disks on which the PG will be updated, the
> client is left waiting.
>
>  It would be possible:
>
>  1-) Only record on the primary OSD
>  2-) Write other replicas in background (like the same way as when an
> OSD fails: "degraded" ).
>
>  This way, client has a faster response when writing to storage:
> improving latency and performance (throughput and IOPS).
>
>  I would find it plausible to accept a period of time (seconds) until
> all replicas are ok (written asynchronously) at the expense of improving
> performance.
>
>  Could you evaluate this scenario?
>
>
> Rafael.
>
>  ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help: Balancing Ceph OSDs with different capacity

2024-02-07 Thread Dan van der Ster
Hi Jasper,

I suggest to disable all the crush-compat and reweighting approaches.
They rarely work out.

The state of the art is:

ceph balancer on
ceph balancer mode upmap
ceph config set mgr mgr/balancer/upmap_max_deviation 1

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

Try our Ceph Analyzer!: https://analyzer.clyso.com/
We are hiring: https://www.clyso.com/jobs/




On Wed, Feb 7, 2024 at 11:05 AM Jasper Tan  wrote:
>
> Hi
>
> I have recently onboarded new OSDs into my Ceph Cluster. Previously, I had
> 44 OSDs of 1.7TiB each and was using it for about a year. About 1 year ago,
> we onboarded an additional 20 OSDs of 14TiB each.
>
> However I observed that many of the data were still being written onto the
> original 1.7TiB OSDs instead of the 14TiB ones. Overtime, this caused a
> bottleneck as the 1.7TiB OSDs reached nearfull capacity.
>
> I have tried to perform a reweight (both crush reweight and reweight) to
> reduce the number of PGs on each 1.7TiB. This worked temporarily but
> resulted in many objects being misplaced and PGs being in a Warning state.
>
> Subsequently I have also tried using crush-compat balancer mode instead of
> upmap but did not see significant improvement. The latest changes I made
> was to change backfill-threshold to 0.85, hoping that PGs will no longer be
> assigned to OSDs that are >85% utilization. However, this did not change
> the situation much as I see many OSDs above >85% utilization today.
>
> Attached is a report from ceph report command. For now I have OUT-ed two of
> my OSDs which have reached 95% capacity. I would greatly appreciate it if
> someone can provide advice on this matter.
>
> Thanks
> Jasper Tan
>
>
> --
>
> --
>
>
> --
>
>
> *The contents of this e-mail message and any attachments are
> confidential and are intended solely
> for addressee. The information may
> also be legally privileged. This transmission is sent in trust, for
> the
> sole purpose of delivery to the intended recipient. If you have received
> this transmission in error,
> any use, reproduction or dissemination of this
> transmission is strictly prohibited. If you are not the
> intended recipient,
> please immediately NOTIFY the sender by reply e-mail or phone and DELETE
> this message and its attachments, if any.*
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-10 Thread Dan van der Ster
Hi Samuel,

It can be a few things. A good place to start is to dump_mempools of one of
those bloated OSDs:

`ceph daemon osd.123 dump_mempools`

Cheers, Dan


--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

We are hiring: https://www.clyso.com/jobs/



On Wed, Jan 10, 2024 at 10:20 AM huxia...@horebdata.cn <
huxia...@horebdata.cn> wrote:

> Dear Ceph folks,
>
> I am responsible for two Ceph clusters, running Nautilius 14.2.22 version,
> one with replication 3, and the other with EC 4+2. After around 400 days
> runing quietly and smoothly, recently the two clusters occured with similar
> problems: some of OSDs consume ca 18 GB while the memory target is setting
> at 2GB.
>
> What could wrong in the background?  Does it mean any slow OSD memory leak
> issues with 14.2.22 which i do not know yet?
>
> I would be highly appreciated if some some provides any clues, ideas,
> comments ..
>
> best regards,
>
> Samuel
>
>
>
> huxia...@horebdata.cn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How balancer module balance data

2023-11-27 Thread Dan van der Ster
Hi,

For the reason you observed, I normally set upmap_max_deviation = 1 on
all clusters I get my hands on.

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

We are hiring: https://www.clyso.com/jobs/
Try our Ceph Analyzer! https://analyzer.clyso.com/

On Mon, Nov 27, 2023 at 10:30 AM  wrote:
>
> Hello,
>
> We are running a pacific 16.2.10 cluster and enabled the balancer module, 
> here is the configuration:
>
> [root@ceph-1 ~]# ceph balancer status
> {
> "active": true,
> "last_optimize_duration": "0:00:00.052548",
> "last_optimize_started": "Fri Nov 17 17:09:57 2023",
> "mode": "upmap",
> "optimize_result": "Unable to find further optimization, or pool(s) 
> pg_num is decreasing, or distribution is already perfect",
> "plans": []
> }
> [root@ceph-1 ~]# ceph balancer eval
> current cluster score 0.017742 (lower is better)
>
> Here is the balancer configuration of upmap_max_deviation:
> # ceph config get mgr mgr/balancer/upmap_max_deviation
> 5
>
> We have two different types of OSDS, one is 7681G and another is 3840G. When 
> I checked our PG distribution on each type of OSD, I found the PG 
> distribution is not evenly, for the 7681G OSDs, the OSD distribution varies 
> from 136 to 158; while for the 3840G OSDs, it varies from 60 to 83, seems the 
> upmap_max_deviation is almost +/- 10. So I just wondering if this is expected 
> or do I need to change the upmap_max_deviation to a smaller value.
>
> Thanks for answering my question.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

2023-11-27 Thread Dan van der Ster
Hi Giuseppe,

There are likely one or two clients whose op is blocking the reconnect/replay.
If you increase debug_mds perhaps you can find the guilty client and
disconnect it / block it from mounting.

Or for a more disruptive recovery you can try this "Deny all reconnect
to clients " option:
https://docs.ceph.com/en/reef/cephfs/troubleshooting/#avoiding-recovery-roadblocks

Hope that helps,

Dan


On Mon, Nov 27, 2023 at 1:08 AM Lo Re Giuseppe  wrote:
>
> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are 
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_WARN
> 1 filesystem is degraded
> 3 clients failing to advance oldest client/flush tid
> 3 MDSs report slow requests
> 6 pgs not scrubbed in time
> 29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 
> 30 secs
> mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked 
> > 30 secs
> mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 
> 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart 
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ==
> RANK STATE   MDS  ACTIVITY DNS
> INOS   DIRS   CAPS
> 0   active cephfs.naret-monitor01.nuakzo  Reqs:0 /s  17.2k  16.2k 
>  1892   14.3k
> 1   active cephfs.naret-monitor02.ztdghf  Reqs:0 /s  28.1k  10.3k 
>   752   6881
> 2clientreplay  cephfs.naret-monitor02.exceuo 63.0k  6491  
>   541 66
> 3   active cephfs.naret-monitor03.lqppte  Reqs:0 /s  16.7k  13.4k 
>  8233990
>   POOL  TYPE USED  AVAIL
>cephfs.cephfs.meta metadata  5888M  18.5T
>cephfs.cephfs.data   data 119G   215T
> cephfs.cephfs.data.e_4_2data2289G  3241T
> cephfs.cephfs.data.e_8_3data9997G   470T
>  STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) 
> reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes but 
> any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files

2023-11-24 Thread Dan van der Ster
Hi Sebastian,

You can find some more discussion and fixes for this type of fs
corruption here:
https://www.spinics.net/lists/ceph-users/msg76952.html

--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

We are hiring: https://www.clyso.com/jobs/

On Fri, Nov 24, 2023 at 5:48 AM Sebastian Knust
 wrote:
>
> Hi,
>
> After updating from 17.2.6 to 17.2.7 with cephadm, our cluster went into
> MDS_DAMAGE state. We had some prior issues with faulty kernel clients
> not releasing capabilities, therefore the update might just be a
> coincidence.
>
> `ceph tell mds.cephfs:0 damage ls` lists 56 affected files all with
> these general details:
>
> {
>  "damage_type": "dentry",
>  "id": 123456,
>  "ino": 1234567890,
>  "frag": "*",
>  "dname": "some-filename.ext",
>  "snap_id": "head",
>  "path": "/full/path/to/file"
> }
>
> The behaviour upon trying to access file information in the (Kernel
> mounted) filesystem is a bit inconsistent. Generally, the first `stat`
> call seems to result in "Input/output error", the next call provides all
> `stat` data as expected from an undamaged file. The file can be read
> with `cat` with full and correct content (verified with backup) once the
> stat call succeeds.
>
> Scrubbing the affected subdirectories with `ceph tell mds.cephfs:0 scrub
> start /path/to/dir/ recursive,repair,force` does not fix the issue.
>
> Trying to delete the file results in an "Input/output error". If the
> stat calls beforehand succeeded, this also crashes the active MDS with
> these messages in the system journal:
> > Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: 
> > mds.0.cache.den(0x10012271195 DisplaySettings.json) newly corrupt dentry to 
> > be committed: [dentry 
> > #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
> >  [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 
> > state=1073741824 | inodepin=1 0x56413e1e2780]
> > Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: log_channel(cluster) 
> > log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry 
> > #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
> >  [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 
> > state=1073741824 | inodepin=1 0x56413e1e2780]
> > Nov 24 14:21:15 iceph-18.servernet 
> > ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]:
> >  2023-11-24T13:21:15.654+ 7f3fdcde0700 -1 mds.0.cache.den(0x10012271195 
> > DisplaySettings.json) newly corrupt dentry to be committed: [dentry 
> > #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
> >  [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x1001>
> > Nov 24 14:21:15 iceph-18.servernet 
> > ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]:
> >  2023-11-24T13:21:15.654+ 7f3fdcde0700 -1 log_channel(cluster) log 
> > [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry 
> > #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
> >  [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012>
> > Nov 24 14:21:15 iceph-18.servernet 
> > ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]:
> >  
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
> >  In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 
> > time 2023-11-24T13:21:15.655088+
> > Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: 
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
> >  In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 
> > time 2023-11-24T13:21:15.655088+
> >   
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABL

[ceph-users] Re: RGW access logs with bucket name

2023-11-02 Thread Dan van der Ster
Using the ops log is a good option -- I had missed that it can now log
to a file. In Quincy:

# ceph config set global rgw_ops_log_rados false
# ceph config set global rgw_ops_log_file_path
'/var/log/ceph/ops-log-$cluster-$name.log'
# ceph config set global rgw_enable_ops_log true

Then restart all RGWs.

Thanks!

Dan

--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

We are hiring: https://www.clyso.com/jobs/

On Mon, Oct 30, 2023 at 7:19 AM Casey Bodley  wrote:
>
> another option is to enable the rgw ops log, which includes the bucket
> name for each request
>
> the http access log line that's visible at log level 1 follows a known
> apache format that users can scrape, so i've resisted adding extra
> s3-specific stuff like bucket/object names there. there was some
> recent discussion around this in
> https://github.com/ceph/ceph/pull/50350, which had originally extended
> that access log line
>
> On Mon, Oct 30, 2023 at 6:03 AM Boris Behrens  wrote:
> >
> > Hi Dan,
> >
> > we are currently moving all the logging into lua scripts, so it is not an
> > issue anymore for us.
> >
> > Thanks
> >
> > ps: the ceph analyzer is really cool. plusplus
> >
> > Am Sa., 28. Okt. 2023 um 22:03 Uhr schrieb Dan van der Ster <
> > dan.vanders...@clyso.com>:
> >
> > > Hi Boris,
> > >
> > > I found that you need to use debug_rgw=10 to see the bucket name :-/
> > >
> > > e.g.
> > > 2023-10-28T19:55:42.288+ 7f34dde06700 10 req 3268931155513085118
> > > 0.0s s->object=... s->bucket=xyz-bucket-123
> > >
> > > Did you find a more convenient way in the meantime? I think we should
> > > log bucket name at level 1.
> > >
> > > Cheers, Dan
> > >
> > > --
> > > Dan van der Ster
> > > CTO
> > >
> > > Clyso GmbH
> > > p: +49 89 215252722 | a: Vancouver, Canada
> > > w: https://clyso.com | e: dan.vanders...@clyso.com
> > >
> > > Try our Ceph Analyzer: https://analyzer.clyso.com
> > >
> > > On Thu, Mar 30, 2023 at 4:15 AM Boris Behrens  wrote:
> > > >
> > > > Sadly not.
> > > > I only see the the path/query of a request, but not the hostname.
> > > > So when a bucket is accessed via hostname (
> > > https://bucket.TLD/object?query)
> > > > I only see the object and the query (GET /object?query).
> > > > When a bucket is accessed bia path (https://TLD/bucket/object?query) I
> > > can
> > > > see also the bucket in the log (GET bucket/object?query)
> > > >
> > > > Am Do., 30. März 2023 um 12:58 Uhr schrieb Szabo, Istvan (Agoda) <
> > > > istvan.sz...@agoda.com>:
> > > >
> > > > > It has the full url begins with the bucket name in the beast logs http
> > > > > requests, hasn’t it?
> > > > >
> > > > > Istvan Szabo
> > > > > Staff Infrastructure Engineer
> > > > > ---
> > > > > Agoda Services Co., Ltd.
> > > > > e: istvan.sz...@agoda.com
> > > > > ---
> > > > >
> > > > > On 2023. Mar 30., at 17:44, Boris Behrens  wrote:
> > > > >
> > > > > Email received from the internet. If in doubt, don't click any link
> > > nor
> > > > > open any attachment !
> > > > > 
> > > > >
> > > > > Bringing up that topic again:
> > > > > is it possible to log the bucket name in the rgw client logs?
> > > > >
> > > > > currently I am only to know the bucket name when someone access the
> > > bucket
> > > > > via https://TLD/bucket/object instead of https://bucket.TLD/object.
> > > > >
> > > > > Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens 
> > > > >  > > >:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am looking forward to move our logs from
> > > > >
> > > > > /var/log/ceph/ceph-client...log to our logaggregator.
> > > > >
> > > > >
> > > > > Is there a way to have the bucket name in the log file?
> > > > >
> > > > >
> > > > > Or can I write the rgw_enable_ops_log into a file? Maybe I

[ceph-users] Re: RGW access logs with bucket name

2023-10-28 Thread Dan van der Ster
Hi Boris,

I found that you need to use debug_rgw=10 to see the bucket name :-/

e.g.
2023-10-28T19:55:42.288+ 7f34dde06700 10 req 3268931155513085118
0.0s s->object=... s->bucket=xyz-bucket-123

Did you find a more convenient way in the meantime? I think we should
log bucket name at level 1.

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

Try our Ceph Analyzer: https://analyzer.clyso.com

On Thu, Mar 30, 2023 at 4:15 AM Boris Behrens  wrote:
>
> Sadly not.
> I only see the the path/query of a request, but not the hostname.
> So when a bucket is accessed via hostname (https://bucket.TLD/object?query)
> I only see the object and the query (GET /object?query).
> When a bucket is accessed bia path (https://TLD/bucket/object?query) I can
> see also the bucket in the log (GET bucket/object?query)
>
> Am Do., 30. März 2023 um 12:58 Uhr schrieb Szabo, Istvan (Agoda) <
> istvan.sz...@agoda.com>:
>
> > It has the full url begins with the bucket name in the beast logs http
> > requests, hasn’t it?
> >
> > Istvan Szabo
> > Staff Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > On 2023. Mar 30., at 17:44, Boris Behrens  wrote:
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > 
> >
> > Bringing up that topic again:
> > is it possible to log the bucket name in the rgw client logs?
> >
> > currently I am only to know the bucket name when someone access the bucket
> > via https://TLD/bucket/object instead of https://bucket.TLD/object.
> >
> > Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens :
> >
> > Hi,
> >
> > I am looking forward to move our logs from
> >
> > /var/log/ceph/ceph-client...log to our logaggregator.
> >
> >
> > Is there a way to have the bucket name in the log file?
> >
> >
> > Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> >
> > this.
> >
> >
> > Cheers and happy new year
> >
> > Boris
> >
> >
> >
> >
> > --
> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> > groüen Saal.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> > --
> > This message is confidential and is for the sole use of the intended
> > recipient(s). It may also be privileged or otherwise protected by copyright
> > or other legal rules. If you have received it by mistake please let us know
> > by reply email and delete it from your system. It is prohibited to copy
> > this message or disclose its content to anyone. Any confidentiality or
> > privilege is not waived or lost by any mistaken delivery or unauthorized
> > disclosure of the message. All messages sent to and from Agoda may be
> > monitored to ensure compliance with company policies, to protect the
> > company's interests and to remove potential malware. Electronic messages
> > may be intercepted, amended, lost or deleted, or contain viruses.
> >
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Leadership Team notes 10/25

2023-10-25 Thread Dan van der Ster
Hi all,

Here are this week's notes from the CLT:

* Collective review of the Reef/Squid "State of Cephalopod" slides.
* Smoke test suite was unscheduled but it's back on now.
* Releases:
   * 17.2.7: about to start building last week, delayed by a few
issues (https://tracker.ceph.com/issues/63257,
https://tracker.ceph.com/issues/63305,
https://github.com/ceph/ceph/pull/54169). ceph_exporter test coverage
will be prioritized.
   * 18.2.1: all PRs in testing or merged.
* Ceph Board approved a new Foundation member tiers model, Silver,
Gold, Platinum, Diamond. Working on implementation with LF.

-- dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: index object in shard begins with hex 80

2023-07-18 Thread Dan van der Ster
Hi Chris,

Those objects are in the so called "ugly namespace" of the rgw, used to
prefix special bucket index entries.

// No UTF-8 character can begin with 0x80, so this is a safe indicator
// of a special bucket-index entry for the first byte. Note: although
// it has no impact, the 2nd, 3rd, or 4th byte of a UTF-8 character
// may be 0x80.
#define BI_PREFIX_CHAR 0x80

You can use --omap-key-file and some sed magic to interact with those keys,
e.g. like this example from my archives [1].
(In my example I needed to remove orphaned olh entries -- in your case you
can generate uglykeys.txt in whichever way is meaningful for your
situation.)

BTW, to be clear, I'm not suggesting you blindly delete those keys. You
would need to confirm that they are not needed by a current bucket instance
before deleting, lest some index get corrupted.

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com


[1]

# radosgw-admin bi list --bucket=xxx --shard-id=0 >
xxx.bilist.0
# cat xxx.bilist.0 | jq -r '.[]|select(.type=="olh" and .entry.key.name=="")
| .idx' > uglykeys.txt
# head -n2 uglykeys.txt
�1001_00/2a/002a985cc73a01ce738da460b990e9b2fa849eb4411efb0a4598876c2859d444/2018_12_11/2893439/3390300/metadata.gz
�1001_02/5f/025f8e0fc8234530d6ae7302adf682509f0f7fb68666391122e16d00bd7107e3/2018_11_14/2625203/3034777/metadata.gz

# cat do_remove.sh

# usage: "bash do_remove.sh | sh -x"
while read f;
do
echo -n $f | sed 's/^.1001_/echo -n -e x801001_/'; echo ' > mykey
&& rados rmomapkey -p default.rgw.buckets.index
.dir.zone.bucketid.xx.indexshardnumber --omap-key-file mykey';
done < uglykeys.txt




On Tue, Jul 18, 2023 at 9:27 AM Christopher Durham 
wrote:

> Hi,
> I am using ceph 17.2.6 on rocky linux 8.
> I got a large omap object warning today.
> Ok, So I tracked it down to a shard for a bucket in the index pool of an
> s3 pool.
>
> However, when lisitng the omapkeys with:
> # rados -p pool.index listomapkeys .dir.zone.bucketid.xx.indexshardnumber
> it is clear that the problem is caused by many omapkeys with the following
> name format:
>
> <80>0_4771163.3444695458.6
> A hex dump of the output of the listomapkeys command above indicates that
> the first 'character' is indeed hex 80, but as there is no equivalent ascii
> for hex 80, I am not sure how to 'get at' those keys to see the values,
> delete them, etc. The index keys not of the format above appear to be fine,
> indicating s3 object names as expected.
>
> The rest of the index shards for the bucket are reasonable and have less
> than  osd_deep_scrub_large_omap_object_key_threshold index objects , and
> the overall total of objects in the bucket is way less than
> osd_deep_scrub_large_omap_object_key_threshold*num_shards.
>
> These weird objects seem to be created occasionally.? Yes, the
> bucket is used heavily.
>
> Any advice here?
> -Chris
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster down after network outage

2023-07-12 Thread Dan van der Ster
On Wed, Jul 12, 2023 at 1:26 AM Frank Schilder  wrote:

Hi all,
>
> one problem solved, another coming up. For everyone ending up in the same
> situation, the trick seems to be to get all OSDs marked up and then allow
> recovery. Steps to take:
>
> - set noout, nodown, norebalance, norecover
> - wait patiently until all OSDs are shown as up
> - unset norebalance, norecover
> - wait wait wait, PGs will eventually become active as OSDs become
> responsive
> - unset nodown, noout
>

Nice work bringing the cluster back up.
Looking into an OSD log would give more detail about why they were
flapping. Are these HDDs? Are the block.dbs on flash?

Generally, I've found that on clusters having OSDs which are slow to boot
and flapping up and down, "nodown" is sufficient to recover from such
issues.

Cheers, Dan

__ Clyso GmbH | Ceph
Support and Consulting | https://www.clyso.com





>
> Now the new problem. I now have an ever growing list of OSDs listed as
> rebalancing, but nothing is actually rebalancing. How can I stop this
> growth and how can I get rid of this list:
>
> [root@gnosis ~]# ceph status
>   cluster:
> id: XXX
> health: HEALTH_WARN
> noout flag(s) set
> Slow OSD heartbeats on back (longest 634775.858ms)
> Slow OSD heartbeats on front (longest 635210.412ms)
> 1 pools nearfull
>
>   services:
> mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 6m)
> mgr: ceph-25(active, since 57m), standbys: ceph-26, ceph-01, ceph-02,
> ceph-03
> mds: con-fs2:8 4 up:standby 8 up:active
> osd: 1260 osds: 1258 up (since 24m), 1258 in (since 45m)
>  flags noout
>
>   data:
> pools:   14 pools, 25065 pgs
> objects: 1.97G objects, 3.5 PiB
> usage:   4.4 PiB used, 8.7 PiB / 13 PiB avail
> pgs: 25028 active+clean
>  30active+clean+scrubbing+deep
>  7 active+clean+scrubbing
>
>   io:
> client:   1.3 GiB/s rd, 718 MiB/s wr, 7.71k op/s rd, 2.54k op/s wr
>
>   progress:
> Rebalancing after osd.135 marked in (1s)
>   [=...]
> Rebalancing after osd.69 marked in (2s)
>   []
> Rebalancing after osd.75 marked in (2s)
>   [===.]
> Rebalancing after osd.173 marked in (2s)
>   []
> Rebalancing after osd.42 marked in (1s)
>   [=...] (remaining: 2s)
> Rebalancing after osd.104 marked in (2s)
>   []
> Rebalancing after osd.82 marked in (2s)
>   []
> Rebalancing after osd.107 marked in (2s)
>   [===.]
> Rebalancing after osd.19 marked in (2s)
>   [===.]
> Rebalancing after osd.67 marked in (2s)
>   [=...]
> Rebalancing after osd.46 marked in (2s)
>   [===.] (remaining: 1s)
> Rebalancing after osd.123 marked in (2s)
>   [===.]
> Rebalancing after osd.66 marked in (2s)
>   []
> Rebalancing after osd.12 marked in (2s)
>   [==..] (remaining: 2s)
> Rebalancing after osd.95 marked in (2s)
>   [=...]
> Rebalancing after osd.134 marked in (2s)
>   [===.]
> Rebalancing after osd.14 marked in (1s)
>   [===.]
> Rebalancing after osd.56 marked in (2s)
>   [=...]
> Rebalancing after osd.143 marked in (1s)
>   []
> Rebalancing after osd.118 marked in (2s)
>   [===.]
> Rebalancing after osd.96 marked in (2s)
>   []
> Rebalancing after osd.105 marked in (2s)
>   [===.]
> Rebalancing after osd.44 marked in (1s)
>   [===.] (remaining: 5s)
> Rebalancing after osd.41 marked in (1s)
>   [==..] (remaining: 1s)
> Rebalancing after osd.9 marked in (2s)
>   [=...] (remaining: 37s)
> Rebalancing after osd.58 marked in (2s)
>   [==..] (remaining: 8s)
> Rebalancing after osd.140 marked in (1s)
>   [===.]
> Rebalancing after osd.132 marked in (2s)
>   []
> Rebalancing after osd.31 marked in (1s)
>   [=...]
> Rebalancing after osd.110 marked in (2s)
>   []
> Rebalancing after osd.21 marked in (2s)
>   [=...]
> Rebalancing after osd.114 marked in (2s)
>   [===.]
> Rebalancing after osd.83 marked in (2s)
>   

[ceph-users] Re: MON sync time depends on outage duration

2023-07-10 Thread Dan van der Ster
Oh yes, sounds like purging the rbd trash will be the real fix here!
Good luck!

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Mon, Jul 10, 2023 at 6:10 AM Eugen Block  wrote:

> Hi,
> I got a customer response with payload size 4096, that made things
> even worse. The mon startup time was now around 40 minutes. My doubts
> wrt decreasing the payload size seem confirmed. Then I read Dan's
> response again which also mentions that the default payload size could
> be too small. So I asked them to double the default (2M instead of 1M)
> and am now waiting for a new result. I'm still wondering why this only
> happens when the mon is down for more than 5 minutes. Does anyone have
> an explanation for that time factor?
> Another thing they're going to do is to remove lots of snapshot
> tombstones (rbd mirroring snapshots in the trash namespace), maybe
> that will reduce the osd_snap keys in the mon db, which then would
> increase the startup time. We'll see...
>
> Zitat von Eugen Block :
>
> > Thanks, Dan!
> >
> >> Yes that sounds familiar from the luminous and mimic days.
> >> The workaround for zillions of snapshot keys at that time was to use:
> >>   ceph config set mon mon_sync_max_payload_size 4096
> >
> > I actually did search for mon_sync_max_payload_keys, not bytes so I
> > missed your thread, it seems. Thanks for pointing that out. So the
> > defaults seem to be these in Octopus:
> >
> > "mon_sync_max_payload_keys": "2000",
> > "mon_sync_max_payload_size": "1048576",
> >
> >> So it could be in your case that the sync payload is just too small to
> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> debug_mon
> >> you should be able to understand what is taking so long, and tune
> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >
> > I'm confused, if the payload size is too small, why would decreasing
> > it help? Or am I misunderstanding something? But it probably won't
> > hurt to try it with 4096 and see if anything changes. If not we can
> > still turn on debug logs and take a closer look.
> >
> >> And additional to Dan suggestion, the HDD is not a good choices for
> >> RocksDB, which is most likely the reason for this thread, I think
> >> that from the 3rd time the database just goes into compaction
> >> maintenance
> >
> > Believe me, I know... but there's not much they can currently do
> > about it, quite a long story... But I have been telling them that
> > for months now. Anyway, I will make some suggestions and report back
> > if it worked in this case as well.
> >
> > Thanks!
> > Eugen
> >
> > Zitat von Dan van der Ster :
> >
> >> Hi Eugen!
> >>
> >> Yes that sounds familiar from the luminous and mimic days.
> >>
> >> Check this old thread:
> >>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
> >> (that thread is truncated but I can tell you that it worked for Frank).
> >> Also the even older referenced thread:
> >>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
> >>
> >> The workaround for zillions of snapshot keys at that time was to use:
> >>   ceph config set mon mon_sync_max_payload_size 4096
> >>
> >> That said, that sync issue was supposed to be fixed by way of adding the
> >> new option mon_sync_max_payload_keys, which has been around since
> nautilus.
> >>
> >> So it could be in your case that the sync payload is just too small to
> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> debug_mon
> >> you should be able to understand what is taking so long, and tune
> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >>
> >> Good luck!
> >>
> >> Dan
> >>
> >> __
> >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
> >>
> >>
> >>
> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block  wrote:
> >>
> >>> Hi *,
> >>>
> >>> I'm investigating an interesting issue on two customer clusters (used
> >>> for mirroring) I've not solved yet, but today we finally made some
> >>> progress. Maybe someone has an idea where to look next, I'd appreciate
> >>

[ceph-users] Re: Planning cluster

2023-07-10 Thread Dan van der Ster
Hi Jan,

On Sun, Jul 9, 2023 at 11:17 PM Jan Marek  wrote:

> Hello,
>
> I have a cluster, which have this configuration:
>
> osd pool default size = 3
> osd pool default min size = 1
>

Don't use min_size = 1 during regular stable operations. Instead, use
min_size = 2 to ensure data safety, and then you can set the pool to
min_size = 1 manually in the case of an emergency. (E.g. in case the 2
copies fail and will not be recoverable).


> I have 5 monitor nodes and 7 OSD nodes.
>

3 monitors is probably enough. Put 2 in the same DC with 2 replicas, and
the other in the DC with 1 replica.


> I have changed a crush map to divide ceph cluster to two
> datacenters - in the first one will be a part of cluster with 2
> copies of data and in the second one will be part of cluster
> with one copy - only emergency.
>
> I still have this cluster in one
>
> This cluster have a 1 PiB of raw data capacity, thus it is very
> expensive add a further 300TB capacity to have 2+2 data redundancy.
>
> Will it works?
>
> If I turn off the 1/3 location, will it be operational?


Yes the PGs should be active and accept IO. But the cluster will be
degraded, it cannot stay in this state permanently. (You will need to
recover the 3rd replica or change the crush map).



> I
> believe, it is a better choose, it will. And what if "die" 2/3
> location?


with min_size = 2, the PG wil be inactive. but the data will be safe. If
this happens, then set min_size=1 to activate the PGs.
Mon will not have quorum though -- you need a plan for that. And also plan
where you put your MDSs.

-- dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




> On this cluster is pool with cephfs - this is a main
> part of CEPH.
>
> Many thanks for your notices.
>
> Sincerely
> Jan Marek
> --
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snapshots: impact of moving data

2023-07-06 Thread Dan van der Ster
Hi Mathias,

Provided that both subdirs are within the same snap context (subdirs below
where the .snap is created), I would assume that in the mv case, the space
usage is not doubled: the snapshots point at the same inode and it is just
linked at different places in the filesystem.

However, if your cluster and livelihood depends on this being true, I
suggest making a small test in a tiny empty cephfs, listing the rados pools
before and after mv and snapshot operations to find out exactly which data
objects are created.

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Thu, Jun 22, 2023 at 8:54 AM Kuhring, Mathias <
mathias.kuhr...@bih-charite.de> wrote:

> Dear Ceph community,
>
> We want to restructure (i.e. move around) a lot of data (hundreds of
> terabyte) in our CephFS.
> And now I was wondering what happens within snapshots when I move data
> around within a snapshotted folder.
> I.e. do I need to account for a lot increased storage usage due to older
> snapshots differing from the new restructured state?
> In the end it is just metadata changes. Are the snapshots aware of this?
>
> Consider the following examples.
>
> Copying data:
> Let's say I have a folder /test, with a file XYZ in sub-folder
> /test/sub1 and an empty sub-folder /test/sub2.
> I create snapshot snapA in /test/.snap, copy XYZ to sub-folder
> /test/sub2, delete it from /test/sub1 and create another snapshot snapB.
> I would have two snapshots each with distinct copies of XYZ, hence using
> double the space in the FS:
> /test/.snap/snapA/sub1/XYZ <-- copy 1
> /test/.snap/snapA/sub2/
> /test/.snap/snapB/sub1/
> /test/.snap/snapB/sub2/XYZ <-- copy 2
>
> Moving data:
> Let's assume the same structure.
> But now after creating snapshot snapA, I move XYZ to sub-folder
> /test/sub2 and then create the other snapshot snapB.
> The directory tree will look the same. But how is this treated internally?
> Once I move the data, will there be an actually copy created in snapA to
> represent the old state?
> Or will this remain the same data (like a link to the inode or so)?
> And hence not double the storage used for that file.
>
> I couldn't find (or understand) anything related to this in the docs.
> The closest seems to be the hard-link section here:
> https://docs.ceph.com/en/quincy/dev/cephfs-snapshots/#hard-links
> Which unfortunately goes a bit over my head.
> So I'm not sure if this answers my question.
>
> Thank you all for your help. Appreciate it.
>
> Best Wishes,
> Mathias Kuhring
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Quarterly (CQ) - Issue #1

2023-07-06 Thread Dan van der Ster
Thanks Zac!

I only see the txt attachment here. Where can we get the PDF A4 and letter
renderings?

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Mon, Jul 3, 2023 at 10:29 AM Zac Dover  wrote:

> The first issue of "Ceph Quarterly" is attached to this email. Ceph
> Quarterly (or "CQ") is an overview of the past three months of upstream
> Ceph development. We provide CQ in three formats: A4, letter, and plain
> text wrapped at 80 columns.
>
> Zac Dover
> Upstream Documentation
> Ceph Foundation___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cannot get backfill speed up

2023-07-06 Thread Dan van der Ster
Hi Jesper,

Indeed many users reported slow backfilling and recovery with the mclock
scheduler. This is supposed to be fixed in the latest quincy but clearly
something is still slowing things down.
Some clusters have better luck reverting to osd_op_queue = wpq.

(I'm hoping by proposing this someone who tuned mclock recently will chime
in with better advice).

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Wed, Jul 5, 2023 at 10:28 PM Jesper Krogh  wrote:

>
> Hi.
>
> Fresh cluster - but despite setting:
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> recovery_max_active_ssd
> osd_recovery_max_active_ssd  50
>
>mon
> default[20]
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> osd_max_backfills
> osd_max_backfills100
>
>mon
> default[10]
>
> I still get
> jskr@dkcphhpcmgt028:/$ sudo ceph status
>cluster:
>  id: 5c384430-da91-11ed-af9c-c780a5227aff
>  health: HEALTH_OK
>
>services:
>  mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028
> (age 16h)
>  mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys:
> dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd
>  mds: 2/2 daemons up, 1 standby
>  osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs
>
>data:
>  volumes: 2/2 healthy
>  pools:   9 pools, 495 pgs
>  objects: 24.85M objects, 60 TiB
>  usage:   117 TiB used, 159 TiB / 276 TiB avail
>  pgs: 10655690/145764002 objects misplaced (7.310%)
>   474 active+clean
>   15  active+remapped+backfilling
>   6   active+remapped+backfill_wait
>
>io:
>  client:   0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr
>  recovery: 328 MiB/s, 108 objects/s
>
>progress:
>  Global Recovery Event (9h)
>[==..] (remaining: 25m)
>
> With these numbers for the setting - I would expect to get more than 15
> active backfilling... (and based on SSD's and 2x25gbit network, I can
> also spend more resources on recovery than 328 MiB/s
>
> Thanks, .
>
> --
> Jesper Krogh
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg_num != pgp_num - and unable to change.

2023-07-06 Thread Dan van der Ster
Hi Jesper,

> In earlier versions of ceph (without autoscaler) I have only experienced
> that setting pg_num and pgp_num took immidiate effect?

That's correct -- in recent Ceph (since nautilus) you cannot manipulate
pgp_num directly anymore. There is a backdoor setting (set pgp_num_actual
...) but I don't really recommend that.

Since nautilus, pgp_num (and pg_num) will be increased by the mgr
automatically to reach your pg_num_target over time. (If you're a source
code reader check DaemonServer::adjust_pgs for how this works).

In short, the mgr is throttled by the target_max_misplaced_ratio, which
defaults to 5%.

So if you want to split more aggressively,
increase target_max_misplaced_ratio.

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com



On Wed, Jul 5, 2023 at 9:41 PM Jesper Krogh  wrote:

> Hi.
>
> Fresh cluster - after a dance where the autoscaler did not work
> (returned black) as described in the doc - I now seemingly have it
> working. It has bumpted target to something reasonable -- and is slowly
> incrementing pg_num and pgp_num by 2 over time (hope this is correct?)
>
> But .
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
> pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8
> min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22
> pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159
> lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk
> stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application
> cephfs
>
> pg_num = 150
> pgp_num = 22
>
> and setting pgp_num seemingly have zero effect on the system .. not even
> with autoscaling set to off.
>
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_autoscale_mode off
> set pool 22 pg_autoscale_mode to off
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pgp_num 150
> set pool 22 pgp_num to 150
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_num_min 128
> set pool 22 pg_num_min to 128
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_num 150
> set pool 22 pg_num to 150
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_autoscale_mode on
> set pool 22 pg_autoscale_mode to on
> jskr@dkcphhpcmgt028:/$ sudo ceph progress
> PG autoscaler increasing pool 22 PGs from 150 to 512 (14s)
>  []
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
> pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8
> min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22
> pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159
> lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk
> stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application
> cephfs
>
> pgp_num != pg_num ?
>
> In earlier versions of ceph (without autoscaler) I have only experienced
> that setting pg_num and pgp_num took immidiate effect?
>
> Jesper
>
> jskr@dkcphhpcmgt028:/$ sudo ceph version
> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
> (stable)
> jskr@dkcphhpcmgt028:/$ sudo ceph health
> HEALTH_OK
> jskr@dkcphhpcmgt028:/$ sudo ceph status
>cluster:
>  id: 5c384430-da91-11ed-af9c-c780a5227aff
>  health: HEALTH_OK
>
>services:
>  mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028
> (age 15h)
>  mgr: dkcphhpcmgt031.afbgjx(active, since 32h), standbys:
> dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd
>  mds: 2/2 daemons up, 1 standby
>  osd: 40 osds: 40 up (since 44h), 40 in (since 39h); 33 remapped pgs
>
>data:
>  volumes: 2/2 healthy
>  pools:   9 pools, 495 pgs
>  objects: 24.85M objects, 60 TiB
>  usage:   117 TiB used, 158 TiB / 276 TiB avail
>  pgs: 13494029/145763897 objects misplaced (9.257%)
>   462 active+clean
>   23  active+remapped+backfilling
>   10  active+remapped+backfill_wait
>
>io:
>  client:   0 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 94 op/s wr
>  recovery: 705 MiB/s, 208 objects/s
>
>progress:
>
>
> --
> Jesper Krogh
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON sync time depends on outage duration

2023-07-06 Thread Dan van der Ster
Hi Eugen!

Yes that sounds familiar from the luminous and mimic days.

Check this old thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
(that thread is truncated but I can tell you that it worked for Frank).
Also the even older referenced thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/

The workaround for zillions of snapshot keys at that time was to use:
   ceph config set mon mon_sync_max_payload_size 4096

That said, that sync issue was supposed to be fixed by way of adding the
new option mon_sync_max_payload_keys, which has been around since nautilus.

So it could be in your case that the sync payload is just too small to
efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon
you should be able to understand what is taking so long, and tune
mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.

Good luck!

Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com



On Thu, Jul 6, 2023 at 1:47 PM Eugen Block  wrote:

> Hi *,
>
> I'm investigating an interesting issue on two customer clusters (used
> for mirroring) I've not solved yet, but today we finally made some
> progress. Maybe someone has an idea where to look next, I'd appreciate
> any hints or comments.
> These are two (latest) Octopus clusters, main usage currently is RBD
> mirroring with snapshot mode (around 500 RBD images are synced every
> 30 minutes). They noticed very long startup times of MON daemons after
> reboot, times between 10 and 30 minutes (reboot time already
> subtracted). These delays are present on both sites. Today we got a
> maintenance window and started to check in more detail by just
> restarting the MON service (joins quorum within seconds), then
> stopping the MON service and wait a few minutes (still joins quorum
> within seconds). And then we stopped the service and waited for more
> than 5 minutes, simulating a reboot, and then we were able to
> reproduce it. The sync then takes around 15 minutes, we verified with
> other MONs as well. The MON store is around 2 GB of size (on HDD), I
> understand that the sync itself can take some time, but what is the
> threshold here? I tried to find a hint in the MON config, searching
> for timeouts with 300 seconds, there were only a few matches
> (mon_session_timeout is one of them), but I'm not sure if they can
> explain this behavior.
> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
> that there were more than 42 Million osd_snap keys, which is quite a
> lot and would explain the size of the MON store. But I'm also not sure
> if it's related to the long syncing process.
> Does that sound familiar to anyone?
>
> Thanks,
> Eugen
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Dan van der Ster
Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds metadata
pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix for
the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:
>
> I checked our logs from yesterday, the PG scaling only started today,
> perhaps triggered by the snapshot trimming. I disabled it, but it didn't
> change anything.
>
> What did change something was restarting the MDS one by one, which had
> got far behind with trimming their caches and with a bunch of stuck ops.
> After restarting them, the pool size decreased quickly to 600GiB. I
> noticed the same behaviour yesterday, though yesterday is was more
> extreme and restarting the MDS took about an hour and I had to increase
> the heartbeat timeout. This time, it took only half a minute per MDS,
> probably because it wasn't that extreme yet and I had reduced the
> maximum cache size. Still looks like a bug to me.
>
>
> On 31/05/2023 11:18, Janek Bevendorff wrote:
> > Another thing I just noticed is that the auto-scaler is trying to
> > scale the pool down to 128 PGs. That could also result in large
> > fluctuations, but this big?? In any case, it looks like a bug to me.
> > Whatever is happening here, there should be safeguards with regard to
> > the pool's capacity.
> >
> > Here's the current state of the pool in ceph osd pool ls detail:
> >
> > pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule
> > 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
> > pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
> > 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
> > expected_num_objects 300 recovery_op_priority 5 recovery_priority
> > 2 application cephfs
> >
> > Janek
> >
> >
> > On 31/05/2023 10:10, Janek Bevendorff wrote:
> >> Forgot to add: We are still on Nautilus (16.2.12).
> >>
> >>
> >> On 31/05/2023 09:53, Janek Bevendorff wrote:
> >>> Hi,
> >>>
> >>> Perhaps this is a known issue and I was simply too dumb to find it,
> >>> but we are having problems with our CephFS metadata pool filling up
> >>> over night.
> >>>
> >>> Our cluster has a small SSD pool of around 15TB which hosts our
> >>> CephFS metadata pool. Usually, that's more than enough. The normal
> >>> size of the pool ranges between 200 and 800GiB (which is quite a lot
> >>> of fluctuation already). Yesterday, we had suddenly had the pool
> >>> fill up entirely and they only way to fix it was to add more
> >>> capacity. I increased the pool size to 18TB by adding more SSDs and
> >>> could resolve the problem. After a couple of hours of reshuffling,
> >>> the pool size finally went back to 230GiB.
> >>>
> >>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had
> >>> adjusted the weights so that not all disks could fill up entirely
> >>> like last time, so it ended there.
> >>>
> >>> I wasn't really able to identify the problem yesterday, but under
> >>> the more controllable scenario today, I could check the MDS logs at
> >>> debug_mds=10 and to me it seems like the problem is caused by
> >>> snapshot trimming. The logs contain a lot of snapshot-related
> >>> messages for paths that haven't been touched in a long time. The
> >>> messages all look something like this:
> >>>
> >>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
> >>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first
> >>> cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
> >>> b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201'
> >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100
> >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
> >>> 0x100 ...
> >>>
> >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200
> >>> 7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir
> >>> 0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0
> >>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0
> >>> tempexporting=0 0x5607759d9600]
> >>>
> >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200
> >>> 7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir
> >>> 0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0
> >>> request=0 child=0 frozen=0 subtree=1 

[ceph-users] Re: Newer linux kernel cephfs clients is more trouble?

2023-05-29 Thread Dan van der Ster
Hi,

Sorry for poking this old thread, but does this issue still persist in
the 6.3 kernels?

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, Dec 7, 2022 at 3:42 AM William Edwards  wrote:
>
>
> > Op 7 dec. 2022 om 11:59 heeft Stefan Kooman  het volgende 
> > geschreven:
> >
> > On 5/13/22 09:38, Xiubo Li wrote:
> >>> On 5/12/22 12:06 AM, Stefan Kooman wrote:
> >>> Hi List,
> >>>
> >>> We have quite a few linux kernel clients for CephFS. One of our customers 
> >>> has been running mainline kernels (CentOS 7 elrepo) for the past two 
> >>> years. They started out with 3.x kernels (default CentOS 7), but upgraded 
> >>> to mainline when those kernels would frequently generate MDS warnings 
> >>> like "failing to respond to capability release". That worked fine until 
> >>> 5.14 kernel. 5.14 and up would use a lot of CPU and *way* more bandwidth 
> >>> on CephFS than older kernels (order of magnitude). After the MDS was 
> >>> upgraded from Nautilus to Octopus that behavior is gone (comparable CPU / 
> >>> bandwidth usage as older kernels). However, the newer kernels are now the 
> >>> ones that give "failing to respond to capability release", and worse, 
> >>> clients get evicted (unresponsive as far as the MDS is concerned). Even 
> >>> the latest 5.17 kernels have that. No difference is observed between 
> >>> using messenger v1 or v2. MDS version is 15.2.16.
> >>> Surprisingly the latest stable kernels from CentOS 7 work flawlessly now. 
> >>> Although that is good news, newer operating systems come with newer 
> >>> kernels.
> >>>
> >>> Does anyone else observe the same behavior with newish kernel clients?
> >> There have some known bugs, which have been fixed or under fixing 
> >> recently, even in the mainline and, not sure whether are they related. 
> >> Such as [1][2][3][4]. More detail please see ceph-client repo testing 
> >> branch [5].
> >
> > None of the issues you mentioned were related. We gained some more 
> > experience with newer kernel clients, specifically on Ubuntu Focal / Jammy 
> > (5.15). Performance issues seem to arise in certain workloads, specifically 
> > load-balanced Apache shared web hosting clusters with CephFS. We have 
> > tested linux kernel clients from 5.8 up to and including 6.0 with a 
> > production workload and the short summary is:
> >
> > < 5.13, everything works fine
> > 5.13 and up is giving issues
>
> I see this issue on 6.0.0 as well.
>
> >
> > We tested the 5.13.-rc1 as well, and already that kernel is giving issues. 
> > So something has changed in 5.13 that results in performance regression in 
> > certain workloads. And I wonder if it has something to do with the changes 
> > related to fscache that have, and are, happening in the kernel. These web 
> > servers might access the same directories / files concurrently.
> >
> > Note: we have quite a few 5.15 kernel clients not doing any (load-balanced) 
> > web based workload (container clusters on CephFS) that don't have any 
> > performance issue running these kernels.
> >
> > Issue: poor CephFS performance
> > Symptom / result: excessive CephFS network usage (order of magnitude higher 
> > than for older kernels not having this issue), within a minute there are a 
> > bunch of slow web service processes, claiming loads of virtual memory, that 
> > result in heavy swap usage and basically rendering the node unusable slow.
> >
> > Other users that replied to this thread experienced similar symptoms. It is 
> > reproducible on both CentOS (EPEL mainline kernels) as well as on Ubuntu 
> > (hwe as well as default relase kernel).
> >
> > MDS version used: 15.2.16 (with a backported patch from 15.2.17) (single 
> > active / standby-replay)
> >
> > Does this ring a bell?
> >
> > Gr. Stefan
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific - MDS behind on trimming

2023-05-26 Thread Dan van der Ster
Hi Emmanuel,

In my experience MDS getting behind on trimming normally happens for
one of two reasons. Either your client workload is simply too
expensive for your metadata pool OSDs to keep up (and btw some ops are
known to be quite expensive such as setting xattrs or deleting files).
Or I've seen this during massive exports of subtrees between
multi-active MDS.

If you're using a single active MDS, you can exclude the 2nd case.

So if it's the former, then it would be useful to know exactly how
many log segments your MDS is accumulating.. is it going in short
bursts then coming back to normal? Or is it stuck at a very high
value?
Injecting mds_log_max_segments=40 is indeed a very large unusual
amount -- you definitely don't want to leave it like this long term.
(And silencing the warning for bursty client IO is better achieved by
increasing the mds_log_warn_factor e.g. to 5 or 10.)

Cheers, Dan




__
Clyso GmbH | https://www.clyso.com

On Fri, May 26, 2023 at 1:29 AM Emmanuel Jaep  wrote:
>
> Hi,
>
> lately, we have had some issues with our MDSs (Ceph version 16.2.10
> Pacific).
>
> Part of them are related to MDS being behind on trimming.
>
> I checked the documentation and found the following information (
> https://docs.ceph.com/en/pacific/cephfs/health-messages/):
> > CephFS maintains a metadata journal that is divided into *log segments*.
> The length of journal (in number of segments) is controlled by the setting
> mds_log_max_segments, and when the number of segments exceeds that setting
> the MDS starts writing back metadata so that it can remove (trim) the
> oldest segments. If this writeback is happening too slowly, or a software
> bug is preventing trimming, then this health message may appear. The
> threshold for this message to appear is controlled by the config option
> mds_log_warn_factor, the default is 2.0.
>
>
> Some resources on the web (https://www.suse.com/support/kb/doc/?id=19740)
> indicated that a solution would be to change the `mds_log_max_segments`.
> Which I did:
> ```
> ceph --cluster floki tell mds.* injectargs '--mds_log_max_segments=40'
> ```
>
> Of course, the warning disappeared, but I have a feeling that I just hid
> the problem. Pushing a value to 400'000 when the default value is 512 is a
> lot.
>  Why is the trimming not taking place? How can I troubleshoot this further?
>
> Best,
>
> Emmanuel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg upmap primary

2023-05-04 Thread Dan van der Ster
Hello,

After you delete the OSD, the now "invalid" upmap rule will be
automatically removed.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 3, 2023 at 10:13 PM Nguetchouang Ngongang Kevin
 wrote:
>
> Hello, I have a question, when happened when i delete a pg on which i
> set a particular osd as primary using the pg-upmap-primary command ?
>
> --
> Nguetchouang Ngongang Kevin
> ENS de Lyon
> https://perso.ens-lyon.fr/kevin.nguetchouang/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS "newly corrupt dentry" after patch version upgrade

2023-05-02 Thread Dan van der Ster
Hi Janek,

That assert is part of a new corruption check added in 16.2.12 -- see
https://github.com/ceph/ceph/commit/1771aae8e79b577acde749a292d9965264f20202

The abort is controlled by a new option:

+Option("mds_abort_on_newly_corrupt_dentry", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
+.set_default(true)
+.set_description("MDS will abort if dentry is detected newly corrupted."),

So in theory you could switch that off, but it is concerning that the
metadata is corrupted already.
I'm cc'ing Patrick who has been working on this issue.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com

On Tue, May 2, 2023 at 7:32 AM Janek Bevendorff
 wrote:
>
> Hi,
>
> After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS
> fails start start. After replaying the journal, it just crashes with
>
> [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
> #0x1/storage [2,head] auth (dversion lock)
>
> Immediately after the upgrade, I had it running shortly, but then it
> decided to crash for unknown reasons and I cannot get it back up.
>
> We have five ranks in total, the other four seem to be fine. I backed up
> the journal and tried to run cephfs-journal-tool --rank=cephfs.storage:0
> event recover_dentries summary, but it never finishes only eats up a lot
> of RAM. I stopped it after an hour and 50GB RAM.
>
> Resetting the journal makes the MDS crash with a missing inode error on
> another top-level directory, so I re-imported the backed-up journal. Is
> there any way to recover from this without rebuilding the whole file system?
>
> Thanks
> Janek
>
>
> Here's the full crash log:
>
>
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-29>
> 2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 Finished
> replaying journal
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-28>
> 2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.1711712 making mds
> journal writeable
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-27>
> 2023-05-02T16:16:52.761+0200 7f51f878b700  1 mds.0.journaler.mdlog(ro)
> set_writeable
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-26>
> 2023-05-02T16:16:52.761+0200 7f51f878b700  2 mds.0.1711712 i am not
> alone, moving to state resolve
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-25>
> 2023-05-02T16:16:52.761+0200 7f51f878b700  3 mds.0.1711712 request_state
> up:resolve
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-24>
> 2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077
> set_want_state: up:replay -> up:resolve
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-23>
> 2023-05-02T16:16:52.761+0200 7f51f878b700  5 mds.beacon.xxx077 Sending
> beacon up:resolve seq 15
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-22>
> 2023-05-02T16:16:52.761+0200 7f51f878b700 10 monclient:
> _send_mon_message to mon.xxx056 at v2:141.54.133.56:3300/0
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-21>
> 2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient: tick
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-20>
> 2023-05-02T16:16:53.113+0200 7f51fef98700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2023-05-02T16:16:23.118186+0200)
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-19>
> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.xxx077 Updating MDS map
> to version 1711713 from mon.1
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-18>
> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712
> handle_mds_map i am now mds.0.1711712
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-17>
> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712
> handle_mds_map state change up:replay --> up:resolve
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-16>
> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 resolve_start
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-15>
> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 reopen_log
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-14>
> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set
> is 1,2,3,4
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-13>
> 2023-05-02T16:16:53.373+0200 7f51fff9a700  1 mds.0.1711712 recovery set
> is 1,2,3,4
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-12>
> 2023-05-02T16:16:53.373+0200 7f5202fa0700 10 monclient: get_auth_request
> con 0x5574fe74c400 auth_method 0
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-11>
> 2023-05-02T16:16:53.373+0200 7f52037a1700 10 monclient: get_auth_request
> con 0x5574fe40fc00 auth_method 0
> May 02 16:16:53 xxx077 ceph-mds[3047358]:-10>
> 2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request
> con 0x5574f932fc00 auth_method 0
> May 02 16:16:53 xxx077 ceph-mds[3047358]: -9>
> 2023-05-02T16:16:53.373+0200 7f520279f700 10 monclient: get_auth_request
> con 0x5574ffce2000 auth_method 0
> May 02 16:16:53 xxx077 ceph-mds[3047358]: -8>
> 2023-05-02T16:16:53.377+0200 7f5202fa0700  5 mds.beacon.xxx077 received
> beacon reply 

[ceph-users] Re: How to find the bucket name from Radosgw log?

2023-04-26 Thread Dan van der Ster
Hi,

Your cluster probably has dns-style buckets enabled. ..
In that case the path does not include the bucket name, and neither
does the rgw log.
Do you have a frontend lb like haproxy? You'll find the bucket names there.

-- Dan

__
Clyso GmbH | https://www.clyso.com


On Tue, Apr 25, 2023 at 2:34 PM  wrote:
>
> I find a log like this, and I thought the bucket name should be "photos":
>
> [2023-04-19 15:48:47.0.5541s] "GET /photos/shares/
>
> But I can not find it:
>
> radosgw-admin bucket stats --bucket photos
> failure: 2023-04-19 15:48:53.969 7f69dce49a80  0 could not get bucket info 
> for bucket=photos
> (2002) Unknown error 2002
>
> How does this happen? Thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to control omap capacity?

2023-04-26 Thread Dan van der Ster
Hi,

Simplest solution would be to add a few OSDs.

-- dan

__
Clyso GmbH | https://www.clyso.com


On Tue, Apr 25, 2023 at 2:58 PM WeiGuo Ren  wrote:
>
> I have two osds. these  osd are used to rgw index pool. After a lot of
> stress tests, these two osds were written to 99.90%. The full ratio
> (95%) did not take effect? I don't know much. Could it be that if the
> osd of omap is fully stored, it cannot be limited by the full ratio?
> ALSO I use ceph-bluestore-tool to expand it . Before I add a partition
> . But i failed, I dont know why.
> In my cluster every osd have 55GB (db val data in same device), ceph
> -v is 14.2.5. can anyone give me some idear to fix it?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: For suggestions and best practices on expanding Ceph cluster and removing old nodes

2023-04-26 Thread Dan van der Ster
Thanks Tom, this is a very useful post!
I've added our docs guy Zac in cc: IMHO this would be useful in a
"Tips & Tricks" section of the docs.

-- dan

__
Clyso GmbH | https://www.clyso.com




On Wed, Apr 26, 2023 at 7:46 AM Thomas Bennett  wrote:
>
> I would second Joachim's suggestion - this is exactly what we're in the
> process of doing for a client, i.e migrating from Luminous to Quincy.
> However below would also work if you're moving to Nautilus.
>
> The only catch with this plan would be if you plan to reuse any hardware -
> i.e the hosts running rados gateways and mons, etc. If you have enough
> hardware to spare this is a good plan.
>
> My process:
>
>1. Stand a new Quincy cluster and tune the cluster.
>2. Migrate user information, secrets and access keys (using
>radosg-admin in a script).
>3. Using a combination of rclone and parallel to push data across from
>the old cluster to the new cluster.
>
>
> Below is a bash script I used to capture all the user information on the
> old cluster and I ran it on the new cluster to create users and keep their
> secrets and keys the same.
>
> #
> for i in $(radosgw-admin user list | jq -r .[]); do
> USER_INFO=$(radosgw-admin user info --uid=$i)
> USER_ID=$(echo $USER_INFO | jq -r '.user_id')
> DISPLAY_NAME=$(echo $USER_INFO | jq '.display_name')
> EMAIL=$(echo $USER_INFO | jq '.email')
> MAX_BUCKETS=$(echo $USER_INFO | jq -r '(.max_buckets|tostring)')
> ACCESS=$(echo $USER_INFO | jq -r '.keys[].access_key')
> SECRET=$(echo $USER_INFO | jq -r '.keys[].secret_key')
> echo "radosgw-admin user create --uid=$USER_ID
> --display-name=$DISPLAY_NAME --email=$EMAIL --max-buckets=$MAX_BUCKETS
> --access-key=$ACCESS --secret-key=$SECRET" | tee -a
> generated.radosgw-admin-user-create.sh
> done
> #
>
> Rclone is a really powerful tool! I lazily set up a backends for each user,
> by appending below to the for loop in the above script. Below script is not
> pretty but it does the job:
> #
> echo "" >> generated.rclone.conf
> echo [old-cluster-$USER_ID] >> generated.rclone.conf
> echo type = s3 >> generated.rclone.conf
> echo provider = Ceph >> generated.rclone.conf
> echo env_auth = false >> generated.rclone.conf
> echo access_key_id = $ACCESS >> generated.rclone.conf
> echo secret_access_key = $SECRET >> generated.rclone.conf
> echo endpoint = http://xx.xx.xx.xx: >> generated.rclone.conf
> echo acl = public-read >> generated.rclone.conf
> echo "" >> generated.rclone.conf
> echo [new-cluster-$USER_ID] >> generated.rclone.conf
> echo type = s3 >> generated.rclone.conf
> echo provider = Ceph >> generated.rclone.conf
> echo env_auth = false >> generated.rclone.conf
> echo access_key_id = $ACCESS >> generated.rclone.conf
> echo secret_access_key = $SECRET >> generated.rclone.conf
> echo endpoint = http://yy.yy.yy.yy: >> generated.rclone.conf
> echo acl = public-read >> generated.rclone.conf
> echo "" >> generated.rclone.conf
> #
>
> Copy the generated.rclone.conf to the node that is going to act as the
> transfer node (I just used the new rados gateway node) into
> ~/.config/rclone/rclone.conf
>
> Now if you run rclone lsd old-cluser-{user}: (it even tab completes!)
> you'll get a list of all the buckets for that user.
>
> You could even simply rclone sync old-cluser-{user}: new-cluser-{user}: and
> it should sync all buckets for a user.
>
> Catches:
>
>- Use the scripts carefully - our buckets for this one user are set
>public-read - you might want to check each line of the script if you use 
> it.
>- Quincy bucket naming convention is stricter than Luminous. I've had to
>catch some '_' and upper cases and fix them in the command line I generate
>for copying each bucket.
>- Using rclone will take a long time.Feed a script into parallel sped
>things up for me:
>   - # parallel -j 10 < sync-script
>- Watch out for lifecycling! Not sure how to handle this to make sure
>it's captured correctly.
>
> Cheers,
> Tom
>
> On Tue, 25 Apr 2023 at 22:36, Marc  wrote:
>
> >
> > Maybe he is limited by the supported OS
> >
> >
> > >
> > > I would create a new cluster with Quincy and would migrate the data from
> > > the old to the new cluster bucket by bucket. Nautilus is out of support
> > > and
> > > I would recommend at least to use a ceph version that is receiving
> > > Backports.
> > >
> > > huxia...@horebdata.cn  schrieb am Di., 25. Apr.
> > > 2023, 18:30:
> > >
> > > > Dear Ceph folks,
> > > >
> > > > I would like to listen to your advice on the following topic: We have
> > > a
> > > > 6-node Ceph cluster (for RGW usage only ) running on Luminous 12.2.12,
> > > and
> > > > now will add 10 new nodes. Our plan is to phase out the old 6 nodes,
> > > and
> > > > run RGW Ceph cluster with the new 10 nodes on Nautilus version。
> > > >
> > > > I can think of two ways to achieve the above 

[ceph-users] Re: Massive OMAP remediation

2023-04-26 Thread Dan van der Ster
Hi Ben,

Are you compacting the relevant osds periodically? ceph tell osd.x
compact (for the three osds holding the bilog) would help reshape the
rocksdb levels to least perform better for a little while until the
next round of bilog trims.

Otherwise, I have experience deleting ~50M object indices in one step
in the past, probably back in the luminous days IIRC. It will likely
lockup the relevant osds for a while while the omap is removed. If you
dare take that step, it might help to set nodown; that might prevent
other osds from flapping and creating more work.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Tue, Apr 25, 2023 at 2:45 PM Ben.Zieglmeier
 wrote:
>
> Hi All,
>
> We have a RGW cluster running Luminous (12.2.11) that has one object with an 
> extremely large OMAP database in the index pool. Listomapkeys on the object 
> returned 390 Million keys to start. Through bilog trim commands, we’ve 
> whittled that down to about 360 Million. This is a bucket index for a 
> regrettably unsharded bucket. There are only about 37K objects actually in 
> the bucket, but through years of neglect, the bilog grown completely out of 
> control. We’ve hit some major problems trying to deal with this particular 
> OMAP object. We just crashed 4 OSDs when a bilog trim caused enough churn to 
> knock one of the OSDs housing this PG out of the cluster temporarily. The OSD 
> disks are 6.4TB NVMe, but are split into 4 partitions, each housing their own 
> OSD daemon (collocated journal).
>
> We want to be rid of this large OMAP object, but are running out of options 
> to deal with it. Reshard outright does not seem like a viable option, as we 
> believe the deletion would deadlock OSDs can could cause impact. Continuing 
> to run `bilog trim` 1000 records at a time has been what we’ve done, but this 
> also seems to be creating impacts to performance/stability. We are seeking 
> options to remove this problematic object without creating additional 
> problems. It is quite likely this bucket is abandoned, so we could remove the 
> data, but I fear even the deletion of such a large OMAP could bring OSDs down 
> and cause potential for metadata loss (the other bucket indexes on that same 
> PG).
>
> Any insight available would be highly appreciated.
>
> Thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mons excessive writes to local disk and SSD wearout

2023-02-24 Thread Dan van der Ster
Hi Andrej,

That doesn't sound right -- I checked a couple of our clusters just
now and the mon filesystem is writing at just a few 100kBps.

debug_mon = 10 should clarify the root cause. Perhaps it's logm from
some persistent slow ops?

Cheers, Dan



On Fri, Feb 24, 2023 at 7:36 AM Andrej Filipcic  wrote:
>
>
> Hi,
>
> on our large ceph cluster with 60 servers, 1600 OSDs, we have observed
> that small system nvmes are wearing out rapidly. Our monitoring shows
> mon writes on average about 10MB/s to store.db. For small system nvmes
> of 250GB and DWPD of ~1, this turns out to be too much, 0.8TB/day or
> 1.5PB in 5 years, too much even for 3DWPD of the same capacity.
>
> Apart from replacing the drives with larger ones, more durable,
> preferably  both, do you have any suggestions if these writes can be
> reduced? Actually, the mon writes match 0.15Hz  rate of .sst file
> creation of 64MB
>
> Best regards,
> Andrej
>
> --
> _
> prof. dr. Andrej Filipcic,   E-mail: andrej.filip...@ijs.si
> Department of Experimental High Energy Physics - F9
> Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> SI-1001 Ljubljana, Slovenia
> Tel.: +386-1-477-3674Fax: +386-1-425-7074
> -
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-15 Thread Dan van der Ster
Sorry -- Let me rewrite that second paragraph without overloading the
term "rebalancing", which I recognize is confusing.

...

In your case, where you want to perform a quick firmware update on the
drive, you should just use noout.

Without noout, the OSD will be marked out after 5 minutes and objects
will be re-replicated to other OSDs -- those degraded PGs will move to
"backfilling" state and copy the objects on new OSDs.

With noout, the cluster won't start backfilling/recovering, but don't
worry -- this won't block IO. What happens is the disk that is having
its firmware upgraded will be marked "down", and IO will be accepted
and logged by its peers, so that when the disk is back "up" it can
replay ("recover") those writes to catch up.


The norebalance flag only impacts data movement for PGs that are not
degraded -- no OSDs are down. This can be useful to pause backfilling
e.g. when you are adding or removing hosts to a cluster.

-- dan

On Wed, Feb 15, 2023 at 2:58 PM Dan van der Ster  wrote:
>
> Hi Will,
>
> There are some misconceptions in your mail.
>
> 1. "noout" is a flag used to prevent the down -> out transition after
> an osd is down for several minutes. (Default 5 minutes).
> 2. "norebalance" is a flag used to prevent objects from being
> backfilling to a different OSD *if the PG is not degraded*.
>
> In your case, where you want to perform a quick firmware update on the
> drive, you should just use noout.
> Without noout, the OSD will be marked out after 5 minutes and data
> will start rebalancing to other OSDs.
> With noout, the cluster won't start rebalancing. But this won't block
> IO -- the disk being repaired will be "down" and IO will be accepted
> and logged by it's peers, so that when the disk is back "up" it can
> replay those writes to catch up.
>
> Hope that helps!
>
> Dan
>
>
>
> On Wed, Feb 15, 2023 at 1:12 PM  wrote:
> >
> > Hi,
> >
> > We have a discussion going on about which is the correct flag to use for 
> > some maintenance on an OSD, should it be "noout" or "norebalance"? This was 
> > sparked because we need to take an OSD out of service for a short while to 
> > upgrade the firmware.
> >
> > One school of thought is:
> > - "ceph norebalance" prevents automatic rebalancing of data between OSDs, 
> > which Ceph does to ensure all OSDs have roughly the same amount of data.
> > - "ceph noout" on the other hand prevents OSDs from being marked as 
> > out-of-service during maintenance, which helps maintain cluster performance 
> > and availability.
> > - Additionally, if another OSD fails while the "norebalance" flag is set, 
> > the data redundancy and fault tolerance of the Ceph cluster may be 
> > compromised.
> > - So if we're going to maintain the performance and reliability we need to 
> > set the "ceph noout" flag to prevent the OSD from being marked as OOS 
> > during maintenance and allow the automatic data redistribution feature of 
> > Ceph to work as intended.
> >
> > The other opinion is:
> > - With the noout flag set, Ceph clients are forced to think that OSD exists 
> > and is accessible - so they continue sending requests to such OSD. The OSD 
> > also remains in the crush map without any signs that it is actually out. If 
> > an additional OSD fails in the cluster with the noout flag set, Ceph is 
> > forced to continue thinking that this new failed OSD is OK. It leads to 
> > stalled or delayed response from the OSD side to clients.
> > - Norebalance instead takes into account the in/out OSD status, but 
> > prevents data rebalance. Clients are also aware of the real OSD status, so 
> > no requests go to the OSD which is actually out. If an additional OSD fails 
> > - only the required temporary PG are created to maintain at least 2 
> > existing copies of the same data (well, generally it is set by the pool min 
> > size).
> >
> > The upstream docs seem pretty clear that noout should be used for 
> > maintenance 
> > (https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-osd/),
> >  but the second opinion strongly suggests that norebalance is actually 
> > better and the Ceph docs are out of date.
> >
> > So what is the feedback from the wider community?
> >
> > Thanks,
> > Will
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-15 Thread Dan van der Ster
Hi Will,

There are some misconceptions in your mail.

1. "noout" is a flag used to prevent the down -> out transition after
an osd is down for several minutes. (Default 5 minutes).
2. "norebalance" is a flag used to prevent objects from being
backfilling to a different OSD *if the PG is not degraded*.

In your case, where you want to perform a quick firmware update on the
drive, you should just use noout.
Without noout, the OSD will be marked out after 5 minutes and data
will start rebalancing to other OSDs.
With noout, the cluster won't start rebalancing. But this won't block
IO -- the disk being repaired will be "down" and IO will be accepted
and logged by it's peers, so that when the disk is back "up" it can
replay those writes to catch up.

Hope that helps!

Dan



On Wed, Feb 15, 2023 at 1:12 PM  wrote:
>
> Hi,
>
> We have a discussion going on about which is the correct flag to use for some 
> maintenance on an OSD, should it be "noout" or "norebalance"? This was 
> sparked because we need to take an OSD out of service for a short while to 
> upgrade the firmware.
>
> One school of thought is:
> - "ceph norebalance" prevents automatic rebalancing of data between OSDs, 
> which Ceph does to ensure all OSDs have roughly the same amount of data.
> - "ceph noout" on the other hand prevents OSDs from being marked as 
> out-of-service during maintenance, which helps maintain cluster performance 
> and availability.
> - Additionally, if another OSD fails while the "norebalance" flag is set, the 
> data redundancy and fault tolerance of the Ceph cluster may be compromised.
> - So if we're going to maintain the performance and reliability we need to 
> set the "ceph noout" flag to prevent the OSD from being marked as OOS during 
> maintenance and allow the automatic data redistribution feature of Ceph to 
> work as intended.
>
> The other opinion is:
> - With the noout flag set, Ceph clients are forced to think that OSD exists 
> and is accessible - so they continue sending requests to such OSD. The OSD 
> also remains in the crush map without any signs that it is actually out. If 
> an additional OSD fails in the cluster with the noout flag set, Ceph is 
> forced to continue thinking that this new failed OSD is OK. It leads to 
> stalled or delayed response from the OSD side to clients.
> - Norebalance instead takes into account the in/out OSD status, but prevents 
> data rebalance. Clients are also aware of the real OSD status, so no requests 
> go to the OSD which is actually out. If an additional OSD fails - only the 
> required temporary PG are created to maintain at least 2 existing copies of 
> the same data (well, generally it is set by the pool min size).
>
> The upstream docs seem pretty clear that noout should be used for maintenance 
> (https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-osd/), 
> but the second opinion strongly suggests that norebalance is actually better 
> and the Ceph docs are out of date.
>
> So what is the feedback from the wider community?
>
> Thanks,
> Will
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Frequent calling monitor election

2023-02-09 Thread Dan van der Ster
Hi Frank,

Check the mon logs with some increased debug levels to find out what
the leader is busy with.
We have a similar issue (though, daily) and it turned out to be
related to the mon leader timing out doing a SMART check.
See https://tracker.ceph.com/issues/54313 for how I debugged that.

Cheers, Dan

On Thu, Feb 9, 2023 at 7:56 AM Frank Schilder  wrote:
>
> Hi all,
>
> our monitors have enjoyed democracy since the beginning. However, I don't 
> share a sudden excitement about voting:
>
> 2/9/23 4:42:30 PM[INF]overall HEALTH_OK
> 2/9/23 4:42:30 PM[INF]mon.ceph-01 is new leader, mons 
> ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> 2/9/23 4:42:26 PM[INF]mon.ceph-01 calling monitor election
> 2/9/23 4:42:26 PM[INF]mon.ceph-26 calling monitor election
> 2/9/23 4:42:26 PM[INF]mon.ceph-25 calling monitor election
> 2/9/23 4:42:26 PM[INF]mon.ceph-02 calling monitor election
> 2/9/23 4:40:00 PM[INF]overall HEALTH_OK
> 2/9/23 4:30:00 PM[INF]overall HEALTH_OK
> 2/9/23 4:24:34 PM[INF]overall HEALTH_OK
> 2/9/23 4:24:34 PM[INF]mon.ceph-01 is new leader, mons 
> ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> 2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
> 2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
> 2/9/23 4:24:29 PM[INF]mon.ceph-03 calling monitor election
> 2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
> 2/9/23 4:24:29 PM[INF]mon.ceph-26 calling monitor election
> 2/9/23 4:24:29 PM[INF]mon.ceph-25 calling monitor election
> 2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
> 2/9/23 4:24:04 PM[INF]overall HEALTH_OK
> 2/9/23 4:24:03 PM[INF]mon.ceph-01 is new leader, mons 
> ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> 2/9/23 4:23:59 PM[INF]mon.ceph-01 calling monitor election
> 2/9/23 4:23:59 PM[INF]mon.ceph-02 calling monitor election
> 2/9/23 4:20:00 PM[INF]overall HEALTH_OK
> 2/9/23 4:10:00 PM[INF]overall HEALTH_OK
> 2/9/23 4:00:00 PM[INF]overall HEALTH_OK
> 2/9/23 3:50:00 PM[INF]overall HEALTH_OK
> 2/9/23 3:43:13 PM[INF]overall HEALTH_OK
> 2/9/23 3:43:13 PM[INF]mon.ceph-01 is new leader, mons 
> ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> 2/9/23 3:43:08 PM[INF]mon.ceph-01 calling monitor election
> 2/9/23 3:43:08 PM[INF]mon.ceph-26 calling monitor election
> 2/9/23 3:43:08 PM[INF]mon.ceph-25 calling monitor election
>
> We moved a switch from one rack to another and after the switch came beck up, 
> the monitors frequently bitch about who is the alpha. How do I get them to 
> focus more on their daily duties again?
>
> Thanks for any help!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mon scrub error (scrub mismatch)

2023-01-03 Thread Dan van der Ster
Hi Frank,

Can you work backwards in the logs to when this first appeared?
The scrub error is showing that mon.0 has 78 auth keys and the other
two have 77. So you'd have query the auth keys of each mon to see if
you get a different response each time (e.g. ceph auth list), and
compare with what you expect.

Cheers, Dan

On Tue, Jan 3, 2023 at 9:29 AM Frank Schilder  wrote:
>
> Hi Eugen,
>
> thanks for your answer. All our mons use rocksdb.
>
> I found some old threads, but they never really explained anything. What 
> irritates me is that this is a silent corruption. If you don't read the logs 
> every day, you will not see it, ceph status reports health ok. That's also 
> why I'm wondering if this is a real issue or not.
>
> It would be great if someone could shed light on (1) how serious this is, (2) 
> why it doesn't trigger a health warning/error and (3) why the affected mon 
> doesn't sync back from the majority right away.
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: 03 January 2023 15:04:34
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: mon scrub error (scrub mismatch)
>
> Hi Frank,
>
> I had this a few years back and ended up recreating the MON with the
> scrub mismatch, so in your case it probably would be mon.0. To test if
> the problem still exists you can trigger a mon scrub manually:
>
> ceph mon scrub
>
> Are all MONs on rocksdb back end in this cluster? I didn't check back
> then if this was the case in our cluster, so I'm just wondering if
> that could be an explanation.
>
> Regards,
> Eugen
>
> Zitat von Frank Schilder :
>
> > Hi all,
> >
> > we have these messages in our logs daily:
> >
> > 1/3/23 12:20:00 PM[INF]overall HEALTH_OK
> > 1/3/23 12:19:46 PM[ERR] mon.2 ScrubResult(keys
> > {auth=77,config=2,health=11,logm=10} crc
> > {auth=688385498,config=4279003239,health=3522308637,logm=132403602})
> > 1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys
> > {auth=78,config=2,health=11,logm=9} crc
> > {auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
> > 1/3/23 12:19:46 PM[ERR]scrub mismatch
> > 1/3/23 12:19:46 PM[ERR] mon.1 ScrubResult(keys
> > {auth=77,config=2,health=11,logm=10} crc
> > {auth=688385498,config=4279003239,health=3522308637,logm=132403602})
> > 1/3/23 12:19:46 PM[ERR] mon.0 ScrubResult(keys
> > {auth=78,config=2,health=11,logm=9} crc
> > {auth=325876668,config=4279003239,health=3522308637,logm=1083913445})
> > 1/3/23 12:19:46 PM[ERR]scrub mismatch
> > 1/3/23 12:17:04 PM[INF]Cluster is now healthy
> > 1/3/23 12:17:04 PM[INF]Health check cleared: MON_CLOCK_SKEW (was:
> > clock skew detected on mon.tceph-02)
> >
> > Cluster is health OK:
> >
> > # ceph status
> >   cluster:
> > id: bf1f51f5-b381-4cf7-b3db-88d044c1960c
> > health: HEALTH_OK
> >
> >   services:
> > mon: 3 daemons, quorum tceph-01,tceph-02,tceph-03 (age 3M)
> > mgr: tceph-01(active, since 8w), standbys: tceph-03, tceph-02
> > mds: fs:1 {0=tceph-02=up:active} 2 up:standby
> > osd: 9 osds: 9 up (since 3M), 9 in
> >
> >   task status:
> >
> >   data:
> > pools:   4 pools, 321 pgs
> > objects: 9.94M objects, 336 GiB
> > usage:   1.6 TiB used, 830 GiB / 2.4 TiB avail
> > pgs: 321 active+clean
> >
> > Unfortunately, google wasn't of too much help. Is this scrub error
> > something to worry about?
> >
> > Thanks and best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs ceph.dir.rctime decrease

2022-12-19 Thread Dan van der Ster
Hi,

Yes this is a known issue -- an mtime can be in the future, and an
rctime won't go backwards. There was an earlier attempt to allow
fixing the rctimes but this got stuck and needs effort to bring it up
to date: https://github.com/ceph/ceph/pull/37938

Cheers, dan

On Sun, Dec 18, 2022 at 12:23 PM Stolte, Felix  wrote:
>
> Hi guys,
>
> i want to use ceph.dir.rctime for backup purposes. Unfortunately there are 
> some files in our filesystem which have a ctime of years in the future. This 
> is reflected correctly by ceph.dir.rctime. I changed the the time of this 
> files to now (just did a touch on the file), but rctime stays the same. I 
> waited one day and remounted the filesytem, but the value of rctime stays the 
> same.
>
> It is possible to update the ceph.dir.rctime in some way or is it hard coded, 
> that rctime will never be decreased?
>
> Regards
> Felix
>
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
> -
> -
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd set-require-min-compat-client

2022-11-30 Thread Dan van der Ster
Hi Felix,

With `ceph balancer off` the upmap balancer will not move any PGs around.

https://docs.ceph.com/en/latest/rados/operations/balancer/

Cheers, Dan

On Wed, Nov 30, 2022 at 1:20 PM Stolte, Felix  wrote:
>
> Hi Dan,
>
> thanks for your reply. I wasn’t worried about the setting itself, but about 
> the balancer starting to use the pg-upmap feature (which currently fails, 
> because of the jewel setting). I would assume though, that the balancer is 
> using pg-upmap in a throttled way to avoid performance issues.
>
> I will execute the command on the weekend, just to be safe.
>
> Best regards
> Felix
>
>
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
> -
> -----------------
>
> Am 30.11.2022 um 12:48 schrieb Dan van der Ster :
>
> Hi Felix,
>
> This change won't trigger any rebalancing. It will prevent older clients from 
> connecting, but since this isn't a crush tunable it won't directly affect 
> data placement.
>
> Best,
>
> Dan
>
>
> On Wed, Nov 30, 2022, 12:33 Stolte, Felix  wrote:
>>
>> Hey guys,
>>
>> our ceph cluster is on pacific, but started on jewel years ago. While i was 
>> going through the logs of the mrg daemon i stumbled about the following 
>> entry:
>>
>> =
>> [balancer ERROR root] execute error: r = -1, detail = min_compat_client 
>> jewel < luminous, which is required for pg-upmap. Try 'ceph osd 
>> set-require-min-compat-client luminous' before using the new interface
>> =
>>
>> I could confirm that with `ceph osd get-require-min-compat-client` my value 
>> is still value. Reading the docs it looks to me, we really want to set this 
>> to luminous to benefit from a better pg distribution. My question for you is 
>> the following:
>>
>> Do I have to expect a major rebalancing after applying the 'ceph osd 
>> set-require-min-compat-client luminous‘  command, affecting my cluster IO?
>>
>> All my daemons are on pacific and all clients at least on nautilus.
>>
>> Thanks in advance and best regards
>> Felix
>>
>> -
>> -
>> Forschungszentrum Juelich GmbH
>> 52425 Juelich
>> Sitz der Gesellschaft: Juelich
>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>> Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
>> -
>> -
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd set-require-min-compat-client

2022-11-30 Thread Dan van der Ster
Hi Felix,

This change won't trigger any rebalancing. It will prevent older clients
from connecting, but since this isn't a crush tunable it won't directly
affect data placement.

Best,

Dan


On Wed, Nov 30, 2022, 12:33 Stolte, Felix  wrote:

> Hey guys,
>
> our ceph cluster is on pacific, but started on jewel years ago. While i
> was going through the logs of the mrg daemon i stumbled about the following
> entry:
>
> =
> [balancer ERROR root] execute error: r = -1, detail = min_compat_client
> jewel < luminous, which is required for pg-upmap. Try 'ceph osd
> set-require-min-compat-client luminous' before using the new interface
> =
>
> I could confirm that with `ceph osd get-require-min-compat-client` my
> value is still value. Reading the docs it looks to me, we really want to
> set this to luminous to benefit from a better pg distribution. My question
> for you is the following:
>
> Do I have to expect a major rebalancing after applying the 'ceph osd
> set-require-min-compat-client luminous‘  command, affecting my cluster IO?
>
> All my daemons are on pacific and all clients at least on nautilus.
>
> Thanks in advance and best regards
> Felix
>
>
> -
>
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
>
> -
>
> -
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PGs stuck down

2022-11-30 Thread Dan van der Ster
Hi all,

It's difficult to say exactly what happened here without cluster logs.
Dale, would you be able to share the ceph.log showing the start of the
incident?

Cheers, dan

On Wed, Nov 30, 2022 at 10:30 AM Frank Schilder  wrote:
>
> Hi Eugen,
>
> power outage is one thing, a cable cut is another. With power outages you 
> will have OSDs down and only one sub-cluster up at a time. OSD's will peer 
> locally on a single DC and stuff moves on.
>
> With a cable cut you have a split brain. Have you actually tested your setup 
> with everything up except the network connection between OSDs on your 2 DCs? 
> My bet is that it goes into standstill just like in Dale's case because OSDs 
> are up on both sides and, therefore, PGs will want to peer cross-site but 
> can't.
>
> I don't think it is possible to do a 2DC stretched setup unless you have 2 or 
> more physically separated possibilities for network routing that will never 
> go down simultaneously. If just network link goes between OSDs on both sides 
> access will be down.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: 30 November 2022 10:07:13
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: PGs stuck down
>
> Hi,
>
> while I basically agree with Frank's response (e. g. min_size = 2) I
> disagree that it won't work without the stretch mode. We have a
> customer with a similar setup, two datacenters and a third mon in a
> different location. And this setup has proven multiple times the
> resiliency of ceph. Due to hardware issues in the power supplies they
> experiences two or three power outages in one DC without data loss.
> They use an erasure coded pool stretched across these two DCs, the
> third mon is reachable both ways around the DCs, of course. But this
> works quite well, they were very happy with ceph's resiliency. The
> cluster is still running on Nautilus.
>
> Regards,
> Eugen
>
> Zitat von Frank Schilder :
>
> > Hi Dale,
> >
> >> we thought we had set it up to prevent.. and with size = 4 and
> >> min_size set = 1
> >
> > I'm afraid this is exactly what you didn't. Firstly, min_size=1 is
> > always a bad idea. Secondly, if you have 2 data centres, the only
> > way to get this to work is to use stretch mode. Even if you had
> > min_size=2 (which, by the way you should have in any case), without
> > stretch mode you would not be guaranteed that you have all PGs
> > active+clean after one DC goes down (or cable gets cut). There is a
> > quite long and very detailed explanation of why this is the case and
> > with min_size=1 you are very certain to hit one of these cases or
> > even loose data.
> >
> > What you could check in your situation are these two:
> >
> > mon_osd_min_up_ratio
> > mon_osd_min_in_ratio
> >
> > My guess is that these prevented the mons from marking sufficiently
> > many OSDs as out and therefore they got stuck peering (maybe even
> > nothing was marked down?). The other thing is that you almost
> > certainly had exactly the split brain situation that stretch mode is
> > there to prevent. You probably ended up with 2 sub-clusters with 2
> > mons each and now what? If the third mon could still see the other 2
> > I don't think you get a meaningful quorum. Stretch mode will
> > actually change the crush rule depending on a decision by the
> > tie-breaking monitor to re-configure the pool to use only OSDs in
> > one of the 2 DCs so that no cross-site peering happens.
> >
> > Maybe if you explicitly shut down one of the DC-mons you get stuff
> > to work in one of the DCs?
> >
> > Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Wolfpaw - Dale Corse 
> > Sent: 29 November 2022 07:20:20
> > To: 'ceph-users'
> > Subject: [ceph-users] PGs stuck down
> >
> > Hi All,
> >
> >
> >
> > We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
> > do very well :( We ended up with 98% of PGs as down.
> >
> >
> >
> > This setup has 2 data centers defined, with 4 copies across both, and a
> > minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
> > center connected to each of the other 2 by VPN.
> >
> >
> >
> > When I did a pg query on the PG's that were stuck it said they were blocked
> > from coming up because they couldn't contact 2 of the OSDs (located in the
> > other data center that it was unable to reach).. but the other 2 were fine.
> >
> >
> >
> > I'm at a loss because this was exactly the thing we thought we had set it up
> > to prevent.. and with size = 4 and min_size set = 1 I understood that it
> > would continue without a problem? :(
> >
> >
> >
> > Crush map is below .. if anyone has any ideas? I would sincerely appreciate
> > it :)
> >
> >
> >
> > Thanks!
> >
> > Dale
> >
> >

[ceph-users] Re: LVM osds loose connection to disk

2022-11-18 Thread Dan van der Ster
Hi Frank,

bfq was definitely broken, deadlocking io for a few CentOS Stream 8
kernels between EL 8.5 and 8.6 -- we also hit that in production and
switched over to `none`.

I don't recall exactly when the upstream kernel was also broken but
apparently this was the fix:
https://marc.info/?l=linux-block=164366111512992=2

In any case you might want to just use `none` with flash devs -- i'm
not sure the fair scheduling and mq is very meaningful anymore for
high iops devices and ceph.

Cheers, Dan



On Thu, Nov 17, 2022 at 1:23 PM Frank Schilder  wrote:
>
> Hi Igor,
>
> I might have a smoking gun. Could it be that ceph (the kernel??) has issues 
> with certain disk schedulers? There was a recommendation on this list to use 
> bfq with bluestore. This was actually the one change other than ceph version 
> during upgrade: to make bfq default. Now, this might be a problem with 
> certain drives that have a preferred scheduler different than bfq. Here my 
> observation:
>
> I managed to get one of the OSDs to hang today. It was not the usual abort, I 
> don't know why the op_thread_timeout and suicide_timeout didn't trigger. The 
> OSD's worker thread was unresponsive for a bit more than 10 minutes before I 
> took action. Hence, nothing in the log (should maybe have used kill 
> sigabort). Now, this time I tried to check if I can access the disk with dd. 
> And, I could not. A
>
> dd if=/dev/sdn of=disk-dump bs=4096 count=100
>
> got stuck right away in D-state:
>
> 1652472 D+   dd if=/dev/sdn of=disk-dump bs=4096 count=100
>
> This time, since I was curious about the disk scheduler, I went to another 
> terminal on the same machine and did:
>
> # cat /sys/block/sdn/queue/scheduler
> mq-deadline kyber [bfq] none
> # echo none >> /sys/block/sdn/queue/scheduler
> # cat /sys/block/sdn/queue/scheduler
> [none] mq-deadline kyber bfq
>
> Going back to the stuck session, I see now (you can see my attempts to 
> interrupt the dd):
>
> # dd if=/dev/sdn of=disk-dump bs=4096 count=100
> ^C^C3+0 records in
> 2+0 records out
> 8192 bytes (8.2 kB) copied, 336.712 s, 0.0 kB/s
>
> Suddenly, the disk responds again! Also, the ceph container stopped (a docker 
> stop container returned without the container stopping - as before in this 
> situation).
>
> Could it be that recommendations for disk scheduler choice should be 
> reconsidered, or is this pointing towards a bug in either how ceph or the 
> kernel schedules disk IO? To confirm this hypothesis, I will retry the stress 
> test with the scheduler set to the default kernel choice.
>
> I did day-long fio benchmarks with all schedulers and all sorts of workloads 
> on our drives and could not find anything like that. It looks like it is very 
> difficult to impossible to reproduce a realistic ceph-osd IO pattern for 
> testing. Is there any tool available for this?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 14 November 2022 13:03:58
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: LVM osds loose connection to disk
>
> I can't reproduce the problem with artificial workloads, I need to get one of 
> these OSDs running in the meta-data pool until it crashes. My plan is to 
> reduce time-outs and increase log level for these specific OSDs to capture 
> what happened before an abort in the memory log. I can spare about 100G of 
> RAM for log entries. I found the following relevant options with settings I 
> think will work for my case:
>
> osd_op_thread_suicide_timeout 30 # default 150
> osd_op_thread_timeout 10 # default 15
> debug_bluefs 1/20 # default 1/5
> debug_bluestore 1/20 # default 1/5
> bluestore_kv_sync_util_logging_s 3 # default 10
> log_max_recent 10 # default 1
>
> It would be great if someone could confirm that these settings will achieve 
> what I want (or what is missing). I would like to capture at least 1 minute 
> worth of log entries in RAM with high debug settings. Does anyone have a good 
> estimate for how many log-entries are created per second with these settings 
> for tuning log_max_recent?
>
> Thanks for your help!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 11 November 2022 10:25:17
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: LVM osds loose connection to disk
>
> Hi Igor,
>
> thanks for your reply. We only exchanged the mimic containers with the 
> octopus ones. We didn't even reboot the servers during upgrade, only later 
> for trouble shooting. The only change since the upgrade is the ceph container.
>
> I'm trying to go down the stack and run a benchmark on the OSD directly. 
> Unfortunately, it seems that osd bench is too simplistic. I don't think we 
> have a problem with the disk, I rather think there is a race condition that 
> gets the bstore_kv_sync thread 

[ceph-users] Re: OSDs down after reweight

2022-11-15 Thread Dan van der Ster
Hi Frank,

Just a guess, but I wonder if for small values rounding/precision
start to impact the placement like you observed.

Do you see the same issue if you reweight to 2x the original?

-- Dan

On Tue, Nov 15, 2022 at 10:09 AM Frank Schilder  wrote:
>
> Hi all,
>
> I re-weighted all OSDs in a pool down from 1.0 to the same value 0.052 (see 
> reason below). After this, all hell broke loose. OSDs were marked down, slow 
> OPS all over the place and the MDSes started complaining about slow 
> ops/requests. Basically all PGs were remapped. After setting all re-weights 
> back to 1.0 the situation went back to normal.
>
> Expected behaviour: No (!!!) PGs are remapped and everything continues to 
> work. Why did things go down?
>
> More details: We have 24 OSDs with weight=1.74699 in a pool. I wanted to add 
> OSDs with weight=0.09099 in such a way that the small OSDs receive 
> approximately the same number of PGs as the large ones. Setting a re-weight 
> factor of 0.052 for the large ones should achieve just that: 
> 1.74699*0.05=0.09084. So, procedure was:
>
> - ceph osd crush reweight osd.N 0.052 for all OSDs in that pool
> - add the small disks and re-balance
>
> I would expect that the crush mapping is invariant under a uniform change of 
> weight. That is, if I apply the same relative weight-change to all OSDs 
> (new_weight=old_weight*common_factor) in a pool, the mappings should be 
> preserved. However, this is not what I observed. How is it possible that PG 
> mappings change if the relative weight of all OSDs to each other stays the 
> same (the probabilities of picking an OSD are unchanged over all OSDs)?
>
> Thanks for any hints.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Temporary shutdown of subcluster and cephfs

2022-10-19 Thread Dan van der Ster
Hi Frank,

fs fail isn't ideal -- there's an 'fs down' command for this.

Here's a procedure we used, last used in the nautilus days:

1. If possible, umount fs from all the clients, so that all dirty
pages are flushed.
2. Prepare the ceph cluster: ceph osd set noout/noin
3. Wait until there is zero IO on the cluster, unmount any leftover clients.
4. ceph fs set cephfs down true
5. Stop all the ceph-osd's.
6. Power off the cluster.
(At this point we had only the ceph-mon's ceph-mgr's running -- you
can shut those down too).
7. Power on the cluster, wait for mon/mgr/osds/mds to power-up.
8. ceph fs set cephfs down false
9. Reconnect and test clients.
10. ceph osd unset noout/noin

-- Dan

On Wed, Oct 19, 2022 at 12:43 PM Frank Schilder  wrote:
>
> Hi all,
>
> we need to prepare for temporary shut-downs of a part of our ceph cluster. I 
> have 2 questions:
>
> 1) What is the recommended procedure to temporarily shut down a ceph fs 
> quickly?
> 2) How to avoid MON store log spam overflow (on octopus 15.2.17)?
>
> To 1: Currently, I'm thinking about:
>
> - fs fail 
> - shut down all MDS daemons
> - shut down all OSDs in that sub-cluster
> - shut down MGRs and MONs in that sub-cluster
> - power servers down
> - mark out OSDs manually (the number will exceed the MON limit for auto-out)
>
> - power up
> - wait a bit
> - do I need to mark OSDs in again or will they join automatically after 
> manual out and restart (maybe just temporarily increase the MON limit at end 
> of procedure above)?
> - fs set  joinable true
>
> Is this a safe procedure? The documentation calls this a procedure for 
> "Taking the cluster down rapidly for deletion or disaster recovery", neither 
> of the two is our intent. We need to have a fast *reversable* procedure, 
> because an "fs set down true" simply takes too long.
>
> There will be ceph fs clients remaining up. Desired behaviour is that 
> client-IO stalls until fs comes back up and then just continues as if nothing 
> had happened.
>
> To 2: We will have a sub-cluster down for an extended period of time. There 
> have been cases where such a situation killed MONS due to excessive amount of 
> non-essential logs accumulating in the MON store. Is this still a problem 
> with 15.2.17 and what can I do to reduce this problem?
>
> Thanks for any hints/corrections/confirmations!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Status of Quincy 17.2.5 ?

2022-10-19 Thread Dan van der Ster
There was a mail on d...@ceph.io that 17.2.4 missed a few backports, so
I presume 17.2.5 is a hotfix -- it's what 17.2.4 was supposed to be.
(And clearly the announcement is pending)

https://github.com/ceph/ceph/commits/v17.2.5

-- dan

On Wed, Oct 19, 2022 at 11:46 AM Christian Rohmann
 wrote:
>
> On 19/10/2022 11:26, Chris Palmer wrote:
> > I've noticed that packages for Quincy 17.2.5 appeared in the debian 11
> > repo a few days ago. However I haven't seen any mention of it
> > anywhere, can't find any release notes, and the documentation still
> > shows 17.2.4 as the latest version.
> >
> > Is 17.2.5 documented and ready for use yet? It's a bit risky having it
> > sitting undocumented in the repo for any length of time when it might
> > inadvertently be applied when doing routine patching... (I spotted it,
> > but one day someone might not).
>
> I believe the upload of a new release to the repo prior to the
> announcement happens quite regularly - it might just be due to the
> technical process of releasing.
> But I agree it would be nice to have a more "bit flip" approach to new
> releases in the repo and not have the packages appear as updates prior
> to the announcement and final release and update notes.
>
>
> Regards
>
> Christian
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-16 Thread Dan van der Ster
Hi Michel,

Are you sure there isn't a hardware problem with the disk? E.g. maybe you
have SCSI timeouts in dmesg or high ioutil with iostat?

Anyway I don't think there's a big risk related to draining and stopping
the osd. Just consider this a disk failure, which can happen at any time
anyway.

Start by marking it out. If there are still too many slow requests or laggy
PGs, try setting primary affinity to zero.
And if that still doesn't work, I wouldn't hesitate to stop that sick osd
so objects backfill from the replicas

(We had a somewhat similar issue today, btw .. some brand of SSDs
occasionally hangs IO across a whole SCSI bus when failing. Stopping the
osd revives the rest of the disks on the box).

Cheers, Dan



On Sun, Oct 16, 2022, 22:08 Michel Jouvin  wrote:

> Hi,
>
> We have a production cluster made of 12 OSD servers with 16 OSD each
> (all the same HW) which has been running fine for 5 years (initially
> installed with Luminous) and which has been running Octopus (15.2.16)
> for 1 year and was recently upgraded to 15.2.17 (1 week before the
> problem started but doesn't seem to be linked with this upgrade). Since
> beginning of October, we started to see PGs in state "active+laggy" and
> slow requests always related to the same OSD and looking at its log, we
> saw "log_latency_fn slow" messages. There was no disk error logged in
> any system log file. Restarting the OSD didn't really help but no
> functionnal problems were seen.
>
> Looking again at the problem in the last days, we saw that the cluster
> was in HEALTH_WARN state because several PGs were not deep-scrubbed in
> time. In the logs we saw also (but may be we just missed them initially)
> "heartbeat_map is_healthy 'OSD::osd_op_tp thread...' had timed out after
> 15" messages. This number increased days after days and is now almost 3
> times the number of PGs hosted by the laggy OSD (despite hundreds of
> deep scrubs running successfully, the cluster has 4297 PGs). It seems
> that in the list we find all PGs that have a replica (all the pools are
> with 3 replica, no EC) on the laggy OSD. We confirmed that there is no
> detected disk error in the system.
>
> Today we restarted the server hosted this OSD, without much hope. It
> didn't help and the same OSD (and only this one) continues to have the
> same problem. In addition to the messages mentioned, the admin socket
> for this OSD became unresponsive: despite command being executed (see
> below), it was not returning in a decent amount of times (several minutes).
>
> As the OSD RocksDB have probably never been compacted, we decided to
> compact the laggy OSD. Despite the "ceph tell osd.10 compact" never
> returned (it was killed after a few hours as the OSD has been marked
> down during a few seconds), the compaction started and lasted ~5
> hours... but completed successfully. But the only improvement that was
> seen after the compaction was that the admin socket is now responsive
> (despite a bit slow). The messages about log_latency_fn and
> heartbeat_map are still present (and frequent) and the deep scrubs are
> still blocked.
>
> We are looking for advices on what to do to fix this issue. We'd in mind
> to stop this OSD, zap it and resintall it but we are worrying it may be
> risky to do this with an OSD that has not been deep scrubbed for a long
> time. And we are sure there is a better solution! Understanding the
> cause would be a much better approach!
>
> Thanks in advance for any help. Best regards,
>
> Michel
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: crush hierarchy backwards and upmaps ...

2022-10-14 Thread Dan van der Ster
Hi,

On Thu, Oct 13, 2022 at 8:14 PM Christopher Durham  wrote:
>
>
> Dan,
>
> Again i am using 16.2.10 on rocky 8
>
> I decided to take a step back and check a variety of options before I do 
> anything. Here are my results.
>
> If I use this rule:
>
> rule mypoolname {
>  id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step choose indep 2 type chassis
> step chooseleaf indep 1 type host
> step emit
>
> This is changing the pod definitions all to type chassis. I get NO moves when 
> running osdmaptool --test-pg-upmap-items
> and comparing to the current. But --upmap-cleanup gives:
>
> check_pg_upmaps verify upmap of pool.pgid returning -22
> verify_upmap number of buckets 8 exceeds desired 2
>
> for each of my existing upmaps. And it wants to remove them all.
>
> If I use the rule:
>
> rule mypoolname {
>  id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step chooseleaf indep 2 type chassis
> step emit
>
> I get almost 1/2 my data moving as per osdmaptool --test-pg-upmap-items.
>
> With --upmap-cleanup I get:
>
> verify_upmap multiple osds N,M come from the same failure domain -382
> check_pg_upmap verify upmap of pg poolid.pgid returning -22.
>
> For about 1/8 of my upmaps. And it wants to remove these and and add about 
> 100 more.
> Although I suspect that this will be rectified after things are moved and 
> such. Am I correct?

Yes that should be corrected as you suspect. The tracker I linked you
too earlier was indeed pointing out that with unorded crush buckets,
the balancer would create some rules which break the crush failure
domains.
It's good that those are detected and cleaned now -- leaving them in
would lead to unexpected cluster outages (e.g. rebooting a "pod" would
take down a PG because more than the expected 2 shards would have been
present).

> If I use the rule: (after changing my rack definition to only contain hosts 
> that were previously a part of the
> pods or chassis):
>
> rule mypoolname {
>  id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step chooseleaf indep 2 type host
> step emit
>
> I get almost all my data moving as per osdmaptool --test-pg-upmap-items.
>
> With --upmap-cleanup, I get only 10 of these:
>
> verify_upmap multiple osds N,M come from the same failure domain -382
> check_pg_upmap verify upmap of pg poolid.pgid returning -22.
>
> But upmap-cleanup wants to remove all my upmaps, which may actually make 
> sense if we
> redo the entire map this way.
>
> I am curious for the first rule, where I am getting the expected 8 got 2, if 
> I am hitting this bug, that seems to
> suggest that I am having a problem because I have a multi-level (>2) level 
> rule  for an ec pool:
>
> https://tracker.ceph.com/issues/51729
>
>
> This bug appears to be on 14.x, but perhaps it exists on pacific as well.

There appears to be no progress on that bug for a year -- there's no
reason to think pacific has fixed it, and your observations seems to
confirm that.
I suggest you post to that ticket with your info.

Cheers, Dan

> It would be great if I could use the first rule, except for this bug. Perhaps 
> the second rule is best at this point.
>
> Any other thoughts would be appreciated.
>
> -Chris
>
>
> -Original Message-
> From: Dan van der Ster 
> To: Christopher Durham 
> Cc: Ceph Users 
> Sent: Tue, Oct 11, 2022 11:39 am
> Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ...
>
> Hi Chris,
>
> Just curious, does this rule make sense and help with the multi level crush
> map issue?
> (Maybe it also results in zero movement, or at least less then the
> alternative you proposed?)
>
> step choose indep 4 type rack
> step chooseleaf indep 2 type chassis
>
> Cheers, Dan
>
>
>
>
> On Tue, Oct 11, 2022, 19:29 Christopher Durham  wrote:
>
> > Dan,
> >
> > Thank you.
> >
> > I did what you said regarding --test-map-pgs-dump and it wants to move 3
> > OSDs in every PG. Yuk.
> >
> > So before i do that, I tried this rule, after changing all my 'pod' bucket
> > definitions to 'chassis', and compiling and
> > injecting the new crushmap to an osdmap:
> >
> >
> > rule mypoolname {
> >id -5
> >type erasure
> >step take myroot
> >step choose indep 4 type rack
> >step choose indep 2 type chassis
> >step chooseleaf indep 1 type host
> >step emit
> >
> > }
> >
> > --test-pg-upmap-entries

[ceph-users] Re: crush hierarchy backwards and upmaps ...

2022-10-11 Thread Dan van der Ster
Hi Chris,

Just curious, does this rule make sense and help with the multi level crush
map issue?
(Maybe it also results in zero movement, or at least less then the
alternative you proposed?)

step choose indep 4 type rack
step chooseleaf indep 2 type chassis

Cheers, Dan




On Tue, Oct 11, 2022, 19:29 Christopher Durham  wrote:

> Dan,
>
> Thank you.
>
> I did what you said regarding --test-map-pgs-dump and it wants to move 3
> OSDs in every PG. Yuk.
>
> So before i do that, I tried this rule, after changing all my 'pod' bucket
> definitions to 'chassis', and compiling and
> injecting the new crushmap to an osdmap:
>
>
> rule mypoolname {
> id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step choose indep 2 type chassis
> step chooseleaf indep 1 type host
> step emit
>
> }
>
> --test-pg-upmap-entries shows there were NO changes to be done after
> comparing it with the original!!!
>
> However, --upmap-cleanup says:
>
> verify_upmap number of buckets 8 exceeds desired number of 2
> check_pg_upmaps verify_upmap of poolid.pgid returning -22
>
> This is output for every current upmap, but I really do want 8 total
> buckets per PG, as my pool is a 6+2.
>
> The upmap-cleanup output wants me to remove all of my upmaps.
>
> This seems consistent with a bug report that says that there is a problem
> with the balancer on a
> multi-level rule such as the above, albeit on 14.2.x. Any thoughts?
>
> https://tracker.ceph.com/issues/51729
>
> I am leaning towards just eliminating the middle rule and go directly from
> rack to host, even though
> it wants to move a LARGE amount of data according to  a diff before and
> after of --test-pg-upmap-entries.
> In this scenario, I dont see any unexpected errors with --upmap-cleanup
> and I do not want to get stuck
>
> rule mypoolname {
> id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step chooseleaf indep 2 type host
> step emit
> }
>
> -Chris
>
>
> -Original Message-
> From: Dan van der Ster 
> To: Christopher Durham 
> Cc: Ceph Users 
> Sent: Mon, Oct 10, 2022 12:22 pm
> Subject: [ceph-users] Re: crush hierarchy backwards and upmaps ...
>
> Hi,
>
> Here's a similar bug: https://tracker.ceph.com/issues/47361
>
> Back then, upmap would generate mappings that invalidate the crush rule. I
> don't know if that is still the case, but indeed you'll want to correct
> your rule.
>
> Something else you can do before applying the new crush map is use
> osdmaptool to compare the PGs placement before and after, something like:
>
> osdmaptool --test-map-pgs-dump osdmap.before > before.txt
>
> osdmaptool --test-map-pgs-dump osdmap.after > after.txt
>
> diff -u before.txt after.txt
>
> The above will help you estimate how much data will move after injecting
> the fixed crush map. So depending on the impact you can schedule the change
> appropriately.
>
> I also recommend to keep a backup of the previous crushmap so that you can
> quickly restore it if anything goes wrong.
>
> Cheers, Dan
>
>
>
>
>
> On Mon, Oct 10, 2022, 19:31 Christopher Durham  wrote:
>
> > Hello,
> > I am using pacific 16.2.10 on Rocky 8.6 Linux.
> >
> > After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr,
> I
> > achieved a near perfect balance of PGs and space on my OSDs. This is
> great.
> >
> > However, I started getting the following errors on my ceph-mon logs,
> every
> > three minutes, for each of the OSDs that had been mappedby the balancer:
> >2022-10-07T17:10:39.619+ 7f7c2786d700 1 verify_upmap unable to get
> > parent of osd.497, skipping for now
> >
> > After banging my head against the wall for a bit trying to figure this
> > out, I think I have discovered the issue:
> >
> > Currently, I have my pool EC Pool configured with the following crush
> rule:
> >
> > rule mypoolname {
> >id -5
> >type erasure
> >step take myroot
> >step choose indep 4 type rack
> >step choose indep 2 type pod
> >step chooseleaf indep 1 type host
> >step emit
> > }
> >
> > Basically, pick 4 racks, then 2 pods in each rack, and then one host in
> > each pod, For a total of
> > 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the
> myroot
> > root entry, which is as follows.
> >
> >
> > root myroot {
> >id -400
> >item rack1 weight N
> >item rack2 weight N
> >item rack3 weight N
> >

[ceph-users] Re: crush hierarchy backwards and upmaps ...

2022-10-10 Thread Dan van der Ster
Hi,

Here's a similar bug: https://tracker.ceph.com/issues/47361

Back then, upmap would generate mappings that invalidate the crush rule. I
don't know if that is still the case, but indeed you'll want to correct
your rule.

Something else you can do before applying the new crush map is use
osdmaptool to compare the PGs placement before and after, something like:

osdmaptool --test-map-pgs-dump osdmap.before > before.txt

osdmaptool --test-map-pgs-dump osdmap.after > after.txt

diff -u before.txt after.txt

The above will help you estimate how much data will move after injecting
the fixed crush map. So depending on the impact you can schedule the change
appropriately.

I also recommend to keep a backup of the previous crushmap so that you can
quickly restore it if anything goes wrong.

Cheers, Dan





On Mon, Oct 10, 2022, 19:31 Christopher Durham  wrote:

> Hello,
> I am using pacific 16.2.10 on Rocky 8.6 Linux.
>
> After setting upmap_max_deviation to 1 on the ceph balancer in ceph-mgr, I
> achieved a near perfect balance of PGs and space on my OSDs. This is great.
>
> However, I started getting the following errors on my ceph-mon logs, every
> three minutes, for each of the OSDs that had been mappedby the balancer:
> 2022-10-07T17:10:39.619+ 7f7c2786d700 1 verify_upmap unable to get
> parent of osd.497, skipping for now
>
> After banging my head against the wall for a bit trying to figure this
> out, I think I have discovered the issue:
>
> Currently, I have my pool EC Pool configured with the following crush rule:
>
> rule mypoolname {
> id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step choose indep 2 type pod
> step chooseleaf indep 1 type host
> step emit
> }
>
> Basically, pick 4 racks, then 2 pods in each rack, and then one host in
> each pod, For a total of
> 8 chunks. (The pool is a is a 6+2). The 4 racks are chosen from the myroot
> root entry, which is as follows.
>
>
> root myroot {
> id -400
> item rack1 weight N
> item rack2 weight N
> item rack3 weight N
> item rack4 weight N
> }
>
> This has worked fine since inception, over a year ago. And the PGs are all
> as I expect with OSDs from the 4 racks and not on the same host or pod.
>
> The errors above, verify_upmap, started after I had the upmap_
> max_deviation set to 1 in the balancer and having it
> move things around, creating pg_upmap entries.
>
> I then discovered, while trying to figure this out, that the device types
> are:
>
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> ...
> type 6 pod
>
> So pod is HIGHER on the hierarchy than rack. I have it as lower on my
> rule.
>
> What I want to do is remove the pods completely to work around this.
> Something like:
>
> rule mypoolname {
> id -5
> type erasure
> step take myroot
> step choose indep 4 type rack
> step chooseleaf indep 2 type host
> step emit
> }
>
> This will pick 4 racks and then 2 hosts in each rack. Will this cause any
> problems? I can add the pod stuff back later as 'chassis' instead. I can
> live without the 'pod' separation if needed.
>
> To test this, I tried doing something like this:
>
> 1. grab the osdmap:
> ceph osd getmap -o /tmp/om
> 2. pull out the crushmap:
> osdmaptool --export-crush  /tmp/crush.bin
> 3. cnvert it to text:
> crushtool -d /tmp/crush.bin -o /tmp/crush.txt
>
> I then edited the rule for this pool as above, to remove the pod and go
> directly
> to pulling from 4 racks then 2 hosts in each rack. I then compiled up the
> crush map
> and then imported it into the extracted osdmap:
>
> crushtool -c /tmp/crush.txt -o /tmp/crush.bin
> osdmaptool /tmp/om --import-crush /tmp/crush.bin
>
> I then ran upmap-cleanup on the new osdmap:
>
> osdmaptool /tmp/om --upmap-cleanup
>
> I did NOT get any of the verify_upmap messages (but it did generate some
> rm-pg-upmap-items and some new upmaps in the list of commands to execute).
>
> When I did the extraction of the osdmap WITHOUT any changes to it, and
> then ran the upmap-cleanup, I got the same verify_upmap errors I am now
> seeing in the ceph-mon logs.
>
> So, should I just change the crushmap to remove the wrong rack->pod->host
> hierarchy, making it rack->host ?
> Will I have other issues? I am surprised that crush allowed me to create
> this out of order rule to begin with.
>
> Thanks for any suggestions.
>
> -Chris
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Dan van der Ster
It's not necessarily a bug... Running deep scrub again will just tell you
the current state of the PG. That's safe any time.

If it's comes back inconsistent again, I'd repair the PG again, let it
finish completely, then scrub once again to double check that the repair
worked.

Thinking back, I've seen PG 1fff have scrub errors like this in the past,
not not recently, indicating it was a listing bug of some sort. Perhaps
this is just a leftover stats error from a bug in mimic, and the complete
repair will fix this fully for you.

(Btw, I've never had a stats error like this result in a visible issue.
Repair should probably fix this transparently).

.. Dan



On Sat, Oct 8, 2022, 11:27 Frank Schilder  wrote:

> Yes, primary OSD. Extracted with grep -e scrub -e repair -e 19.1fff
> /var/log/ceph/ceph-osd.338.log and then only relevant lines copied.
>
> Yes, according to the case I should just run a deep-scrub and should see.
> I guess if this error was cleared on an aborted repair, this would be a new
> bug? I will do a deep-scrub and report back.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________
> From: Dan van der Ster 
> Sent: 08 October 2022 11:18:37
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re: [ceph-users] recurring stat mismatch on PG
>
> Is that the log from the primary OSD?
>
> About the restart, you should probably just deep-scrub again to see the
> current state.
>
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 11:14 Frank Schilder  fr...@dtu.dk>> wrote:
> Hi Dan,
>
> yes, 15.2.17. I remember that case and was expecting it to be fixed. Here
> a relevant extract from the log:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
> 2022-10-08T10:38:20.618+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff repair starts
> 2022-10-08T10:54:25.801+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 repair : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:54:25.802+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff repair 1 errors, 1 fixed
>
> Just completed a repair and its gone for now. As an alternative
> explanation, we had this scrub error, I started a repair but then OSDs in
> that PG were shut down and restarted. Is it possible that the repair was
> cancelled and the error cleared erroneously?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster mailto:dvand...@gmail.com>>
> Sent: 08 October 2022 11:03:05
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re: [ceph-users] recurring stat mismatch on PG
>
> Hi,
>
> Is that 15.2.17? It reminds me of this bug -
> https://tracker.ceph.com/issues/52705 - where an object with a particular
> name would hash to  and cause a stat mismatch during scrub. But
> 15.2.17 should have the fix for that.
>
>
> Can you find the relevant osd log for more info?
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 10:42 Frank Schilder  fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> wrote:
> Hi all,
>
> I seem to observe something strange on an octopus(latest) cluster. We have
> a PG with a stat mismatch:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
>
> This exact same mismatch was found before and I executed a pg-repair that
> fixed it. Now its back. Does anyone have an idea why this might be
&g

[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Dan van der Ster
Is that the log from the primary OSD?

About the restart, you should probably just deep-scrub again to see the
current state.


.. Dan



On Sat, Oct 8, 2022, 11:14 Frank Schilder  wrote:

> Hi Dan,
>
> yes, 15.2.17. I remember that case and was expecting it to be fixed. Here
> a relevant extract from the log:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
> 2022-10-08T10:38:20.618+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff repair starts
> 2022-10-08T10:54:25.801+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 repair : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:54:25.802+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff repair 1 errors, 1 fixed
>
> Just completed a repair and its gone for now. As an alternative
> explanation, we had this scrub error, I started a repair but then OSDs in
> that PG were shut down and restarted. Is it possible that the repair was
> cancelled and the error cleared erroneously?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 08 October 2022 11:03:05
> To: Frank Schilder
> Cc: Ceph Users
> Subject: Re: [ceph-users] recurring stat mismatch on PG
>
> Hi,
>
> Is that 15.2.17? It reminds me of this bug -
> https://tracker.ceph.com/issues/52705 - where an object with a particular
> name would hash to  and cause a stat mismatch during scrub. But
> 15.2.17 should have the fix for that.
>
>
> Can you find the relevant osd log for more info?
>
> .. Dan
>
>
>
> On Sat, Oct 8, 2022, 10:42 Frank Schilder  fr...@dtu.dk>> wrote:
> Hi all,
>
> I seem to observe something strange on an octopus(latest) cluster. We have
> a PG with a stat mismatch:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
>
> This exact same mismatch was found before and I executed a pg-repair that
> fixed it. Now its back. Does anyone have an idea why this might be
> happening and how to deal with it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io ceph-users-le...@ceph.io>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recurring stat mismatch on PG

2022-10-08 Thread Dan van der Ster
Hi,

Is that 15.2.17? It reminds me of this bug -
https://tracker.ceph.com/issues/52705 - where an object with a particular
name would hash to  and cause a stat mismatch during scrub. But
15.2.17 should have the fix for that.


Can you find the relevant osd log for more info?

.. Dan



On Sat, Oct 8, 2022, 10:42 Frank Schilder  wrote:

> Hi all,
>
> I seem to observe something strange on an octopus(latest) cluster. We have
> a PG with a stat mismatch:
>
> 2022-10-08T10:06:22.206+0200 7fa3c48c7700  0 log_channel(cluster) log
> [DBG] : 19.1fff deep-scrub starts
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fffs0 deep-scrub : stat mismatch, got 64532/64531 objects,
> 1243/1243 clones, 64532/64531 dirty, 0/0 omap, 0/0 pinned, 0/0
> hit_set_archive, 1215/1215 whiteouts, 170978253582/170974059278 bytes, 0/0
> manifest objects, 0/0 hit_set_archive bytes.
> 2022-10-08T10:22:33.049+0200 7fa3c48c7700 -1 log_channel(cluster) log
> [ERR] : 19.1fff deep-scrub 1 errors
>
> This exact same mismatch was found before and I executed a pg-repair that
> fixed it. Now its back. Does anyone have an idea why this might be
> happening and how to deal with it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Dan van der Ster
Hi Zakhar,

I can back up what Konstantin has reported -- we occasionally have
HDDs performing very slowly even though all smart tests come back
clean. Besides ceph osd perf showing a high latency, you could see
high ioutil% with iostat.

We normally replace those HDDs -- usually by draining and zeroing
them, then putting them back in prod (e.g. in a different cluster or
some other service). I don't have statistics on how often those sick
drives come back to full performance or not -- that could indicate it
was a poor physical connection, vibrations, ... , for example. But I
do recall some drives came back repeatedly as "sick" but not dead w/
clean SMART tests.

If you have time you can dig deeper with increased bluestore debug
levels. In our environment this happens often enough that we simply
drain, replace, move on.

Cheers, dan




On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko  wrote:
>
> Unfortunately, that isn't the case: the drive is perfectly healthy and,
> according to all measurements I did on the host itself, it isn't any
> different from any other drive on that host size-, health- or
> performance-wise.
>
> The only difference I noticed is that this drive sporadically does more I/O
> than other drives for a split second, probably due to specific PGs placed
> on its OSD, but the average I/O pattern is very similar to other drives and
> OSDs, so it's somewhat unclear why the specific OSD is consistently showing
> much higher latency. It would be good to figure out what exactly is causing
> these I/O spikes, but I'm not yet sure how to do that.
>
> /Z
>
> On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:
>
> > Hi,
> >
> > When you see one of 100 drives perf is unusually different, this may mean
> > 'this drive is not like the others' and should be replaced
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko  wrote:
> > >
> > > Anyone, please?
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Dan van der Ster
Hi Jan,

It looks like you got into this situation by not setting
require-osd-release to pacific while you were running 16.2.7.
The code has that expectation, and unluckily for you if you had
upgraded to 16.2.8 you would have had a HEALTH_WARN that pointed out
the mismatch between require_osd_release and the running version:
https://tracker.ceph.com/issues/53551
https://github.com/ceph/ceph/pull/44259

Cheers, Dan

On Fri, Oct 7, 2022 at 10:05 AM Jan Marek  wrote:
>
> Hello,
>
> I've now cluster healthy.
>
> I've studied OSDMonitor.cc file and I've found, that there is
> some problematic logic.
>
> Assumptions:
>
> 1) require_osd_release can be only raise.
>
> 2) ceph-mon in version 17.2.3 can set require_osd_release to
> minimal value 'octopus'.
>
> I have two variants:
>
> 1) If I can set require_osd_release to octopus, I have to have
> set require_osd_release actually to 'nautilus' (I will raise
> require_osd_release from nautilus to octopus). Then I have to
> have on line 11618 in OSDMonitor.cc this line:
>
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
>
> 2) If I would have to preserve on line 11618 in file
> OSDMonitor.cc line:
>
> ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
>
> it is nonsense to can set require_osd_release parameter to
> 'octopus', because this line ensures, that I alredy set
> require_osd_release parameter to octopus.
>
> I suggest to use variant 1) and I've sendig attached patch.
>
> There is another question, if MON daemon have to check
> require_osd_release, when it is joining to the cluster, when it
> cannot raise it's value.
>
> It is potentially dangerous situation, see my old e-mail below...
>
> Sincerely
> Jan Marek
>
> Dne Po, říj 03, 2022 at 11:26:51 CEST napsal Jan Marek:
> > Hello,
> >
> > I've problem with our ceph cluster - I've stucked in upgrade
> > process between versions 16.2.7 and 17.2.3.
> >
> > My problem is, that I have upgraded MON, MGR, MDS processes, and
> > when I started upgrade OSDs, ceph tell me, that I cannot add OSD
> > with that version to cluster, because I have problem with
> > require_osd_release.
> >
> > In my osdmap I have:
> >
> > # ceph osd dump | grep require_osd_release
> > require_osd_release nautilus
> >
> > When I tried set this to octopus or pacific, my MON daemon crashed with
> > assertion:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus);
> >
> > in OSDMonitor.cc on line 11618.
> >
> > Please, is there a way to repair it?
> >
> > Can I (temporary) change ceph_assert to this line:
> >
> > ceph_assert(osdmap.require_osd_release >= ceph_release_t::nautilus);
> >
> > and set require_osd_release to, say, pacific?
> >
> > I've tried to downgrade ceph-mon process back to version 16.2,
> > but it cannot join to cluster...
> >
> > Sincerely
> > Jan Marek
> > --
> > Ing. Jan Marek
> > University of South Bohemia
> > Academic Computer Centre
> > Phone: +420389032080
> > http://www.gnu.org/philosophy/no-word-attachments.cs.html
>
>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_memory_target for low-memory machines

2022-10-03 Thread Dan van der Ster
Hi,

384MB is far too low for a Ceph OSD. The warning is telling you that
it's below the min.

Cheers, Dan


On Sun, Oct 2, 2022 at 11:10 AM Nicola Mori  wrote:
>
> Dear Ceph users,
>
> I put together a cluster by reusing some (very) old machines with low
> amounts of RAM, as low as 4 GB for the worst case. I'd need to set
> osd_memory_target properly to avoid going OOM, but it seems there is a
> lower limit preventing me to do so consistently:
>
> 1) in the cluster logs I continuously get this message:
>
>Unable to set osd_memory_target on ceph01 to 402653184: error parsing
> value: Value '831882342' is below minimum 939524096
>
> 2) if I try to set it manually I get this error:
>
># ceph config set osd.7 osd_memory_target 402653184
>Error EINVAL: error parsing value: Value '402653184' is below minimum
> 939524096
>
> 3) if I use the --force option then the value is set, but it is lost
> after a osd restart
>
> How can I fix this issue?
> Thanks,
>
> Nicola
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Almost there - trying to recover cephfs from power outage

2022-09-21 Thread Dan van der Ster
Hi Jorge,

There was an older procedure before the --recover flag.
You can find that here:
   https://github.com/ceph/ceph/pull/42295/files
It was the result of this tracker: https://tracker.ceph.com/issues/51341

Also, here was the change which added the --recover flag:
https://tracker.ceph.com/issues/51716
There you can see the old process described again.

Good luck,

Dan


On Tue, Sep 20, 2022 at 9:00 PM Jorge Garcia  wrote:
>
> I have been trying to recover our ceph cluster from a power outage. I
> was able to recover most of the cluster using the data from the OSDs.
> But the MDS maps were gone, and now I'm trying to recover that. I was
> looking around and found a section in the Quincy manual titled
> RECOVERING THE FILE SYSTEM AFTER CATASTROPHIC MONITOR STORE LOSS. It
> talks about using the following command:
>
>ceph fs new--force --recover
>
> The problem is that we're running Nautilus, and it seems that the
> "--recover" flag doesn't exist yet. Are we out of luck? At the moment,
> the cluster has only one mds (currently reporting as standby) and one
> cephfs filesystem (currently not enabled, according to "ceph fs ls").
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus: PGs stuck "activating" after adding OSDs. Please help!

2022-09-15 Thread Dan van der Ster
Another common config to workaround this pg num limit is:

ceph config set osd osd_max_pg_per_osd_hard_ratio 10

(Then possibly the repeer step on each activating pg)

.. Dan



On Thu, Sept 15, 2022, 17:47 Josh Baergen  wrote:

> Hi Fulvio,
>
> I've seen this in the past when a CRUSH change temporarily resulted in
> too many PGs being mapped to an OSD, exceeding mon_max_pg_per_osd. You
> can try increasing that setting to see if it helps, then setting it
> back to default once backfill completes. You may also need to "ceph pg
> repeer $pgid" for each of the PGs stuck activating.
>
> Josh
>
> On Thu, Sep 15, 2022 at 8:42 AM Fulvio Galeazzi 
> wrote:
> >
> >
> > Hallo,
> > I am on Nautilus and today, after upgrading the operating system
> (from
> > CentOS 7 to CentOS 8 Stream) on a couple OSD servers and adding them
> > back to the cluster, I noticed some PGs are still "activating".
> > The upgraded server are from the same "rack", and I have replica-3
> > pools with 1-per-rack rule, and 6+4 EC pools (in some cases, with SSD
> > pool for metadata).
> >
> > More details:
> > - on the two OSD servers I upgrade, I ran "systemctl stop ceph.target"
> > and waited a while, to verify all PGs would remain "active"
> > - went on with the upgrade and ceph-ansible reconfig
> > - as soon as I started adding OSDs I saw "slow ops"
> > - to exclude possible effect of updated packages, I ran "yum update" on
> > all OSD servers, and rebooted them one by one
> > - after 2-3 hours, the last OSD disks finally came up
> > - I am left with:
> > about 1k "slow ops" (if I pause recovery, number ~stable but max
> > age increasing)
> > ~200 inactive PGs
> >
> > Most of the inactive PGs are from the object store pool:
> >
> > [cephmgr@cephAdmCT1.cephAdmCT1 ~]$ ceph osd pool get
> > default.rgw.buckets.data crush_rule
> > crush_rule: default.rgw.buckets.data
> >
> > rule default.rgw.buckets.data {
> >   id 6
> >   type erasure
> >   min_size 3
> >   max_size 10
> >   step set_chooseleaf_tries 5
> >   step set_choose_tries 100
> >   step take default class big
> >   step chooseleaf indep 0 type host
> >   step emit
> > }
> >
> > But "ceph pg dump_stuck inactive" also shows 4 lines for the glance
> > replicated pool, like:
> >
> > 82.34   activating+remapped  [139,50,207]  139
> > [139,50,284]  139
> > 82.54   activating+undersized+degraded+remapped[139,86,5]  139
> > [139,74]  139
> >
> >
> > Need your help please:
> >
> > - any idea what was the root cause for all this?
> >
> > - and now, how can I help OSDs complete their activation?
> > + does the procedure differ for EC or replicated pools, by the way?
> > + or may be I should first get rid of the "slow ops" issue?
> >
> > I am pasting:
> > ceph osd df tree
> > https://pastebin.ubuntu.com/p/VWhT7FWf6m/
> >
> > ceph osd lspools ; ceph pg dump_stuck inactive
> > https://pastebin.ubuntu.com/p/9f6rXRYMh4/
> >
> > Thanks a lot!
> >
> > Fulvio
> >
> > --
> > Fulvio Galeazzi
> > GARR-CSD Department
> > tel.: +39-334-6533-250
> > skype: fgaleazzi70
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

2022-08-30 Thread Dan van der Ster
>> Note: "step chose" was selected by creating the crush rule with ceph on pool 
>> creation. If the default should be "step choseleaf" (with OSD buckets), then 
>> the automatic crush rule generation in ceph ought to be fixed for EC 
>> profiles.
> Interesting. Which exact command was used to create the pool?

I can reproduce. By default with "host" failure domain, the resulting
rule will "chooseleaf indep host".
But if you create an ec profile with crush-failure-domain=osd, then
resulting rules will "choose indep osd".

We should open a tracker for this.

Either "choose indep osd" and "chooseleaf indep osd" should be give
the same result, or the pool creation should use "chooseleaf indep
osd" in this case.

-- dan


On Tue, Aug 30, 2022 at 1:43 PM Dan van der Ster  wrote:
>
> > Note: "step chose" was selected by creating the crush rule with ceph on 
> > pool creation. If the default should be "step choseleaf" (with OSD 
> > buckets), then the automatic crush rule generation in ceph ought to be 
> > fixed for EC profiles.
>
> Interesting. Which exact command was used to create the pool?
>
> > These experiments indicate that there is a very weird behaviour 
> > implemented, I would actually call this a serious bug.
>
> I don't think this is a bug. Each of your attempts with different
> _tries values changed the max iterations of the various loops in
> crush. Since this takes crush on different "paths" to find a valid
> OSD, the output is going to be different.
>
> > The resulting mapping should be independent of the maximum number of trials
>
> No this is wrong.. the "tunables" change the mapping. The important
> thing is that every node + client in the cluster agrees on the mapping
> -- and indeed since they all use the same tunables, including the
> values for *_tries, they will all agree on the up/acting set.
>
> Cheers, Dan
>
> On Tue, Aug 30, 2022 at 1:10 PM Frank Schilder  wrote:
> >
> > Hi Dan,
> >
> > thanks a lot for looking into this. I can't entirely reproduce your 
> > results. Maybe we are using different versions and there was a change? I'm 
> > testing with the octopus 16.2.16 image: quay.io/ceph/ceph:v15.2.16.
> >
> > Note: "step chose" was selected by creating the crush rule with ceph on 
> > pool creation. If the default should be "step choseleaf" (with OSD 
> > buckets), then the automatic crush rule generation in ceph ought to be 
> > fixed for EC profiles.
> >
> > My results with the same experiments as you did, I can partly confirm and 
> > partly I see oddness that I would consider a bug (reported at the very end):
> >
> > rule fs-data {
> > id 1
> > type erasure
> > min_size 3
> > max_size 6
> > step take default
> > step choose indep 0 type osd
> > step emit
> > }
> >
> > # osdmaptool --test-map-pg 4.1c osdmap.bin
> > osdmaptool: osdmap file 'osdmap.bin'
> >  parsed '4.1c' -> 4.1c
> > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) 
> > acting ([6,1,4,5,3,1], p6)
> >
> > rule fs-data {
> > id 1
> > type erasure
> > min_size 3
> > max_size 6
> > step take default
> > step chooseleaf indep 0 type osd
> > step emit
> > }
> >
> > # osdmaptool --test-map-pg 4.1c osdmap.bin
> > osdmaptool: osdmap file 'osdmap.bin'
> >  parsed '4.1c' -> 4.1c
> > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], 
> > p6)
> >
> > So far, so good. Now the oddness:
> >
> > rule fs-data {
> > id 1
> > type erasure
> > min_size 3
> > max_size 6
> > step set_chooseleaf_tries 5
> > step set_choose_tries 100
> > step take default
> > step chooseleaf indep 0 type osd
> > step emit
> > }
> >
> > # osdmaptool --test-map-pg 4.1c osdmap.bin
> > osdmaptool: osdmap file 'osdmap.bin'
> >  parsed '4.1c' -> 4.1c
> > 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], 
> > p6)
> >
> > How can this be different?? I thought crush returns on the first successful 
> > mapping. This ought to be identical to the previous one. It gets even more 
> > weird:
> >
> > rule fs-data {
> > id 1
> > type erasure
> > min_size 3
> > m

[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

2022-08-30 Thread Dan van der Ster
> Note: "step chose" was selected by creating the crush rule with ceph on pool 
> creation. If the default should be "step choseleaf" (with OSD buckets), then 
> the automatic crush rule generation in ceph ought to be fixed for EC profiles.

Interesting. Which exact command was used to create the pool?

> These experiments indicate that there is a very weird behaviour implemented, 
> I would actually call this a serious bug.

I don't think this is a bug. Each of your attempts with different
_tries values changed the max iterations of the various loops in
crush. Since this takes crush on different "paths" to find a valid
OSD, the output is going to be different.

> The resulting mapping should be independent of the maximum number of trials

No this is wrong.. the "tunables" change the mapping. The important
thing is that every node + client in the cluster agrees on the mapping
-- and indeed since they all use the same tunables, including the
values for *_tries, they will all agree on the up/acting set.

Cheers, Dan

On Tue, Aug 30, 2022 at 1:10 PM Frank Schilder  wrote:
>
> Hi Dan,
>
> thanks a lot for looking into this. I can't entirely reproduce your results. 
> Maybe we are using different versions and there was a change? I'm testing 
> with the octopus 16.2.16 image: quay.io/ceph/ceph:v15.2.16.
>
> Note: "step chose" was selected by creating the crush rule with ceph on pool 
> creation. If the default should be "step choseleaf" (with OSD buckets), then 
> the automatic crush rule generation in ceph ought to be fixed for EC profiles.
>
> My results with the same experiments as you did, I can partly confirm and 
> partly I see oddness that I would consider a bug (reported at the very end):
>
> rule fs-data {
> id 1
> type erasure
> min_size 3
> max_size 6
> step take default
> step choose indep 0 type osd
> step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) acting 
> ([6,1,4,5,3,1], p6)
>
> rule fs-data {
> id 1
> type erasure
> min_size 3
> max_size 6
> step take default
> step chooseleaf indep 0 type osd
> step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6)
>
> So far, so good. Now the oddness:
>
> rule fs-data {
> id 1
> type erasure
> min_size 3
> max_size 6
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default
> step chooseleaf indep 0 type osd
> step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6)
>
> How can this be different?? I thought crush returns on the first successful 
> mapping. This ought to be identical to the previous one. It gets even more 
> weird:
>
> rule fs-data {
> id 1
> type erasure
> min_size 3
> max_size 6
> step set_chooseleaf_tries 50
> step set_choose_tries 200
> step take default
> step chooseleaf indep 0 type osd
> step emit
> }
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting 
> ([6,1,4,5,3,1], p6)
>
> What We increase the maximum number of trials for searching and we 
> end up with an invalid mapping??
>
> These experiments indicate that there is a very weird behaviour implemented, 
> I would actually call this a serious bug. The resulting mapping should be 
> independent of the maximum number of trials (if I understood the crush 
> algorithm correctly). In any case, a valid mapping should never be replaced 
> in favour of an invalid one (containing a down+out OSD).
>
> For now there is a happy end on my test cluster:
>
> # ceph pg dump pgs_brief | grep 4.1c
> dumped pgs_brief
> 4.1c active+remapped+backfilling  [6,1,4,5,3,8]   6  
> [6,1,4,5,3,1]   6
>
> Please look into the extremely odd behaviour reported above. I'm quite 
> confident that this is unintended if not dangerous behaviour and should be 
> corrected. I'm willing to file a tracker item with the data above. I'm 
> actually wo

[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

2022-08-30 Thread Dan van der Ster
BTW, the defaults for _tries seems to work too:


# diff -u crush.txt crush.txt2
--- crush.txt 2022-08-30 11:27:41.941836374 +0200
+++ crush.txt2 2022-08-30 11:55:45.601891010 +0200
@@ -90,10 +90,10 @@
  type erasure
  min_size 3
  max_size 6
- step set_chooseleaf_tries 50
- step set_choose_tries 200
+ step set_chooseleaf_tries 5
+ step set_choose_tries 100
  step take default
- step choose indep 0 type osd
+ step chooseleaf indep 0 type osd
  step emit
 }

# osdmaptool --test-map-pg 4.1c osdmap.bin2
osdmaptool: osdmap file 'osdmap.bin2'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6)


-- dan

On Tue, Aug 30, 2022 at 11:50 AM Dan van der Ster  wrote:
>
> BTW, I vaguely recalled seeing this before. Yup, found it:
> https://tracker.ceph.com/issues/55169
>
> On Tue, Aug 30, 2022 at 11:46 AM Dan van der Ster  wrote:
> >
> > > 2. osd.7 is destroyed but still "up" in the osdmap.
> >
> > Oops, you can ignore this point -- this was an observation I had while
> > playing with the osdmap -- your osdmap.bin has osd.7 down correctly.
> >
> > In case you're curious, here was what confused me:
> >
> > # osdmaptool osdmap.bin2  --mark-up-in --mark-out 7 --dump plain
> > osd.7 up   out weight 0 up_from 3846 up_thru 3853 down_at 3855
> > last_clean_interval [0,0)
> > [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819]
> > [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819]
> > destroyed,exists,up
> >
> > Just ignore this ...
> >
> >
> >
> > -- dan
> >
> > On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster  
> > wrote:
> > >
> > > Hi Frank,
> > >
> > > I suspect this is a combination of issues.
> > > 1. You have "choose" instead of "chooseleaf" in rule 1.
> > > 2. osd.7 is destroyed but still "up" in the osdmap.
> > > 3. The _tries settings in rule 1 are not helping.
> > >
> > > Here are my tests:
> > >
> > > # osdmaptool --test-map-pg 4.1c osdmap.bin
> > > osdmaptool: osdmap file 'osdmap.bin'
> > >  parsed '4.1c' -> 4.1c
> > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6)
> > > acting ([6,1,4,5,3,1], p6)
> > >
> > > ^^ This is what you observe now.
> > >
> > > # diff -u crush.txt crush.txt2
> > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200
> > > @@ -93,7 +93,7 @@
> > >   step set_chooseleaf_tries 50
> > >   step set_choose_tries 200
> > >   step take default
> > > - step choose indep 0 type osd
> > > + step chooseleaf indep 0 type osd
> > >   step emit
> > >  }
> > > # crushtool -c crush.txt2 -o crush.map2
> > > # cp osdmap.bin osdmap.bin2
> > > # osdmaptool --import-crush crush.map2 osdmap.bin2
> > > osdmaptool: osdmap file 'osdmap.bin2'
> > > osdmaptool: imported 1166 byte crush map from crush.map2
> > > osdmaptool: writing epoch 4990 to osdmap.bin2
> > > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > > osdmaptool: osdmap file 'osdmap.bin2'
> > >  parsed '4.1c' -> 4.1c
> > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting
> > > ([6,1,4,5,3,1], p6)
> > >
> > > ^^ The mapping is now "correct" in that it doesn't duplicate the
> > > mapping to osd.1. However it tries to use osd.7 which is destroyed but
> > > up.
> > >
> > > You might be able to fix that by fully marking osd.7 out.
> > > I can also get a good mapping by removing the *_tries settings from rule 
> > > 1:
> > >
> > > # diff -u crush.txt crush.txt2
> > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200
> > > @@ -90,10 +90,8 @@
> > >   type erasure
> > >   min_size 3
> > >   max_size 6
> > > - step set_chooseleaf_tries 50
> > > - step set_choose_tries 200
> > >   step take default
> > > - step choose indep 0 type osd
> > > + step chooseleaf indep 0 type osd
> > >   step emit
> > >  }
> > > ...
> > > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > > osdmaptool: osdmap file 'osdmap.bin2'
> > >  parsed '4.1c' -> 4.1c
> > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting 
> > > ([6,1,4,5,3,1], p6)
> > >
> > > Note that I didn't need to adjust the reweights:
> > >
> >

[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

2022-08-30 Thread Dan van der Ster
BTW, I vaguely recalled seeing this before. Yup, found it:
https://tracker.ceph.com/issues/55169

On Tue, Aug 30, 2022 at 11:46 AM Dan van der Ster  wrote:
>
> > 2. osd.7 is destroyed but still "up" in the osdmap.
>
> Oops, you can ignore this point -- this was an observation I had while
> playing with the osdmap -- your osdmap.bin has osd.7 down correctly.
>
> In case you're curious, here was what confused me:
>
> # osdmaptool osdmap.bin2  --mark-up-in --mark-out 7 --dump plain
> osd.7 up   out weight 0 up_from 3846 up_thru 3853 down_at 3855
> last_clean_interval [0,0)
> [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819]
> [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819]
> destroyed,exists,up
>
> Just ignore this ...
>
>
>
> -- dan
>
> On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster  wrote:
> >
> > Hi Frank,
> >
> > I suspect this is a combination of issues.
> > 1. You have "choose" instead of "chooseleaf" in rule 1.
> > 2. osd.7 is destroyed but still "up" in the osdmap.
> > 3. The _tries settings in rule 1 are not helping.
> >
> > Here are my tests:
> >
> > # osdmaptool --test-map-pg 4.1c osdmap.bin
> > osdmaptool: osdmap file 'osdmap.bin'
> >  parsed '4.1c' -> 4.1c
> > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6)
> > acting ([6,1,4,5,3,1], p6)
> >
> > ^^ This is what you observe now.
> >
> > # diff -u crush.txt crush.txt2
> > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200
> > @@ -93,7 +93,7 @@
> >   step set_chooseleaf_tries 50
> >   step set_choose_tries 200
> >   step take default
> > - step choose indep 0 type osd
> > + step chooseleaf indep 0 type osd
> >   step emit
> >  }
> > # crushtool -c crush.txt2 -o crush.map2
> > # cp osdmap.bin osdmap.bin2
> > # osdmaptool --import-crush crush.map2 osdmap.bin2
> > osdmaptool: osdmap file 'osdmap.bin2'
> > osdmaptool: imported 1166 byte crush map from crush.map2
> > osdmaptool: writing epoch 4990 to osdmap.bin2
> > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > osdmaptool: osdmap file 'osdmap.bin2'
> >  parsed '4.1c' -> 4.1c
> > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting
> > ([6,1,4,5,3,1], p6)
> >
> > ^^ The mapping is now "correct" in that it doesn't duplicate the
> > mapping to osd.1. However it tries to use osd.7 which is destroyed but
> > up.
> >
> > You might be able to fix that by fully marking osd.7 out.
> > I can also get a good mapping by removing the *_tries settings from rule 1:
> >
> > # diff -u crush.txt crush.txt2
> > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200
> > @@ -90,10 +90,8 @@
> >   type erasure
> >   min_size 3
> >   max_size 6
> > - step set_chooseleaf_tries 50
> > - step set_choose_tries 200
> >   step take default
> > - step choose indep 0 type osd
> > + step chooseleaf indep 0 type osd
> >   step emit
> >  }
> > ...
> > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > osdmaptool: osdmap file 'osdmap.bin2'
> >  parsed '4.1c' -> 4.1c
> > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], 
> > p6)
> >
> > Note that I didn't need to adjust the reweights:
> >
> > # osdmaptool osdmap.bin2 --tree
> > osdmaptool: osdmap file 'osdmap.bin2'
> > ID CLASS WEIGHT  TYPE NAME STATUSREWEIGHT PRI-AFF
> > -1   2.44798 root default
> > -7   0.81599 host tceph-01
> >  0   hdd 0.27199 osd.0up  0.87999 1.0
> >  3   hdd 0.27199 osd.3up  0.98000 1.0
> >  6   hdd 0.27199 osd.6up  0.92999 1.0
> > -3   0.81599 host tceph-02
> >  2   hdd 0.27199 osd.2up  0.95999 1.0
> >  4   hdd 0.27199 osd.4up  0.8 1.0
> >  8   hdd 0.27199 osd.8up  0.8 1.0
> > -5   0.81599 host tceph-03
> >  1   hdd 0.27199 osd.1up  0.8 1.0
> >  5   hdd 0.27199 osd.5up  1.0 1.0
> >  7   hdd 0.27199 osd.7 destroyed0 1.0
> >
> >
> > Does this work in real life?
> >
> > Cheers, Dan
> >
> >
> > On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder  wrote:
> > >
> > > Hi Dan,
> > >
> > > please find

[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

2022-08-30 Thread Dan van der Ster
> 2. osd.7 is destroyed but still "up" in the osdmap.

Oops, you can ignore this point -- this was an observation I had while
playing with the osdmap -- your osdmap.bin has osd.7 down correctly.

In case you're curious, here was what confused me:

# osdmaptool osdmap.bin2  --mark-up-in --mark-out 7 --dump plain
osd.7 up   out weight 0 up_from 3846 up_thru 3853 down_at 3855
last_clean_interval [0,0)
[v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819]
[v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819]
destroyed,exists,up

Just ignore this ...



-- dan

On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster  wrote:
>
> Hi Frank,
>
> I suspect this is a combination of issues.
> 1. You have "choose" instead of "chooseleaf" in rule 1.
> 2. osd.7 is destroyed but still "up" in the osdmap.
> 3. The _tries settings in rule 1 are not helping.
>
> Here are my tests:
>
> # osdmaptool --test-map-pg 4.1c osdmap.bin
> osdmaptool: osdmap file 'osdmap.bin'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6)
> acting ([6,1,4,5,3,1], p6)
>
> ^^ This is what you observe now.
>
> # diff -u crush.txt crush.txt2
> --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200
> @@ -93,7 +93,7 @@
>   step set_chooseleaf_tries 50
>   step set_choose_tries 200
>   step take default
> - step choose indep 0 type osd
> + step chooseleaf indep 0 type osd
>   step emit
>  }
> # crushtool -c crush.txt2 -o crush.map2
> # cp osdmap.bin osdmap.bin2
> # osdmaptool --import-crush crush.map2 osdmap.bin2
> osdmaptool: osdmap file 'osdmap.bin2'
> osdmaptool: imported 1166 byte crush map from crush.map2
> osdmaptool: writing epoch 4990 to osdmap.bin2
> # osdmaptool --test-map-pg 4.1c osdmap.bin2
> osdmaptool: osdmap file 'osdmap.bin2'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting
> ([6,1,4,5,3,1], p6)
>
> ^^ The mapping is now "correct" in that it doesn't duplicate the
> mapping to osd.1. However it tries to use osd.7 which is destroyed but
> up.
>
> You might be able to fix that by fully marking osd.7 out.
> I can also get a good mapping by removing the *_tries settings from rule 1:
>
> # diff -u crush.txt crush.txt2
> --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200
> @@ -90,10 +90,8 @@
>   type erasure
>   min_size 3
>   max_size 6
> - step set_chooseleaf_tries 50
> - step set_choose_tries 200
>   step take default
> - step choose indep 0 type osd
> + step chooseleaf indep 0 type osd
>   step emit
>  }
> ...
> # osdmaptool --test-map-pg 4.1c osdmap.bin2
> osdmaptool: osdmap file 'osdmap.bin2'
>  parsed '4.1c' -> 4.1c
> 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6)
>
> Note that I didn't need to adjust the reweights:
>
> # osdmaptool osdmap.bin2 --tree
> osdmaptool: osdmap file 'osdmap.bin2'
> ID CLASS WEIGHT  TYPE NAME STATUSREWEIGHT PRI-AFF
> -1   2.44798 root default
> -7   0.81599 host tceph-01
>  0   hdd 0.27199 osd.0up  0.87999 1.0
>  3   hdd 0.27199 osd.3up  0.98000 1.0
>  6   hdd 0.27199 osd.6up  0.92999 1.0
> -3   0.81599 host tceph-02
>  2   hdd 0.27199 osd.2up  0.95999 1.0
>  4   hdd 0.27199 osd.4up  0.8 1.0
>  8   hdd 0.27199 osd.8up  0.8 1.0
> -5   0.81599 host tceph-03
>  1   hdd 0.27199 osd.1up  0.8 1.0
>  5   hdd 0.27199 osd.5up  1.0 1.0
>  7   hdd 0.27199 osd.7 destroyed0 1.0
>
>
> Does this work in real life?
>
> Cheers, Dan
>
>
> On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder  wrote:
> >
> > Hi Dan,
> >
> > please find attached (only 7K, so I hope it goes through). 
> > md5sum=1504652f1b95802a9f2fe3725bf1336e
> >
> > I was playing a bit around with the crush map and found out the following:
> >
> > 1) Setting all re-weights to 1 does produce valid mappings. However, it 
> > will lead to large imbalances and is impractical in operations.
> >
> > 2) Doing something as simple/stupid as the following also results in valid 
> > mappings without having to change the weights:
> >
> > rule fs-data {
> > id 1
> > type erasure
> > min_size 3
> > max_size 6
> > step set_chooseleaf_tries 50
> > step set_choose_tries 200
> > 

[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

2022-08-30 Thread Dan van der Ster
> but bad distribution of data if a disk goes down (considering the tiny host- 
> and disk numbers). The second rule seems to be almost as good or bad as the 
> default one (step choose indep 0 type osd), except that it does produce valid 
> mappings where the default rule fails.
>
> I will wait with changing the rule in the hope that you find a more elegant 
> solution to this riddle.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 29 August 2022 19:13
> To: Frank Schilder
> Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the same OSD 
> twice.
>
> Hi Frank,
>
> Could you share the osdmap so I can try to solve this riddle?
>
> Cheers , Dan
>
>
> On Mon, Aug 29, 2022, 17:26 Frank Schilder 
> mailto:fr...@dtu.dk>> wrote:
> Hi Dan,
>
> thanks for your answer. I'm not really convinced that we hit a corner case 
> here and even if its one, it seems quite relevant for production clusters. 
> The usual way to get a valid mapping is to increase the number of tries. I 
> increased the following max trial numbers, which I would expect to produce a 
> mapping for all PGs:
>
> # diff map-now.txt map-new.txt
> 4c4
> < tunable choose_total_tries 50
> ---
> > tunable choose_total_tries 250
> 93,94c93,94
> <   step set_chooseleaf_tries 5
> <   step set_choose_tries 100
> ---
> >   step set_chooseleaf_tries 50
> >   step set_choose_tries 200
>
> When I test the map with crushtool it does not report bad mappings. Am I 
> looking at the wrong tunables to increase? It should be possible to get valid 
> mappings without having to modify the re-weights.
>
> Thanks again for your help!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster mailto:dvand...@gmail.com>>
> Sent: 29 August 2022 16:52:52
> To: Frank Schilder
> Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the same OSD 
> twice.
>
> Hi Frank,
>
> CRUSH can only find 5 OSDs, given your current tree, rule, and
> reweights. This is why there is a NONE in the UP set for shard 6.
> But in ACTING we see that it is refusing to remove shard 6 from osd.1
> -- that is the only copy of that shard, so in this case it's helping
> you rather than deleting the shard altogether.
> ACTING == what the OSDs are serving now.
> UP == where CRUSH wants to place the shards.
>
> I suspect that this is a case of CRUSH tunables + your reweights
> putting CRUSH in a corner case of not finding 6 OSDs for that
> particular PG.
> If you set the reweights all back to 1, it probably finds 6 OSDs?
>
> Cheers, Dan
>
>
> On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder 
> mailto:fr...@dtu.dk>> wrote:
> >
> > Hi all,
> >
> > I'm investigating a problem with a degenerated PG on an octopus 15.2.16 
> > test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain 
> > OSD. After simulating a disk fail by removing an OSD and letting the 
> > cluster recover (all under load), I end up with a PG with the same OSD 
> > allocated twice:
> >
> > PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1]
> >
> > OSD 1 is allocated twice. How is this even possible?
> >
> > Here the OSD tree:
> >
> > ID  CLASS  WEIGHT   TYPE NAME  STATUS REWEIGHT  PRI-AFF
> > -1 2.44798  root default
> > -7 0.81599  host tceph-01
> >  0hdd  0.27199  osd.0 up   0.87999  1.0
> >  3hdd  0.27199  osd.3 up   0.98000  1.0
> >  6hdd  0.27199  osd.6 up   0.92999  1.0
> > -3 0.81599  host tceph-02
> >  2hdd  0.27199  osd.2 up   0.95999  1.0
> >  4hdd  0.27199  osd.4 up   0.8  1.0
> >  8hdd  0.27199  osd.8 up   0.8  1.0
> > -5 0.81599  host tceph-03
> >  1hdd  0.27199  osd.1 up   0.8  1.0
> >  5hdd  0.27199  osd.5 up   1.0  1.0
> >  7hdd  0.27199  osd.7  destroyed 0  1.0
> >
> > I tried already to change some tunables thinking about 
> > https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon,
> >  but giving up too soon is obviously not the problem. It is accepting a 
> > wrong mapping.
> >
> > Is there a way out of this? Clearly this is calling for trouble if not data 
> > loss and should not happen at all.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to 
> > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bug in crush algorithm? 1 PG with the same OSD twice.

2022-08-29 Thread Dan van der Ster
Hi Frank,

CRUSH can only find 5 OSDs, given your current tree, rule, and
reweights. This is why there is a NONE in the UP set for shard 6.
But in ACTING we see that it is refusing to remove shard 6 from osd.1
-- that is the only copy of that shard, so in this case it's helping
you rather than deleting the shard altogether.
ACTING == what the OSDs are serving now.
UP == where CRUSH wants to place the shards.

I suspect that this is a case of CRUSH tunables + your reweights
putting CRUSH in a corner case of not finding 6 OSDs for that
particular PG.
If you set the reweights all back to 1, it probably finds 6 OSDs?

Cheers, Dan


On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder  wrote:
>
> Hi all,
>
> I'm investigating a problem with a degenerated PG on an octopus 15.2.16 test 
> cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain OSD. 
> After simulating a disk fail by removing an OSD and letting the cluster 
> recover (all under load), I end up with a PG with the same OSD allocated 
> twice:
>
> PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1]
>
> OSD 1 is allocated twice. How is this even possible?
>
> Here the OSD tree:
>
> ID  CLASS  WEIGHT   TYPE NAME  STATUS REWEIGHT  PRI-AFF
> -1 2.44798  root default
> -7 0.81599  host tceph-01
>  0hdd  0.27199  osd.0 up   0.87999  1.0
>  3hdd  0.27199  osd.3 up   0.98000  1.0
>  6hdd  0.27199  osd.6 up   0.92999  1.0
> -3 0.81599  host tceph-02
>  2hdd  0.27199  osd.2 up   0.95999  1.0
>  4hdd  0.27199  osd.4 up   0.8  1.0
>  8hdd  0.27199  osd.8 up   0.8  1.0
> -5 0.81599  host tceph-03
>  1hdd  0.27199  osd.1 up   0.8  1.0
>  5hdd  0.27199  osd.5 up   1.0  1.0
>  7hdd  0.27199  osd.7  destroyed 0  1.0
>
> I tried already to change some tunables thinking about 
> https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon,
>  but giving up too soon is obviously not the problem. It is accepting a wrong 
> mapping.
>
> Is there a way out of this? Clearly this is calling for trouble if not data 
> loss and should not happen at all.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Some odd results while testing disk performance related to write caching

2022-08-15 Thread Dan van der Ster
Hi,

We have some docs about this in the Ceph hardware recommendations:
https://docs.ceph.com/en/latest/start/hardware-recommendations/#write-caches

I added some responses inline..

On Fri, Aug 5, 2022 at 7:23 PM Torbjörn Jansson  wrote:
>
> Hello
>
> i got a small 3 node ceph cluster and i'm doing some bench marking related to
> performance with drive write caching.
>
> the reason i started was because i wanted to test the SSDs i have for their
> performance for use as db device for the osds and make sure they are setup as
> good as i can get it.
>
> i read that turning off write cache can be beneficial even when it sounds
> backwards.

"write cache" is a volatile cache -- so when it is enabled, Linux
knows that it is writing to a volatile area on the device and
therefore it needs to issue flushes to persist data. Linux considers
these devices to be in "write back" mode.
When the write cache is disabled, then Linux knows it is writing to a
persisted area, and therefore doesn't bother sending flushes anymore
-- these devices are in "write through" mode.
And btw, new data centre class devices have firmware and special
hardware to accelerate those persisted writes when the volatile cache
is disabled. This is the so-called media cache.

> this seems to be true.
> i used mainly fio and "iostat -x" to test using something like:
> fio --filename=/dev/ceph-db-0/bench --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=5 --iodepth=1 --runtime=60 --time_based --group_reporting
>
> and then testing this with write cache turned off and on to compare the 
> results.
> also with and without sync in fio command above.
>
> one thing i observed related to turning off the write cache on drives was that
> it appears a reboot is needed for it to have any effect.

This is depending on the OS -- if you set the cache using the approach
mentioned in the docs above, then in all distros we tested it keeps
WCE and "write through" consistent with each other.

> and this is where it gets strange and the part i don't get.
>
> the disks i have, seagate nytro sas3 ssd, according to the drive manual the
> drive don't care what you set the WCE bit to and it will do write caching
> internally regardless.
> most likely because it is an enterprise disk with built in power loss 
> protection.
>
> BUT it makes a big difference to the performance and the flush per seconds in
> iostat.
> so it appears that if you boot and the drive got its write cache disabled 
> right
> from the start (dmesg contains stuff like: "sd 0:0:0:0: [sda] Write cache:
> disabled") then linux wont send any flush to the drive and you get good
> performance.
> if you change the write caching on a drive during runtime (sdparm for sas or
> hdparm for sata) then it wont change anything.

Check the cache_type at e.g. /sys/class/scsi_disk/0\:0\:0\:0/cache_type
"write back" -> flush is sent
"write through" -> flush not sent

> why is that? why do i have to do a reboot?
> i mean, lets say you boot with write cache disabled, linux decides to never
> send flush and you change it after boot to enable the cache, if there is no
> flush then you risk your data in case of a power loss, or?

On all devices we have, if we have "write through" at boot, then set
(with hdparm or sdparm) WCE=1 or echo "write back" > ...
then the cache_type is automatically set correctly to "write back" and
flushes are sent.

There is another /sys/ entry to toggle flush behaviour: echo "write
through" > /sys/block/sda/queue/write_cache
This is apparently a way to lie to the OS so it stops sending flushes
(without manipulating the WCE mode of the underlying device).

Cheers, Dan

> this is not very obvious or good behavior i think (i hope i'm wrong and some
> one can enlighten me)
>
>
> for sas drives sdparm -s WCE=0 --save /dev/sdX appears to do the right thing
> and it survives a reboot.
> but for sata disks hdparm -W 0 -K 1 /dev/sdX makes the change but as long as
> drive is connected to sas controller it still gets the write cache enabled at
> boot so i bet sas controller also messes with the write cache setting on the
> drives.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mgr service restarted by package install?

2022-07-18 Thread Dan van der Ster
Hi,

It probably wasn't restarted by the package, but the mgr itself
respawned because the set of enabled modules changed.
E.g. this happens when upgrading from octopus to pacific, just after
the pacific mons get a quorum:

2022-07-13T11:43:41.308+0200 7f71c0c86700  1 mgr handle_mgr_map
respawning because set of enabled modules changed!

Cheers, Dan


On Sat, Jul 16, 2022 at 4:34 PM Matthias Ferdinand  wrote:
>
> Hi,
>
> while updating a test cluster (Ubuntu 20.04) from octopus (ubuntu repos)
> to quincy (ceph repos), I noticed that mgr service gets restarted during
> package install.
>
> Right after package install (no manual restarts yet) on 3 combined
> mon/mgr hosts:
>
> # ceph versions
> {
> "mon": {
> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) 
> octopus (stable)": 3
> },
> "mgr": {
> "ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) 
> quincy (stable)": 3
> },
> "osd": {
> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) 
> octopus (stable)": 8
> },
> "mds": {},
> "overall": {
> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) 
> octopus (stable)": 11,
> "ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) 
> quincy (stable)": 3
> }
> }
>
>
> Not sure how problematic this is, but AFAIK it was claimed that ceph
> package installs would not restart ceph services by themselves.
>
>
> Regards
> Matthias Ferdinand
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow osdmaptool upmap performance

2022-07-18 Thread Dan van der Ster
Hi,

Can you try with the fix for this? https://tracker.ceph.com/issues/54180
(https://github.com/ceph/ceph/pull/44925)

It hasn't been backported to any releases, but we could request that
if it looks useful.

Cheers, Dan


On Mon, Jul 18, 2022 at 12:44 AM stuart.anderson
 wrote:
>
> I am seeing very long run times for osdmaptool --upmap running Ceph 15.2.16 
> and I am wondering how to speed that up?
>
> When run on a large pool (PG=8k and OSD=725) I am seeing the following run 
> times for,
> # osdmaptool om.hdd --upmap upmap.mirror.hdd.ec.8 --upmap-pool 
> fs.data.mirror.hdd.ec --upmap-max 1000 --upmap-deviation 8
>
> upmap-deviation time
> --- 
> 15  6m
> 10  11m
> 9   12m
> 8   76m/19m
> 5   killed after 30 hours
>
> I really want to run with --upmap-deviation 1 to blanance this unbalanced 
> pool that currently ranges from 25-90% utilization on individual OSD, 
> however, it is not clear osdmaptool would ever complete.
>
> By comparison another similarly sized pool (PG=4k OSD=1538) in the same 
> cluster only takes a few seconds to run with upmap-deviation=1.
>
> Any suggestions on how to get osdmaptool --upmap to run faster?
>
> Thanks.
>
> ---
> ander...@ligo.caltech.edu
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-14 Thread Dan van der Ster
OK I recreated one OSD. It now has 4k min_alloc_size:

2022-07-14T10:52:58.382+0200 7fe5ec0aa200  1
bluestore(/var/lib/ceph/osd/ceph-0/) _open_super_meta min_alloc_size
0x1000

and I tested all these bluestore_prefer_deferred_size_hdd values:

4096: not deferred
4097: "_do_alloc_write deferring 0x1000 write via deferred"
65536: "_do_alloc_write deferring 0x1000 write via deferred"
65537: "_do_alloc_write deferring 0x1000 write via deferred"

With bluestore_prefer_deferred_size_hdd = 64k, I see that writes up to
0xf000 are deferred, e.g.:

 _do_alloc_write deferring 0xf000 write via deferred

Cheers, Dan

On Thu, Jul 14, 2022 at 9:37 AM Konstantin Shalygin  wrote:
>
> Dan, do you tested the redeploy one of your OSD with default pacific 
> bluestore_min_alloc_size_hdd (4096) ?
> This will also resolves this issue (just not affected, when all options in 
> their defaults)?
>
>
> Thanks,
> k
>
> On 14 Jul 2022, at 08:43, Dan van der Ster  wrote:
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-13 Thread Dan van der Ster
Yes, that is correct. No need to restart the osds.

.. Dan


On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko,  wrote:

> Hi!
>
> My apologies for butting in. Please confirm
> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
> require OSDs to be stopped or rebuilt?
>
> Best regards,
> Zakhar
>
> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster  wrote:
>
>> Hi Igor,
>>
>> Thank you for the reply and information.
>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
>> 65537` correctly defers writes in my clusters.
>>
>> Best regards,
>>
>> Dan
>>
>>
>>
>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov 
>> wrote:
>> >
>> > Hi Dan,
>> >
>> > I can confirm this is a regression introduced by
>> https://github.com/ceph/ceph/pull/42725.
>> >
>> > Indeed strict comparison is a key point in your specific case but
>> generally  it looks like this piece of code needs more redesign to better
>> handle fragmented allocations (and issue deferred write for every short
>> enough fragment independently).
>> >
>> > So I'm looking for a way to improve that at the moment. Will fallback
>> to trivial comparison fix if I fail to do find better solution.
>> >
>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
>> prefer not to raise it that high as 128K to avoid too many writes being
>> deferred (and hence DB overburden).
>> >
>> > IMO setting the parameter to 64K+1 should be fine.
>> >
>> >
>> > Thanks,
>> >
>> > Igor
>> >
>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>> >
>> > Hi Igor and others,
>> >
>> > (apologies for html, but i want to share a plot ;) )
>> >
>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something
>> is very wrong with deferred writes in pacific.
>> > Here is an example cluster, upgraded today:
>> >
>> >
>> >
>> > The OSDs are 12TB HDDs, formatted in nautilus with the default
>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>> >
>> > I found that the performance issue is because 4kB writes are no longer
>> deferred from those pre-pacific hdds to flash in pacific with the default
>> config !!!
>> > Here are example bench writes from both releases:
>> https://pastebin.com/raw/m0yL1H9Z
>> >
>> > I worked out that the issue is fixed if I set
>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
>> Note the default was 32k in octopus).
>> >
>> > I think this is related to the fixes in
>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
>> _do_alloc_write is comparing the prealloc size 0x1 with
>> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than"
>> condition prevents deferred writes from ever happening.
>> >
>> > So I think this would impact anyone upgrading clusters with hdd/ssd
>> mixed osds ... surely we must not be the only clusters impacted by this?!
>> >
>> > Should we increase the default bluestore_prefer_deferred_size_hdd up to
>> 128kB or is there in fact a bug here?
>> >
>> > Best Regards,
>> >
>> > Dan
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-12 Thread Dan van der Ster
Hi Igor,

Thank you for the reply and information.
I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
65537` correctly defers writes in my clusters.

Best regards,

Dan



On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov  wrote:
>
> Hi Dan,
>
> I can confirm this is a regression introduced by 
> https://github.com/ceph/ceph/pull/42725.
>
> Indeed strict comparison is a key point in your specific case but generally  
> it looks like this piece of code needs more redesign to better handle 
> fragmented allocations (and issue deferred write for every short enough 
> fragment independently).
>
> So I'm looking for a way to improve that at the moment. Will fallback to 
> trivial comparison fix if I fail to do find better solution.
>
> Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer 
> not to raise it that high as 128K to avoid too many writes being deferred 
> (and hence DB overburden).
>
> IMO setting the parameter to 64K+1 should be fine.
>
>
> Thanks,
>
> Igor
>
> On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados 
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very 
> wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
>
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default 
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>
> I found that the performance issue is because 4kB writes are no longer 
> deferred from those pre-pacific hdds to flash in pacific with the default 
> config !!!
> Here are example bench writes from both releases: 
> https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set 
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. 
> Note the default was 32k in octopus).
>
> I think this is related to the fixes in https://tracker.ceph.com/issues/52089 
> which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 
> 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly 
> less than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd mixed 
> osds ... surely we must not be the only clusters impacted by this?!
>
> Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB 
> or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Dan van der Ster
Hi,

On Thu, Jul 7, 2022 at 2:37 PM Konstantin Shalygin  wrote:
>
> Hi,
>
> On 7 Jul 2022, at 13:04, Dan van der Ster  wrote:
>
> I'm not sure the html mail made it to the lists -- resending in plain text.
> I've also opened https://tracker.ceph.com/issues/56488
>
>
> I think with pacific you need to redeploy all OSD's to respect the new 
> default bluestore_min_alloc_size_hdd = 4096 [1]
> Or not? 
>

Understood, yes, that is another "solution". But it is incredibly
impractical, I would say impossible, for loaded production
installations.
(How is one supposed to redeploy OSDs on a multi-PB cluster while the
performance is degraded?)

-- Dan

>
> [1] https://github.com/ceph/ceph/pull/34588
>
> k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-07 Thread Dan van der Ster
Hi again,

I'm not sure the html mail made it to the lists -- resending in plain text.
I've also opened https://tracker.ceph.com/issues/56488

Cheers, Dan


On Wed, Jul 6, 2022 at 11:43 PM Dan van der Ster  wrote:
>
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados 
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very 
> wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
>
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default 
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>
> I found that the performance issue is because 4kB writes are no longer 
> deferred from those pre-pacific hdds to flash in pacific with the default 
> config !!!
> Here are example bench writes from both releases: 
> https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set 
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. 
> Note the default was 32k in octopus).
>
> I think this is related to the fixes in https://tracker.ceph.com/issues/52089 
> which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 
> 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly 
> less than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd mixed 
> osds ... surely we must not be the only clusters impacted by this?!
>
> Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB 
> or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Dan van der Ster
Hi,

>From what I can tell, the ceph osd pool command is indeed the same as
rados mksnap.

But bizarrely I just created a new snapshot, changed max_mds, then
removed the snap -- this time I can't manage to "fix" the
inconsistency.
It may be that my first test was so simple (no client IO, no fs
snapshots) that removing the snap fixed it.

In this case, the inconsistent object appears to be an old version of
mds0_openfiles.0

# rados list-inconsistent-obj 3.6 | jq .
{
  "epoch": 7754,
  "inconsistents": [
{
  "object": {
"name": "mds0_openfiles.0",
"nspace": "",
"locator": "",
"snap": 3,
"version": 2467
  },

I tried modifying the current version of that with setomapval, but the
object stays inconsistent.
I even removed it from the pool (head version) and somehow that old
snapshotted version remains with the wrong checksum even though the
snap exists.

# rados rm -p cephfs.cephfs.meta mds0_openfiles.0
#

# ceph pg ls inconsistent
PG   OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES OMAP_BYTES*
OMAP_KEYS*  LOG  STATE  SINCE  VERSIONREPORTED
   UP ACTING SCRUB_STAMP
DEEP_SCRUB_STAMP
3.6   13 0  00  209715200
 0   41  active+clean+inconsistent 2m  7852'2479  7852:12048
[0,3,2]p0  [0,3,2]p0  2022-06-24T11:31:05.605434+0200
2022-06-24T11:31:05.605434+0200

# rados lssnap -p cephfs.cephfs.meta
0 snaps

This is getting super weird (I can list the object but not stat it):

# rados ls -p cephfs.cephfs.meta | grep open
mds1_openfiles.0
mds3_openfiles.0
mds0_openfiles.0
mds2_openfiles.0

# rados stat -p cephfs.cephfs.meta mds0_openfiles.0
 error stat-ing cephfs.cephfs.meta/mds0_openfiles.0: (2) No such file
or directory

I then failed over the mds to a standby so mds0_openfiles.0 exists
again, but the PG remains inconsistent with that old version of the
object.

I will add this to the tracker.

Clearly the objects are not all trimmed correctly when the pool
snapshot is removed.

-- dan



On Fri, Jun 24, 2022 at 11:10 AM Pascal Ehlert  wrote:
>
> Hi Dan,
>
> Just a quick addition here:
>
> I have not used the rados command to create the snapshot but "ceph osd
> pool mksnap $POOL $SNAPNAME" - which I think is the same internally?
>
> And yes, our CephFS has numerous snapshots itself for backup purposes.
>
>
> Cheers,
> Pascal
>
>
>
> Dan van der Ster wrote on 24.06.22 11:06:
> > Hi Pascal,
> >
> > I'm not sure why you don't see that snap, and I'm also not sure if you
> > can just delete the objects directly.
> > BTW, does your CephFS have snapshots itself (e.g. create via mkdir
> > .snap/foobar)?
> >
> > Cheers, Dan
> >
> > On Fri, Jun 24, 2022 at 10:34 AM Pascal Ehlert  wrote:
> >> Hi Dan,
> >>
> >> Thank you so much for going through the effort of reproducing this!
> >> I was just about to plan how to bring up a test cluster but it would've
> >> taken me much longer.
> >>
> >> While I totally assume this is the root cause for our issues, there is
> >> one small difference.
> >> rados lssnap does not list any snapshots for me:
> >>
> >> root@srv01:~# rados lssnap -p kubernetes_cephfs_metadata
> >> 0 snaps
> >>
> >> I do definitely recall having made a snapshot and apparently there are
> >> snapshot objects present in the pool.
> >> Not sure how the reference seemingly got lost.
> >>
> >> Do you have any ideas how I could anyway remove the broken snapshot 
> >> objects?
> >>
> >>
> >> Cheers,
> >>
> >> Pascal
> >>
> >>
> >> Dan van der Ster wrote on 24.06.22 09:27:
> >>> Hi,
> >>>
> >>> It's trivial to reproduce. Running 16.2.9 with max_mds=2, take a pool
> >>> snapshot of the meta pool, then decrease to max_mds=1, then deep scrub
> >>> each meta pg.
> >>>
> >>> In my test I could list and remove the pool snap, then deep-scrub
> >>> again cleared the inconsistencies.
> >>>
> >>> https://tracker.ceph.com/issues/56386
> >>>
> >>> Cheers, Dan
> >>>
> >>> On Fri, Jun 24, 2022 at 8:41 AM Ansgar Jazdzewski
> >>>  wrote:
> >>>> Hi,
> >>>>
> >>>> I would say yes but it would be nice if other people can confirm it too.
> >>>>
> >>>> also can you create a test cluster and do the same tasks
> >>>> * create it with octopus
> >>>> 

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Dan van der Ster
Hi Pascal,

I'm not sure why you don't see that snap, and I'm also not sure if you
can just delete the objects directly.
BTW, does your CephFS have snapshots itself (e.g. create via mkdir
.snap/foobar)?

Cheers, Dan

On Fri, Jun 24, 2022 at 10:34 AM Pascal Ehlert  wrote:
>
> Hi Dan,
>
> Thank you so much for going through the effort of reproducing this!
> I was just about to plan how to bring up a test cluster but it would've
> taken me much longer.
>
> While I totally assume this is the root cause for our issues, there is
> one small difference.
> rados lssnap does not list any snapshots for me:
>
> root@srv01:~# rados lssnap -p kubernetes_cephfs_metadata
> 0 snaps
>
> I do definitely recall having made a snapshot and apparently there are
> snapshot objects present in the pool.
> Not sure how the reference seemingly got lost.
>
> Do you have any ideas how I could anyway remove the broken snapshot objects?
>
>
> Cheers,
>
> Pascal
>
>
> Dan van der Ster wrote on 24.06.22 09:27:
> > Hi,
> >
> > It's trivial to reproduce. Running 16.2.9 with max_mds=2, take a pool
> > snapshot of the meta pool, then decrease to max_mds=1, then deep scrub
> > each meta pg.
> >
> > In my test I could list and remove the pool snap, then deep-scrub
> > again cleared the inconsistencies.
> >
> > https://tracker.ceph.com/issues/56386
> >
> > Cheers, Dan
> >
> > On Fri, Jun 24, 2022 at 8:41 AM Ansgar Jazdzewski
> >  wrote:
> >> Hi,
> >>
> >> I would say yes but it would be nice if other people can confirm it too.
> >>
> >> also can you create a test cluster and do the same tasks
> >> * create it with octopus
> >> * create snapshot
> >> * reduce rank to 1
> >> * upgrade to pacific
> >>
> >> and then try to fix the PG, assuming that you will have the same
> >> issues in your test-cluster,
> >>
> >> cheers,
> >> Ansgar
> >>
> >> Am Do., 23. Juni 2022 um 22:12 Uhr schrieb Pascal Ehlert 
> >> :
> >>> Hi,
> >>>
> >>> I have now tried to "ceph osd pool rmsnap $POOL beforefixes" and it says 
> >>> the snapshot could not be found although I have definitely run "ceph osd 
> >>> pool mksnap $POOL beforefixes" about three weeks ago.
> >>> When running rados list-inconsistent-obj $PG on one of the affected PGs, 
> >>> all of the objects returned have "snap" set to 1:
> >>>
> >>> root@srv01:~# for i in $(rados list-inconsistent-pg $POOL | jq -er .[]); 
> >>> do rados list-inconsistent-obj $i | jq -er .inconsistents[].object; done
> >>> [..]
> >>> {
> >>>"name": "200020744f4.",
> >>>"nspace": "",
> >>>"locator": "",
> >>>"snap": 1,
> >>>"version": 5704208
> >>> }
> >>> {
> >>>"name": "200021aeb16.",
> >>>"nspace": "",
> >>>"locator": "",
> >>>"snap": 1,
> >>>"version": 6189078
> >>> }
> >>> [..]
> >>>
> >>> Running listsnaps on any of them then looks like this:
> >>>
> >>> root@srv01:~# rados listsnaps 200020744f4. -p $POOL
> >>> 200020744f4.:
> >>> cloneidsnapssizeoverlap
> >>> 110[]
> >>> head-0
> >>>
> >>>
> >>> Is it save to assume that these objects belong to a somewhat broken 
> >>> snapshot and can be removed safely without causing further damage?
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Pascal
> >>>
> >>>
> >>>
> >>> Ansgar Jazdzewski wrote on 23.06.22 20:36:
> >>>
> >>> Hi,
> >>>
> >>> we could identify the rbd images that wehre affected and did an export 
> >>> before, but in the case of cephfs metadata i have no plan that will work.
> >>>
> >>> can you try to delete the snapshot?
> >>> also if the filesystem can be shutdown? try to do a backup of the 
> >>> metadatapool
> >>>
> >>> hope you will have some luck, let me know if I can help,
> >>> Ansgar
> >>>
> >>> Pascal Ehlert  schrieb am Do., 23

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-24 Thread Dan van der Ster
Hi,

It's trivial to reproduce. Running 16.2.9 with max_mds=2, take a pool
snapshot of the meta pool, then decrease to max_mds=1, then deep scrub
each meta pg.

In my test I could list and remove the pool snap, then deep-scrub
again cleared the inconsistencies.

https://tracker.ceph.com/issues/56386

Cheers, Dan

On Fri, Jun 24, 2022 at 8:41 AM Ansgar Jazdzewski
 wrote:
>
> Hi,
>
> I would say yes but it would be nice if other people can confirm it too.
>
> also can you create a test cluster and do the same tasks
> * create it with octopus
> * create snapshot
> * reduce rank to 1
> * upgrade to pacific
>
> and then try to fix the PG, assuming that you will have the same
> issues in your test-cluster,
>
> cheers,
> Ansgar
>
> Am Do., 23. Juni 2022 um 22:12 Uhr schrieb Pascal Ehlert 
> :
> >
> > Hi,
> >
> > I have now tried to "ceph osd pool rmsnap $POOL beforefixes" and it says 
> > the snapshot could not be found although I have definitely run "ceph osd 
> > pool mksnap $POOL beforefixes" about three weeks ago.
> > When running rados list-inconsistent-obj $PG on one of the affected PGs, 
> > all of the objects returned have "snap" set to 1:
> >
> > root@srv01:~# for i in $(rados list-inconsistent-pg $POOL | jq -er .[]); do 
> > rados list-inconsistent-obj $i | jq -er .inconsistents[].object; done
> > [..]
> > {
> >   "name": "200020744f4.",
> >   "nspace": "",
> >   "locator": "",
> >   "snap": 1,
> >   "version": 5704208
> > }
> > {
> >   "name": "200021aeb16.",
> >   "nspace": "",
> >   "locator": "",
> >   "snap": 1,
> >   "version": 6189078
> > }
> > [..]
> >
> > Running listsnaps on any of them then looks like this:
> >
> > root@srv01:~# rados listsnaps 200020744f4. -p $POOL
> > 200020744f4.:
> > cloneidsnapssizeoverlap
> > 110[]
> > head-0
> >
> >
> > Is it save to assume that these objects belong to a somewhat broken 
> > snapshot and can be removed safely without causing further damage?
> >
> >
> > Thanks,
> >
> > Pascal
> >
> >
> >
> > Ansgar Jazdzewski wrote on 23.06.22 20:36:
> >
> > Hi,
> >
> > we could identify the rbd images that wehre affected and did an export 
> > before, but in the case of cephfs metadata i have no plan that will work.
> >
> > can you try to delete the snapshot?
> > also if the filesystem can be shutdown? try to do a backup of the 
> > metadatapool
> >
> > hope you will have some luck, let me know if I can help,
> > Ansgar
> >
> > Pascal Ehlert  schrieb am Do., 23. Juni 2022, 16:45:
> >>
> >> Hi Ansgar,
> >>
> >> Thank you very much for the response.
> >> Running your first command to obtain inconsistent objects, I retrieve a
> >> total of 23114 only some of which are snaps.
> >>
> >> You mentioning snapshots did remind me of the fact however that I
> >> created a snapshot on the Ceph metadata pool via "ceph osd pool $POOL
> >> mksnap" before I reduced the number of ranks.
> >> Maybe that has causes the inconsistencies and would explain why the
> >> actual file system appears unaffected?
> >>
> >> Is there any way to validate that theory? I am a bit hesitant to just
> >> run "rmsnap". Could that cause inconsistent data to be written back to
> >> the actual objects?
> >>
> >>
> >> Best regards,
> >>
> >> Pascal
> >>
> >>
> >>
> >> Ansgar Jazdzewski wrote on 23.06.22 16:11:
> >> > Hi Pascal,
> >> >
> >> > We just had a similar situation on our RBD and had found some bad data
> >> > in RADOS here is How we did it:
> >> >
> >> > for i in $(rados list-inconsistent-pg $POOL | jq -er .[]); do rados
> >> > list-inconsistent-obj $i | jq -er .inconsistents[].object.name| awk
> >> > -F'.' '{print $2}'; done
> >> >
> >> > we than found inconsistent snaps on the Object:
> >> >
> >> > rados list-inconsistent-snapset $PG --format=json-pretty | jq
> >> > .inconsistents[].name
> >> >
> >> > List the data on the OSD's (ceph pg map $PG)
> >> >
> >> > ceph-objectstore

[ceph-users] Re: Inconsistent PGs after upgrade to Pacific

2022-06-23 Thread Dan van der Ster
Hi Pascal,

It's not clear to me how the upgrade procedure you described would
lead to inconsistent PGs.

Even if you didn't record every step, do you have the ceph.log, the
mds logs, perhaps some osd logs from this time?
And which versions did you upgrade from / to ?

Cheers, Dan

On Wed, Jun 22, 2022 at 7:41 PM Pascal Ehlert  wrote:
>
> Hi all,
>
> I am currently battling inconsistent PGs after a far-reaching mistake
> during the upgrade from Octopus to Pacific.
> While otherwise following the guide, I restarted the Ceph MDS daemons
> (and this started the Pacific daemons) without previously reducing the
> ranks to 1 (from 2).
>
> This resulted in daemons not coming up and reporting inconsistencies.
> After later reducing the ranks and bringing the MDS back up (I did not
> record every step as this was an emergency situation), we started seeing
> health errors on every scrub.
>
> Now after three weeks, while our CephFS is still working fine and we
> haven't noticed any data damage, we realized that every single PG of the
> cephfs metadata pool is affected.
> Below you can find some information on the actual status and a detailed
> inspection of one of the affected pgs. I am happy to provide any other
> information that could be useful of course.
>
> A repair of the affected PGs does not resolve the issue.
> Does anyone else here have an idea what we could try apart from copying
> all the data to a new CephFS pool?
>
>
>
> Thank you!
>
> Pascal
>
>
>
>
> root@srv02:~# ceph status
>cluster:
>  id: f0d6d4d0-8c17-471a-9f95-ebc80f1fee78
>  health: HEALTH_ERR
>  insufficient standby MDS daemons available
>  69262 scrub errors
>  Too many repaired reads on 2 OSDs
>  Possible data damage: 64 pgs inconsistent
>
>services:
>  mon: 3 daemons, quorum srv02,srv03,srv01 (age 3w)
>  mgr: srv03(active, since 3w), standbys: srv01, srv02
>  mds: 2/2 daemons up, 1 hot standby
>  osd: 44 osds: 44 up (since 3w), 44 in (since 10M)
>
>data:
>  volumes: 2/2 healthy
>  pools:   13 pools, 1217 pgs
>  objects: 75.72M objects, 26 TiB
>  usage:   80 TiB used, 42 TiB / 122 TiB avail
>  pgs: 1153 active+clean
>   55   active+clean+inconsistent
>   9active+clean+inconsistent+failed_repair
>
>io:
>  client:   2.0 MiB/s rd, 21 MiB/s wr, 240 op/s rd, 1.75k op/s wr
>
>
> {
>"epoch": 4962617,
>"inconsistents": [
>  {
>"object": {
>  "name": "100cc8e.",
>  "nspace": "",
>  "locator": "",
>  "snap": 1,
>  "version": 4253817
>},
>"errors": [],
>"union_shard_errors": [
>  "omap_digest_mismatch_info"
>],
>"selected_object_info": {
>  "oid": {
>"oid": "100cc8e.",
>"key": "",
>"snapid": 1,
>"hash": 1369745244,
>"max": 0,
>"pool": 7,
>"namespace": ""
>  },
>  "version": "4962847'6209730",
>  "prior_version": "3916665'4306116",
>  "last_reqid": "osd.27.0:757107407",
>  "user_version": 4253817,
>  "size": 0,
>  "mtime": "2022-02-26T12:56:55.612420+0100",
>  "local_mtime": "2022-02-26T12:56:55.614429+0100",
>  "lost": 0,
>  "flags": [
>"dirty",
>"omap",
>"data_digest",
>"omap_digest"
>  ],
>  "truncate_seq": 0,
>  "truncate_size": 0,
>  "data_digest": "0x",
>  "omap_digest": "0xe5211a9e",
>  "expected_object_size": 0,
>  "expected_write_size": 0,
>  "alloc_hint_flags": 0,
>  "manifest": {
>"type": 0
>  },
>  "watchers": {}
>},
>"shards": [
>  {
>"osd": 20,
>"primary": false,
>"errors": [
>  "omap_digest_mismatch_info"
>],
>"size": 0,
>"omap_digest": "0x",
>"data_digest": "0x"
>  },
>  {
>"osd": 27,
>"primary": true,
>"errors": [
>  "omap_digest_mismatch_info"
>],
>"size": 0,
>"omap_digest": "0x",
>"data_digest": "0x"
>  },
>  {
>"osd": 43,
>"primary": false,
>"errors": [
>  "omap_digest_mismatch_info"
>],
>"size": 0,
>"omap_digest": "0x",
>"data_digest": "0x"
>  }
>]
>  },
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an 

[ceph-users] Re: Tuning for cephfs backup client?

2022-06-23 Thread Dan van der Ster
Hi,

If the single backup client is iterating through the entire fs, its
local dentry cache will probably be thrashing, rendering it quite
useless.
And that dentry cache will be constantly hitting the mds caps per
client limit, so the mds will be busy asking it to release caps (to
invalidate cached dentries).

Since you know this up front, you might want to just have a cronjob on
the backup client like

*/2 * * * * echo 2 > /proc/sys/vm/drop_caches

This will keep things simple for the mds.

(Long ago when caps recall was slightly buggy, we used to run this [1]
cron on *all* kernel cephfs clients.)

Cheers, Dan

[1]
#!/bin/bash

# random sleep to avoid thundering herd
sleep $[ ( $RANDOM % 30 )  + 1 ]s

if ls /sys/kernel/debug/ceph/*/caps 1> /dev/null 2>&1; then
  CAPS=`cat /sys/kernel/debug/ceph/*/caps | grep total | awk '{sum +=
$2} END {print sum}'`
else
  CAPS=0
fi

if [ "${CAPS}" -gt 1 ]; then
logger -t ceph-drop-caps "Dropping ${CAPS} caps..."
echo 2 > /proc/sys/vm/drop_caches
logger -t ceph-drop-caps "Done"
fi



On Thu, Jun 23, 2022 at 10:41 AM Burkhard Linke
 wrote:
>
> Hi,
>
>
> we are using cephfs with currently about 200 million files and a single
> hosts running nightly backups. This setup works fine, except the cephfs
> caps management. Since the single host has to examine a lot of files, it
> will soon run into the mds caps per client limit, and processing will
> slow down due to extra caps request/release round trips to the mds. This
> problem will probably affect all cephfs users who are running a similar
> setup.
>
>
> Are there any tuning knobs on client side we can use to optimize this
> kind of workload? We have already raised the mds caps limit and memory
> limit, but these are global settings for all clients. We only need to
> optimize the single backup client. I'm thinking about:
>
> - earlier release of unused caps
>
> - limiting caps on client in addition to mds
>
> - shorter metadata caching (should also result in earlier release)
>
> - anything else that will result in a better metadata throughput
>
>
> The amount of data backed up nightly is manageable (< 10TB / night), so
> the backup is currently only limited by metadata checks. Given the trend
> of growing data in all fields, backup solution might run into problems
> in the long run...
>
>
> Best regards,
>
> Burkhard Linke
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Drained OSDs are still ACTIVE_PRIMARY - casuing high IO latency on clients

2022-05-20 Thread Dan van der Ster
Hi,

Just a curiosity... It looks like you're comparing an EC pool in octopus to
a replicated pool in nautilus. Does primary affinity work for you in
octopus on a replicated pool? And does a nautilus EC pool work?

.. Dan



On Fri., May 20, 2022, 13:53 Denis Polom,  wrote:

> Hi
>
> I observed high latencies and mount points hanging since Octopus release
> and it's still observed on Pacific latest while draining OSD.
>
> Cluster setup:
>
> Ceph Pacific 16.2.7
>
> Cephfs with EC data pool
>
> EC profile setup:
>
> crush-device-class=
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=10
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> Description:
>
> If we have broken drive, we are removing it from Ceph cluster by
> draining it first. That means changing its crush weight to 0
>
> ceph osd crush reweight osd.1 0
>
> Normally on Nautilus it didn't affected clients. But after upgrade to
> Octopus (and since Octopus till current Pacific release) I can observe
> very high IO latencies on clients while OSD being drained (10sec and
> higher).
>
> By debugging I found out that drained OSD is still listed as
> ACTIVE_PRIMARY and that happens only on EC pools and only since Octopus.
> I tested it back on Nautilus, to be sure, where behavior is correct and
> drained OSD is not listed under UP and ACTIVE OSDs for PGs.
>
> Even if setting up primary-affinity for given OSD to 0 this doesn't have
> any effect on EC pool.
>
> Bellow are my debugs:
>
> Buggy behavior on Octopus and Pacific:
>
> Before draining osd.70:
>
> PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND
> BYTES   OMAP_BYTES*  OMAP_KEYS*  LOG   DISK_LOG
> STATE  STATE_STAMP VERSION
> REPORTED   UP UP_PRIMARY  ACTING
> ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB
> DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
> 16.1fff 2269   0 0  0 0
> 89552977270   0  2449 2449
> active+clean 2022-05-19T08:41:55.241734+020019403690'275685
> 19407588:19607199[70,206,216,375,307,57]  70
> [70,206,216,375,307,57]  7019384365'275621
> 2022-05-19T08:41:55.241493+020019384365'275621
> 2022-05-19T08:41:55.241493+0200  0
> dumped pgs
>
>
> after setting osd.70 crush weight to 0 (osd.70 is still acting primary):
>
>   UP UP_PRIMARY ACTING
> ACTING_PRIMARY  LAST_SCRUB SCRUB_STAMP
> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
> 16.1fff 2269   0 0   2269 0
> 89552977270   0  2449  2449
> active+remapped+backfill_wait  2022-05-20T08:51:54.249071+0200
> 19403690'275685  19407668:19607289 [71,206,216,375,307,57]  71
> [70,206,216,375,307,57]  7019384365'275621
> 2022-05-19T08:41:55.241493+020019384365'275621
> 2022-05-19T08:41:55.241493+0200  0
> dumped pgs
>
>
> Correct behavior on Nautilus:
>
> Before draining osd.10:
>
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
> OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP
> VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY
> LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
> SNAPTRIMQ_LEN
> 2.4e  2  00 0   0
> 8388608   0  0   22 active+clean 2022-05-20
> 02:13:47.43210461'275:40   [10,0,7] 10   [10,0,7]
> 100'0 2022-05-20 01:44:36.217286 0'0 2022-05-20
> 01:44:36.217286 0
>
> after setting osd.10 crush weight to 0 (behavior is correct, osd.10 is
> not listed, not used):
>
>
> root@nautilus1:~# ceph pg dump pgs | head -2
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
> OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE
> STATE_STAMPVERSION REPORTED UP UP_PRIMARY
> ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP   SNAPTRIMQ_LEN
> 2.4e 14  00 0   0
> 58720256   0  0  18   18 active+clean 2022-05-20
> 02:18:59.414812   75'1880:43 [22,0,7] 22
> [22,0,7] 220'0 2022-05-20
> 01:44:36.217286 0'0 2022-05-20 01:44:36.217286 0
>
>
> Now question is if is it some implemented feature?
>
> Or is it a bug?
>
> Thank you!
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No rebalance after ceph osd crush unlink

2022-05-18 Thread Dan van der Ster
Hi,

It's interesting that crushtool doesn't include the shadow tree -- I
am pretty sure that used to be included. I don't suggest editing the
crush map, compiling, then re-injecting -- I don't know what it will
do in this case.

What you could do instead is something like:
* ceph osd getcrushmap -o crush.map # backup the map
* ceph osd set norebalance # disable rebalancing while we experiment
* ceph osd crush reweight-all # see if this fixes the crush shadow tree

The unset norebalance if the crush tree looks good. Or if the crush
tree isn't what you expect, revert to your backup with `ceph osd
setcrushmap -i crush.map`.

-- dan



On Wed, May 18, 2022 at 12:47 PM Frank Schilder  wrote:
>
> Hi Dan,
>
> thanks for pointing me to this. Yes, it looks like a/the bug, the shadow tree 
> is not changed although it should be updated as well. This is not even shown 
> in the crush map I exported with getcrushmap. The option --show-shadow did 
> the trick.
>
> Will `ceph osd crush reweight-all` actually remove these shadow leafs or just 
> set the weight to 0? I need to link this host later again and I would like a 
> solution as clean as possible. What would, for example, happen if I edit the 
> crush map and execute setcrushmap? Will it recompile the correct crush map 
> from the textual definition, or will these hanging leafs persist?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Dan van der Ster 
> Sent: 18 May 2022 12:04:07
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] No rebalance after ceph osd crush unlink
>
> Hi Frank,
>
> Did you check the shadow tree (the one with tilde's in the name, seen
> with `ceph osd crush tree --show-shadow`)? Maybe the host was removed
> in the outer tree, but not the one used for device-type selection.
> There were bugs in this area before, e.g. 
> https://tracker.ceph.com/issues/48065
> In those cases, the way to make the crush tree consistent again was
> `ceph osd crush reweight-all`.
>
> Cheers, Dan
>
>
>
> On Wed, May 18, 2022 at 11:51 AM Frank Schilder  wrote:
> >
> > Dear all,
> >
> > I have a strange problem. I have some hosts linked under an additional 
> > logical data center and needed to unlink two of the hosts. After unlinking 
> > the first host with
> >
> > ceph osd crush unlink ceph-18 MultiSite
> >
> > the crush map for this data center is updated correctly:
> >
> > datacenter MultiSite {
> > id -148 # do not change unnecessarily
> > id -149 class hdd   # do not change unnecessarily
> > id -150 class ssd   # do not change unnecessarily
> > id -236 class rbd_meta  # do not change unnecessarily
> > id -200 class rbd_data  # do not change unnecessarily
> > id -320 class rbd_perf  # do not change unnecessarily
> > # weight 643.321
> > alg straw2
> > hash 0  # rjenkins1
> > item ceph-04 weight 79.691
> > item ceph-05 weight 81.474
> > item ceph-06 weight 79.691
> > item ceph-07 weight 79.691
> > item ceph-19 weight 81.695
> > item ceph-20 weight 81.695
> > item ceph-21 weight 79.691
> > item ceph-22 weight 79.691
> > }
> >
> > The host is gone. However, nothing happened. The pools with the crush rule
> >
> > rule ms-ssd {
> > id 12
> > type replicated
> > min_size 1
> > max_size 10
> > step take MultiSite class rbd_data
> > step chooseleaf firstn 0 type host
> > step emit
> > }
> >
> > should now move data away from OSDs on this host, but nothing is happening. 
> > A pool with crush rule ms-ssd is:
> >
> > # ceph osd pool get sr-rbd-meta-one all
> > size: 3
> > min_size: 2
> > pg_num: 128
> > pgp_num: 128
> > crush_rule: ms-ssd
> > hashpspool: true
> > nodelete: true
> > nopgchange: false
> > nosizechange: false
> > write_fadvise_dontneed: false
> > noscrub: false
> > nodeep-scrub: false
> > use_gmt_hitset: 1
> > auid: 0
> > fast_read: 0
> >
> > However, its happily keeping data on the OSDs of host ceph-18. For example, 
> > one of the OSDs on this host has ID 1076. There are 4 PGs using this OSD:
> >
> > # ceph pg ls-by-pool sr-rbd-meta-one | grep 1076
> > 1.33 2500 0   0 7561564817834125 
> > 3073 active+clean 2022-05-18 10:54:41.84

[ceph-users] Re: No rebalance after ceph osd crush unlink

2022-05-18 Thread Dan van der Ster
Hi Frank,

Did you check the shadow tree (the one with tilde's in the name, seen
with `ceph osd crush tree --show-shadow`)? Maybe the host was removed
in the outer tree, but not the one used for device-type selection.
There were bugs in this area before, e.g. https://tracker.ceph.com/issues/48065
In those cases, the way to make the crush tree consistent again was
`ceph osd crush reweight-all`.

Cheers, Dan



On Wed, May 18, 2022 at 11:51 AM Frank Schilder  wrote:
>
> Dear all,
>
> I have a strange problem. I have some hosts linked under an additional 
> logical data center and needed to unlink two of the hosts. After unlinking 
> the first host with
>
> ceph osd crush unlink ceph-18 MultiSite
>
> the crush map for this data center is updated correctly:
>
> datacenter MultiSite {
> id -148 # do not change unnecessarily
> id -149 class hdd   # do not change unnecessarily
> id -150 class ssd   # do not change unnecessarily
> id -236 class rbd_meta  # do not change unnecessarily
> id -200 class rbd_data  # do not change unnecessarily
> id -320 class rbd_perf  # do not change unnecessarily
> # weight 643.321
> alg straw2
> hash 0  # rjenkins1
> item ceph-04 weight 79.691
> item ceph-05 weight 81.474
> item ceph-06 weight 79.691
> item ceph-07 weight 79.691
> item ceph-19 weight 81.695
> item ceph-20 weight 81.695
> item ceph-21 weight 79.691
> item ceph-22 weight 79.691
> }
>
> The host is gone. However, nothing happened. The pools with the crush rule
>
> rule ms-ssd {
> id 12
> type replicated
> min_size 1
> max_size 10
> step take MultiSite class rbd_data
> step chooseleaf firstn 0 type host
> step emit
> }
>
> should now move data away from OSDs on this host, but nothing is happening. A 
> pool with crush rule ms-ssd is:
>
> # ceph osd pool get sr-rbd-meta-one all
> size: 3
> min_size: 2
> pg_num: 128
> pgp_num: 128
> crush_rule: ms-ssd
> hashpspool: true
> nodelete: true
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> use_gmt_hitset: 1
> auid: 0
> fast_read: 0
>
> However, its happily keeping data on the OSDs of host ceph-18. For example, 
> one of the OSDs on this host has ID 1076. There are 4 PGs using this OSD:
>
> # ceph pg ls-by-pool sr-rbd-meta-one | grep 1076
> 1.33 2500 0   0 7561564817834125 3073 
> active+clean 2022-05-18 10:54:41.840097 757122'10112944  757122:84604327
> [574,286,1076]p574[574,286,1076]p574 2022-05-18 04:24:32.900261 
> 2022-05-11 19:56:32.781889
> 1.3d 2590 0   0 7962393603380 64 3006 
> active+clean 2022-05-18 10:54:41.749090 757122'24166942  757122:57010202 
> [1074,1076,1052]p1074 [1074,1076,1052]p1074 2022-05-18 06:16:35.605026 
> 2022-05-16 19:37:56.829763
> 1.4d 2490 0   0 7136789485690105 3070 
> active+clean 2022-05-18 10:54:41.738918  757119'5861104  757122:45718157  
> [1072,262,1076]p1072  [1072,262,1076]p1072 2022-05-18 06:50:04.731194 
> 2022-05-18 06:50:04.731194
> 1.70 2720 0   0 8143173984591 76 3007 
> active+clean 2022-05-18 10:54:41.743604 757122'11849453  757122:72537747
> [268,279,1076]p268[268,279,1076]p268 2022-05-17 15:43:46.512941 
> 2022-05-17 15:43:46.512941
>
> I don't understand why these are not remapped and rebalancing. Any ideas?
>
> Version is mimic latest.
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v16.2.8 Pacific released

2022-05-17 Thread Dan van der Ster
On Tue, May 17, 2022 at 1:14 PM Cory Snyder  wrote:
>
> Hi all,
>
> Unfortunately, we experienced some issues with the upgrade to 16.2.8
> on one of our larger clusters. Within a few hours of the upgrade, all
> 5 of our managers had become unavailable. We found that they were all
> deadlocked due to (what appears to be) a regression with GIL and mutex
> handling. See https://tracker.ceph.com/issues/39264 and
> https://github.com/ceph/ceph/pull/38677 for context on previous
> manifestations of the issue.
>
> I discovered some mistakes within a recent Pacific backport that seem
> to be responsible. Here is the tracker for the regression:
> https://tracker.ceph.com/issues/55687. Here is an open PR that should
> resolve the problem: https://github.com/ceph/ceph/pull/38677.

I guess you mean https://github.com/ceph/ceph/pull/46302 ?

Thanks

.. dan

>
> Note that this is a sort of race condition, and the issue tends to
> manifest itself more frequently in larger clusters. Enabling certain
> modules may also make it more likely to occur. On our cluster, MGRs
> are consistently deadlocking within about an hour.
>
> Hopefully this is useful to others who are considering an upgrade!
>
> Thanks,
>
> Cory Snyder
>
>
>
>
>
>
> On Mon, May 16, 2022 at 3:46 PM David Galloway  wrote:
> >
> > We're happy to announce the 8th backport release in the Pacific series.
> > We recommend users to update to this release. For a detailed release
> > notes with links & changelog please refer to the official blog entry at
> > https://ceph.io/en/news/blog/2022/v16-2-8-pacific-released
> >
> > Notable Changes
> > ---
> >
> > * MON/MGR: Pools can now be created with `--bulk` flag. Any pools
> > created with `bulk` will use a profile of the `pg_autoscaler` that
> > provides more performance from the start. However, any pools created
> > without the `--bulk` flag will remain using it's old behavior by
> > default. For more details, see:
> > https://docs.ceph.com/en/latest/rados/operations/placement-groups/
> >
> > * MGR: The pg_autoscaler can now be turned `on` and `off` globally with
> > the `noautoscale` flag. By default this flag is unset and the default
> > pg_autoscale mode remains the same. For more details, see:
> > https://docs.ceph.com/en/latest/rados/operations/placement-groups/
> >
> > * A health warning will now be reported if the ``require-osd-release``
> > flag is not set to the appropriate release after a cluster upgrade.
> >
> > * CephFS: Upgrading Ceph Metadata Servers when using multiple active
> > MDSs requires ensuring no pending stray entries which are directories
> > are present for active ranks except rank 0. See
> > https://docs.ceph.com/en/latest/releases/pacific/#upgrading-from-octopus-or-nautilus.
> >
> > Getting Ceph
> > 
> > * Git at git://github.com/ceph/ceph.git
> > * Tarball at https://download.ceph.com/tarballs/ceph-16.2.8.tar.gz
> > * Containers at https://quay.io/repository/ceph/ceph
> > * For packages, see https://docs.ceph.com/docs/master/install/get-packages/
> > * Release git sha1: 209e51b856505df4f2f16e54c0d7a9e070973185
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reasonable MDS rejoin time?

2022-05-17 Thread Dan van der Ster
Hi Felix,

"rejoin" took awhile in the past because the MDS needs to reload all
inodes for all the open directories at boot time.
In our experience this can take ~10 minutes on the most active clusters.
In your case, I wonder if the MDS was going OOM in a loop while
recovering? This was happening to us before -- there are recipes on
this ML how to remove the "openfiles" objects to get out of that
situation.

Anyway, Octopus now has a feature to skip preloading the direntries at
rejoin time: https://github.com/ceph/ceph/pull/44667
This will become the default soon, but you can switch off the preload
already now. In our experience, rejoin is now taking under a minute on
even the busiest clusters.

Cheers, Dan

On Tue, May 17, 2022 at 11:45 AM Felix Lee  wrote:
>
> Yes, we do plan to upgrade Ceph in near future for sure.
> In any case, I used brutal way(kinda) to kick rejoin to active by
> setting "mds_wipe_sessions = true" to all MDS.
> Still, the entire MDS recovery process makes us blind to estimate the
> service downtime. So, I am wondering if there is any way for us to
> estimate the rejoin time? So that we can decide whether to wait or take
> proactive action if necessary.
>
>
>
> Best regards,
> Felix Lee ~
>
> On 5/17/22 16:15, Jos Collin wrote:
> > I suggest you to upgrade the cluster to the latest release [1], as
> > nautilus reached EOL.
> >
> > [1] https://docs.ceph.com/en/latest/releases/
> >
> > On 16/05/22 13:29, Felix Lee wrote:
> >> Hi, Jos,
> >> Many thanks for your reply.
> >> And sorry, I missed to mention the version, which is 14.2.22.
> >>
> >> Here is the log:
> >> https://drive.google.com/drive/folders/1qzPf64qw16VJDKSzcDoixZ690KL8XSoc?usp=sharing
> >>
> >>
> >> Here, the ceph01(active) and ceph11(standby-replay) were the ones what
> >> suffered crash. The log didn't tell us much but several slow request
> >> were occurring. And, the ceph11 had "cache is too large" warning by
> >> the time it went crashed, suppose it could happen when doing recovery.
> >> (each MDS has 64GB memory, BTW )
> >> The ceph16 is current rejoin one, I've turned debug_mds to 20 for a
> >> while as ceph-mds.ceph16.log-20220516.gz
> >>
> >>
> >> Thanks
> >> &
> >> Best regards,
> >> Felix Lee ~
> >>
> >>
> >>
> >> On 5/16/22 14:45, Jos Collin wrote:
> >>> It's hard to suggest without the logs. Do verbose logging
> >>> debug_mds=20. What's the ceph version? Do you have the logs why the
> >>> MDS crashed?
> >>>
> >>> On 16/05/22 11:20, Felix Lee wrote:
>  Dear all,
>  We currently have 7 multi-active MDS, with another 7 standby-replay.
>  We thought this should cover most of disasters, and it actually did.
>  But things just got happened, here is the story:
>  One of MDS crashed and standby-replay took over, but got stuck at
>  resolve state.
>  Then, the other two MDS(rank 0 and 5) received tones of slow
>  requests, and my colleague restarted them, thinking the
>  standby-replay would take over immediately (this seemed to be wrong
>  or at least unnecessary action, I guess...). Then, it resulted three
>  of them in resolve state...
>  In the meanwhile, I realized that the first failed rank(rank 2) had
>  abnormal memory usage and kept getting crashed, after couple
>  restarting, the memory usage was back to normal, and then, those
>  tree MDS entered into rejoin state.
>  Now, this rejoin state is there for three days and keeps going as
>  we're speaking. Here, no significant error message shows up even
>  with "debug_mds 10", so, we have no idea when it's gonna end and if
>  it's really running on the track.
>  So, I am wondering how do we check MDS rejoin progress/status to
>  make sure if it's running normally? Or, how do we estimate the
>  rejoin time and maybe improve it? because we always need to tell
>  user the time estimation of its recovery.
> 
> 
>  Thanks
>  &
>  Best regards,
>  Felix Lee ~
> 
> >>>
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >>
> >
> >
>
> --
> Felix H.T Lee   Academia Sinica Grid & Cloud.
> Tel: +886-2-27898308
> Office: Room P111, Institute of Physics, 128 Academia Road, Section 2,
> Nankang, Taipei 115, Taiwan
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
On Wed, Apr 13, 2022 at 7:07 PM Gregory Farnum  wrote:
>
> On Wed, Apr 13, 2022 at 10:01 AM Dan van der Ster  wrote:
> >
> > I would set the pg_num, not pgp_num. In older versions of ceph you could
> > manipulate these things separately, but in pacific I'm not confident about
> > what setting pgp_num directly will do in this exact scenario.
> >
> > To understand, the difference between these two depends on if you're
> > splitting or merging.
> > First, definitions: pg_num is the number of PGs and pgp_num is the number
> > used for placing objects.
> >
> > So if pgp_num < pg_num, then at steady state only pgp_num pgs actually
> > store data, and the other pg_num-pgp_num PGs are sitting empty.
>
> Wait, what? That's not right! pgp_num is pg *placement* number; it
> controls how we map PGs to OSDs. But the full pg still exists as its
> own thing on the OSD and has its own data structures and objects. If
> currently the cluster has reduced pgp_num it has changed the locations
> of PGs, but it hasn't merged any PGs together. Changing the pg_num and
> causing merges will invoke a whole new workload which can be pretty
> substantial.

Eek, yes, I got this wrong. Somehow I imagined some orthogonal
implementation based on how it appears to work in practice.

In any case, isn't this still the best approach to make all PGs go
active+clean ASAP in this scenario?

1. turn off the autoscaler (for those pools, or fully)
2. for any pool with pg_num_target or pgp_num_target values, get the
current pgp_num X and use it to `ceph osd pool set  pg_num X`.

Can someone confirm that or recommend something different?

Cheers, Dan



> -Greg
>
> >
> > To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs,
> > then decreases pg_num as the PGs are emptied to actually delete the now
> > empty PGs.
> >
> > Splitting is similar but in reverse: first, Ceph creates new empty PGs by
> > increasing pg_num. Then it gradually increases pgp_num to start sending
> > data to the new PGs.
> >
> > That's the general idea, anyway.
> >
> > Long story short, set pg_num to something close to the current
> > pgp_num_target.
> >
> > .. Dan
> >
> >
> > On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, 
> > 
> > wrote:
> >
> > > Thank you so much, Dan!
> > >
> > > Can you confirm for me that for pool7, which has 2048/2048 for pg_num and
> > > 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be
> > > different for a single pool, or does pg_num and pgp_num have to always be
> > > the same?
> > >
> > > IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at
> > > 890, is that ok? Because if we reduce the pg_num by 1200 it will just 
> > > start
> > > a whole new load of misplaced object rebalancing. Won't it?
> > >
> > > Thank you,
> > > Ray
> > >
> > >
> > > -Original Message-
> > > From: Dan van der Ster 
> > > Sent: Wednesday, April 13, 2022 11:11 AM
> > > To: Ray Cunningham 
> > > Cc: ceph-users@ceph.io
> > > Subject: Re: [ceph-users] Stop Rebalancing
> > >
> > > Hi, Thanks.
> > >
> > > norebalance/nobackfill are useful to pause ongoing backfilling, but aren't
> > > the best option now to get the PGs to go active+clean and let the mon db
> > > come back under control. Unset those before continuing.
> > >
> > > I think you need to set the pg_num for pool1 to something close to but
> > > less than 926. (Or whatever the pg_num_target is when you run the command
> > > below).
> > > The idea is to let a few more merges complete successfully but then once
> > > all PGs are active+clean to take a decision about the other interventions
> > > you want to carry out.
> > > So this ought to be good:
> > > ceph osd pool set pool1 pg_num 920
> > >
> > > Then for pool7 this looks like splitting is ongoing. You should be able to
> > > pause that by setting the pg_num to something just above 883.
> > > I would do:
> > > ceph osd pool set pool7 pg_num 890
> > >
> > > It may even be fastest to just set those pg_num values to exactly what the
> > > current pgp_num_target is. You can try it.
> > >
> > > Once your cluster is stable again, then you should set those to the
> > > nearest power of two.
> > > Personally I would wait for #53729 to be fixed before embarking on future
> > > pg_num changes.
> > > (You'll hav

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
I would set the pg_num, not pgp_num. In older versions of ceph you could
manipulate these things separately, but in pacific I'm not confident about
what setting pgp_num directly will do in this exact scenario.

To understand, the difference between these two depends on if you're
splitting or merging.
First, definitions: pg_num is the number of PGs and pgp_num is the number
used for placing objects.

So if pgp_num < pg_num, then at steady state only pgp_num pgs actually
store data, and the other pg_num-pgp_num PGs are sitting empty.

To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs,
then decreases pg_num as the PGs are emptied to actually delete the now
empty PGs.

Splitting is similar but in reverse: first, Ceph creates new empty PGs by
increasing pg_num. Then it gradually increases pgp_num to start sending
data to the new PGs.

That's the general idea, anyway.

Long story short, set pg_num to something close to the current
pgp_num_target.

.. Dan


On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, 
wrote:

> Thank you so much, Dan!
>
> Can you confirm for me that for pool7, which has 2048/2048 for pg_num and
> 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be
> different for a single pool, or does pg_num and pgp_num have to always be
> the same?
>
> IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at
> 890, is that ok? Because if we reduce the pg_num by 1200 it will just start
> a whole new load of misplaced object rebalancing. Won't it?
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Wednesday, April 13, 2022 11:11 AM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi, Thanks.
>
> norebalance/nobackfill are useful to pause ongoing backfilling, but aren't
> the best option now to get the PGs to go active+clean and let the mon db
> come back under control. Unset those before continuing.
>
> I think you need to set the pg_num for pool1 to something close to but
> less than 926. (Or whatever the pg_num_target is when you run the command
> below).
> The idea is to let a few more merges complete successfully but then once
> all PGs are active+clean to take a decision about the other interventions
> you want to carry out.
> So this ought to be good:
> ceph osd pool set pool1 pg_num 920
>
> Then for pool7 this looks like splitting is ongoing. You should be able to
> pause that by setting the pg_num to something just above 883.
> I would do:
> ceph osd pool set pool7 pg_num 890
>
> It may even be fastest to just set those pg_num values to exactly what the
> current pgp_num_target is. You can try it.
>
> Once your cluster is stable again, then you should set those to the
> nearest power of two.
> Personally I would wait for #53729 to be fixed before embarking on future
> pg_num changes.
> (You'll have to mute a warning in the meantime -- check the docs after the
> warning appears).
>
> Cheers, dan
>
> On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham <
> ray.cunning...@keepertech.com> wrote:
> >
> > Perfect timing, I was just about to reply. We have disabled autoscaler
> on all pools now.
> >
> > Unfortunately, I can't just copy and paste from this system...
> >
> > `ceph osd pool ls detail` only 2 pools have any difference.
> > pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> > pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
> >
> > ` ceph osd pool autoscale-status`
> > Size is defined
> > target size is empty
> > Rate is 7 for all pools except pool7, which is 1.333730697632 Raw
> > capacity is defined Ratio for pool1 is .0177, pool7 is .4200 and all
> > others is 0 Target and Effective Ratio is empty Bias is 1.0 for all
> > PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> > New PG_NUM is empty
> > Autoscale is now off for all
> > Profile is scale-up
> >
> >
> > We have set norebalance and nobackfill and are watching to see what
> happens.
> >
> > Thank you,
> > Ray
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Wednesday, April 13, 2022 10:00 AM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > One more thing, could you please also share the `ceph osd pool
> autoscale-status` ?
> >
> >
> > On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham <
> ray.cunning...@keepertech.com> wrote:
> > >
> > > Thank you Dan! I will definitely disable autoscaler on the rest of our
> pools. I can't get the PG

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
Hi, Thanks.

norebalance/nobackfill are useful to pause ongoing backfilling, but
aren't the best option now to get the PGs to go active+clean and let
the mon db come back under control. Unset those before continuing.

I think you need to set the pg_num for pool1 to something close to but
less than 926. (Or whatever the pg_num_target is when you run the
command below).
The idea is to let a few more merges complete successfully but then
once all PGs are active+clean to take a decision about the other
interventions you want to carry out.
So this ought to be good:
ceph osd pool set pool1 pg_num 920

Then for pool7 this looks like splitting is ongoing. You should be
able to pause that by setting the pg_num to something just above 883.
I would do:
ceph osd pool set pool7 pg_num 890

It may even be fastest to just set those pg_num values to exactly what
the current pgp_num_target is. You can try it.

Once your cluster is stable again, then you should set those to the
nearest power of two.
Personally I would wait for #53729 to be fixed before embarking on
future pg_num changes.
(You'll have to mute a warning in the meantime -- check the docs after
the warning appears).

Cheers, dan

On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham
 wrote:
>
> Perfect timing, I was just about to reply. We have disabled autoscaler on all 
> pools now.
>
> Unfortunately, I can't just copy and paste from this system...
>
> `ceph osd pool ls detail` only 2 pools have any difference.
> pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
>
> ` ceph osd pool autoscale-status`
> Size is defined
> target size is empty
> Rate is 7 for all pools except pool7, which is 1.333730697632
> Raw capacity is defined
> Ratio for pool1 is .0177, pool7 is .4200 and all others is 0
> Target and Effective Ratio is empty
> Bias is 1.0 for all
> PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> New PG_NUM is empty
> Autoscale is now off for all
> Profile is scale-up
>
>
> We have set norebalance and nobackfill and are watching to see what happens.
>
> Thank you,
> Ray
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Wednesday, April 13, 2022 10:00 AM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> One more thing, could you please also share the `ceph osd pool 
> autoscale-status` ?
>
>
> On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham 
>  wrote:
> >
> > Thank you Dan! I will definitely disable autoscaler on the rest of our 
> > pools. I can't get the PG numbers today, but I will try to get them 
> > tomorrow. We definitely want to get this under control.
> >
> > Thank you,
> > Ray
> >
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Tuesday, April 12, 2022 2:46 PM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > Hi Ray,
> >
> > Disabling the autoscaler on all pools is probably a good idea. At least 
> > until https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> > susceptible to that -- but better safe than sorry).
> >
> > To pause the ongoing PG merges, you can indeed set the pg_num to the 
> > current value. This will allow the ongoing merge complete and prevent 
> > further merges from starting.
> > From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> > pgp_num_target... If you share the current values of those we can help 
> > advise what you need to set the pg_num to to effectively pause things where 
> > they are.
> >
> > BTW -- I'm going to create a request in the tracker that we improve the pg 
> > autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> > out a split/merge operation and avoid taking one-way decisions without 
> > permission from the administrator. The autoscaler is meant to be helpful, 
> > not degrade a cluster for 100 days!
> >
> > Cheers, Dan
> >
> >
> >
> > On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
> >  wrote:
> > >
> > > Hi Everyone,
> > >
> > > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > > rebalancing of misplaced objects is overwhelming the cluster and 
> > > impacting MON DB compaction, deep scrub repairs and us upgrading legacy 
> > > bluestore OSDs. We have to pause the rebalancing if misplaced objects or 
> > > we're going to fall over.
> > >
> > > Autoscaler-status tells us that we are reducing our PGs by 700'ish wh

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
One more thing, could you please also share the `ceph osd pool
autoscale-status` ?


On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham
 wrote:
>
> Thank you Dan! I will definitely disable autoscaler on the rest of our pools. 
> I can't get the PG numbers today, but I will try to get them tomorrow. We 
> definitely want to get this under control.
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Tuesday, April 12, 2022 2:46 PM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi Ray,
>
> Disabling the autoscaler on all pools is probably a good idea. At least until 
> https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> susceptible to that -- but better safe than sorry).
>
> To pause the ongoing PG merges, you can indeed set the pg_num to the current 
> value. This will allow the ongoing merge complete and prevent further merges 
> from starting.
> From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> pgp_num_target... If you share the current values of those we can help advise 
> what you need to set the pg_num to to effectively pause things where they are.
>
> BTW -- I'm going to create a request in the tracker that we improve the pg 
> autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> out a split/merge operation and avoid taking one-way decisions without 
> permission from the administrator. The autoscaler is meant to be helpful, not 
> degrade a cluster for 100 days!
>
> Cheers, Dan
>
>
>
> On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
>  wrote:
> >
> > Hi Everyone,
> >
> > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > rebalancing of misplaced objects is overwhelming the cluster and impacting 
> > MON DB compaction, deep scrub repairs and us upgrading legacy bluestore 
> > OSDs. We have to pause the rebalancing if misplaced objects or we're going 
> > to fall over.
> >
> > Autoscaler-status tells us that we are reducing our PGs by 700'ish which 
> > will take us over 100 days to complete at our current recovery speed. We 
> > disabled autoscaler on our biggest pool, but I'm concerned that it's 
> > already on the path to the lower PG count and won't stop adding to our 
> > misplaced count after drop below 5%. What can we do to stop the cluster 
> > from finding more misplaced objects to rebalance? Should we set the PG num 
> > manually to what our current count is? Or will that cause even more havoc?
> >
> > Any other thoughts or ideas? My goals are to stop the rebalancing 
> > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy 
> > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact 
> > when you aren't 100% active+clean).
> >
> > Thank you,
> > Ray
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-12 Thread Dan van der Ster
OK -- here's the tracker for what I mentioned:
https://tracker.ceph.com/issues/55303

On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham
 wrote:
>
> Thank you Dan! I will definitely disable autoscaler on the rest of our pools. 
> I can't get the PG numbers today, but I will try to get them tomorrow. We 
> definitely want to get this under control.
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Tuesday, April 12, 2022 2:46 PM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi Ray,
>
> Disabling the autoscaler on all pools is probably a good idea. At least until 
> https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> susceptible to that -- but better safe than sorry).
>
> To pause the ongoing PG merges, you can indeed set the pg_num to the current 
> value. This will allow the ongoing merge complete and prevent further merges 
> from starting.
> From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> pgp_num_target... If you share the current values of those we can help advise 
> what you need to set the pg_num to to effectively pause things where they are.
>
> BTW -- I'm going to create a request in the tracker that we improve the pg 
> autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> out a split/merge operation and avoid taking one-way decisions without 
> permission from the administrator. The autoscaler is meant to be helpful, not 
> degrade a cluster for 100 days!
>
> Cheers, Dan
>
>
>
> On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
>  wrote:
> >
> > Hi Everyone,
> >
> > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > rebalancing of misplaced objects is overwhelming the cluster and impacting 
> > MON DB compaction, deep scrub repairs and us upgrading legacy bluestore 
> > OSDs. We have to pause the rebalancing if misplaced objects or we're going 
> > to fall over.
> >
> > Autoscaler-status tells us that we are reducing our PGs by 700'ish which 
> > will take us over 100 days to complete at our current recovery speed. We 
> > disabled autoscaler on our biggest pool, but I'm concerned that it's 
> > already on the path to the lower PG count and won't stop adding to our 
> > misplaced count after drop below 5%. What can we do to stop the cluster 
> > from finding more misplaced objects to rebalance? Should we set the PG num 
> > manually to what our current count is? Or will that cause even more havoc?
> >
> > Any other thoughts or ideas? My goals are to stop the rebalancing 
> > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy 
> > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact 
> > when you aren't 100% active+clean).
> >
> > Thank you,
> > Ray
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-12 Thread Dan van der Ster
Hi Ray,

Disabling the autoscaler on all pools is probably a good idea. At
least until https://tracker.ceph.com/issues/53729 is fixed. (You are
likely not susceptible to that -- but better safe than sorry).

To pause the ongoing PG merges, you can indeed set the pg_num to the
current value. This will allow the ongoing merge complete and prevent
further merges from starting.
>From `ceph osd pool ls detail` you'll see pg_num, pgp_num,
pg_num_target, pgp_num_target... If you share the current values of
those we can help advise what you need to set the pg_num to to
effectively pause things where they are.

BTW -- I'm going to create a request in the tracker that we improve
the pg autoscaler heuristic. IMHO the autoscaler should estimate the
time to carry out a split/merge operation and avoid taking one-way
decisions without permission from the administrator. The autoscaler is
meant to be helpful, not degrade a cluster for 100 days!

Cheers, Dan



On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham
 wrote:
>
> Hi Everyone,
>
> We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> rebalancing of misplaced objects is overwhelming the cluster and impacting 
> MON DB compaction, deep scrub repairs and us upgrading legacy bluestore OSDs. 
> We have to pause the rebalancing if misplaced objects or we're going to fall 
> over.
>
> Autoscaler-status tells us that we are reducing our PGs by 700'ish which will 
> take us over 100 days to complete at our current recovery speed. We disabled 
> autoscaler on our biggest pool, but I'm concerned that it's already on the 
> path to the lower PG count and won't stop adding to our misplaced count after 
> drop below 5%. What can we do to stop the cluster from finding more misplaced 
> objects to rebalance? Should we set the PG num manually to what our current 
> count is? Or will that cause even more havoc?
>
> Any other thoughts or ideas? My goals are to stop the rebalancing temporarily 
> so we can deep scrub and repair inconsistencies, upgrade legacy bluestore 
> OSDs and compact our MON DBs (supposedly MON DBs don't compact when you 
> aren't 100% active+clean).
>
> Thank you,
> Ray
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Successful Upgrade from 14.2.18 to 15.2.16

2022-04-12 Thread Dan van der Ster
Hi Stefan,

Thanks for the report. 9 hours fsck is the longest I've heard about
yet -- and on NVMe, that's quite surprising!

Which firmware are you running on those Samsung's? For a different
reason Mark and we have been comparing performance of that drive
between what's in his lab vs what we have in our data centre. We have
no obvious perf issues running EDA5702Q; Mark has some issue with the
Quincy RC running FW EDA53W0Q. I'm not sure if it's related, but worth
checking...

In any case, I'm also surprised you decided to drain the boxes before
fsck. Wouldn't 9 hours of down osds, with noout set, going to be less
invasive?

Cheers, Dan


On Mon, Apr 11, 2022 at 10:56 AM Stefan Kooman  wrote:
>
> Hi All,
>
> Last week we succesfully upgraded our 14.2.18 cluster to 15.2.16.
> According to "ceph crash ls" we did not have a single crash while
> running Nautilus \o/. Unlike releases before Nautilus we occasionally
> had issues with MDS (hitting bugs) but since Nautilus this is no longer
> the case. And hopefully it stays like this. So kudos to all Ceph devs
> and contributors!
>
> One thing that took *way* longer than expected was the bluestore fsck.
> We did a "offline" and a "online" approach on one host. Both took the
> same amount of time (unlike previous releases, where online fsck would
> take way longer) ... about *9 hours* on NVMe disks (Samsung PM-983,
> SAMSUNG MZQLB3T8HALS-7).
> Note that we have a relatively big CephFS workload (with lots of small
> files and deep directory hierarchies), so your mileage may vary. Also
> note that "online" does not mean that our OSDs are UP ... they are not.
> They only "boot" when this process has finished. So the
> "bluestore_fsck_quick_fix_on_mount" parameter is misleading here.
> We decided to not proceed with bluestore fsck and first upgrade all
> storage nodes. We are now planning on redeploying the remaining OSDs and
> use "pgremapper" to drain hosts to new storage servers one by one: less
> risk (no degraded data for a prolonged period of time) and potentially
> even quicker.
>
> FYI,
>
> Gr. Stefan
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
BTW -- i've created https://tracker.ceph.com/issues/55169 to ask that
we add some input validation. Injecting such a crush map would ideally
not be possible.

-- dan

On Mon, Apr 4, 2022 at 11:02 AM Dan van der Ster  wrote:
>
> Excellent news!
> After everything is back to active+clean, don't forget to set min_size to 4 :)
>
> have a nice day
>
> On Mon, Apr 4, 2022 at 10:59 AM Fulvio Galeazzi  
> wrote:
> >
> > Yesss! Fixing the choose/chooseleaf thing did make the magic.  :-)
> >
> >Thanks a lot for your support Dan. Lots of lessons learned from my
> > side, I'm really grateful.
> >
> >All PGs are now active, will let Ceph rebalance.
> >
> >    Ciao ciao
> >
> > Fulvio
> >
> > On 4/4/22 10:50, Dan van der Ster wrote:
> > > Hi Fulvio,
> > >
> > > Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch!
> > > I suggest to fix it and inject the new crush map and see how it goes.
> > >
> > >
> > > Next, in your crush map for the storage type, you have an error:
> > >
> > > # types
> > > type 0 osd
> > > type 1 host
> > > type 2 chassis
> > > type 3 rack
> > > type 4 row
> > > type 5 pdu
> > > type 6 pod
> > > type 7 room
> > > type 8 datacenter
> > > type 9 region
> > > type 10 root
> > > type 11 storage
> > >
> > > The *order* of types is very important in crush -- they must be nested
> > > in the order they appear in the tree. "storage" should therefore be
> > > something between host and osd.
> > > If not, and if you use that type, it can break things.
> > > But since you're not actually using "storage" at the moment, it
> > > probably isn't causing any issue.
> > >
> > > So -- could you go ahead with that chooseleaf fix then let us know how it 
> > > goes?
> > >
> > > Cheers, Dan
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi  
> > > wrote:
> > >>
> > >> Hi again Dan!
> > >> Things are improving, all OSDs are up, but still that one PG is down.
> > >> More info below.
> > >>
> > >> On 4/1/22 19:26, Dan van der Ster wrote:
> > >>>>>> Here is the output of "pg 85.12 query":
> > >>>>>>https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> > >>>>>>  and its status (also showing the other 85.XX, for reference):
> > >>>>>
> > >>>>> This is very weird:
> > >>>>>
> > >>>>>"up": [
> > >>>>>2147483647,
> > >>>>>2147483647,
> > >>>>>2147483647,
> > >>>>>2147483647,
> > >>>>>2147483647
> > >>>>>],
> > >>>>>"acting": [
> > >>>>>67,
> > >>>>>91,
> > >>>>>82,
> > >>>>>2147483647,
> > >>>>>112
> > >>>>>],
> > >>
> > >> Meanwhile, since a random PG still shows an output like the above one, I
> > >> think I found the problem with the crush rule: it syas "choose" rather
> > >> than "chooseleaf"!
> > >>
> > >> rule csd-data-pool {
> > >>   id 5
> > >>   type erasure
> > >>   min_size 3
> > >>   max_size 5
> > >>   step set_chooseleaf_tries 5
> > >>   step set_choose_tries 100
> > >>   step take default class big
> > >>   step choose indep 0 type host<--- HERE!
> > >>   step emit
> > >> }
> > >>
> > >> ...relic of a more complicated, two-step rule... sigh!
> > >>
> > >>> PGs are active if at least 3 shards are up.
> > >>> Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
> > >>> assuming 85.25 remains the one and only PG which is down?)
> > >>
> > >> Yes, 85.25 is still the single 'down' PG.
> > >>
> > >>>> pool 85 'csd-dataonly-ec-pool' erasure 

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
Excellent news!
After everything is back to active+clean, don't forget to set min_size to 4 :)

have a nice day

On Mon, Apr 4, 2022 at 10:59 AM Fulvio Galeazzi  wrote:
>
> Yesss! Fixing the choose/chooseleaf thing did make the magic.  :-)
>
>Thanks a lot for your support Dan. Lots of lessons learned from my
> side, I'm really grateful.
>
>All PGs are now active, will let Ceph rebalance.
>
>Ciao ciao
>
>     Fulvio
>
> On 4/4/22 10:50, Dan van der Ster wrote:
> > Hi Fulvio,
> >
> > Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch!
> > I suggest to fix it and inject the new crush map and see how it goes.
> >
> >
> > Next, in your crush map for the storage type, you have an error:
> >
> > # types
> > type 0 osd
> > type 1 host
> > type 2 chassis
> > type 3 rack
> > type 4 row
> > type 5 pdu
> > type 6 pod
> > type 7 room
> > type 8 datacenter
> > type 9 region
> > type 10 root
> > type 11 storage
> >
> > The *order* of types is very important in crush -- they must be nested
> > in the order they appear in the tree. "storage" should therefore be
> > something between host and osd.
> > If not, and if you use that type, it can break things.
> > But since you're not actually using "storage" at the moment, it
> > probably isn't causing any issue.
> >
> > So -- could you go ahead with that chooseleaf fix then let us know how it 
> > goes?
> >
> > Cheers, Dan
> >
> >
> >
> >
> >
> > On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi  
> > wrote:
> >>
> >> Hi again Dan!
> >> Things are improving, all OSDs are up, but still that one PG is down.
> >> More info below.
> >>
> >> On 4/1/22 19:26, Dan van der Ster wrote:
> >>>>>> Here is the output of "pg 85.12 query":
> >>>>>>https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> >>>>>>  and its status (also showing the other 85.XX, for reference):
> >>>>>
> >>>>> This is very weird:
> >>>>>
> >>>>>"up": [
> >>>>>2147483647,
> >>>>>2147483647,
> >>>>>2147483647,
> >>>>>2147483647,
> >>>>>2147483647
> >>>>>],
> >>>>>"acting": [
> >>>>>67,
> >>>>>91,
> >>>>>82,
> >>>>>2147483647,
> >>>>>112
> >>>>>],
> >>
> >> Meanwhile, since a random PG still shows an output like the above one, I
> >> think I found the problem with the crush rule: it syas "choose" rather
> >> than "chooseleaf"!
> >>
> >> rule csd-data-pool {
> >>   id 5
> >>   type erasure
> >>   min_size 3
> >>   max_size 5
> >>   step set_chooseleaf_tries 5
> >>   step set_choose_tries 100
> >>   step take default class big
> >>   step choose indep 0 type host<--- HERE!
> >>   step emit
> >> }
> >>
> >> ...relic of a more complicated, two-step rule... sigh!
> >>
> >>> PGs are active if at least 3 shards are up.
> >>> Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
> >>> assuming 85.25 remains the one and only PG which is down?)
> >>
> >> Yes, 85.25 is still the single 'down' PG.
> >>
> >>>> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> >>>> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> >>>> last_change 616460 flags
> >>>> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> >>>> application rbd
> >>>
> >>> Yup okay, we need to fix that later to make this cluster correctly
> >>> configured. To be followed up.
> >>
> >> At some point, need to update min_size to 4.
> >>
> >>>> If I understand correctly, it should now be safe (but I will wait for
> >>>> your green light) to repeat the same for:
> >>>> osd.121 chunk 85.11s0
> >>>> osd.145 chunk 85.33s0
> >>

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
Could you share the output of `ceph pg 85.25 query`.

Then increase the crush weights of those three osds to 0.1, then check
if the PG goes active.
(It is possible that the OSDs are not registering as active while they
have weight zero).

-- dan


On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi  wrote:
>
> Hi again Dan!
> Things are improving, all OSDs are up, but still that one PG is down.
> More info below.
>
> On 4/1/22 19:26, Dan van der Ster wrote:
> >>>> Here is the output of "pg 85.12 query":
> >>>>   https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> >>>> and its status (also showing the other 85.XX, for reference):
> >>>
> >>> This is very weird:
> >>>
> >>>   "up": [
> >>>   2147483647,
> >>>   2147483647,
> >>>   2147483647,
> >>>   2147483647,
> >>>   2147483647
> >>>   ],
> >>>   "acting": [
> >>>   67,
> >>>   91,
> >>>   82,
> >>>   2147483647,
> >>>   112
> >>>   ],
>
> Meanwhile, since a random PG still shows an output like the above one, I
> think I found the problem with the crush rule: it syas "choose" rather
> than "chooseleaf"!
>
> rule csd-data-pool {
>  id 5
>  type erasure
>  min_size 3
>  max_size 5
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class big
>  step choose indep 0 type host<--- HERE!
>  step emit
> }
>
> ...relic of a more complicated, two-step rule... sigh!
>
> > PGs are active if at least 3 shards are up.
> > Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
> > assuming 85.25 remains the one and only PG which is down?)
>
> Yes, 85.25 is still the single 'down' PG.
>
> >> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> >> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> >> last_change 616460 flags
> >> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> >> application rbd
> >
> > Yup okay, we need to fix that later to make this cluster correctly
> > configured. To be followed up.
>
> At some point, need to update min_size to 4.
>
> >> If I understand correctly, it should now be safe (but I will wait for
> >> your green light) to repeat the same for:
> >> osd.121 chunk 85.11s0
> >> osd.145 chunk 85.33s0
> >>so they can also start.
> >
> > Yes, please go ahead and do the same.
> > I expect that your PG 85.25 will go active as soon as both those OSDs
> > start correctly.
>
> Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down.
> Its chunks are in:
>
> 85.25s0: osd.64
> 85.25s1: osd.140 osd.159
> 85.25s2: osd.96
> 85.25s3: osd.121 osd.176
> 85.25s4: osd.159 osd.56
>
> > BTW, I also noticed in your crush map below that the down osds have
> > crush weight zero!
> > So -- this means they are the only active OSDs for a PG, and they are
> > all set to be drained.
> > How did this happen? It is also surely part of the root cause here!
> >
> > I suggest to reset the crush weight of those back to what it was
> > before, probably 1 ?
>
> At some point I changed those weight to 0., but this was well after the
> beginning of the problem: this helped, at least, healing a lot of
> degraded/undersized.
>
> > After you have all the PGs active, we need to find out why their "up"
> > set is completely bogus.
> > This is evidence that your crush rule is broken.
> > If a PG doesn't have an complete "up" set, then it can never not be
> > degraded -- the PGs don't know where to go.
>
> Do you think the choose-chooseleaf issue mentioned above, could be the
> culprit?
>
> > I'm curious about that "storage" type you guys invented.
>
> Oh, nothing too fancy... foreword, we happen to be using (and are
> currently finally replacing) hardware (based on FiberChannel-SAN) which
> is not the first choice in the Ceph world: but purchase happened before
> we turned to Ceph as our storage solution. Each OSD server has access to
> 2 such distinct storage systems, hence the idea to describe these
> failure domains in the crush rule.
>
> > Could you please copy to pastebin and share the crush.txt from
> >
> > ceph osd getcrushmap -o crush.map
> > crushtool 

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-04 Thread Dan van der Ster
Hi Fulvio,

Yes -- that choose/chooseleaf thing is definitely a problem.. Good catch!
I suggest to fix it and inject the new crush map and see how it goes.


Next, in your crush map for the storage type, you have an error:

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
type 11 storage

The *order* of types is very important in crush -- they must be nested
in the order they appear in the tree. "storage" should therefore be
something between host and osd.
If not, and if you use that type, it can break things.
But since you're not actually using "storage" at the moment, it
probably isn't causing any issue.

So -- could you go ahead with that chooseleaf fix then let us know how it goes?

Cheers, Dan





On Mon, Apr 4, 2022 at 10:01 AM Fulvio Galeazzi  wrote:
>
> Hi again Dan!
> Things are improving, all OSDs are up, but still that one PG is down.
> More info below.
>
> On 4/1/22 19:26, Dan van der Ster wrote:
> >>>> Here is the output of "pg 85.12 query":
> >>>>   https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> >>>> and its status (also showing the other 85.XX, for reference):
> >>>
> >>> This is very weird:
> >>>
> >>>   "up": [
> >>>   2147483647,
> >>>   2147483647,
> >>>   2147483647,
> >>>   2147483647,
> >>>   2147483647
> >>>   ],
> >>>   "acting": [
> >>>   67,
> >>>   91,
> >>>   82,
> >>>   2147483647,
> >>>   112
> >>>   ],
>
> Meanwhile, since a random PG still shows an output like the above one, I
> think I found the problem with the crush rule: it syas "choose" rather
> than "chooseleaf"!
>
> rule csd-data-pool {
>  id 5
>  type erasure
>  min_size 3
>  max_size 5
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class big
>  step choose indep 0 type host<--- HERE!
>  step emit
> }
>
> ...relic of a more complicated, two-step rule... sigh!
>
> > PGs are active if at least 3 shards are up.
> > Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
> > assuming 85.25 remains the one and only PG which is down?)
>
> Yes, 85.25 is still the single 'down' PG.
>
> >> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> >> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> >> last_change 616460 flags
> >> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> >> application rbd
> >
> > Yup okay, we need to fix that later to make this cluster correctly
> > configured. To be followed up.
>
> At some point, need to update min_size to 4.
>
> >> If I understand correctly, it should now be safe (but I will wait for
> >> your green light) to repeat the same for:
> >> osd.121 chunk 85.11s0
> >> osd.145 chunk 85.33s0
> >>so they can also start.
> >
> > Yes, please go ahead and do the same.
> > I expect that your PG 85.25 will go active as soon as both those OSDs
> > start correctly.
>
> Hmmm, unfortunately not. All OSDs are up, but 85.25 is still down.
> Its chunks are in:
>
> 85.25s0: osd.64
> 85.25s1: osd.140 osd.159
> 85.25s2: osd.96
> 85.25s3: osd.121 osd.176
> 85.25s4: osd.159 osd.56
>
> > BTW, I also noticed in your crush map below that the down osds have
> > crush weight zero!
> > So -- this means they are the only active OSDs for a PG, and they are
> > all set to be drained.
> > How did this happen? It is also surely part of the root cause here!
> >
> > I suggest to reset the crush weight of those back to what it was
> > before, probably 1 ?
>
> At some point I changed those weight to 0., but this was well after the
> beginning of the problem: this helped, at least, healing a lot of
> degraded/undersized.
>
> > After you have all the PGs active, we need to find out why their "up"
> > set is completely bogus.
> > This is evidence that your crush rule is broken.
> > If a PG doesn't have an complete "up" set, then it can never not be
> > degraded -- the PGs don't know where to go.
>
> Do you think the choose-chooseleaf issue mentioned above, could be the
> culprit?
>
> > I'm curious about that "storage" type you gu

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Dan van der Ster
We're on the right track!

On Fri, Apr 1, 2022 at 6:57 PM Fulvio Galeazzi  wrote:
>
> Ciao Dan, thanks for your messages!
>
> On 4/1/22 11:25, Dan van der Ster wrote:
> > The PGs are stale, down, inactive *because* the OSDs don't start.
> > Your main efforts should be to bring OSDs up, without purging or
> > zapping or anyting like that.
> > (Currently your cluster is down, but there are hopes to recover. If
> > you start purging things that can result in permanent data loss.).
>
> Sure, will not do anything like purge/whatever, as long as I can abuse
> your patience...
>
>
> >> Looking for the string 'start interval does not contain the required
> >> bound' I found similar errors in the three OSDs:
> >> osd.158: 85.12s0
> >> osd.145: 85.33s0
> >> osd.121: 85.11s0
> >
> > Is that log also for PG 85.12 on the other OSDs?
>
> Not sure I am getting your point here, sorry. I grep'ed that string in
> the above logs, and only found the occurrences I mentioned. To be
> specific, reference to 85.12 was found only on osd.158 and not on the
> other 'down' OSDs.
>

Sorry, my question was confusing, because I didn't notice you already
mentioned which PG shards are relevant for crashing each PG.
Just ignore the question. more below...

> >> Here is the output of "pg 85.12 query":
> >>  https://pastebin.ubuntu.com/p/ww3JdwDXVd/
> >>and its status (also showing the other 85.XX, for reference):
> >
> > This is very weird:
> >
> >  "up": [
> >  2147483647,
> >  2147483647,
> >  2147483647,
> >  2147483647,
> >  2147483647
> >  ],
> >  "acting": [
> >  67,
> >  91,
> >  82,
> >  2147483647,
> >  112
> >  ],
> >
> > Right now, do the following:
> >ceph osd set norebalance
> > That will prevent PGs moving from one OSD to another *unless* they are 
> > degraded.
>
> Done

Great, keep it like that for a while until we understand the "crush"
issue, which is different from the osds crashing issue.

>
> > 2. My theory about what happened here. Your crush rule change "osd ->
> > host" below basically asked all PGs to be moved.
> > Some glitch happened and some broken parts of PG 85.12 ended up on
> > some OSDs, now causing those OSDs to crash.
> > 85.12 is "fine", I mean active, now because there are enough complete
> > parts of it on other osds.
> > The fact that "up" above is listing '2147483647' for every osd means
> > your new crush rule is currently broken. Let's deal with fixing that
> > later.
>
> Hmm, in theory, it looks correct, but I see your point and in fact I am
> stuck with some 1-3% fraction of the objects misplaced/degraded, all of
> them in pool 85
>


PGs are active if at least 3 shards are up.
Our immediate goal remains to get 3 shards up for PG 85.25 (I'm
assuming 85.25 remains the one and only PG which is down?)

> > 3. Question -- what is the output of `ceph osd pool ls detail | grep
> > csd-dataonly-ec-pool` ? If you have `min_size 3` there, then this is
> > part of the root cause of the outage here. At the end of this thread,
> > *only after everything is recovered and no PGs are
> > undersized/degraded* , you will need to set it `ceph osd pool set
> > csd-dataonly-ec-pool min_size 4`
>
> Indeed, it's 3. Connected to your last point below (never mess with
> crush rules if there is anything ongoing), during rebalancing there was
> something which was stuck and I think "health detail" was suggesting
> that reducing min-size would help. I took not of the pools for which I
> updated the parameter, and will go back to the proper values once the
> situation will be clean.
>
> pool 85 'csd-dataonly-ec-pool' erasure size 5 min_size 3 crush_rule 5
> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
> last_change 616460 flags
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps stripe_width 12288
> application rbd

Yup okay, we need to fix that later to make this cluster correctly
configured. To be followed up.

>
> > 4. The immediate goal should be to try to get osd.158 to start up, by
> > "removing" the corrupted part of PG 85.12 from it.
> > IF we can get osd.158 started, then the same approach should work for
> > the other OSDs.
> >  From your previous log, osd.158 has a broken piece of pg 85.12. Let's
> > export-remove it:
> >
> > ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
> > --op ex

[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Dan van der Ster
relevant rule to:
>
> ~]$ ceph --cluster cephpa1 osd lspools | grep 85
> 85 csd-dataonly-ec-pool
> ~]$ ceph --cluster cephpa1 osd pool get csd-dataonly-ec-pool crush_rule
> crush_rule: csd-data-pool
>
> rule csd-data-pool {
>  id 5
>  type erasure
>  min_size 3
>  max_size 5
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class big
>  step choose indep 0 type host  <--- this was "osd", before
>  step emit
> }

Can you please share the output of `ceph osd tree` ?

We need to understand why crush is not working any more for your pool.

>
> At the time I changed the rule, there was no 'down' PG, all PGs in the
> cluster were 'active' plus possibly some other state (remapped,
> degraded, whatever) as I had added some new disk servers few days before.

Never make crush rule changes when any PG is degraded, remapped, or whatever!
They must all be active+clean to consider big changes like injecting a
new crush rule!!

Cheers, Dan

> The rule change, of course, caused some data movement and after a while
> I found those three OSDs down.
>
>Thanks!
>
> Fulvio
>
>
> On 3/30/22 16:48, Dan van der Ster wrote:
> > Hi Fulvio,
> >
> > I'm not sure why that PG doesn't register.
> > But let's look into your log. The relevant lines are:
> >
> >-635> 2022-03-30 14:49:57.810 7ff904970700 -1 log_channel(cluster)
> > log [ERR] : 85.12s0 past_intervals [616435,616454) start interval does
> > not contain the required bound [605868,616454) start
> >
> >-628> 2022-03-30 14:49:57.810 7ff904970700 -1 osd.158 pg_epoch:
> > 616454 pg[85.12s0( empty local-lis/les=0/0 n=0 ec=616435/616435 lis/c
> > 605866/605866 les/c/f 605867/605868/0 616453/616454/616454)
> > [158,168,64,102,156]/[67,91,82,121,112]p67(0) r=-1 lpr=616454
> > pi=[616435,616454)/0 crt=0'0 remapped NOTIFY mbc={}] 85.12s0
> > past_intervals [616435,616454) start interval does not contain the
> > required bound [605868,616454) start
> >
> >-355> 2022-03-30 14:49:57.816 7ff904970700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
> > In function 'void PG::check_past_interval_bounds() const' thread
> > 7ff904970700 time 2022-03-30 14:49:57.811165
> >
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
> > 956: ceph_abort_msg("past_interval start interval mismatch")
> >
> >
> > What is the output of `ceph pg 85.12 query` ?
> >
> > What's the history of that PG? was it moved around recently prior to this 
> > crash?
> > Are the other down osds also hosting broken parts of PG 85.12 ?
> >
> > Cheers, Dan
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG down, due to 3 OSD failing

2022-04-01 Thread Dan van der Ster
Don't purge anything!

On Fri, Apr 1, 2022 at 9:38 AM Fulvio Galeazzi  wrote:
>
> Ciao Dan,
>  thanks for your time!
>
> So you are suggesting that my problems with PG 85.25 may somehow resolve
> if I manage to bring up the three OSDs currently "down" (possibly due to
> PG 85.12, and other PGs)?
>
> Looking for the string 'start interval does not contain the required
> bound' I found similar errors in the three OSDs:
> osd.158: 85.12s0
> osd.145: 85.33s0
> osd.121: 85.11s0
>
> Here is the output of "pg 85.12 query":
> https://pastebin.ubuntu.com/p/ww3JdwDXVd/
>   and its status (also showing the other 85.XX, for reference):
>
> 85.11395010 0   0 165479411712   0
>  0 3000  stale+active+clean3d606021'532631
>617659:1827554
> [124,157,68,72,102]p124
> [124,157,68,72,102]p124 2022-03-28 07:21:00.566032 2022-03-28
> 07:21:00.566032
> 85.123970439704158816   0 166350008320   0
>  0 3028 active+undersized+degraded+remapped3d606021'573200
>620336:1839924
> [2147483647,2147483647,2147483647,2147483647,2147483647]p-1
> [67,91,82,2147483647,112]p67 2022-03-15 03:25:28.478280
> 2022-03-12 19:10:45.866650
> 85.25394020 0   0 165108592640   0
>  0 3098 stale+down+remapped3d606021'521273
>618930:1734492
> [2147483647,2147483647,2147483647,2147483647,2147483647]p-1
> [2147483647,2147483647,96,2147483647,2147483647]p96 2022-03-15
> 04:08:42.561720 2022-03-09 17:05:34.205121
> 85.33393190 0   0 164740796416   0
>  0 3000  stale+active+clean3d606021'513259
>617659:2125167
> [174,112,85,102,124]p174
> [174,112,85,102,124]p174 2022-03-28 07:21:12.097873 2022-03-28
> 07:21:12.097873
>
> So 85.11 and 85.33 do not look bad, after all: why are the relevant OSDs
> complaining? Is there a way to force them (OSDs) to forget about the
> chunks they possess, as apparently those have already safely migrated
> elsewhere?
>
> Indeed 85.12 is not really healthy...
> As for chunks of 85.12 and 85.25, the 3 down OSDs have:
> osd.121
> 85.12s3
> 85.25s3
> osd.158
> 85.12s0
> osd.145
> none
> I guess I can safely purge osd.145 and re-create it, then.
>
>
> As for the history of the pool, this is an EC pool with metadata in a
> SSD-backed replicated pool. At some point I realized I had made a
> mistake in the allocation rule for the "data" part, so I changed the
> relevant rule to:
>
> ~]$ ceph --cluster cephpa1 osd lspools | grep 85
> 85 csd-dataonly-ec-pool
> ~]$ ceph --cluster cephpa1 osd pool get csd-dataonly-ec-pool crush_rule
> crush_rule: csd-data-pool
>
> rule csd-data-pool {
>  id 5
>  type erasure
>  min_size 3
>  max_size 5
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class big
>  step choose indep 0 type host  <--- this was "osd", before
>  step emit
> }
>
> At the time I changed the rule, there was no 'down' PG, all PGs in the
> cluster were 'active' plus possibly some other state (remapped,
> degraded, whatever) as I had added some new disk servers few days before.
> The rule change, of course, caused some data movement and after a while
> I found those three OSDs down.
>
>Thanks!
>
> Fulvio
>
>
> On 3/30/22 16:48, Dan van der Ster wrote:
> > Hi Fulvio,
> >
> > I'm not sure why that PG doesn't register.
> > But let's look into your log. The relevant lines are:
> >
> >-635> 2022-03-30 14:49:57.810 7ff904970700 -1 log_channel(cluster)
> > log [ERR] : 85.12s0 past_intervals [616435,616454) start interval does
> > not contain the required bound [605868,616454) start
> >
> >-628> 2022-03-30 14:49:57.810 7ff904970700 -1 osd.158 pg_epoch:
> > 616454 pg[85.12s0( empty local-lis/les=0/0 n=0 ec=616435/616435 lis/c
> > 605866/605866 les/c/f 605867/605868/0 616453/616454/616454)
> > [158,168,64,102,156]/[67,91,82,121,112]p67(0) r=-1 lpr=616454
> > pi=[616435,616454)/0 crt=0'0 remapped NOTIFY mbc={}] 85.12s0
> > past_intervals [616435,616454) start interval does not contain the
> > required bound [605868,616454) start
> >
> >-355> 2022-03-30 14:49:57.816 7ff904970700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/P

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-30 Thread Dan van der Ster
Hi Fulvio,

I'm not sure why that PG doesn't register.
But let's look into your log. The relevant lines are:

  -635> 2022-03-30 14:49:57.810 7ff904970700 -1 log_channel(cluster)
log [ERR] : 85.12s0 past_intervals [616435,616454) start interval does
not contain the required bound [605868,616454) start

  -628> 2022-03-30 14:49:57.810 7ff904970700 -1 osd.158 pg_epoch:
616454 pg[85.12s0( empty local-lis/les=0/0 n=0 ec=616435/616435 lis/c
605866/605866 les/c/f 605867/605868/0 616453/616454/616454)
[158,168,64,102,156]/[67,91,82,121,112]p67(0) r=-1 lpr=616454
pi=[616435,616454)/0 crt=0'0 remapped NOTIFY mbc={}] 85.12s0
past_intervals [616435,616454) start interval does not contain the
required bound [605868,616454) start

  -355> 2022-03-30 14:49:57.816 7ff904970700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
In function 'void PG::check_past_interval_bounds() const' thread
7ff904970700 time 2022-03-30 14:49:57.811165

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
956: ceph_abort_msg("past_interval start interval mismatch")


What is the output of `ceph pg 85.12 query` ?

What's the history of that PG? was it moved around recently prior to this crash?
Are the other down osds also hosting broken parts of PG 85.12 ?

Cheers, Dan

On Wed, Mar 30, 2022 at 3:00 PM Fulvio Galeazzi  wrote:
>
> Ciao Dan,
>  this is what I did with chunk s3, copying it from osd.121 to
> osd.176 (which is managed by the same host).
>
> But still
> pg 85.25 is stuck stale for 85029.707069, current state
> stale+down+remapped, last acting
> [2147483647,2147483647,96,2147483647,2147483647]
>
> So "health detail" apparently plainly ignores osd.176: moreover, its
> output only shows OSD 96, but I just checked again and the other chunks
> are still on OSDs 56,64,140,159 which are all "up".
>
> By the way, you talk about a "bug" in your message: do you have any
> specific one in mind, or was it just a generic synonym for "problem"?
> By the way, I uploaded here:
> https://pastebin.ubuntu.com/p/dTfPkMb7mD/
> a few hundreds of lines from one of the failed OSDs upon "activate --all".
>
>Thanks
>
> Fulvio
>
> On 29/03/2022 10:53, Dan van der Ster wrote:
> > Hi Fulvio,
> >
> > I don't think upmap will help -- that is used to remap where data
> > should be "up", but your problem is more that the PG chunks are not
> > going active due to the bug.
> >
> > What happens if you export one of the PG chunks then import it to
> > another OSD -- does that chunk become active?
> >
> > -- dan
> >
> >
> >
> > On Tue, Mar 29, 2022 at 10:51 AM Fulvio Galeazzi
> >  wrote:
> >>
> >> Hallo again Dan, I am afraid I'd need a little more help, please...
> >>
> >> Current status is as follows.
> >>
> >> This is where I moved the chunk which was on osd.121:
> >> ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/cephpa1-176
> >> --no-mon-config --op list-pgs  | grep ^85\.25
> >> 85.25s3
> >>
> >> while other chunks are (server - osd.id):
> >> === r2srv101.pa1.box.garr - 96
> >> 85.25s2
> >> === r2srv100.pa1.box.garr - 121  <-- down, chunk is on osd.176
> >> 85.25s3
> >> === r2srv100.pa1.box.garr - 159
> >> 85.25s1
> >> 85.25s4
> >> === r1-sto09.pa1.box.garr - 56
> >> 85.25s4
> >> === r1-sto09.pa1.box.garr - 64
> >> 85.25s0
> >> === r3srv15.pa1.box.garr - 140
> >> 85.25s1
> >>
> >> Health detail shows that just one chunk can be found (if I understand
> >> the output correctly):
> >>
> >> ~]# ceph health detail | grep 85\.25
> >>   pg 85.25 is stuck stale for 5680.315732, current state
> >> stale+down+remapped, last acting
> >> [2147483647,2147483647,96,2147483647,2147483647]
> >>
> >> Can I run some magic upmap command to explain my cluster where all the
> >> chunks are? What would be the right syntax?
> >> Little additional problem: I see s1 and s4 twice... I guess this was due
> >> to remapping, as I was adding disks to the cluster: which one is the
> >> right copy?
> >>
> >> Thanks!
> >>
> >>  Fulvio
> >>
> >> Il 3/29/2022 9:35 AM, Fulvio Galeazzi ha s

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-29 Thread Dan van der Ster
Hi Fulvio,

I don't think upmap will help -- that is used to remap where data
should be "up", but your problem is more that the PG chunks are not
going active due to the bug.

What happens if you export one of the PG chunks then import it to
another OSD -- does that chunk become active?

-- dan



On Tue, Mar 29, 2022 at 10:51 AM Fulvio Galeazzi
 wrote:
>
> Hallo again Dan, I am afraid I'd need a little more help, please...
>
> Current status is as follows.
>
> This is where I moved the chunk which was on osd.121:
> ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/cephpa1-176
> --no-mon-config --op list-pgs  | grep ^85\.25
> 85.25s3
>
> while other chunks are (server - osd.id):
> === r2srv101.pa1.box.garr - 96
> 85.25s2
> === r2srv100.pa1.box.garr - 121  <-- down, chunk is on osd.176
> 85.25s3
> === r2srv100.pa1.box.garr - 159
> 85.25s1
> 85.25s4
> === r1-sto09.pa1.box.garr - 56
> 85.25s4
> === r1-sto09.pa1.box.garr - 64
> 85.25s0
> === r3srv15.pa1.box.garr - 140
> 85.25s1
>
> Health detail shows that just one chunk can be found (if I understand
> the output correctly):
>
> ~]# ceph health detail | grep 85\.25
>  pg 85.25 is stuck stale for 5680.315732, current state
> stale+down+remapped, last acting
> [2147483647,2147483647,96,2147483647,2147483647]
>
> Can I run some magic upmap command to explain my cluster where all the
> chunks are? What would be the right syntax?
> Little additional problem: I see s1 and s4 twice... I guess this was due
> to remapping, as I was adding disks to the cluster: which one is the
> right copy?
>
>Thanks!
>
> Fulvio
>
> Il 3/29/2022 9:35 AM, Fulvio Galeazzi ha scritto:
> > Thanks a lot, Dan!
> >
> >  > The EC pgs have a naming convention like 85.25s1 etc.. for the various
> >  > k/m EC shards.
> >
> > That was the bit of information I was missing... I was looking for the
> > wrong object.
> > I can now go on and export/import that one PGid chunk.
> >
> >Thanks again!
> >
> >  Fulvio
> >
> > On 28/03/2022 16:27, Dan van der Ster wrote:
> >> Hi Fulvio,
> >>
> >> You can check (offline) which PGs are on an OSD with the list-pgs op,
> >> e.g.
> >>
> >> ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
> >> --op list-pgs
> >>
> >>
> >> -- dan
> >>
> >>
> >> On Mon, Mar 28, 2022 at 2:29 PM Fulvio Galeazzi
> >>  wrote:
> >>>
> >>> Hallo,
> >>>   all of a sudden, 3 of my OSDs failed, showing similar messages in
> >>> the log:
> >>>
> >>> .
> >>>   -5> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> >>> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> >>> ec=148456/148456 lis/c
> >>>612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> >>> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> >>> unknown mbc={}]
> >>> enter Started
> >>>   -4> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> >>> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> >>> ec=148456/148456 lis/c
> >>>612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> >>> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> >>> unknown mbc={}]
> >>> enter Start
> >>>   -3> 2022-03-28 14:19:02.451 7fc20fe99700  1 osd.145 pg_epoch:
> >>> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> >>> ec=148456/148456 lis/c
> >>>612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> >>> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> >>> unknown mbc={}]
> >>> state: transitioning to Stray
> >>>   -2> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> >>> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> >>> ec=148456/148456 lis/c
> >>>612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> >>> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> >>> unknown mbc={}]
> >>> exit Start 0.08 0 0.00
> >>>   -1> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> >>> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> >>> ec=148456/148456 lis/c
> >>>612106/612106 les/c/f 612107/612107/0 612106/612106/612

[ceph-users] Re: PG down, due to 3 OSD failing

2022-03-28 Thread Dan van der Ster
Hi Fulvio,

You can check (offline) which PGs are on an OSD with the list-pgs op, e.g.

ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/  --op list-pgs

The EC pgs have a naming convention like 85.25s1 etc.. for the various
k/m EC shards.

-- dan


On Mon, Mar 28, 2022 at 2:29 PM Fulvio Galeazzi  wrote:
>
> Hallo,
>  all of a sudden, 3 of my OSDs failed, showing similar messages in
> the log:
>
> .
>  -5> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> enter Started
>  -4> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> enter Start
>  -3> 2022-03-28 14:19:02.451 7fc20fe99700  1 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> state: transitioning to Stray
>  -2> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> exit Start 0.08 0 0.00
>  -1> 2022-03-28 14:19:02.451 7fc20fe99700  5 osd.145 pg_epoch:
> 616454 pg[70.2c6s1( empty local-lis/les=612106/612107 n=0
> ec=148456/148456 lis/c
>   612106/612106 les/c/f 612107/612107/0 612106/612106/612101)
> [168,145,102,96,112,124,128,134,56,34]p168(0) r=1 lpr=616429 crt=0'0
> unknown mbc={}]
> enter Started/Stray
>   0> 2022-03-28 14:19:02.451 7fc20f698700 -1 *** Caught signal
> (Aborted) **
>   in thread 7fc20f698700 thread_name:tp_osd_tp
>
>   ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351)
> nautilus (stable)
>   1: (()+0x12ce0) [0x7fc2327dcce0]
>   2: (gsignal()+0x10f) [0x7fc231452a4f]
>   3: (abort()+0x127) [0x7fc231425db5]
>   4: (ceph::__ceph_abort(char const*, int, char const*,
> std::__cxx11::basic_string,
> std::allocator > const&)+0x1b4) [0x55b8139cb671]
>   5: (PG::check_past_interval_bounds() const+0xc16) [0x55b813b586f6]
>   6: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x3e8)
> [0x55b813b963d8]
>   7: (boost::statechart::simple_state PG::RecoveryState::RecoveryMachine, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x7d) [0x55b813bdd32d]
>   8: (PG::handle_advance_map(std::shared_ptr,
> std::shared_ptr, std::vector >&,
> int, std::vector >&, int,
> PG::RecoveryCtx*)+0x39d) [0x55b813b7b5fd]
>   9: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> PG::RecoveryCtx*)+0x2e9) [0x55b813ad14e9]
>   10: (OSD::dequeue_peering_evt(OSDShard*, PG*,
> std::shared_ptr, ThreadPool::TPHandle&)+0xaa)
> [0x55b813ae345a]
>   11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x55) [0x55b813d66c15]
>   12: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x1366) [0x55b813adff46]
>   13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x55b8140dc944]
>   14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b8140df514]
>   15: (()+0x81cf) [0x7fc2327d21cf]
>   16: (clone()+0x43) [0x7fc23143dd83]
>
> Trying to "activate --all", rebotting server, and such, did not help.
>
> I am now stuck with one PG (85.25) down, find below the output from "query".
>
> The PG belongs to a 3+2 erasure-coded pool.
> As the devices corresponding to the 3 down OSDs are properly mounted, is
> there a way to get PG.ID=85.25 from the devices and copy it elsewhere?
> Actually, I tried to find 85.25 in the 3 down OSDs with command:
> ~]# ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
> --no-mon-config --pgid 85.25 --op export --file /tmp/pg_85-25
> PG '85.25' not found
>which puzzled me... is there a way to search such PG.ID over the
> whole cluster?
>
>Thanks for your help!
>
> Fulvio
>
> 
>
> ~]# ceph --cluster cephpa1 health detail | grep down
> .
> PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg down
>  pg 85.25 is down+remapped, acting
> [2147483647,2147483647,96,2147483647,2147483647]
> ~]# ceph --cluster cephpa1 pg 85.25 query
> {

[ceph-users] Re: ceph mon failing to start

2022-03-28 Thread Dan van der Ster
Are the two running mons also running 14.2.9 ?

--- dan

On Mon, Mar 28, 2022 at 8:27 AM Tomáš Hodek  wrote:
>
> Hi, I have 3 node ceph cluster (managed via proxmox). Got single node
> fatal failure and replaced it. Os boots correctly, however monitor on
> failed node did not start successfully; Other 2 monitors are OK, ceph
> status is healthy:
>
> ceph -s
> cluster:
> id: 845868a1-9902-4b61-aa06-0767cb09f1c2
> health: HEALTH_OK
>
> services:
> mon: 2 daemons, quorum pxmx1,pxmx3 (age 2h)
> mgr: pxmx1(active, since 56m), standbys: pxmx3
> osd: 18 osds: 18 up (since 111m), 18 in (since 3h)
>
> data:
> pools: 1 pools, 256 pgs
> objects: 2.12M objects, 8.1 TiB
> usage: 24 TiB used, 21 TiB / 45 TiB avail
> pgs: 256 active+clean
>
> content of ceph.conf
>
> [global]
> auth_client_required = cephx
> auth_cluster_required = cephx
> auth_service_required = cephx
> cluster_network = 10.60.10.1/24
> fsid = 845868a1-9902-4b61-aa06-0767cb09f1c2
> mon_allow_pool_delete = true
> mon_host = 10.60.10.1 10.60.10.3 10.60.10.2
> osd_pool_default_min_size = 2
> osd_pool_default_size = 3
> public_network = 10.60.10.1/24
>
> [client]
> keyring = /etc/pve/priv/$cluster.$name.keyring
>
> [mds]
> keyring = /var/lib/ceph/mds/ceph-$id/keyring
>
> Monitor is failing (at least as I understand the problem) with following
> logged error:
>
> mon.pxmx2@-1(probing) e4 handle_auth_request failed to assign global_id
>
> whole mon log attached.
>
> I have tried to scrap dead monitor and recreate it via proxmoxes gui,
> shell and even have created content /var/lib/ceph/mon/ manually and
> tried to run monitor from terminal. It starts, listens to connections on
> port 3300 and 6789, but does not communicate properly with other
> remaining mons.
>
> thanks for info
>
> Tomas Hodek___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats

2022-03-10 Thread Dan van der Ster
Hi,

After Nautilus there were two omap usage stats upgrades:
Octopus (v15) fsck (on by default) enables per-pool omap usage stats.
Pacific (v16) fsck (off by default) enables per-pg omap usage stats.
(fsck is off by default in pacific because it takes quite some time to
update the on-disk metadata, and btw the pacific fsck had a data
corrupting bug until 16.2.7 [1]).

You're getting a warning because you skipped over Octopus (which is
ok!) but this jump means you miss the per-pool omap stats upgrade.
Confusingly, the per-pool omap warning is *on* by default, hence the
warning message.

You can disable the per-pool warning with:
ceph config set global bluestore_warn_on_no_per_pool_omap false

Or you can decide to fsck the OSDs now with 16.2.7. This will add
per-pg omap stats and clear the warning.

This is documented here:

https://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-no-per-pool-omap

Cheers, Dan

On Thu, Mar 10, 2022 at 12:43 PM Claas Goltz  wrote:
>
> Hi,
>
> I’m in the process of upgrading all our ceph servers from 14.2.9 to 16.2.7.
>
> Two of three monitors are on 16.2.6 and one is 16.2.7. I will update them
> soon.
>
>
>
> Before updating to 16.2.6/7 I set the “bluestore_fsck_quick_fix_on_mount
> false” flag and I already upgraded more than the half of my OSD Hosts (10
> so far) to the latest Version without any problems. My Health Check now
> says:
>
> “92 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats”
>
>
>
> How should I handle the warning now?
>
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-03-08 Thread Dan van der Ster
Here's the reason they exit:

7f1605dc9700 -1 osd.97 486896 _committed_osd_maps marked down 6 >
osd_max_markdown_count 5 in last 600.00 seconds, shutting down

If an osd flaps (marked down, then up) 6 times in 10 minutes, it
exits. (This is a safety measure).

It's normally caused by a network issue -- other OSDs are telling the
mon that he is down, but then the OSD himself tells the mon that he's
up!

Cheers, Dan

On Mon, Mar 7, 2022 at 10:36 PM Boris Behrens  wrote:
>
> Hi,
>
> we've had the problem with OSDs marked as offline since we updated to
> octopus and hope the problem would be fixed with the latest patch. We have
> this kind of problem only with octopus and there only with the big s3
> cluster.
> * Hosts are all Ubuntu 20,04 and we've set the txqueuelen to 10k
> * Network interfaces are 20gbit (2x10 in a 802.3ad encap3+4 bond)
> * We only use the frontend network.
> * All disks are spinning, some have block.db devices.
> * All disks are bluestore
> * configs are mostly defaults
> * we've set the OSDs to restart=always without a limit, because we had the
> problem with unavailable PGs when two OSDs are marked as offline and the
> share PGs.
>
> But since we installed the latest patch we are experiencing more OSD downs
> and even crashes.
> I tried to remove as much duplicated lines as possible.
>
> Is the numa error a problem?
> Why do OSD daemons not respond to hearthbeats? I mean even when the disk is
> totally loaded with IO, the system itself should answer heathbeats, or am I
> missing something?
>
> I really hope some of you could send me on the correct way to solve this
> nasty problem.
>
> This is how the latest crash looks like
> Mar 07 17:44:15 s3db18 ceph-osd[4530]: 2022-03-07T17:44:15.099+
> 7f5f05d2a700 -1 osd.161 489755 set_numa_affinity unable to identify public
> interface '' numa node: (2) No such file or directory
> ...
> Mar 07 17:49:07 s3db18 ceph-osd[4530]: 2022-03-07T17:49:07.678+
> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
> interface '' numa node: (2) No such file or directory
> Mar 07 17:53:07 s3db18 ceph-osd[4530]: *** Caught signal (Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> thread_name:tp_osd_tp
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> [0x7f5f0d45ef08]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> unsigned long)+0x471) [0x55a699a01201]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> long, unsigned long)+0x8e) [0x55a699a0199e]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> [0x55a699a224b0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]: 2022-03-07T17:53:07.387+
> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 7f5ef1501700
> thread_name:tp_osd_tp
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  ceph version 15.2.16
> (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  1: (()+0x143c0) [0x7f5f0d4623c0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  2: (pthread_kill()+0x38)
> [0x7f5f0d45ef08]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  3:
> (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*,
> unsigned long)+0x471) [0x55a699a01201]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  4:
> (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, unsigned
> long, unsigned long)+0x8e) [0x55a699a0199e]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  5:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3f0)
> [0x55a699a224b0]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  6:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a699a252c4]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  7: (()+0x8609) [0x7f5f0d456609]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  8: (clone()+0x43) [0x7f5f0cfc0163]
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  NOTE: a copy of the executable, or
> `objdump -rdS ` is needed to interpret this.
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  -5246> 2022-03-07T17:49:07.678+
> 7f5f05d2a700 -1 osd.161 489774 set_numa_affinity unable to identify public
> interface '' numa node: (2) No such file or directory
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  0> 2022-03-07T17:53:07.387+
> 7f5ef1501700 -1 *** Caught signal (Aborted) **
> Mar 07 17:53:07 s3db18 ceph-osd[4530]:  in thread 

[ceph-users] Re: Retrieving cephx key from ceph-fuse

2022-03-07 Thread Dan van der Ster
On Fri, Mar 4, 2022 at 2:07 PM Robert Vasek  wrote:
>
> Is there a way for an attacker with sufficient privileges to retrieve the
> key by somehow mining it off of the process memory of ceph-fuse which is
> now maintaining the volume mount?

Yes, one should assume that if they can gcore dump the ceph-fuse
process, they can extract the cephx key.
In normal deployment scenarios only a root user should be able to do
this, and they would normally have the key anyway.

Cheers, Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Errors when scrub ~mdsdir and lots of num_strays

2022-03-01 Thread Dan van der Ster
Hi,

There was a recent (long) thread about this. It might give you some hints:
   
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2NT55RUMD33KLGQCDZ74WINPPQ6WN6CW/

And about the crash, it could be related to
https://tracker.ceph.com/issues/51824

Cheers, dan


On Tue, Mar 1, 2022 at 11:30 AM Arnaud M  wrote:
>
> Hello Dan
>
> Thanks a lot for the answer
>
> i do remove the the snap everydays (I keep them for one month)
> But the "num_strays" never seems to reduce.
>
> I know I can do a listing of the folder with "find . -ls".
>
> So my question is: is there a way to find the directory causing the strays so 
> I can "find . ls" them ? I would prefer not to do it on my whole cluster as 
> it will take time (several days and more if i need to do it also on every 
> snap) and will certainly overload the mds.
>
> Please let me know if there is a way to spot the source of strays ? So I can 
> find the folder/snap with the biggest strays ?
>
> And what about the scrub of ~mdsdir who crashes every times with the error:
>
> {
> "damage_type": "dir_frag",
> "id": 3776355973,
> "ino": 1099567262916,
> "frag": "*",
>     "path": "~mds0/stray3/1000350ecc4"
> },
>
> Again, thanks for your help, that is really appreciated
>
> All the best
>
> Arnaud
>
> Le mar. 1 mars 2022 à 11:02, Dan van der Ster  a écrit :
>>
>> Hi,
>>
>> stray files are created when you have hardlinks to deleted files, or
>> snapshots of deleted files.
>> You need to delete the snapshots, or "reintegrate" the hardlinks by
>> recursively listing the relevant files.
>>
>> BTW, in pacific there isn't a big problem with accumulating lots of
>> stray files. (Before pacific there was a default limit of 1M strays,
>> but that is now removed).
>>
>> Cheers, dan
>>
>> On Tue, Mar 1, 2022 at 1:04 AM Arnaud M  wrote:
>> >
>> > Hello to everyone
>> >
>> > Our ceph cluster is healthy and everything seems to go well but we have a
>> > lot of num_strays
>> >
>> > ceph tell mds.0 perf dump | grep stray
>> > "num_strays": 1990574,
>> > "num_strays_delayed": 0,
>> > "num_strays_enqueuing": 0,
>> > "strays_created": 3,
>> > "strays_enqueued": 17,
>> > "strays_reintegrated": 0,
>> > "strays_migrated": 0,
>> >
>> > And num_strays doesn't seems to reduce whatever we do (scrub / or scrub
>> > ~mdsdir)
>> > And when we scrub ~mdsdir (force,recursive,repair) we get thoses error
>> >
>> > {
>> > "damage_type": "dir_frag",
>> > "id": 3775653237,
>> > "ino": 1099569233128,
>> > "frag": "*",
>> > "path": "~mds0/stray3/100036efce8"
>> > },
>> > {
>> > "damage_type": "dir_frag",
>> > "id": 3776355973,
>> > "ino": 1099567262916,
>> > "frag": "*",
>> > "path": "~mds0/stray3/1000350ecc4"
>> > },
>> > {
>> > "damage_type": "dir_frag",
>> > "id": 3776485071,
>> > "ino": 1099559071399,
>> > "frag": "*",
>> > "path": "~mds0/stray4/10002d3eea7"
>> > },
>> >
>> > And just before the end of the ~mdsdir scrub the mds crashes and I have to
>> > do a
>> >
>> > ceph mds repaired 0 to have the filesystem back online
>> >
>> > A lot of them. Do you have any ideas of what those errors are and how
>> > should I handle them ?
>> >
>> > We have a lot of data in our cephfs cluster 350 TB+ and we takes snapshot
>> > everyday of / and keep them for 1 month (rolling)
>> >
>> > here is our cluster state
>> >
>> > ceph -s
>> >   cluster:
>> > id: 817b5736-84ae-11eb-bf7b-c9513f2d60a9
>> > health: HEALTH_WARN
>> > 78 pgs not deep-scrubbed in time
>> > 70 pgs not scrubbed in time
>> >
>> >   services:
>> > mon: 3 daemons, quorum ceph-r-112-1,ceph-g

  1   2   3   4   5   6   >