[ceph-users] ceph tell mds.0 dirfrag split - syntax of the "frag" argument

2024-05-15 Thread Alexander E. Patrakov
Hello,

In the context of https://tracker.ceph.com/issues/64298, I decided to
do something manually. In the help output of "ceph tell" for an MDS, I
found these possibly useful commands:

dirfrag ls : List fragments in directory
dirfrag merge  : De-fragment directory by path
dirfrag split   : Fragment directory by path

They accept the "frag" argument that is underdocumented. In the
testsuite, they are used, and it seems like this argument accepts some
notation containing a slash, which is also produced as "str" by
"dirfrag ls".

Can anyone explain the meaning of the parts before and after the
slash? What is the relation between the accepted values for "dirfrag
split" and the output of "dirfrag ls" - do I just feed the fragment
from "dirfrag ls" to "dirfrag split" as-is? Is running "dirfrag split"
manually safe on a production cluster?

Thanks in advance.

-- 
Alexander Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Forcing Posix Permissions On New CephFS Files

2024-05-09 Thread Alexander E. Patrakov
Hello Matthew,

You can inherit the group, but not the user, of the containing folder.
This can be achieved by making the folder setgid and then making sure
that the client systems have a proper umask. See the attached PDF for
a presentation that I conducted on this topic to my students in the
past.

On Thu, May 9, 2024 at 2:03 PM duluxoz  wrote:
>
> Hi All,
>
> I've gone and gotten myself into a "can't see the forest for the trees"
> state, so I'm hoping someone can take pity on me and answer a really dumb Q.
>
> So I've got a CephFS system happily bubbling along and a bunch of
> (linux) workstations connected to a number of common shares/folders. To
> take a single one of these folders as an example ("music") the
> sub-folders and files of that share all belong to root:music with
> permissions of 2770 (folders) and 0660 (files). The "music" folder is
> then connected to (as per the Ceph Doco: mount.ceph) via each
> workstation's fstab file - all good, all working, everyone's happy.
>
> What I'm trying to achieve is that when a new piece of music (a file) is
> uploaded to the Ceph Cluster the file inherits the music share's default
> ownership (root:music) and permissions (0660). What is happening at the
> moment is I'm getting permissions of 644 (and 755 for new folders).
>
> I've been looking for a way to do what I want but, as I said, I've gone
> and gotten myself thoroughly mixed-up.
>
> Could someone please point me in the right direction on how to achieve
> what I'm after - thanks
>
> Cheers
>
> Dulux-Oz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW multisite slowness issue due to the "304 Not Modified" responses on primary zone

2024-05-01 Thread Alexander E. Patrakov
Hello Saif,

Unfortunately, I have no other ideas that could help you.

On Wed, May 1, 2024 at 4:48 PM Saif Mohammad  wrote:
>
> Hi Alexander,
>
> We have configured  the parameters in our infrastructure to fix the issue, 
> and despite tuning them or even set it to the higher levels, the issue still 
> persists. We have shared the latency between the DC and DR site for your 
> reference. Please advise on alternative solutions to resolve this issue as 
> this is very crucial for us.
>
> - rgw_bucket_sync_spawn_window
> - rgw_data_sync_spawn_window
> - rgw_meta_sync_spawn_window
>
> root@host-01:~# ping 
> PING  (ip) 56(84) bytes of data.
> 64 bytes from : icmp_seq=1 ttl=60 time=41.9 ms
> 64 bytes from : icmp_seq=2 ttl=60 time=41.5 ms
> 64 bytes from : icmp_seq=3 ttl=60 time=41.6 ms
> 64 bytes from  : icmp_seq=4 ttl=60 time=50.8 ms
>
>
> Any guidance would be greatly appreciated.
>
> Regards,
> Mohammad Saif
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-21 Thread Alexander E. Patrakov
Hello Anthony,

Do you have any data on the reliability of QLC NVMe drives? How old is
your deep archive cluster, how many NVMes it has, and how many did you
have to replace?

On Sun, Apr 21, 2024 at 11:06 PM Anthony D'Atri  wrote:
>
> A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB in 
> size, 32 of those in one RU makes for a cluster that doesn’t take up the 
> whole DC.
>
> > On Apr 21, 2024, at 5:42 AM, Darren Soothill  
> > wrote:
> >
> > Hi Niklaus,
> >
> > Lots of questions here but let me tray and get through some of them.
> >
> > Personally unless a cluster is for deep archive then I would never suggest 
> > configuring or deploying a cluster without Rocks DB and WAL on NVME.
> > There are a number of benefits to this in terms of performance and 
> > recovery. Small writes go to the NVME first before being written to the HDD 
> > and it makes many recovery operations far more efficient.
> >
> > As to how much faster it makes things that very much depends on the type of 
> > workload you have on the system. Lots of small writes will make a 
> > significant difference. Very large writes not as much of a difference.
> > Things like compactions of the RocksDB database are a lot faster as they 
> > are now running from NVME and not from the HDD.
> >
> > We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This 
> > is assuming the NVME’s being used are good mixed use enterprise NVME’s with 
> > power loss protection.
> >
> > As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but 
> > this is no worse than a failure of an entire node. This is something Ceph 
> > is designed to handle.
> >
> > I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
> > that will degrade the performance of them when you are trying to get better 
> > performance.
> >
> >
> >
> > Darren Soothill
> >
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io/
> >
> > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > CEO: Martin Verges - VAT-ID: DE310638492
> > Com. register: Amtsgericht Munich HRB 231263
> > Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
> >
> >
> >
> >
> > _______
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-04-07 Thread Alexander E. Patrakov
Hi Erich,

On Mon, Apr 8, 2024 at 11:51 AM Erich Weiler  wrote:
>
> Hi Xiubo,
>
> > Thanks for your logs, and it should be the same issue with
> > https://tracker.ceph.com/issues/62052, could you try to test with this
> > fix again ?
>
> This sounds good - but I'm not clear on what I should do?  I see a patch
> in that tracker page, is that what you are referring to?  If so, how
> would I apply such a patch?  Or is there simply a binary update I can
> apply somehow to the MDS server software?

The backport of this patch (https://github.com/ceph/ceph/pull/53241)
was merged on October 18, 2023, and Ceph 18.2.1 was released on
December 18, 2023. Therefore, if you are running Ceph 18.2.1 on the
server side, you already have the fix. If you are already running
version 18.2.1 or 18.2.2 (to which you should upgrade anyway), please
complain, as the purported fix is then ineffective.

>
> Thanks for helping!
>
> -erich
>
> > Please let me know if you still could see this bug then it should be the
> > locker order bug as https://tracker.ceph.com/issues/62123.
> >
> > Thanks
> >
> > - Xiubo
> >
> >
> > On 3/28/24 04:03, Erich Weiler wrote:
> >> Hi All,
> >>
> >> I've been battling this for a while and I'm not sure where to go from
> >> here.  I have a Ceph health warning as such:
> >>
> >> # ceph -s
> >>   cluster:
> >> id: 58bde08a-d7ed-11ee-9098-506b4b4da440
> >> health: HEALTH_WARN
> >> 1 MDSs report slow requests
> >> 1 MDSs behind on trimming
> >>
> >>   services:
> >> mon: 5 daemons, quorum
> >> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
> >> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
> >> mds: 1/1 daemons up, 2 standby
> >> osd: 46 osds: 46 up (since 9h), 46 in (since 2w)
> >>
> >>   data:
> >> volumes: 1/1 healthy
> >> pools:   4 pools, 1313 pgs
> >> objects: 260.72M objects, 466 TiB
> >> usage:   704 TiB used, 424 TiB / 1.1 PiB avail
> >> pgs: 1306 active+clean
> >>  4active+clean+scrubbing+deep
> >>  3active+clean+scrubbing
> >>
> >>   io:
> >> client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr
> >>
> >> And the specifics are:
> >>
> >> # ceph health detail
> >> HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
> >> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
> >> mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked >
> >> 30 secs
> >> [WRN] MDS_TRIM: 1 MDSs behind on trimming
> >> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250)
> >> max_segments: 250, num_segments: 13884
> >>
> >> That "num_segments" number slowly keeps increasing.  I suspect I just
> >> need to tell the MDS servers to trim faster but after hours of
> >> googling around I just can't figure out the best way to do it. The
> >> best I could come up with was to decrease "mds_cache_trim_decay_rate"
> >> from 1.0 to .8 (to start), based on this page:
> >>
> >> https://www.suse.com/support/kb/doc/?id=19740
> >>
> >> But it doesn't seem to help, maybe I should decrease it further? I am
> >> guessing this must be a common issue...?  I am running Reef on the MDS
> >> servers, but most clients are on Quincy.
> >>
> >> Thanks for any advice!
> >>
> >> cheers,
> >> erich
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Alexander E. Patrakov
Hello Erich,

What you are experiencing is definitely a bug - but possibly a client
bug. Not sure. Upgrading Ceph packages on the clients, though, will
not help, because the actual CephFS client is the kernel. You can try
upgrading it to the latest 6.8.x (or, better, trying the same workload
from different hosts with the upgraded kernel), but I doubt that it
will help.

On Fri, Mar 29, 2024 at 6:17 AM Erich Weiler  wrote:
>
> Could there be an issue with the fact that the servers (MDS, MGR, MON,
> OSD) are running reef and all the clients are running quincy?
>
> I can easily enough get the new reef repo in for all our clients (Ubuntu
> 22.04) and upgrade the clients to reef if that might help..?
>
> On 3/28/24 3:05 PM, Erich Weiler wrote:
> > I asked the user and they said no, no rsync involved.  Although I
> > rsync'd 500TB into this filesystem in the beginning without incident, so
> > hopefully it's not a big deal here.
> >
> > I'm asking the user what their workflow does to try and pin this down.
> >
> > Are there any other known reason why a slow request would start on a
> > certain inode, then block a bunch of cache segments behind it, until the
> > MDS is restarted?
> >
> > Once I restart the MDS daemon that is slow, it shows the cache segments
> > transfer to the other MDS server and very quickly drop to zero, then
> > everything is healthy again, the stuck directory in question responds
> > again and all is well.  Then a few hours later it started happening
> > again (not always the same directory).
> >
> > I hope I'm not experiencing a bug, but I can't see what would be causing
> > this...
> >
> > On 3/28/24 2:37 PM, Alexander E. Patrakov wrote:
> >> Hello Erich,
> >>
> >> Does the workload, by any chance, involve rsync? It is unfortunately
> >> well-known for triggering such issues. A workaround is to export the
> >> directory via NFS and run rsync against the NFS mount instead of
> >> directly against CephFS.
> >>
> >> On Fri, Mar 29, 2024 at 4:58 AM Erich Weiler  wrote:
> >>>
> >>> MDS logs show:
> >>>
> >>> Mar 28 13:42:29 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
> >>> log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
> >>> 3676.400077 secs
> >>> Mar 28 13:42:30 pr-md-02.prism ceph-mds[1464328]:
> >>> mds.slugfs.pr-md-02.sbblqq Updating MDS map to version 22775 from mon.3
> >>> Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
> >>> log [WRN] : 320 slow requests, 5 included below; oldest blocked for >
> >>> 3681.400104 secs
> >>> Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
> >>> log [WRN] : slow request 3668.805732 seconds old, received at
> >>> 2024-03-28T19:41:25.772531+: client_request(client.99375:574268
> >>> getattr AsXsFs #0x1000c097307 2024-03-28T19:41:25.770954+
> >>> caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
> >>> getattr
> >>> Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
> >>> log [WRN] : slow request 3667.883853 seconds old, received at
> >>> 2024-03-28T19:41:26.694410+: client_request(client.99390:374844
> >>> getattr AsXsFs #0x1000c097307 2024-03-28T19:41:26.696172+
> >>> caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
> >>> getattr
> >>> Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
> >>> log [WRN] : slow request 3663.724571 seconds old, received at
> >>> 2024-03-28T19:41:30.853692+: client_request(client.99390:375258
> >>> getattr AsXsFs #0x1000c097307 2024-03-28T19:41:30.852166+
> >>> caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
> >>> getattr
> >>> Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
> >>> log [WRN] : slow request 3681.399582 seconds old, received at
> >>> 2024-03-28T19:41:13.178681+: client_request(client.99385:11712080
> >>> getattr AsXsFs #0x1000c097307 2024-03-28T19:41:13.178764+
> >>> caller_uid=30150, caller_gid=600{600,608,999,}) currently failed to
> >>> rdlock, waiting
> >>> Mar 28 13:42:34 pr-md-02.prism ceph-mds[1464328]: log_channel(cluster)
> >>> log [WRN] : slow request 3680.508972 seconds old, received at
> >>> 2024-03-28T19:41:14.069291+: client_request(client.99385:11712556
> >>> getattr A

[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Alexander E. Patrakov
ncluded below; oldest blocked for >
> > 724.386184 secs
> > Mar 27 11:58:57 pr-md-01.prism ceph-mds[1296468]:
> > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16637 from mon.0
> > Mar 27 11:59:00 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> > log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
> > 729.386333 secs
> > Mar 27 11:59:02 pr-md-01.prism ceph-mds[1296468]:
> > mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16638 from mon.0
> > Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> > log [WRN] : 53 slow requests, 5 included below; oldest blocked for >
> > 734.386400 secs
> > Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> > log [WRN] : slow request 730.190197 seconds old, received at
> > 2024-03-27T18:46:55.137022+: client_request(client.99445:4189994
> > getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+
> > caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
> > getattr
> >
> > Can we tell which client the slow requests are coming from?  It says
> > stuff like "client.99445:4189994" but I don't know how to map that to a
> > client...
> >
> > Thanks for the response!
> >
> > -erich
> >
> > On 3/27/24 21:28, Xiubo Li wrote:
> >>
> >> On 3/28/24 04:03, Erich Weiler wrote:
> >>> Hi All,
> >>>
> >>> I've been battling this for a while and I'm not sure where to go from
> >>> here.  I have a Ceph health warning as such:
> >>>
> >>> # ceph -s
> >>>   cluster:
> >>> id: 58bde08a-d7ed-11ee-9098-506b4b4da440
> >>> health: HEALTH_WARN
> >>> 1 MDSs report slow requests
> >>
> >> There had slow requests. I just suspect the behind on trimming was
> >> caused by this.
> >>
> >> Could you share the logs about the slow requests ? What are they ?
> >>
> >> Thanks
> >>
> >>
> >>> 1 MDSs behind on trimming
> >>>
> >>>   services:
> >>> mon: 5 daemons, quorum
> >>> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
> >>> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
> >>> mds: 1/1 daemons up, 2 standby
> >>> osd: 46 osds: 46 up (since 9h), 46 in (since 2w)
> >>>
> >>>   data:
> >>> volumes: 1/1 healthy
> >>> pools:   4 pools, 1313 pgs
> >>> objects: 260.72M objects, 466 TiB
> >>> usage:   704 TiB used, 424 TiB / 1.1 PiB avail
> >>> pgs: 1306 active+clean
> >>>  4active+clean+scrubbing+deep
> >>>  3active+clean+scrubbing
> >>>
> >>>   io:
> >>> client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr
> >>>
> >>> And the specifics are:
> >>>
> >>> # ceph health detail
> >>> HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
> >>> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
> >>> mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked >
> >>> 30 secs
> >>> [WRN] MDS_TRIM: 1 MDSs behind on trimming
> >>> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250)
> >>> max_segments: 250, num_segments: 13884
> >>>
> >>> That "num_segments" number slowly keeps increasing.  I suspect I just
> >>> need to tell the MDS servers to trim faster but after hours of
> >>> googling around I just can't figure out the best way to do it. The
> >>> best I could come up with was to decrease "mds_cache_trim_decay_rate"
> >>> from 1.0 to .8 (to start), based on this page:
> >>>
> >>> https://www.suse.com/support/kb/doc/?id=19740
> >>>
> >>> But it doesn't seem to help, maybe I should decrease it further? I am
> >>> guessing this must be a common issue...?  I am running Reef on the
> >>> MDS servers, but most clients are on Quincy.
> >>>
> >>> Thanks for any advice!
> >>>
> >>> cheers,
> >>> erich
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Behind on Trimming...

2024-03-28 Thread Alexander E. Patrakov
35 from mon.0
> Mar 27 11:58:50 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
> 719.386116 secs
> Mar 27 11:58:53 pr-md-01.prism ceph-mds[1296468]:
> mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16636 from mon.0
> Mar 27 11:58:55 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
> 724.386184 secs
> Mar 27 11:58:57 pr-md-01.prism ceph-mds[1296468]:
> mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16637 from mon.0
> Mar 27 11:59:00 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> log [WRN] : 16 slow requests, 0 included below; oldest blocked for >
> 729.386333 secs
> Mar 27 11:59:02 pr-md-01.prism ceph-mds[1296468]:
> mds.slugfs.pr-md-01.xdtppo Updating MDS map to version 16638 from mon.0
> Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> log [WRN] : 53 slow requests, 5 included below; oldest blocked for >
> 734.386400 secs
> Mar 27 11:59:05 pr-md-01.prism ceph-mds[1296468]: log_channel(cluster)
> log [WRN] : slow request 730.190197 seconds old, received at
> 2024-03-27T18:46:55.137022+: client_request(client.99445:4189994
> getattr AsXsFs #0x100081b9ceb 2024-03-27T18:46:55.134857+
> caller_uid=30150, caller_gid=600{600,608,999,}) currently joining batch
> getattr
>
> Can we tell which client the slow requests are coming from?  It says
> stuff like "client.99445:4189994" but I don't know how to map that to a
> client...
>
> Thanks for the response!
>
> -erich
>
> On 3/27/24 21:28, Xiubo Li wrote:
> >
> > On 3/28/24 04:03, Erich Weiler wrote:
> >> Hi All,
> >>
> >> I've been battling this for a while and I'm not sure where to go from
> >> here.  I have a Ceph health warning as such:
> >>
> >> # ceph -s
> >>   cluster:
> >> id: 58bde08a-d7ed-11ee-9098-506b4b4da440
> >> health: HEALTH_WARN
> >> 1 MDSs report slow requests
> >
> > There had slow requests. I just suspect the behind on trimming was
> > caused by this.
> >
> > Could you share the logs about the slow requests ? What are they ?
> >
> > Thanks
> >
> >
> >> 1 MDSs behind on trimming
> >>
> >>   services:
> >> mon: 5 daemons, quorum
> >> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
> >> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
> >> mds: 1/1 daemons up, 2 standby
> >> osd: 46 osds: 46 up (since 9h), 46 in (since 2w)
> >>
> >>   data:
> >> volumes: 1/1 healthy
> >> pools:   4 pools, 1313 pgs
> >> objects: 260.72M objects, 466 TiB
> >> usage:   704 TiB used, 424 TiB / 1.1 PiB avail
> >> pgs: 1306 active+clean
> >>  4active+clean+scrubbing+deep
> >>  3active+clean+scrubbing
> >>
> >>   io:
> >> client:   123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr
> >>
> >> And the specifics are:
> >>
> >> # ceph health detail
> >> HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
> >> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
> >> mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked >
> >> 30 secs
> >> [WRN] MDS_TRIM: 1 MDSs behind on trimming
> >> mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250)
> >> max_segments: 250, num_segments: 13884
> >>
> >> That "num_segments" number slowly keeps increasing.  I suspect I just
> >> need to tell the MDS servers to trim faster but after hours of
> >> googling around I just can't figure out the best way to do it. The
> >> best I could come up with was to decrease "mds_cache_trim_decay_rate"
> >> from 1.0 to .8 (to start), based on this page:
> >>
> >> https://www.suse.com/support/kb/doc/?id=19740
> >>
> >> But it doesn't seem to help, maybe I should decrease it further? I am
> >> guessing this must be a common issue...?  I am running Reef on the MDS
> >> servers, but most clients are on Quincy.
> >>
> >> Thanks for any advice!
> >>
> >> cheers,
> >> erich
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-28 Thread Alexander E. Patrakov
On Thu, Mar 28, 2024 at 9:17 AM Angelo Hongens  wrote:
> According to 45drives, saving the CTDB lock file in CephFS is a bad idea

Could you please share a link to their page that says this?

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Erasure Code with Autoscaler and Backfill_toofull

2024-03-27 Thread Alexander E. Patrakov
gt; >> crush_rule  0  object_hash  rjenkins  pg_num   32pgp_num  32
> >> autoscale_mode  on
> >>
> >> # ceph osd erasure-code-profile get 6.3
> >> crush-device-class=
> >> crush-failure-domain=host
> >> crush-root=default
> >> jerasure-per-chunk-alignment=false
> >> k=6
> >> m=3
> >> plugin=jerasure
> >> technique=reed_sol_van
> >> w=8
> >>
> >> # ceph pg ls | awk 'NR==1 || /backfill_toofull/' | awk '{print $1" "$2"
> >> "$4" "$6" "$11" "$15" "$16}' | column -t
> >> PG OBJECTS  MISPLACED  BYTES STATE
> >> UP  ACTING
> >> 36.f   222077   141392 953817797727  active+remapped+backfill_toofull
> >>  [1,27,41,8,36,17,14,40,32]p1[33,32,29,23,16,17,28,1,14]p33
> >> 36.5c  221761   147015 950692130045  active+remapped+backfill_toofull
> >>  [26,27,40,29,1,37,39,11,42]p26  [12,24,4,2,31,25,17,33,8]p12
> >> 36.60  222710   0  957109050809  active+remapped+backfill_toofull
> >>  [41,34,22,3,1,35,9,39,29]p41[2,34,22,3,27,32,28,24,1]p2
> >> 36.6b  02   427168 953843892012  active+remapped+backfill_toofull
> >>  [20,15,7,21,37,1,38,17,32]p20   [7,2,32,26,5,35,24,17,23]p7
> >> 36.74  222681   777546 957679960067  active+remapped+backfill_toofull
> >>  [42,24,12,34,38,10,27,1,25]p42  [34,33,12,0,19,14,17,30,25]p34
> >> 36.7b  222974   1560818957691042940  active+remapped+backfill_toofull
> >>  [2,35,27,1,20,18,19,12,8]p2 [31,23,21,24,35,18,19,33,25]p31
> >> 36.82  222362   1998670954507657022  active+remapped+backfill_toofull
> >>  [37,22,1,38,11,23,27,32,33]p37  [27,33,0,32,5,25,20,13,15]p27
> >> 36.b5  221676   1330056953443725830  active+remapped+backfill_toofull
> >>  [6,8,38,12,21,1,39,34,27]p6 [33,8,26,12,3,10,22,34,1]p33
> >> 36.b6  222669   1335327956973704883  active+remapped+backfill_toofull
> >>  [11,13,41,4,12,34,29,6,1]p11[2,29,34,4,12,9,15,6,28]p2
> >> 36.e0  221518   1772144952581426388  active+remapped+backfill_toofull
> >>  [1,27,21,31,30,23,37,13,28]p1   [25,21,14,31,1,2,34,17,24]p25
> >>
> >> ceph pg ls | awk 'NR==1 || /backfilling/' | grep -e BYTES -e '\[1' -e
> >> ',1,'
> >> -e '1\]' | awk '{print $1" "$2" "$4" "$6" "$11" "$15" "$16}' | column -t
> >> PG OBJECTS  MISPLACED  BYTES STATEUP
> >>ACTING
> >> 36.4a  221508   89144  951346455917  active+remapped+backfilling
> >>  [40,43,33,32,30,38,22,35,9]p40  [27,10,20,7,30,21,1,28,31]p27
> >> 36.79  222315   575955797107713  active+remapped+backfilling
> >>  [1,36,31,33,25,23,14,3,13]p1[27,6,31,23,25,5,14,29,13]p27
> >> 36.8d  29   1284156    955234423342  active+remapped+backfilling
> >>  [35,34,27,37,38,36,43,3,16]p35  [35,34,15,26,1,11,27,18,16]p35
> >> 36.ba  222039   0  952547107971  active+remapped+backfilling
> >>  [0,40,33,23,41,4,27,22,28]p0[0,35,33,27,1,3,30,22,28]p0
> >> 36.da  221607   277464 951599928383  active+remapped+backfilling
> >>  [21,31,8,9,11,25,36,23,28]p21   [0,10,1,22,33,11,35,15,28]p0
> >> 36.db  221685   58816  951420054091  active+remapped+backfilling
> >>  [3,28,12,13,1,38,40,35,43]p3[27,20,17,21,1,23,28,24,31]p27
> >>
> >> # ceph osd df | sort -nk 17 | tail -n 5
> >> 21hdd   9.09598   1.0  9.1 TiB  7.7 TiB  7.7 TiB  0 B31
> >> GiB
> >>   1.4 TiB  84.62  1.16   68  up
> >> 24hdd   9.09598   1.0  9.1 TiB  7.7 TiB  7.7 TiB1 KiB25
> >> GiB
> >>   1.4 TiB  84.98  1.16   69  up
> >> 29hdd   9.09569   1.0  9.1 TiB  8.0 TiB  8.0 TiB   72 MiB23
> >> GiB
> >>   1.1 TiB  88.42  1.21   73  up
> >> 13hdd   9.09569   1.0  9.1 TiB  8.1 TiB  8.1 TiB1 KiB22
> >> GiB
> >>  1023 GiB  89.02  1.22   76  up
> >>  1hdd   7.27698   1.0  7.3 TiB  6.8 TiB  6.8 TiB   27 MiB18
> >> GiB
> >>   451 GiB  93.94  1.28   64  up
> >>
> >> # cat /etc/ceph/ceph.conf | grep full
> >> mon_osd_full_ratio = .98
> >> mon_osd_nearfull_ratio = .96
> >> mon_osd_backfillfull_ratio = .97
> >> osd_backfill_full_ratio = .97
> >> osd_failsafe_full_ratio = .99
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



--
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs client not released caps when running rsync

2024-03-26 Thread Alexander E. Patrakov
h-users/msg50158.html
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B7K6B5VXM3I7TODM4GRF3N7S254O5ETY/
>
> It turns out that the problem is in rsync, in the way it works?
>
> The only "solution" is to do it on the client according to a schedule
> (or upon reaching a certain number of open caps) “echo 2 >
> /proc/sys/vm/drop_caches”. After this command, the cephfs client
> releases the cached caps. And if there were a lot of them, then MDS
> becomes slow again.
>
> We also tried to mount cephfs with the option "caps_max=1" so that
> the client would do a forced release when the specified value is
> reached, but this did not help.
>
> We can limit mds_max_caps_per_client (not tested), but this also affects
> all clients at once.
>
> The command "ceph daemon mds.cephfs.X cache drop" (with or without an
> additional parameter) does not help
>
> Tested on Linux kernels (client side): 5.10 and 6.1
>
> Did I understand everything correctly? is this the expected behavior
> when running rsync?
>
>
> And one more problem (I don’t know if it’s related or not), when rsync
> finishes copying, all caps are freed except the last two (pinned i_caps
> / total inodes 2 / 2)
>
> At this moment a warning appears (or remains after releasing a large
> number of caps): 1 clients failing to advance oldest client/flush tid
> But then it doesn't disappear. I waited 12 hours.
> Warning disappears only after executing the "sync" command on the
> client. and in the client metrics you can see "pinned i_caps / total
> inodes 1 / 1"
>
> Note: running "echo 2 > /proc/sys/vm/drop_caches" does not help in this
> case.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Alexander E. Patrakov
On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:
>
>
>
> On 24/03/2024 01:14, Torkil Svensgaard wrote:
> > On 24-03-2024 00:31, Alexander E. Patrakov wrote:
> >> Hi Torkil,
> >
> > Hi Alexander
> >
> >> Thanks for the update. Even though the improvement is small, it is
> >> still an improvement, consistent with the osd_max_backfills value, and
> >> it proves that there are still unsolved peering issues.
> >>
> >> I have looked at both the old and the new state of the PG, but could
> >> not find anything else interesting.
> >>
> >> I also looked again at the state of PG 37.1. It is known what blocks
> >> the backfill of this PG; please search for "blocked_by." However, this
> >> is just one data point, which is insufficient for any conclusions. Try
> >> looking at other PGs. Is there anything too common in the non-empty
> >> "blocked_by" blocks?
> >
> > I'll take a look at that tomorrow, perhaps we can script something
> > meaningful.
>
> Hi Alexander
>
> While working on a script querying all PGs and making a list of all OSDs
> found in a blocked_by list, and how many times for each, I discovered
> something odd about pool 38:
>
> "
> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
> OSDs blocking other OSDs:


> All PGs in the pool are active+clean so why are there any blocked_by at
> all? One example attached.

I don't know. In any case, it doesn't match the "one OSD blocks them
all" scenario that I was looking for. I think this is something bogus
that can probably be cleared in your example by restarting osd.89
(i.e, the one being blocked).

>
> Mvh.
>
> Torkil
>
> >> I think we have to look for patterns in other ways, too. One tool that
> >> produces good visualizations is TheJJ balancer. Although it is called
> >> a "balancer," it can also visualize the ongoing backfills.
> >>
> >> The tool is available at
> >> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
> >>
> >> Run it as follows:
> >>
> >> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
> >
> > Output attached.
> >
> > Thanks again.
> >
> > Mvh.
> >
> > Torkil
> >
> >> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
> >> wrote:
> >>>
> >>> Hi Alex
> >>>
> >>> New query output attached after restarting both OSDs. OSD 237 is no
> >>> longer mentioned but it unfortunately made no difference for the number
> >>> of backfills which went 59->62->62.
> >>>
> >>> Mvh.
> >>>
> >>> Torkil
> >>>
> >>> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
> >>>> Hi Torkil,
> >>>>
> >>>> I have looked at the files that you attached. They were helpful: pool
> >>>> 11 is problematic, it complains about degraded objects for no obvious
> >>>> reason. I think that is the blocker.
> >>>>
> >>>> I also noted that you mentioned peering problems, and I suspect that
> >>>> they are not completely resolved. As a somewhat-irrational move, to
> >>>> confirm this theory, you can restart osd.237 (it is mentioned at the
> >>>> end of query.11.fff.txt, although I don't understand why it is there)
> >>>> and then osd.298 (it is the primary for that pg) and see if any
> >>>> additional backfills are unblocked after that. Also, please re-query
> >>>> that PG again after the OSD restart.
> >>>>
> >>>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
> >>>> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> >>>>>> Hi Torkil,
> >>>>>
> >>>>> Hi Alexander
> >>>>>
> >>>>>> I have looked at the CRUSH rules, and the equivalent rules work on my
> >>>>>> test cluster. So this cannot be the cause of the blockage.
> >>>>>
> >>>>> Thank you for taking the time =)
> >>>>>
> >>>>>> What happens if you increase the osd_max_backfills setting
> >>>>>> temporarily?
> >>>>>
> >>>>> We already had the mclock override option in place and I re-enabled
> >>>

[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-25 Thread Alexander E. Patrakov
On Mon, Mar 25, 2024 at 11:01 PM John Mulligan
 wrote:
>
> On Friday, March 22, 2024 2:56:22 PM EDT Alexander E. Patrakov wrote:
> > Hi John,
> >
> > > A few major features we have planned include:
> > > * Standalone servers (internally defined users/groups)
> >
> > No concerns here
> >
> > > * Active Directory Domain Member Servers
> >
> > In the second case, what is the plan regarding UID mapping? Is NFS
> > coexistence planned, or a concurrent mount of the same directory using
> > CephFS directly?
>
> In the immediate future the plan is to have a very simple, fairly
> "opinionated" idmapping scheme based on the autorid backend.

OK, the docs for clustered SAMBA do mention the autorid backend in
examples. It's a shame that the manual page does not explicitly list
it as compatible with clustered setups.

However, please consider that the majority of Linux distributions
(tested: CentOS, Fedora, Alt Linux, Ubuntu, OpenSUSE) use "realmd" to
join AD domains by default (where "default" means a pointy-clicky way
in a workstation setup), which uses SSSD, and therefore, by this
opinionated choice of the autorid backend, you create mappings that
disagree with the supposed majority and the default. This will create
problems in the future when you do consider NFS coexistence.

Well, it's a different topic that most organizations that I have seen
seem to ignore this default. Maybe those that don't have any problems
don't have any reason to talk to me? I think that more research is
needed here on whether RedHat's and GNOME's push of SSSD is something
not-ready or indeed the de-facto standard setup.

Even if you don't want to use SSSD, providing an option to provision a
few domains with idmap rid backend with statically configured ranges
(as an override to autorid) would be a good step forward, as this can
be made compatible with the default RedHat setup.

> Sharing the same directories over both NFS and SMB at the same time, also
> known as "multi-protocol", is not planned for now, however we're all aware
> that there's often a demand for this feature and we're aware of the complexity
> it brings. I expect we'll work on that at some point but not initially.
> Similarly, sharing the same directories over a SMB share and directly on a
> cephfs mount won't be blocked but we won't recommend it.

OK. Feature request: in the case if there are several CephFS
filesystems, support configuration of which one to serve.

>
> >
> > In fact, I am quite skeptical, because, at least in my experience,
> > every customer's SAMBA configuration as a domain member is a unique
> > snowflake, and cephadm would need an ability to specify arbitrary UID
> > mapping configuration to match what the customer uses elsewhere - and
> > the match must be precise.
> >
>
> I agree - our initial use case is something along the lines:
> Users of a Ceph Cluster that have Windows systems, Mac systems, or appliances
> that are joined to an existing AD
> but are not currently interoperating with the Ceph cluster.
>
> I expect to add some idpapping configuration and agility down the line,
> especially supporting some form of rfc2307 idmapping (where unix IDs are
> stored in AD).

Yes, for whatever reason, people do this, even though it is cumbersome
to manage.

>
> But those who already have idmapping schemes and samba accessing ceph will
> probably need to just continue using the existing setups as we don't have an
> immediate plan for migrating those users.
>
> > Here is what I have seen or was told about:
> >
> > 1. We don't care about interoperability with NFS or CephFS, so we just
> > let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2"
> > idmap backend. It's completely OK that workstations get different UIDs
> > and GIDs, as only SIDs traverse the wire.
>
> This is pretty close to our initial plan but I'm not clear why you'd think
> that "workstations get different UIDs and GIDs". For all systems acessing the
> (same) ceph cluster the id mapping should be consistent.
> You did make me consider multi-cluster use cases with something like cephfs
> volume mirroring - that's something that I hadn't thought of before *but*
> using an algorithmic mapping backend like autorid (and testing) I think we're
> mostly OK there.

The tdb2 backend (used in my example) is not algorithmic, it is
allocating. That is, it sequentially allocates IDs on the
first-seen-first-allocated basis. Yet this is what this customer uses,
presumably because it is the only backend that explicitly specifies
clustering operation in its manual page.

And the "autorid" backend is also not fully algorithmic, it allocates
ranges to domains on the same 

[ceph-users] Re: ceph cluster extremely unbalanced

2024-03-25 Thread Alexander E. Patrakov
Hi Denis,

As the vast majority of OSDs have bluestore_min_alloc_size = 65536, I
think you can safely ignore https://tracker.ceph.com/issues/64715. The
only consequence will be that 58 OSDs will be less full than others.
In other words, please use either the hybrid approach or the built-in
balancer right away.

As for migrating to the modern defaults for bluestore_min_alloc_size,
yes, recreating OSDs host-by-host (once you have the cluster balanced)
is the only way. You can keep using the built-in balancer while doing
that.

On Mon, Mar 25, 2024 at 5:04 PM Denis Polom  wrote:
>
> Hi Alexander,
>
> that sounds pretty promising to me.
>
> I've checked bluestore_min_alloc_size and most 1370 OSDs have value 65536.
>
> You mentioned: "You will have to do that weekly until you redeploy all
> OSDs that were created with 64K bluestore_min_alloc_size"
>
> Is it the only way to approach this, that each OSD has to be recreated?
>
> Thank you for reply
>
> dp
>
> On 3/24/24 12:44 PM, Alexander E. Patrakov wrote:
> > Hi Denis,
> >
> > My approach would be:
> >
> > 1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K
> > bluestore_min_alloc_size. If so, you cannot really use the built-in
> > balancer, as it would result in a bimodal distribution instead of a
> > proper balance, see https://tracker.ceph.com/issues/64715, but let's
> > ignore this little issue if you have enough free space.
> > 2. Change the weights as appropriate. Make absolutely sure that there
> > are no reweights other than 1.0. Delete all dead or destroyed OSDs
> > from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL
> > warnings that appear, they will be gone during the next step.
> > 3. Run this little script from Cern to stop the data movement that was
> > just initiated:
> > https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py,
> > pipe its output to bash. This should cancel most of the data movement,
> > but not all - the script cannot stop the situation when two OSDs want
> > to exchange their erasure-coded shards, like this: [1,2,3,4] ->
> > [1,3,2,4].
> > 4. Set the "target max misplaced ratio" option for MGR to what you
> > think is appropriate. The default is 0.05, and this means that the
> > balancer will enable at most 5% of the PGs to participate in the data
> > movement. I suggest starting with 0.01 and increasing if there is no
> > visible impact of the balancing on the client traffic.
> > 5. Enable the balancer.
> >
> > If you think that https://tracker.ceph.com/issues/64715 is a problem
> > that would prevent you from using the built-in balancer:
> >
> > 4. Download this script:
> > https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
> > 5. Run it as follows: ./placementoptimizer.py -v balance --osdsize
> > device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash
> >
> > This will move at most 500 PGs to better places, starting with the
> > fullest OSDs. All weights are ignored, and the switches take care of
> > the bluestore_min_alloc_size overhead mismatch. You will have to do
> > that weekly until you redeploy all OSDs that were created with 64K
> > bluestore_min_alloc_size.
> >
> > A hybrid approach (initial round of balancing with TheJJ, then switch
> > to the built-in balancer) may also be viable.
> >
> > On Sun, Mar 24, 2024 at 7:09 PM Denis Polom  wrote:
> >> Hi guys,
> >>
> >> recently I took over a care of Ceph cluster that is extremely
> >> unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus ->
> >> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it.
> >>
> >> Crush failure domain is datacenter (there are 3), data pool is EC 3+3.
> >>
> >> This cluster had and has balancer disabled for years. And was "balanced"
> >> manually by changing OSDs crush weights. So now it is complete mess and
> >> I would like to change it to have OSDs crush weight same (3.63898)  and
> >> to enable balancer with upmap.
> >>
> >>   From `ceph osd df ` sorted from the least used to most used OSDs:
> >>
> >> IDCLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
> >> AVAIL %USE   VAR   PGS  STATUS
> >> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
> >>TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5
> >> TiB   1.5 PiB  71.50
> >>428hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB1 KiB  5.6
> >> G

[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread Alexander E. Patrakov
Hello Matthew,

Is the overwrite enabled in the erasure-coded pool? If not, here is
how to fix it:

ceph osd pool set my_pool.data allow_ec_overwrites true

On Mon, Mar 25, 2024 at 11:17 AM duluxoz  wrote:
>
> Hi Curt,
>
> Blockdev --getbsz: 4096
>
> Rbd info my_pool.meta/my_image:
>
> ~~~
>
> rbd image 'my_image':
>  size 4 TiB in 1048576 objects
>  order 22 (4 MiB objects)
>  snapshot_count: 0
>  id: 294519bf21a1af
>  data_pool: my_pool.data
>  block_name_prefix: rbd_data.30.294519bf21a1af
>  format: 2
>  features: layering, exclusive-lock, object-map, fast-diff,
> deep-flatten, data-pool
>  op_features:
>  flags:
>  create_timestamp: Sun Mar 24 17:44:33 2024
>  access_timestamp: Sun Mar 24 17:44:33 2024
>  modify_timestamp: Sun Mar 24 17:44:33 2024
> ~~~
>
> On 24/03/2024 21:10, Curt wrote:
> > Hey Mathew,
> >
> > One more thing out of curiosity can you send the output of blockdev
> > --getbsz on the rbd dev and rbd info?
> >
> > I'm using 16TB rbd images without issue, but I haven't updated to reef
> > .2 yet.
> >
> > Cheers,
> > Curt
>


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph cluster extremely unbalanced

2024-03-24 Thread Alexander E. Patrakov
  "op": "choose_indep",
>  "num": 2,
>  "type": "osd"
>  },
>  {
>  "op": "emit"
>  }
>  ]
> }
>
> My question is what would be proper and most safer way to make it happen.
>
> * should I first enable balancer and let it do its work and after that
> change the OSDs crush weights to be even?
>
> * or should it otherwise - first to make crush weights even and then
> enable the balancer?
>
> * or is there another safe(r) way?
>
> What are the ideal balancer settings for that?
>
> I'm expecting a large data movement, and this is production cluster.
>
> I'm also afraid that during the balancing or changing crush weights some
> OSDs become full. I've tried that already and had to move some PGs
> manually to another OSDs in the same failure domain.
>
>
> I would appreciate any suggestion on that.
>
> Thank you!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-24 Thread Alexander E. Patrakov
Hi,

Please test again, it must have been some network issue. A 10 TB RBD
image is used here without any problems.

On Sun, Mar 24, 2024 at 1:01 PM duluxoz  wrote:
>
> Hi Alexander,
>
> DOH!
>
> Thanks for pointing out my typo - I missed it, and yes, it was my
> issue.  :-)
>
> New issue (sort of): The requirement of the new RBD Image is 2 TB in
> size (its for a MariaDB Database/Data Warehouse). However, I'm getting
> the following errors:
>
> ~~~
>
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x7f00/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> ~~~
>
> I tested with a 100 GB image in the same pool and was 100% successful,
> so I'm now wondering if there is some sort of Ceph RBD Image size limit
> - although, honestly, that seems to be counter-intuitive to me
> considering CERN uses Ceph for their data storage needs.
>
> Any ideas / thoughts?
>
> Cheers
>
> Dulux-Oz
>
> On 23/03/2024 18:52, Alexander E. Patrakov wrote:
> > Hello Dulux-Oz,
> >
> > Please treat the RBD as a normal block device. Therefore, "mkfs" needs
> > to be run before mounting it.
> >
> > The mistake is that you run "mkfs xfs" instead of "mkfs.xfs" (space vs
> > dot). And, you are not limited to xfs, feel free to use ext4 or btrfs
> > or any other block-based filesystem.
> >
>


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.

I have looked at both the old and the new state of the PG, but could
not find anything else interesting.

I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?

I think we have to look for patterns in other ways, too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.

The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py

Run it as follows:

./placementoptimizer.py showremapped --by-osd | tee remapped.txt

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard  wrote:
>
> Hi Alex
>
> New query output attached after restarting both OSDs. OSD 237 is no
> longer mentioned but it unfortunately made no difference for the number
> of backfills which went 59->62->62.
>
> Mvh.
>
> Torkil
>
> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
> > Hi Torkil,
> >
> > I have looked at the files that you attached. They were helpful: pool
> > 11 is problematic, it complains about degraded objects for no obvious
> > reason. I think that is the blocker.
> >
> > I also noted that you mentioned peering problems, and I suspect that
> > they are not completely resolved. As a somewhat-irrational move, to
> > confirm this theory, you can restart osd.237 (it is mentioned at the
> > end of query.11.fff.txt, although I don't understand why it is there)
> > and then osd.298 (it is the primary for that pg) and see if any
> > additional backfills are unblocked after that. Also, please re-query
> > that PG again after the OSD restart.
> >
> > On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard  wrote:
> >>
> >>
> >>
> >> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> >>> Hi Torkil,
> >>
> >> Hi Alexander
> >>
> >>> I have looked at the CRUSH rules, and the equivalent rules work on my
> >>> test cluster. So this cannot be the cause of the blockage.
> >>
> >> Thank you for taking the time =)
> >>
> >>> What happens if you increase the osd_max_backfills setting temporarily?
> >>
> >> We already had the mclock override option in place and I re-enabled our
> >> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> >> on how full they are. Active backfills went from 16 to 53 which is
> >> probably because default osd_max_backfills for mclock is 1.
> >>
> >> I think 53 is still a low number of active backfills given the large
> >> percentage misplaced.
> >>
> >>> It may be a good idea to investigate a few of the stalled PGs. Please
> >>> run commands similar to this one:
> >>>
> >>> ceph pg 37.0 query > query.37.0.txt
> >>> ceph pg 37.1 query > query.37.1.txt
> >>> ...
> >>> and the same for the other affected pools.
> >>
> >> A few samples attached.
> >>
> >>> Still, I must say that some of your rules are actually unsafe.
> >>>
> >>> The 4+2 rule as used by rbd_ec_data will not survive a
> >>> datacenter-offline incident. Namely, for each PG, it chooses OSDs from
> >>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> >>> offline, you will, therefore, have only 4 OSDs up, which is exactly
> >>> the number of data chunks. However, the pool requires min_size 5, so
> >>> all PGs will be inactive (to prevent data corruption) and will stay
> >>> inactive until the datacenter comes up again. However, please don't
> >>> set min_size to 4 - then, any additional incident (like a defective
> >>> disk) will lead to data loss, and the shards in the datacenter which
> >>> went offline would be useless because they do not correspond to the
> >>> updated shards written by the clients.
> >>
> >> Thanks for the explanation. This is an old pool predating the 3 DC setup
> >> and we'll migrate the data to a 4+5 pool when we can.
> >>
> >>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
> >>> number of

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard  wrote:
>
>
>
> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> > Hi Torkil,
>
> Hi Alexander
>
> > I have looked at the CRUSH rules, and the equivalent rules work on my
> > test cluster. So this cannot be the cause of the blockage.
>
> Thank you for taking the time =)
>
> > What happens if you increase the osd_max_backfills setting temporarily?
>
> We already had the mclock override option in place and I re-enabled our
> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> on how full they are. Active backfills went from 16 to 53 which is
> probably because default osd_max_backfills for mclock is 1.
>
> I think 53 is still a low number of active backfills given the large
> percentage misplaced.
>
> > It may be a good idea to investigate a few of the stalled PGs. Please
> > run commands similar to this one:
> >
> > ceph pg 37.0 query > query.37.0.txt
> > ceph pg 37.1 query > query.37.1.txt
> > ...
> > and the same for the other affected pools.
>
> A few samples attached.
>
> > Still, I must say that some of your rules are actually unsafe.
> >
> > The 4+2 rule as used by rbd_ec_data will not survive a
> > datacenter-offline incident. Namely, for each PG, it chooses OSDs from
> > two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> > offline, you will, therefore, have only 4 OSDs up, which is exactly
> > the number of data chunks. However, the pool requires min_size 5, so
> > all PGs will be inactive (to prevent data corruption) and will stay
> > inactive until the datacenter comes up again. However, please don't
> > set min_size to 4 - then, any additional incident (like a defective
> > disk) will lead to data loss, and the shards in the datacenter which
> > went offline would be useless because they do not correspond to the
> > updated shards written by the clients.
>
> Thanks for the explanation. This is an old pool predating the 3 DC setup
> and we'll migrate the data to a 4+5 pool when we can.
>
> > The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
> > number of data chunks. See above why it is bad. Please set min_size to
> > 5.
>
> Thanks, that was a leftover for getting the PGs to peer (stuck at
> creating+incomplete) when we created the pool. It's back to 5 now.
>
> > The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
> > 100% active+clean.
>
> There is very little data in this pool, that is probably the main reason.
>
> > Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
> > have 300+ PGs, the observed maximum is 347. Please set it to 400.
>
> Copy that. Didn't seem to make a difference though, and we have
> osd_max_pg_per_osd_hard_ratio set to 5.00.
>
> Mvh.
>
> Torkil
>
> > On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard  wrote:
> >>
> >>
> >>
> >> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
> >>> Sorry for replying to myself, but "ceph osd pool ls detail" by itself
> >>> is insufficient. For every erasure code profile mentioned in the
> >>> output, please also run something like this:
> >>>
> >>> ceph osd erasure-code-profile get prf-for-ec-data
> >>>
> >>> ...where "prf-for-ec-data" is the name that appears after the words
> >>> "erasure profile" in the "ceph osd pool ls detail" output.
> >>
> >> [root@lazy ~]# ceph osd pool ls detail | grep erasure
> >> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
> >> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
> >> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
> >> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
> >> fast_read 1 compression_algorithm snappy compression_mode aggressive
> >> application rbd
> >> pool 37 'cephfs.hdd.data'

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.

What happens if you increase the osd_max_backfills setting temporarily?

It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.

Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.

The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
number of data chunks. See above why it is bad. Please set min_size to
5.

The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
100% active+clean.

Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
have 300+ PGs, the observed maximum is 347. Please set it to 400.

On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard  wrote:
>
>
>
> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
> > Sorry for replying to myself, but "ceph osd pool ls detail" by itself
> > is insufficient. For every erasure code profile mentioned in the
> > output, please also run something like this:
> >
> > ceph osd erasure-code-profile get prf-for-ec-data
> >
> > ...where "prf-for-ec-data" is the name that appears after the words
> > "erasure profile" in the "ceph osd pool ls detail" output.
>
> [root@lazy ~]# ceph osd pool ls detail | grep erasure
> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags
> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
> fast_read 1 compression_algorithm snappy compression_mode aggressive
> application rbd
> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size
> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048
> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
> compression_algorithm zstd compression_mode aggressive application cephfs
> pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9
> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32
> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags
> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
> compression_algorithm zstd compression_mode aggressive application rbd
>
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
> crush-device-class=hdd
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd
> crush-device-class=hdd
> crush-failure-domain=datacenter
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=5
> plugin=jerasure
> technique=reed_sol_van
> w=8
> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd
> crush-device-class=ssd
> crush-failure-domain=datacenter
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=5
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> But as I understand it those profiles are only used to create the
> initial crush rule for the pool, and we have manually edited those along
> the way. Here are the 3 rules in use for the 3 EC pools:
>
> rule rbd_ec_data {
>  id 0
>  type erasure
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step choose indep 0 type datacenter
>  step chooseleaf indep 2 type host
>  step emit
> }
> rule cephfs.hdd.data {
>  id 7
>  type erasure
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step choose indep 0 type datacenter
&

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Sorry for replying to myself, but "ceph osd pool ls detail" by itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:

ceph osd erasure-code-profile get prf-for-ec-data

...where "prf-for-ec-data" is the name that appears after the words
"erasure profile" in the "ceph osd pool ls detail" output.

On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
 wrote:
>
> Hi Torkil,
>
> I take my previous response back.
>
> You have an erasure-coded pool with nine shards but only three
> datacenters. This, in general, cannot work. You need either nine
> datacenters or a very custom CRUSH rule. The second option may not be
> available if the current EC setup is already incompatible, as there is
> no way to change the EC parameters.
>
> It would help if you provided the output of "ceph osd pool ls detail".
>
> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
>  wrote:
> >
> > Hi Torkil,
> >
> > Unfortunately, your files contain nothing obviously bad or suspicious,
> > except for two things: more PGs than usual and bad balance.
> >
> > What's your "mon max pg per osd" setting?
> >
> > On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:
> > >
> > > On 2024-03-23 17:54, Kai Stian Olstad wrote:
> > > > On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> > > >>
> > > >> The other output is too big for pastebin and I'm not familiar with
> > > >> paste services, any suggestion for a preferred way to share such
> > > >> output?
> > > >
> > > > You can attached files to the mail here on the list.
> > >
> > > Doh, for some reason I was sure attachments would be stripped. Thanks,
> > > attached.
> > >
> > > Mvh.
> > >
> > > Torkil
> >
> >
> >
> > --
> > Alexander E. Patrakov
>
>
>
> --
> Alexander E. Patrakov



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

I take my previous response back.

You have an erasure-coded pool with nine shards but only three
datacenters. This, in general, cannot work. You need either nine
datacenters or a very custom CRUSH rule. The second option may not be
available if the current EC setup is already incompatible, as there is
no way to change the EC parameters.

It would help if you provided the output of "ceph osd pool ls detail".

On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
 wrote:
>
> Hi Torkil,
>
> Unfortunately, your files contain nothing obviously bad or suspicious,
> except for two things: more PGs than usual and bad balance.
>
> What's your "mon max pg per osd" setting?
>
> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:
> >
> > On 2024-03-23 17:54, Kai Stian Olstad wrote:
> > > On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> > >>
> > >> The other output is too big for pastebin and I'm not familiar with
> > >> paste services, any suggestion for a preferred way to share such
> > >> output?
> > >
> > > You can attached files to the mail here on the list.
> >
> > Doh, for some reason I was sure attachments would be stripped. Thanks,
> > attached.
> >
> > Mvh.
> >
> > Torkil
>
>
>
> --
> Alexander E. Patrakov



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hi Torkil,

Unfortunately, your files contain nothing obviously bad or suspicious,
except for two things: more PGs than usual and bad balance.

What's your "mon max pg per osd" setting?

On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard  wrote:
>
> On 2024-03-23 17:54, Kai Stian Olstad wrote:
> > On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:
> >>
> >> The other output is too big for pastebin and I'm not familiar with
> >> paste services, any suggestion for a preferred way to share such
> >> output?
> >
> > You can attached files to the mail here on the list.
>
> Doh, for some reason I was sure attachments would be stripped. Thanks,
> attached.
>
> Mvh.
>
> Torkil



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-23 Thread Alexander E. Patrakov
Hello Torkil,

It would help if you provided the whole "ceph osd df tree" and "ceph
pg ls" outputs.

On Sat, Mar 23, 2024 at 4:26 PM Torkil Svensgaard  wrote:
>
> Hi
>
> We have this after adding some hosts and changing crush failure domain
> to datacenter:
>
> pgs: 1338512379/3162732055 objects misplaced (42.321%)
>   5970active+remapped+backfill_wait
>   4853 active+clean
>   11   active+remapped+backfilling
>
> We have 3 datacenters each with 6 hosts and ~400 HDD OSDs with DB/WAL on
> NVMe. Using mclock with high_recovery_ops profile.
>
> What is the bottleneck here? I would have expected a huge number of
> simultaneous backfills. Backfill reservation logjam?
>
> Mvh.
>
> Torkil
>
> --
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-23 Thread Alexander E. Patrakov
Hello Dulux-Oz,

Please treat the RBD as a normal block device. Therefore, "mkfs" needs
to be run before mounting it.

The mistake is that you run "mkfs xfs" instead of "mkfs.xfs" (space vs
dot). And, you are not limited to xfs, feel free to use ext4 or btrfs
or any other block-based filesystem.

On Sat, Mar 23, 2024 at 2:28 PM duluxoz  wrote:
>
> Hi All,
>
> I'm trying to mount a Ceph Reef (v18.2.2 - latest version) RBD Image as
> a 2nd HDD on a Rocky Linux v9.3 (latest version) host.
>
> The EC pool has been created and initialised and the image has been
> created.
>
> The ceph-common package has been installed on the host.
>
> The correct keyring has been added to the host (with a chmod of 600) and
> the host has been configure with an rbdmap file as follows:
> `my_pool.meta/my_image
> id=ceph_user,keyring=/etc/ceph/ceph.client.ceph_user.keyring`.
>
> When running the rbdmap.service the image appears as both `/dev/rbd0`
> and `/dev/rbd/my_pool.meta/my_image`, exactly as the Ceph Doco says it
> should.
>
> So everything *appears* AOK up to this point.
>
> My question now is: Should I run `mkfs xfs` on `/dev/rbd0` *before* or
> *after* I try to mount the image (via fstab:
> `/dev/rbd/my_pool.meta/my_image  /mnt/my_image  xfs  noauto  0 0` - as
> per the Ceph doco)?
>
> The reason I ask is that I've tried this *both* ways and all I get is an
> error message (sorry, can't remember the exact messages and I'm not
> currently in front of the host to confirm it  :-) - but from memory it
> was something about not being able to recognise the 1st block - or
> something like that).
>
> So, I'm obviously doing something wrong, but I can't work out what
> exactly (and the logs don't show any useful info).
>
> Do I, for instance, have the process wrong / don't understand the exact
> process, or is there something else wrong?
>
> All comments/suggestions/etc greatly appreciated - thanks in advance
>
> Cheers
>
> Dulux-Oz
> _______
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Laptop Losing Connectivity To CephFS On Sleep/Hibernation

2024-03-23 Thread Alexander E. Patrakov
On Sat, Mar 23, 2024 at 3:08 PM duluxoz  wrote:
>
>
> On 23/03/2024 18:00, Alexander E. Patrakov wrote:
> > Hi Dulux-Oz,
> >
> > CephFS is not designed to deal with mobile clients such as laptops
> > that can lose connectivity at any time. And I am not talking about the
> > inconveniences on the laptop itself, but about problems that your
> > laptop would cause to other clients. The problems stem from the fact
> > that MDSes give out "caps" to clients, which are, essentially,
> > licenses to do local caching. If another client wants to access the
> > same file, the MDS would need to contact the laptop and tell it to
> > release the caps - which is no longer possible. Result: a health
> > warning and delays/hangs on other clients.
> >
> > The proper solution here is to use NFSv3 (ideally with a userspace
> > client instead of a kernel mount). NFSv3, because v4 has leases which
> > bring the problem back. And this means that you cannot use cephadm to
> > deploy this NFS server, as cephadm-deployed NFS-Ganesha is hard-coded
> > to speak only NFSv4.
> >
> > SAMBA server with oplocks disabled, and, again, a userspace client
> > could be another solution.
> >
> > If you decide to disregard this advice, here are some tips.
> >
> > With systemd, configuring autofs is as easy as adding
> > "x-systemd.automount,x-systemd.idle-timeout=1min,noauto,nofail,_netdev"
> > to your /etc/fstab line. This applies both to CephFS and NFS.
> >
> > For kernel-based NFSv3 mounts, consider adding "nolock".
> >
> > Another CephFS-specific mount option that somewhat helps with
> > reconnects is "recover_session=clean".
> >
> > --
> > Alexander E. Patrakov
> Hi Alex, and thanks for getting back to me so quickly (I really
> appreciate it),
>
> So from what you said it looks like we've got the wrong solution.
> Instead, (if I'm understanding things correctly) we may be better off
> setting up a dedicated Samba server with the CephFS mounts, and then
> using Samba to share those out - is that right?

Almost right. Please set up a cluster of two SAMBA servers with CTDB,
for high availability.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Laptop Losing Connectivity To CephFS On Sleep/Hibernation

2024-03-23 Thread Alexander E. Patrakov
Hi Dulux-Oz,

CephFS is not designed to deal with mobile clients such as laptops
that can lose connectivity at any time. And I am not talking about the
inconveniences on the laptop itself, but about problems that your
laptop would cause to other clients. The problems stem from the fact
that MDSes give out "caps" to clients, which are, essentially,
licenses to do local caching. If another client wants to access the
same file, the MDS would need to contact the laptop and tell it to
release the caps - which is no longer possible. Result: a health
warning and delays/hangs on other clients.

The proper solution here is to use NFSv3 (ideally with a userspace
client instead of a kernel mount). NFSv3, because v4 has leases which
bring the problem back. And this means that you cannot use cephadm to
deploy this NFS server, as cephadm-deployed NFS-Ganesha is hard-coded
to speak only NFSv4.

SAMBA server with oplocks disabled, and, again, a userspace client
could be another solution.

If you decide to disregard this advice, here are some tips.

With systemd, configuring autofs is as easy as adding
"x-systemd.automount,x-systemd.idle-timeout=1min,noauto,nofail,_netdev"
to your /etc/fstab line. This applies both to CephFS and NFS.

For kernel-based NFSv3 mounts, consider adding "nolock".

Another CephFS-specific mount option that somewhat helps with
reconnects is "recover_session=clean".


On Sat, Mar 23, 2024 at 2:12 PM duluxoz  wrote:
>
> Hi All,
>
> I'm looking for some help/advice to solve the issue outlined in the heading.
>
> I'm running CepfFS (name: cephfs) on a Ceph Reef (v18.2.2 - latest
> update) cluster, connecting from a laptop running Rocky Linux v9.3
> (latest update) with KDE v5 (latest update).
>
> I've set up the laptop to connect to a number of directories on CephFS
> via the `/etc/fstab' folder, an example of such is:
> `ceph_user@.cephfs=/my_folder  /mnt/my_folder  ceph noatime,_netdev  0 0`.
>
> Everything is working great; the required Ceph Key is on the laptop
> (with a chmod of 600), I can access the files on the Ceph Cluster, etc,
> etc, etc - all good.
>
> However, whenever the laptop is in sleep or hibernate mode (ie when I
> close the laptop's lid) and then bring the laptop out of
> sleep/hibernation (ie I open the laptop's lid) I've lost the CephFS
> mountings. The only way to bring them back is to run `mount -a` as root
> (or sudo). This is, as I'm sure you'll agree, not a long-term viable
> options - especially as this is a running as a pilot-project and the
> eventual end-users won't have access to root/sudo.
>
> So I'm seeking the collective wisdom of the community in how to solve
> this issue.
>
> I've taken a brief look at autofs, and even half-heartedly had a go at
> configuring it, but it didn't seem to work - honestly, it was late and I
> wanted to get home after a long day.  :-)
>
> Is this the solution to my issue, or is there a better way to construct
> the fstab entries, or is there another solution I haven't found yet in
> the doco or via google-foo?
>
> All help and advice greatly appreciated - thanks in advance
>
> Cheers
>
> Dulux-Oz
> _______
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



--
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-22 Thread Alexander E. Patrakov
Hi John,
> A few major features we have planned include:
> * Standalone servers (internally defined users/groups)

No concerns here

> * Active Directory Domain Member Servers

In the second case, what is the plan regarding UID mapping? Is NFS
coexistence planned, or a concurrent mount of the same directory using
CephFS directly?

In fact, I am quite skeptical, because, at least in my experience,
every customer's SAMBA configuration as a domain member is a unique
snowflake, and cephadm would need an ability to specify arbitrary UID
mapping configuration to match what the customer uses elsewhere - and
the match must be precise.

Here is what I have seen or was told about:

1. We don't care about interoperability with NFS or CephFS, so we just
let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2"
idmap backend. It's completely OK that workstations get different UIDs
and GIDs, as only SIDs traverse the wire.
2. [not seen in the wild, the customer did not actually implement it,
it's a product of internal miscommunication, and I am not sure if it
is valid at all] We don't care about interoperability with CephFS,
and, while we have NFS, security guys would not allow running NFS
non-kerberized. Therefore, no UIDs or GIDs traverse the wire, only
SIDs and names. Therefore, all we need is to allow both SAMBA and NFS
to use shared UID mapping allocated on as-needed basis using the
"tdb2" idmap module, and it doesn't matter that these UIDs and GIDs
are inconsistent with what clients choose.
3. We don't care about ACLs at all, and don't care about CephFS
interoperability. We set ownership of all new files to root:root 0666
using whatever options are available [well, I would rather use a
dedicated nobody-style uid/gid here]. All we care about is that only
authorized workstations or authorized users can connect to each NFS or
SMB share, and we absolutely don't want them to be able to set custom
ownership or ACLs.
4. We care about NFS and CephFS file ownership being consistent with
what Windows clients see. We store all UIDs and GIDs in Active
Directory using the rfc2307 schema, and it's mandatory that all
servers (especially SAMBA - thanks to the "ad" idmap backend) respect
that and don't try to invent anything [well, they do - BUILTIN/Users
gets its GID through tdb2]. Oh, and by the way, we have this strangely
low-numbered group that everybody gets wrong unless they set "idmap
config CORP : range = 500-99".
5. We use a few static ranges for algorithmic ID translation using the
idmap rid backend. Everything works.
6. We use SSSD, which provides consistent IDs everywhere, and for a
few devices which can't use it, we configured compatible idmap rid
ranges for use with winbindd. The only problem is that we like
user-private groups, and only SSSD has support for them (although we
admit it's our fault that we enabled this non-default option).
7. We store ID mappings in non-AD LDAP and use winbindd with the
"ldap" idmap backend.

I am sure other weird but valid setups exist - please extend the list
if you can.

Which of the above scenarios would be supportable without resorting to
the old way of installing SAMBA manually alongside the cluster?

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Alexander E. Patrakov
 objects, 1.1 PiB
> >> usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
> >> pgs: 1425479651/3163081036 objects misplaced (45.066%)
> >>  6224 active+remapped+backfill_wait
> >>  4516 active+clean
> >>  67   active+clean+scrubbing
> >>  25   active+remapped+backfilling
> >>  16   active+clean+scrubbing+deep
> >>  1active+remapped+backfill_wait+backfill_toofull
> >>
> >>   io:
> >> client:   117 MiB/s rd, 68 MiB/s wr, 274 op/s rd, 183 op/s wr
> >> recovery: 438 MiB/s, 192 objects/s
> >> "
> >>
> >> Anyone know what the issue might be? Given that is happens on and off
> >> with large periods of time in between with normal low latencies I
> >> think it unlikely that it is just because the cluster is busy.
> >>
> >> Also, how come there's only a small amount of PGs doing backfill when
> >> we have such a large misplaced percentage? Can this be just from
> >> backfill reservation logjam?
> >>
> >> Mvh.
> >>
> >> Torkil
> >>
>
> --
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD does not die when disk has failures

2024-03-21 Thread Alexander E. Patrakov
Hi Robert,

One of the theoretically possible (but not implemented in Ceph)
benefits of not crashing would be that an OSD could request the
errored piece of data from other OSDs and rewrite the data on the disk
in place. When a defective sector is rewritten, most disks and SSDs
mark the original one as still bad but reassign a spare to serve in
its place. The end result is that the block device no longer has bad
sectors visible to applications. Doing so, instead of just throwing an
SSD with just one defective block into the trash can, could reduce the
amount of digital waste. Note that this is not a good approach for
HDDs, where defects tend to multiply.

Source: I still have an OCZ Intrepid 3700 SSD with 18 remapped
sectors. All of them appeared during a misguided test through a
USB-to-SATA adapter, which apparently could not provide enough power.
Eight years later, it still works and still has only these 18 remapped
sectors.

Anyway, all of the above is of only theoretical importance, as the
code to hide/cure disk defects that way does not exist.

On Thu, Mar 21, 2024 at 5:15 AM Igor Fedotov  wrote:
>
> Hi Robert,
>
> I presume the plan was to support handling EIO at upper layers. But
> apparently that hasn't been completed. Or there are some bugs...
>
> Will take a look.
>
>
> Thanks,
>
> Igor
>
> On 3/19/2024 3:36 PM, Robert Sander wrote:
> > Hi,
> >
> > On 3/19/24 13:00, Igor Fedotov wrote:
> >>
> >> translating EIO to upper layers rather than crashing an OSD is a
> >> valid default behavior. One can alter this by setting
> >> bluestore_fail_eio parameter to true.
> >
> > What benefit lies in this behavior when in the end client IO stalls?
> >
> > Regards
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> _______
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS space usage

2024-03-20 Thread Alexander E. Patrakov
Hi Thorne,

The idea is quite simple. By retesting the leak with a separate pool, used
by nobody except you, in the case if the leak exists and is reproducible
(which is not a given), you can definitely pinpoint it without giving any
chance to the alternate hypothesis "somebody wrote some data in parallel".
And then, even if the leak is small but reproducible, one can say that
multiple such events accumulated to 10 TB of garbage in the original pool.

On Wed, Mar 20, 2024 at 7:29 PM Thorne Lawler  wrote:

> Alexander,
>
> I'm happy to create a new pool if it will help, but I don't presently see
> how creating a new pool will help us to identify the source of the 10TB
> discrepancy in this original cephfs pool.
>
> Please help me to understand what you are hoping to find...?
> On 20/03/2024 6:35 pm, Alexander E. Patrakov wrote:
>
> Thorne,
>
> That's why I asked you to create a separate pool. All writes go to the
> original pool, and it is possible to see object counts per-pool.
>
> On Wed, Mar 20, 2024 at 6:32 AM Thorne Lawler  wrote:
>
>> Alexander,
>>
>> Thank you, but as I said to Igor: The 5.5TB of files on this filesystem
>> are virtual machine disks. They are under constant, heavy write load. There
>> is no way to turn this off.
>> On 19/03/2024 9:36 pm, Alexander E. Patrakov wrote:
>>
>> Hello Thorne,
>>
>> Here is one more suggestion on how to debug this. Right now, there is
>> uncertainty on whether there is really a disk space leak or if
>> something simply wrote new data during the test.
>>
>> If you have at least three OSDs you can reassign, please set their
>> CRUSH device class to something different than before. E.g., "test".
>> Then, create a new pool that targets this device class and add it to
>> CephFS. Then, create an empty directory on CephFS and assign this pool
>> to it using setfattr. Finally, try reproducing the issue using only
>> files in this directory. This way, you will be sure that nobody else
>> is writing any data to the new pool.
>>
>> On Tue, Mar 19, 2024 at 5:40 PM Igor Fedotov  
>>  wrote:
>>
>> Hi Thorn,
>>
>> given the amount of files at CephFS volume I presume you don't have
>> severe write load against it. Is that correct?
>>
>> If so we can assume that the numbers you're sharing are mostly refer to
>> your experiment. At peak I can see bytes_used increase = 629,461,893,120
>> bytes (45978612027392  - 45349150134272). With replica factor = 3 this
>> roughly matches your written data (200GB I presume?).
>>
>>
>> More interestingly is that after file's removal we can see 419,450,880
>> bytes delta (=45349569585152 - 45349150134272). I could see two options
>> (apart that someone else wrote additional stuff to CephFS during the
>> experiment) to explain this:
>>
>> 1. File removal wasn't completed at the last probe half an hour after
>> file's removal. Did you see stale object counter when making that probe?
>>
>> 2. Some space is leaking. If that's the case this could be a reason for
>> your issue if huge(?) files at CephFS are created/removed periodically.
>> So if we're certain that the leak really occurred (and option 1. above
>> isn't the case) it makes sense to run more experiments with
>> writing/removing a bunch of huge files to the volume to confirm space
>> leakage.
>>
>> On 3/18/2024 3:12 AM, Thorne Lawler wrote:
>>
>> Thanks Igor,
>>
>> I have tried that, and the number of objects and bytes_used took a
>> long time to drop, but they seem to have dropped back to almost the
>> original level:
>>
>>   * Before creating the file:
>>   o 3885835 objects
>>   o 45349150134272 bytes_used
>>   * After creating the file:
>>   o 3931663 objects
>>   o 45924147249152 bytes_used
>>   * Immediately after deleting the file:
>>   o 3935995 objects
>>   o 45978612027392 bytes_used
>>   * Half an hour after deleting the file:
>>   o 3886013 objects
>>   o 45349569585152 bytes_used
>>
>> Unfortunately, this is all production infrastructure, so there is
>> always other activity taking place.
>>
>> What tools are there to visually inspect the object map and see how it
>> relates to the filesystem?
>>
>>
>> Not sure if there is anything like that at CephFS level but you can use
>> rados tool to view objects in cephfs data pool and try to build some
>> mapping between them and CephFS file list. Could be a bit tricky though.
>>
>> On 15/03/2024 7:18 pm, Igor Fedotov wrote:
>>

[ceph-users] Re: CephFS space usage

2024-03-20 Thread Alexander E. Patrakov
Thorne,

That's why I asked you to create a separate pool. All writes go to the
original pool, and it is possible to see object counts per-pool.

On Wed, Mar 20, 2024 at 6:32 AM Thorne Lawler  wrote:

> Alexander,
>
> Thank you, but as I said to Igor: The 5.5TB of files on this filesystem
> are virtual machine disks. They are under constant, heavy write load. There
> is no way to turn this off.
> On 19/03/2024 9:36 pm, Alexander E. Patrakov wrote:
>
> Hello Thorne,
>
> Here is one more suggestion on how to debug this. Right now, there is
> uncertainty on whether there is really a disk space leak or if
> something simply wrote new data during the test.
>
> If you have at least three OSDs you can reassign, please set their
> CRUSH device class to something different than before. E.g., "test".
> Then, create a new pool that targets this device class and add it to
> CephFS. Then, create an empty directory on CephFS and assign this pool
> to it using setfattr. Finally, try reproducing the issue using only
> files in this directory. This way, you will be sure that nobody else
> is writing any data to the new pool.
>
> On Tue, Mar 19, 2024 at 5:40 PM Igor Fedotov  
>  wrote:
>
> Hi Thorn,
>
> given the amount of files at CephFS volume I presume you don't have
> severe write load against it. Is that correct?
>
> If so we can assume that the numbers you're sharing are mostly refer to
> your experiment. At peak I can see bytes_used increase = 629,461,893,120
> bytes (45978612027392  - 45349150134272). With replica factor = 3 this
> roughly matches your written data (200GB I presume?).
>
>
> More interestingly is that after file's removal we can see 419,450,880
> bytes delta (=45349569585152 - 45349150134272). I could see two options
> (apart that someone else wrote additional stuff to CephFS during the
> experiment) to explain this:
>
> 1. File removal wasn't completed at the last probe half an hour after
> file's removal. Did you see stale object counter when making that probe?
>
> 2. Some space is leaking. If that's the case this could be a reason for
> your issue if huge(?) files at CephFS are created/removed periodically.
> So if we're certain that the leak really occurred (and option 1. above
> isn't the case) it makes sense to run more experiments with
> writing/removing a bunch of huge files to the volume to confirm space
> leakage.
>
> On 3/18/2024 3:12 AM, Thorne Lawler wrote:
>
> Thanks Igor,
>
> I have tried that, and the number of objects and bytes_used took a
> long time to drop, but they seem to have dropped back to almost the
> original level:
>
>   * Before creating the file:
>   o 3885835 objects
>   o 45349150134272 bytes_used
>   * After creating the file:
>   o 3931663 objects
>   o 45924147249152 bytes_used
>   * Immediately after deleting the file:
>   o 3935995 objects
>   o 45978612027392 bytes_used
>   * Half an hour after deleting the file:
>   o 3886013 objects
>   o 45349569585152 bytes_used
>
> Unfortunately, this is all production infrastructure, so there is
> always other activity taking place.
>
> What tools are there to visually inspect the object map and see how it
> relates to the filesystem?
>
>
> Not sure if there is anything like that at CephFS level but you can use
> rados tool to view objects in cephfs data pool and try to build some
> mapping between them and CephFS file list. Could be a bit tricky though.
>
> On 15/03/2024 7:18 pm, Igor Fedotov wrote:
>
> ceph df detail --format json-pretty
>
> --
>
> Regards,
>
> Thorne Lawler - Senior System Administrator
> *DDNS* | ABN 76 088 607 265
> First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
> P +61 499 449 170
>
> _DDNS
>
> /_*Please note:* The information contained in this email message and
> any attached files may be confidential information, and may also be
> the subject of legal professional privilege. _If you are not the
> intended recipient any use, disclosure or copying of this email is
> unauthorised. _If you received this email in error, please notify
> Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this
> matter and delete all copies of this transmission together with any
> attachments. /
>
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us athttps://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
> ___
> ceph-users mailing list -- ceph-u

[ceph-users] Re: CephFS space usage

2024-03-19 Thread Alexander E. Patrakov
Hello Thorne,

Here is one more suggestion on how to debug this. Right now, there is
uncertainty on whether there is really a disk space leak or if
something simply wrote new data during the test.

If you have at least three OSDs you can reassign, please set their
CRUSH device class to something different than before. E.g., "test".
Then, create a new pool that targets this device class and add it to
CephFS. Then, create an empty directory on CephFS and assign this pool
to it using setfattr. Finally, try reproducing the issue using only
files in this directory. This way, you will be sure that nobody else
is writing any data to the new pool.

On Tue, Mar 19, 2024 at 5:40 PM Igor Fedotov  wrote:
>
> Hi Thorn,
>
> given the amount of files at CephFS volume I presume you don't have
> severe write load against it. Is that correct?
>
> If so we can assume that the numbers you're sharing are mostly refer to
> your experiment. At peak I can see bytes_used increase = 629,461,893,120
> bytes (45978612027392  - 45349150134272). With replica factor = 3 this
> roughly matches your written data (200GB I presume?).
>
>
> More interestingly is that after file's removal we can see 419,450,880
> bytes delta (=45349569585152 - 45349150134272). I could see two options
> (apart that someone else wrote additional stuff to CephFS during the
> experiment) to explain this:
>
> 1. File removal wasn't completed at the last probe half an hour after
> file's removal. Did you see stale object counter when making that probe?
>
> 2. Some space is leaking. If that's the case this could be a reason for
> your issue if huge(?) files at CephFS are created/removed periodically.
> So if we're certain that the leak really occurred (and option 1. above
> isn't the case) it makes sense to run more experiments with
> writing/removing a bunch of huge files to the volume to confirm space
> leakage.
>
> On 3/18/2024 3:12 AM, Thorne Lawler wrote:
> >
> > Thanks Igor,
> >
> > I have tried that, and the number of objects and bytes_used took a
> > long time to drop, but they seem to have dropped back to almost the
> > original level:
> >
> >   * Before creating the file:
> >   o 3885835 objects
> >   o 45349150134272 bytes_used
> >   * After creating the file:
> >   o 3931663 objects
> >   o 45924147249152 bytes_used
> >   * Immediately after deleting the file:
> >   o 3935995 objects
> >   o 45978612027392 bytes_used
> >   * Half an hour after deleting the file:
> >   o 3886013 objects
> >   o 45349569585152 bytes_used
> >
> > Unfortunately, this is all production infrastructure, so there is
> > always other activity taking place.
> >
> > What tools are there to visually inspect the object map and see how it
> > relates to the filesystem?
> >
> Not sure if there is anything like that at CephFS level but you can use
> rados tool to view objects in cephfs data pool and try to build some
> mapping between them and CephFS file list. Could be a bit tricky though.
> >
> > On 15/03/2024 7:18 pm, Igor Fedotov wrote:
> >> ceph df detail --format json-pretty
> > --
> >
> > Regards,
> >
> > Thorne Lawler - Senior System Administrator
> > *DDNS* | ABN 76 088 607 265
> > First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
> > P +61 499 449 170
> >
> > _DDNS
> >
> > /_*Please note:* The information contained in this email message and
> > any attached files may be confidential information, and may also be
> > the subject of legal professional privilege. _If you are not the
> > intended recipient any use, disclosure or copying of this email is
> > unauthorised. _If you received this email in error, please notify
> > Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this
> > matter and delete all copies of this transmission together with any
> > attachments. /
> >
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us athttps://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Robust cephfs design/best practice

2024-03-15 Thread Alexander E. Patrakov
Hi Istvan,

I would like to add a few notes to what Burkhard mentioned already.

First, CephFS has a built-in feature that allows restricting access to
a certain directory:

ceph fs authorize cephfs client.public-only /public rw

This creates a key with the following caps:

caps mds = "allow rw path=/public"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data=cephfs"

This is still a problem if a malicious client guesses the inode number
and tries to access it through OSDs directly, by predicting object
names. This can be thwarted by using namespaces, but it's a bit
cumbersome. First, before placing any single file into this directory,
you have to assign a namespace (here "public", but it's just a string,
you can use any other string) to it:

setfattr -n ceph.dir.layout.pool_namespace -v public /mnt/cephfs/public

Then, adjust the key to have the following caps:

caps mds = "allow rw path=/public"
caps mon = "allow r"
caps osd "allow rw pool=cephfs_data namespace=public,allow rw
pool=cephfs_data namespace=public"

Second, CephFS has a feature where any directory can be snapshotted.
There is also an mgr module "snap_schedule" that allows scheduling
snapshot creation - e.g., creating daily or weekly snapshots of your
data. This module also allows setting retention policies. However, a
lot of users who tried it ended up with unsatisfactory performance:
when removing snapshots (scheduled or not), and actually during any
big removals, the MR_Finisher thread of the MDS often ends up
consuming 100% of a CPU core, sometimes impacting other operations.
The only advice that I have here is to snapshot only what you need and
only as often as you need.

On Fri, Mar 15, 2024 at 4:28 PM Burkhard Linke
 wrote:
>
> Hi,
>
> On 15.03.24 08:57, Szabo, Istvan (Agoda) wrote:
> > Hi,
> >
> > I'd like to add cephfs to our production objectstore/block storage cluster 
> > so I'd like to collect hands on experiences like, good to know/be 
> > careful/avoid etc ... other than ceph documentation.
>
>
> Just some aspects that might not be obvious on first sight:
>
>
> 1. CephFS does not perform authorization on the MDS. You have to trust
> all your clients to behave correctly. If this is not possible you can
> have a NFS server export CephFS (and use kerberos authentication, GID
> management e.g. from LDAP/AD etc.)
>
>
> 2. CephFS behaves different than e.g. NFS. It has a much stronger
> consistency model. Certain operations which are fine on NFS are an anti
> pattern for CephFS, e.g. running hundreds of jobs on a compute cluster
> and redirect output to a single file. Your users will have to adopt.
>
>
> 3. CephFS maintains an object for each file in the first data pool. This
> object has at least one xattr value attached that is crucial for
> desaster recovery. You first data pool thus cannot be an EC pool. I'm
> not sure how often this value is updated (e.g. does it contain mtime?).
> If you plan to use an EC pool for data storage, you need three pools:
> metadata, replicated data pool as first pool, EC pool as third pool. You
> can use filesystem attributes to control which pool is used for data
> storage. This is the setup of our main filesystem (using only the EC
> pool for data):
>
> --- POOLS ---
> POOL ID   PGS   STORED  OBJECTS USED %USED  MAX
> AVAIL
> xxx_metadata  7464  203 GiB   27.87M  608 GiB 3.056.3 TiB
> xxx_data_rep  76   256  0 B  357.92M  0 B 06.3 TiB
> xxx_data_ec   77  4096  2.0 PiB  852.56M  2.4 PiB 50.971.8 PiB
>
>
> 4. MDS is a memory hog and mostly single threaded. Metadata performance
> depends on cpu speed, and especially on the amount of RAM available.
> More RAM, more cache inode information.
>
>
> 5. Avoid workloads having too many files open at the same time. Each
> file being access require a capability reservation on the MDS, which
> consumes a certain amount of memory. More client with more open files ->
> more RAM needed.
>
>
> 6. Finally: In case of a failover, the then-active MDS has to be
> reconnected by all clients. It will collect inode information for all
> open files during this phase. This can consume a lot of memory, and it
> will definitely take some time depending on the performance of the ceph
> cluster. If you have too many files open, the MDS may run into a timeout
> and restart, resulting in a restart loop. I fixed this problem in the
> past by extending the timeout.
>
>
> Overall CephFS is a great system, but you need to know your current and
> future workloads to configure it accordingly. This is also true for any
> other shared filesystem.
>
>
> Best regards,
>
> Burkhard Linke
>
> ___

[ceph-users] Re: bluestore_min_alloc_size and bluefs_shared_alloc_size

2024-03-11 Thread Alexander E. Patrakov
Hello Joel,

Please be aware that it is not recommended to keep a mix of OSDs
created with different bluestore_min_alloc_size values within the same
CRUSH device class. The consequence of such a mix is that the balancer
will not work properly - instead of evening out the OSD space
utilization, it will create a distribution with two bands.

This is a bug in the balancer. A ticket has been filed already:
https://tracker.ceph.com/issues/64715

On Tue, Mar 12, 2024 at 4:45 AM Joel Davidow  wrote:
>
> For osds that are added new, bfm_bytes_per_block is 4096. However, for osds
> that were added when the cluster was running octopus, bfm_bytes_per_block
> remains 65535.
>
> Based on
> https://github.com/ceph/ceph/blob/1c349451176cc5b4ebfb24b22eaaa754e05cff6c/src/os/bluestore/BitmapFreelistManager.cc
> and the space allocation section on page 360 of
> https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf, it appears
> bfm_bytes_per_block is the bluestore_min_alloc_size that the osd was built
> with.
>
> Below is a sanitized example of what I was referring to as the osd label
> (which includes bfm_bytes_per_block) that was run on an osd built under
> octopus. The cluster was later upgraded to pacific.
>
> user@osd-host:/# ceph-bluestore-tool show-label --path
> /var/lib/ceph/osd/ceph-36
> inferring bluefs devices from bluestore path
> {
> "/var/lib/ceph/osd/ceph-36/block": {
> "osd_uuid": "",
> "size": 4000783007744,
> "btime": "2021-09-14T15:16:55.605860+",
> "description": "main",
> "bfm_blocks": "61047168",
> "bfm_blocks_per_key": "128",
> "bfm_bytes_per_block": "65536",
> "bfm_size": "4000783007744",
> "bluefs": "1",
> "ceph_fsid": "",
> "kv_backend": "rocksdb",
> "magic": "ceph osd volume v026",
> "mkfs_done": "yes",
> "osd_key": "",
> "osdspec_affinity": "",
> "ready": "ready",
> "require_osd_release": "16",
> "whoami": "36"
> }
> }
>
> I'm really interested in learning the answers to the questions in the
> original post.
>
> Thanks,
> Joel
>
> On Wed, Mar 6, 2024 at 12:11 PM Anthony D'Atri 
> wrote:
>
> >
> >
> > On Feb 28, 2024, at 17:55, Joel Davidow  wrote:
> >
> > Current situation
> > -
> > We have three Ceph clusters that were originally built via cephadm on
> > octopus and later upgraded to pacific. All osds are HDD (will be moving to
> > wal+db on SSD) and were resharded after the upgrade to enable rocksdb
> > sharding.
> >
> > The value for bluefs_shared_alloc_size has remained unchanged at 65535.
> >
> > The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is
> > reported as 4096 by ceph daemon osd. config show in pacific.
> >
> >
> > min_alloc_size is baked into a given OSD when it is created.  The central
> > config / runtime value does not affect behavior for existing OSDs.  The
> > only way to change it is to destroy / redeploy the OSD.
> >
> > There was a succession of PRs in the Octopus / Pacific timeframe around
> > default min_alloc_size for HDD and SSD device classes, including IIRC one
> > temporary reversion.
> >
> > However, the osd label after upgrading to pacific retains the value of
> > 65535 for bfm_bytes_per_block.
> >
> >
> > OSD label?
> >
> > I'm not sure if your Pacific release has the back port, but not that along
> > ago `ceph osd metadata` was amended to report the min_alloc_size that a
> > given OSD was built with.  If you don't have that, the OSD's startup log
> > should report it.
> >
> > -- aad
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How can I clone data from a faulty bluestore disk?

2024-02-03 Thread Alexander E. Patrakov
Hi,

I think that the approach with exporting and importing PGs would be
a-priori more successful than the one based on pvmove or ddrescue. The
reason is that you don't need to export/import all data that the
failed disk holds, but only the PGs that Ceph cannot recover
otherwise. The logic here is that these are, likely, not the same PGs
due to which tools are crashing.

Note that after the export/import operation Ceph might still think "I
need a copy from that failed disk and not the one that you gave me",
in this case just export a copy of the same PG from the other failed
OSD and import elsewhere, up to the total number of copies. If even
that desn't help, "ceph osd lost XX" would be the last (very
dangerous) words to convince Cepth that osd.XX will not be seen in the
future.

On Sat, Feb 3, 2024 at 5:35 AM Eugen Block  wrote:
>
> Hi,
>
> if the OSDs are deployed as LVs (by ceph-volume) you could try to do a
> pvmove to a healthy disk. There was a thread here a couple of weeks
> ago explaining the steps. I don’t have it at hand right now, but it
> should be easy to find.
> Of course, there’s no guarantee that this will be successful. I also
> can’t tell if Igor‘s approach is more promising.
>
> Zitat von Igor Fedotov :
>
> > Hi Carl,
> >
> > you might want to use ceph-objectstore-tool to export PGs from
> > faulty OSDs and import them back to healthy ones.
> >
> > The process could be quite tricky though.
> >
> > There is also pending PR (https://github.com/ceph/ceph/pull/54991)
> > to make the tool more tolerant to disk errors.
> >
> > The patch worth trying in some cases, not a silver bullet though.
> >
> > And generally whether the recovery doable greatly depends on the
> > actual error(s).
> >
> >
> > Thanks,
> >
> > Igor
> >
> > On 02/02/2024 19:03, Carl J Taylor wrote:
> >> Hi,
> >> I have a small cluster with some faulty disks within it and I want to clone
> >> the data from the faulty disks onto new ones.
> >>
> >> The cluster is currently down and I am unable to do things like
> >> ceph-bluestore-fsck but ceph-bluestore-tool  bluefs-export does appear to
> >> be working.
> >>
> >> Any help would be appreciated
> >>
> >> Many thanks
> >> Carl
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > _______
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Changing A Ceph Cluster's Front- And/Or Back-End Networks IP Address(es)

2024-01-31 Thread Alexander E. Patrakov
Hi.

On Tue, Jan 30, 2024 at 5:24 PM duluxoz  wrote:
>
> Hi All,
>
> Quick Q: How easy/hard is it to change the IP networks of:
>
> 1) A Ceph Cluster's "Front-End" Network?

This is hard, but doable. The problem is that the MON database
contains the expected addresses of all MONs, and therefore, you cannot
just change them. What you can do is:

* Make sure that routing between the old network and the new one is functional
* Set at least "noout"
* Change the cluster config so that it lists two public networks (old and new)
* Remove one MON
* Adjust ceph.conf on all MONs to point it to the new MON address
* Add a new MON on the new network, wait for it to join the quorum
* Repeat the process for other MONs, one by one
* Do another rolling restart of all MONs with the correct ceph.conf,
just for good measure

Then change the network config and ceph.conf on OSD nodes and clients.

> 2) A Ceph Cluster's "Back-End" Network?

I have never tried this. In fact, any cluster that has a back-end
network not over the same physical interfaces as the front-end network
will not pass my audit. VLANs are OK. A failure of a back-end network
card on a single OSD would put the whole cluster to a halt.

The problem is described at
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#flapping-osds

>
> Is it a "simply" matter of:
>
> a) Placing the Nodes in maintenance mode
>
> b) Changing a config file (I assume it's /etc/ceph/ceph.conf) on each Node
>
> c) Rebooting the Nodes
>
> d) Taking each Node out of Maintenance Mode
>
> Thanks in advance
>
> Cheers
>
> Dulux-Oz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Network Flapping Causing Slow Ops and Freezing VMs

2024-01-06 Thread Alexander E. Patrakov
Hello Mahnoosh,

Just to double check, can you confirm that you are NOT using a
physically separate cluster network and private network? A
configuration with such physically separate networks is inherently
vulnerable and therefore cannot be recommended. VLANs on the same
physical interface are probably acceptable, but I have never seen a
cluster configured like this.

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd/#flapping-osds

On Sat, Jan 6, 2024 at 9:28 PM mahnoosh shahidi  wrote:
>
> Hi all,
>
> I hope this message finds you well. We recently encountered an issue on one
> of our OSD servers, leading to network flapping and subsequently causing
> significant performance degradation across our entire cluster. Although the
> OSDs were correctly marked as down in the monitor, slow ops persisted until
> we resolved the network issue. This incident resulted in a major
> disruption, especially affecting VMs with mapped RBD images, leading to
> their freezing.
>
> In light of this, I have two key questions for the community:
>
> 1. Why did slow ops persist even after marking the affected server as down
> in the monitor?
>
> 2.Are there any recommended configurations for OSD suicide or OSD down
> reports that could help us better handle similar network-related issues in
> the future?
>
> Best Regards,
> Mahnoosh
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH Cluster performance review

2023-11-13 Thread Alexander E. Patrakov
Hello Mosharaf,

There is an automated service available that will criticize your cluster:

https://analyzer.clyso.com/#/analyzer

On Sun, Nov 12, 2023 at 12:03 PM Mosharaf Hossain <
mosharaf.hoss...@bol-online.com> wrote:

> Hello Community
>
> Currently, I operate a CEPH Cluster utilizing Ceph Octopus version 1.5.2.7,
> installed through Ansible. The challenge I'm encountering is that, during
> scrubbing, OSD latency spikes to 300-600 ms, resulting in sluggish
> performance for all VMs.
> Additionally, some OSDs fail during the scrubbing process. In such
> instances, promptly halting the scrubbing resolves the issue.
>
> *Summary of CEPH Version*:
> CEPH Version Number of Nodes Node Networking OSD OSD Total pools PG Size
> 15.2.7 12
> (6 SSD node + 6 HDD node) All nodes are connected through  10G bonded link,
> i.e. 10Gx2=20GB for each node.  64 SSD
> 42 HDD 106 one-ssd 256 active+clean
> one-hdd 512 active+clean
> cloudstack.hdd 512 active+clean
>
> I intend to enlarge the PG size for the "one-ssd" configuration. Please
> provide the PG number, and suggest the optimal approach to increase the PG
> size without causing any slowdown or service disruptions to the VMs.
>
> Your expertise and guidance on this matter would be highly valuable, and
> I'm eager to benefit from the collective knowledge of the Ceph community.
>
> Thank you in advance for your time and assistance. I look forward to
> hearing from you.
>
> *CEPH Health status:*
> root@mon1:~# ceph -s
>   cluster:
> id: f8096ec7-51db-4557-85e6-57d7fdfe9423
> health: HEALTH_WARN
> nodeep-scrub flag(s) set
> 656 pgs not deep-scrubbed in time
>
>   services:
> mon: 3 daemons, quorum ceph2,mon1,ceph6 (age 4d)
> mgr: ceph4(active, since 2w), standbys: mon1, ceph3, ceph6, ceph1
> mds: cephfs:1 {0=ceph8=up:active} 1 up:standby
> osd: 107 osds: 105 up (since 3h), 105 in (since 3d)
>  flags nodeep-scrub
> rgw: 4 daemons active (ceph10.rgw0, ceph7.rgw0, ceph9.rgw0,
> mon1.rgw0)
> rgw-nfs: 2 daemons active (ceph7, ceph9)
>
>   task status:
>
>   data:
> pools:   13 pools, 2057 pgs
> objects: 9.40M objects, 35 TiB
> usage:   106 TiB used, 154 TiB / 259 TiB avail
> pgs: 2057 active+clean
>
>   io:
> client:   14 MiB/s rd, 30 MiB/s wr, 1.50k op/s rd, 1.53k op/s wr
>
>
> root@ceph1:~# ceph df
> --- RAW STORAGE ---
> CLASS  SIZE AVAILUSED RAW USED  %RAW USED
> hdd151 TiB   78 TiB   72 TiB73 TiB  48.04
> ssd110 TiB   78 TiB   32 TiB32 TiB  29.42
> TOTAL  261 TiB  156 TiB  104 TiB   105 TiB  40.19
>
> --- POOLS ---
> POOLID  PGS  STORED   OBJECTS  USED %USED  MAX
> AVAIL
> cephfs_data  1   64  3.8 KiB0   11 KiB  0
> 23 TiB
> cephfs_metadata  28  228 MiB   79  685 MiB  0
> 23 TiB
> .rgw.root3   32  6.0 KiB8  1.5 MiB  0
> 23 TiB
> default.rgw.control  4   32  0 B8  0 B  0
> 23 TiB
> default.rgw.meta 5   32   12 KiB   48  7.5 MiB  0
> 23 TiB
> default.rgw.log  6   32  4.8 KiB  207  6.0 MiB  0
> 23 TiB
> default.rgw.buckets.index7   32  410 MiB   15  1.2 GiB  0
> 23 TiB
> default.rgw.buckets.data 8  512  4.6 TiB1.29M   14 TiB  16.59
> 23 TiB
> default.rgw.buckets.non-ec   9   32  1.0 MiB  676  130 MiB  0
> 23 TiB
> one-hdd 10  512  9.2 TiB2.45M   28 TiB  28.69
> 23 TiB
> device_health_metrics   111  9.5 MiB  113   28 MiB  0
> 23 TiB
> one-ssd 12  256   11 TiB2.88M   32 TiB  31.37
> 23 TiB
> cloudstack.hdd  15  512   10 TiB2.72M   31 TiB  30.94
> 23 TiB
>
>
>
> Regards
> Mosharaf Hossain
> Manager, Product Development
> IT Division
>
> Bangladesh Export Import Company Ltd.
>
> Level-8, SAM Tower, Plot #4, Road #22, Gulshan-1, Dhaka-1212,Bangladesh
>
> Tel: +880 9609 000 999, +880 2 5881 5559, Ext: 14191, Fax: +880 2 9895757
>
> Cell: +8801787680828, Email: mosharaf.hoss...@bol-online.com, Web:
> www.bol-online.com
> <
> https://www.google.com/url?q=http://www.bol-online.com=D=hangouts=1557908951423000=AFQjCNGMxIuHSHsD3qO6y5JddpEZ0S592A
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS_CACHE_OVERSIZED, what is this a symptom of?

2023-09-19 Thread Alexander E. Patrakov
Hello Pedro,

This is a known bug in standby-replay MDS. Please see the links below and
patiently wait for the resolution. Restarting the standby-replay MDS will
clear the warning with zero client impact, and realistically, that's the
only thing (besides disabling the standby-replay MDS completely) that you
can do.

https://tracker.ceph.com/issues/40213
https://github.com/ceph/ceph/pull/48483


On Tue, Sep 19, 2023 at 6:51 AM Pedro Lopes  wrote:

> So I'm getting this warning (although there are no noticeable problems in
> the cluster):
>
> $ ceph health detail
> HEALTH_WARN 1 MDSs report oversized cache
> [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
> mds.storefs-b(mds.0): MDS cache is too large (7GB/4GB); 0 inodes in
> use by clients, 0 stray files
>
> Ceph FS status:
>
> $ ceph fs status
> storefs - 20 clients
> ===
> RANK  STATE  MDSACTIVITY DNSINOS   DIRS
>  CAPS
>  0active  storefs-a  Reqs:0 /s  1385k  1385k   113k
>  193k
> 0-s   standby-replay  storefs-b  Evts:0 /s  3123k  3123k  33.5k 0
>
>   POOL  TYPE USED  AVAIL
> storefs-metadata  metadata  19.4G  12.6T
>  storefs-pool4x data4201M  9708G
>  storefs-pool2x data2338G  18.9T
> MDS version: ceph version 17.2.5
> (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
>
> What is telling me? Is it just that case of the cache size needing to be
> bigger? Is it a problem with the clients holding onto some kind of
> reference (documentation says this can be a cause, but now how to check for
> it).
>
> Thanks in advance,
> Pedro Lopes
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Best practices regarding MDS node restart

2023-09-09 Thread Alexander E. Patrakov
Hello,

I am interested in the best-practice guidance for the following situation.

There is a Ceph cluster with CephFS deployed. There are three servers
dedicated to running MDS daemons: one active, one standby-replay, and one
standby. There is only a single rank.

Sometimes, servers need to be rebooted for reasons unrelated to Ceph.
What's the proper procedure to follow when restarting a node that currently
contains an active MDS server? The goal is to minimize the client downtime.
Ideally, they should not notice even if they play MP3s from the CephFS
filesystem (note that I haven't tested this exact scenario) - is this
achievable?

I tried to use the "ceph mds fail mds02" command while mds02 was active and
mds03 was standby-replay, to force the fail-over to mds03 so that I could
reboot mds02. Result: mds02 became standby, while mds03 went through
reconnect (30 seconds), rejoin (another 30 seconds), and replay (5 seconds)
phases. During the "reconnect" and "rejoin" phases, the "Activity" column
of "ceph fs status" is empty, which concerns me. It looks like I just
caused a 65-second downtime. After all of that, mds02 became
standby-replay, as expected.

Is there a better way? Or, should I have rebooted mds02 without much
thinking?

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unhappy Cluster

2023-09-08 Thread Alexander E. Patrakov
Hello Dave,

I think your data is still intact. Nautilus, indeed, had issues when
recovering erasure-coded pools. You can try temporarily setting min_size to
4. This bug has been fixed in Octopus or later releases. From the release
notes at https://docs.ceph.com/en/latest/releases/octopus/:

Ceph will allow recovery below min_size for Erasure coded pools, wherever
> possible.



On Sat, Sep 9, 2023 at 5:58 AM Dave S  wrote:

> Hi Everyone,
> I've been fighting with a ceph cluster that we have recently
> physically relocated and lost 2 OSDs during the ensuing power down and
> relocation. After powering everything back on we have
>  3   incomplete
>  3   remapped+incomplete
> And indeed we have 2 OSDs that died along the way.
> The reason I'm contacting the list is that I'm surprised that these
> PGs are incomplete.  We're running Erasure coding with K=4, M=2 which
> in my understanding we should be able to lose 2 OSDs without an issue.
> Am I mis-understanding this or does m=2 mean you can lose m-1 OSDs?
>
> Also, these two OSDs happened to be in the same server (#3 of 8 total
> servers).
>
> This is an older cluster running Nautilus 14.2.9.
>
> Any thoughts?
> Thanks
> -Dave
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rocksdb compaction and OSD timeout

2023-09-07 Thread Alexander E. Patrakov
  0 log_channel(cluster) log
[WRN] : Health check failed: 1 osds down (OSD_DOWN)
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.453+ 7f8f70212700  1 mon.mds01@0(leader).osd
e1031872 e1031872: 459 total, 437 up, 438 in
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.453+ 7f8f70212700  0 log_channel(cluster) log
[DBG] : osdmap e1031872: 459 total, 437 up, 438 in

Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.532+ 7f83d813d700  0 --1-
v1:10.3.0.9:6862/5574 >> v1:10.3.0.7:6857/5579 conn(0x55cd63342000
0x55cd63344000 :-1 s=OPENED pgs=12691 cs=1079 l=0).fault initiating
reconnect
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.532+ 7f83d813d700  0 --1-
v1:10.3.0.9:6862/5574 >> v1:10.3.0.7:6857/5579 conn(0x55cd63342000
0x55cd63344000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=12691 cs=1080
l=0).handle_connect_reply_2 connect got RESETSESSION
Wed 2023-09-06 22:42:26 UTC ceph-osd07 ceph-osd@91.service[5579]:
2023-09-06T22:42:26.529+ 7fbf8f3ed700  0 --1-
v1:10.3.0.7:6857/5579 >> v1:10.3.0.9:6862/5574 conn(0x5588d67cac00
0x5588f210b000 :6857 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_message_2 accept we reset (peer sent cseq 1080),
sending RESETSESSION

Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.769+ 7f8f72216700  0 log_channel(cluster) log
[INF] : osd.39 marked itself dead as of e1031872
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.768+ 7f83c4ed7700  0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.39 down, but it is still running
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.768+ 7f83c4ed7700  0 log_channel(cluster) log
[DBG] : map e1031872 wrongly marked me down at e1031872
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.768+ 7f83c4ed7700  1 osd.39 1031872
start_waiting_for_healthy
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.772+ 7f83c4ed7700  1 osd.39 1031872 is_healthy
false -- only 0/12 up peers (less than 33%)
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.772+ 7f83c4ed7700  1 osd.39 1031872 not
healthy; waiting to boot
Wed 2023-09-06 22:42:27 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:27.481+ 7f8f70212700  1 mon.mds01@0(leader).osd
e1031873 e1031873: 459 total, 437 up, 438 in

Then osd.39 comes up by itself:

Wed 2023-09-06 22:42:28 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:28.516+ 7f83d20fe700  1 osd.39 1031873 tick
checking mon for new map
Wed 2023-09-06 22:42:28 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:28.541+ 7f8f70212700  0 log_channel(cluster) log
[INF] : osd.39 v1:10.3.0.9:6860/5574 boot
Wed 2023-09-06 22:42:28 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:28.541+ 7f8f70212700  0 log_channel(cluster) log
[DBG] : osdmap e1031874: 459 total, 438 up, 438 in

1) Is this the same?
2) It's strange that the majority (but not all) of OSDs saying "no
reply" come from the old part of the cluster. Is there any debug
option that would allow us to discriminate between the automatic
compaction and network issues? At which debug level are automatic
compactions announced?

I am asking the question (2) because, prior to the first osd_op_tp
timeout, there is nothing that announces a compaction.

On Thu, Sep 7, 2023 at 5:08 PM J-P Methot  wrote:
>
> We're talking about automatic online compaction here, not running the
> command.
>
> On 9/7/23 04:04, Konstantin Shalygin wrote:
> > Hi,
> >
> >> On 7 Sep 2023, at 10:05, J-P Methot  wrote:
> >>
> >> We're running latest Pacific on our production cluster and we've been
> >> seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed
> >> out after 15.00954s' error. We have reasons to believe this
> >> happens each time the RocksDB compaction process is launched on an
> >> OSD. My question is, does the cluster detecting that an OSD has timed
> >> out interrupt the compaction process? This seems to be what's
> >> happening, but it's not immediately obvious. We are currently facing
> >> an infinite loop of random OSDs timing out and if the compaction
> >> process is interrupted without finishing, it may explain that.
> >
> > You run the online compacting for this OSD's (`ceph osd compact
> > ${osd_id}` command), right?
> >
> >
> >
> > k
>
> --
> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
> ___
> ceph

[ceph-users] Re: librbd 4k read/write?

2023-08-11 Thread Alexander E. Patrakov
Hello Murilo,

This is an expected result, and it is not specific to Ceph. Any
storage that consists of multiple disks will produce a performance
gain over a single disk only if the workload allows for concurrent use
of these disks - which is not the case with your 4K benchmark due to
the de-facto missing readahead. The default readahead in Linux is just
128 kilobytes, and it means that even in a linear read scenario the
benchmark has no way to hit multiple RADOS objects at once. Reminder:
they are 4 megabytes in size by default with RBD.

To allow for faster linear reads and writes, please create a file,
/etc/udev/rules.d/80-rbd.rules, with the following contents (assuming
that the VM sees the RBD as /dev/sda):

KERNEL=="sda", ENV{DEVTYPE}=="disk", ACTION=="add|change",
ATTR{bdi/read_ahead_kb}="32768"

Or test it without any udev rule like this:

bloskdev --setra 65536 /dev/sda

The difference in numbers is because one is in kilobytes and one is in
512-byte sectors.

Mandatory warning: this setting can hurt other workloads.

On Thu, Aug 10, 2023 at 11:37 PM Murilo Morais  wrote:
>
> Good afternoon everybody!
>
> I have the following scenario:
> Pool RBD replication x3
> 5 hosts with 12 SAS spinning disks each
>
> I'm using exactly the following line with FIO to test:
> fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> -iodepth=16 -rw=write -filename=./test.img
>
> If I increase the blocksize I can easily reach 1.5 GBps or more.
>
> But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> which is quite annoying. I achieve the same rate if rw=read.
>
> If I use librbd's cache I get a considerable improvement in writing, but
> reading remains the same.
>
> I already tested with rbd_read_from_replica_policy=balance but I didn't
> notice any difference. I tried to leave readahead enabled by setting
> rbd_readahead_disable_after_bytes=0 but I didn't see any difference in
> sequential reading either.
>
> Note: I tested it on another smaller cluster, with 36 SAS disks and got the
> same result.
>
> I don't know exactly what to look for or configure to have any improvement.
> _______
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [multisite] The purpose of zonegroup

2023-06-30 Thread Alexander E. Patrakov
Thanks! This is something that should be copy-pasted at the top of
https://docs.ceph.com/en/latest/radosgw/multisite/

Actually, I reported a documentation bug for something very similar.

On Fri, Jun 30, 2023 at 11:30 PM Casey Bodley  wrote:
>
> you're correct that the distinction is between metadata and data;
> metadata like users and buckets will replicate to all zonegroups,
> while object data only replicates within a single zonegroup. any given
> bucket is 'owned' by the zonegroup that creates it (or overridden by
> the LocationConstraint on creation). requests for data in that bucket
> sent to other zonegroups should redirect to the zonegroup where it
> resides
>
> the ability to create multiple zonegroups can be useful in cases where
> you want some isolation for the datasets, but a shared namespace of
> users and buckets. you may have several connected sites sharing
> storage, but only require a single backup for purposes of disaster
> recovery. there it could make sense to create several zonegroups with
> only two zones each to avoid replicating all objects to all zones
>
> in other cases, it could make more sense to isolate things in separate
> realms with a single zonegroup each. zonegroups just provide some
> flexibility to control the isolation of data and metadata separately
>
> On Thu, Jun 29, 2023 at 5:48 PM Yixin Jin  wrote:
> >
> > Hi folks,
> > In the multisite environment, we can get one realm that contains multiple 
> > zonegroups, each in turn can have multiple zones. However, the purpose of 
> > zonegroup isn't clear to me. It seems that when a user is created, its 
> > metadata is synced to all zones within the same realm, regardless whether 
> > they are in different zonegroups or not. The same happens to buckets. 
> > Therefore, what is the purpose of having zonegroups? Wouldn't it be easier 
> > to just have realm and zones?
> > Thanks,Yixin
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> _______
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 pg inconsistent and does not recover

2023-06-27 Thread Alexander E. Patrakov
;candidate had a read error" on OSD "33" mean that a BlueStore checksum 
> error was detected on OSD "33" at the same time as the OSD "2" disk failed?
> If yes, maybe that is the explanation:
>
> * pg 2.87 is backed by OSDs [33,2,20]; OSD 2's hardware broke during the 
> scrub, OSD 33 detected a checksum error during the scrub, and thus we have 2 
> OSDs left (33 and 20) whose checksums disagree.
>
> I am just guessing this, though.
> Also, if this is correct, the next question would be: What is with OSD 20?
> Since there is no error reported at all for OSD 20, I assume that its 
> checksum agrees with its data.
> Now, can I find out whether OSD 20's checksum agrees with OSD 33's data?
>
> (Side note: The disk of OSD 33 looks fine in smartctl.)
>
> Thanks,
> Niklas
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [rgw multisite] Perpetual behind

2023-06-17 Thread Alexander E. Patrakov
On Sat, Jun 17, 2023 at 4:41 AM Yixin Jin  wrote:
>
> Hi ceph gurus,
>
> I am experimenting with rgw multisite sync feature using Quincy release 
> (17.2.5). I am using the zone-level sync, not bucket-level sync policy. 
> During my experiment, somehow my setup got into a situation that it doesn't 
> seem to get out of. One zone is perpetually behind the other, although there 
> is no ongoing client request.
>
> Here is the output of my "sync status":
>
> root@mon1-z1:~# radosgw-admin sync status
>   realm f90e4356-3aa7-46eb-a6b7-117dfa4607c4 (test-realm)
>   zonegroup a5f23c9c-0640-41f2-956f-a8523eccecb3 (zg)
>zone bbe3e2a1-bdba-4977-affb-80596a6fe2b9 (z1)
>   metadata sync no sync (zone is master)
>   data sync source: 9645a68b-012e-4889-bf24-096e7478f786 (z2)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is behind on 14 shards
> behind shards: 
> [56,61,63,107,108,109,110,111,112,113,114,115,116,117]
>
>
> It stays behind forever while rgw is almost completely idle (1% of CPU).
>
> Any suggestion on how to drill deeper to see what happened?

Hello!

I have no idea what has happened, but it would be helpful if you
confirm the latency between the two clusters. In other words, please
don't expect the sync between e.g. Germany and Singapore to catch up
fast. It will be limited by the amount of data that can be synced in
one request and the hard-coded maximum number of requests in flight.

In Reef, there are new tunables that help on high-latency links:
rgw_data_sync_spawn_window, rgw_bucket_sync_spawn_window.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to release the invalid tcp connection under radosgw?

2023-06-13 Thread Alexander E. Patrakov
Hello Louis,

On Tue, Jun 13, 2023 at 11:51 AM Louis Koo  wrote:
>
> connections:
> [root@et-uos-warm02 deeproute]# netstat -anltp | grep rados | grep 
> 10.x.x.x:7480 | grep ESTAB | grep 10.12 | wc -l
> 6650
>
> The prints:
> tcp0  0 10.x.x.x:7480   10.x.x.12:40210   ESTABLISHED 
> 76749/radosgw
> tcp0  0 10.x.x.x:7480   10.x.x.12:33218ESTABLISHED 
> 76749/radosgw
> tcp0  0 10.x.x.x:7480   10.x.x.12:33546ESTABLISHED 
> 76749/radosgw
> tcp0  0 10.x.x.x:7480   10.x.x.12:50024ESTABLISHED 
> 76749/radosgw
> 
>
> but client  ip 10.x.x.12 is unreachable(because the node was shutdown), the  
> status of the tcp connections is always "ESTABLISHED", how to fix it?

Please use this guide:
https://www.cyberciti.biz/tips/cutting-the-tcpip-network-connection-with-cutter.html


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-06-02 Thread Alexander E. Patrakov
Hello Stefan,

On Fri, Jun 2, 2023 at 11:12 PM Stefan Kooman  wrote:

> On 6/2/23 16:33, Anthony D'Atri wrote:
> > Stefan, how do you have this implemented? Earlier this year I submitted
> > https://tracker.ceph.com/issues/58569
> > <https://tracker.ceph.com/issues/58569> asking to enable just this.
>
> Lol, I have never seen that tracker otherwise I would have informed you
> about it. I see the PR and tracker are updated by you / Joshua, thanks
> for that..
>
> So yes, we have this implemented and running in production (currently
> re-provisioning all OSDs). It's a locally patched 16.2.11 ceph-volume
> for that matter. The PR [1] needs some fixing (I need to sit down and
> make it happen, just so many other things that take up my time). But
> then this would be enabled by default for flash devices
> (non-rotational). If used with cryptsetup 2.4.x also the appropriate
> sector size is used (based on the physical sector size). We use 4K on NVMe.
>
> Added benefit of using cryptsetup 2.4.x is that is uses Argon2id as
> PBKDF for LUKS2.
>
> We created a backport of cryptsetup 2.4.3 for use in Ubuntu Focal (based
> on Jammy) [2].
>
> We are converting our whole cluster using LUKS2 with the work queues
> bypassed. For the nodes that have been converted already it works just
> fine. So, as multiple users seem to be waiting for this to be available
> in Ceph ... I should hurry up and make sure the PR gets in proper shape
> and merged in main.
>

Thanks for the report.

However, I would like to take back a part of my previous response, where I
informed you about the "xtsproxy" kernel module. Please don't try to use
it. Reason: I recently filed a bug for its inclusion into the Zen kernel,
available for Arch Linux users, and the result is that the resulting system
stopped booting for some users. So a proper backport is required, even
though the Cloudflare patch applies as-is.

https://github.com/zen-kernel/zen-kernel/issues/306
https://github.com/zen-kernel/zen-kernel/issues/310

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iscsi gateway semi deprecation warning?

2023-05-26 Thread Alexander E. Patrakov
On Sat, May 27, 2023 at 12:21 AM Mark Kirkwood
 wrote:
>
> I am looking at using an iscsi gateway in front of a ceph setup. However
> the warning in the docs is concerning:
>
> The iSCSI gateway is in maintenance as of November 2022. This means that
> it is no longer in active development and will not be updated to add new
> features.
>
> Does this mean I should be wary of using it, or is it simply that it
> does all the stuff it needs to and no further development is needed?


Hello Mark,

The planned replacement is based on the newer NVMe-oF protocol and
SPDK. See this presentation for the purported performance benefits:
https://ci.spdk.io/download/2022-virtual-forum-prc/D2_4_Yue_A_Performance_Study_for_Ceph_NVMeoF_Gateway.pdf

The git repository is here: https://github.com/ceph/ceph-nvmeof.
However, this is not yet something recommended for a production-grade
setup. At the very least, wait until this subproject makes it into
Ceph documentation and becomes available as RPMs and DEBs.

For now, you can still use ceph-iscsi - assuming that you need it,
i.e. that raw RBD is not an option.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-05-26 Thread Alexander E. Patrakov
On Sat, May 27, 2023 at 5:09 AM Alexander E. Patrakov
 wrote:
>
> Hello Frank,
>
> On Fri, May 26, 2023 at 6:27 PM Frank Schilder  wrote:
> >
> > Hi all,
> >
> > jumping on this thread as we have requests for which per-client fs mount 
> > encryption makes a lot of sense:
> >
> > > What kind of security to you want to achieve with encryption keys stored
> > > on the server side?
> >
> > One of the use cases is if a user requests a share with encryption at rest. 
> > Since encryption has an unavoidable performance impact, it is impractical 
> > to make 100% of users pay for the requirements that only 1% of users really 
> > have. Instead of all-OSD back-end encryption hitting everyone for little 
> > reason, encrypting only some user-buckets/fs-shares on the front-end 
> > application level will ensure that the data is encrypted at rest.
> >
>
> I would disagree about the unavoidable performance impact of at-rest
> encryption of OSDs. Read the CloudFlare blog article which shows how
> they make the encryption impact on their (non-Ceph) drives negligible:
> https://blog.cloudflare.com/speeding-up-linux-disk-encryption/. The
> main part of their improvements (the ability to disable dm-crypt
> workqueues) is already in the mainline kernel. There is also a Ceph
> pull request that disables dm-crypt workqueues on certain drives:
> https://github.com/ceph/ceph/pull/49554
>
> While the other part of the performance enhancements authored by
> CloudFlare (namely, the "xtsproxy" module) is not mainlined yet, I
> hope that some equivalent solution will find its way into the official
> kernel sooner or later.
>
> In summary: just encrypt everything.

As a follow-up, if you disagree with the advice to encrypt everything,
please note that CephFS allows one to place certain directories on a
separate pool. Therefore, you can create a separate device class for
encrypted OSDs, create a pool that uses this device class, and put the
directories owned by your premium users onto this pool.

Documentation: https://docs.ceph.com/en/latest/cephfs/file-layouts/

>
> > It may very well not serve any other purpose, but these are requests we 
> > get. If I could provide an encryption key to a ceph-fs kernel at mount 
> > time, this requirement could be solved very elegantly on a per-user 
> > (request) basis and only making users who want it pay with performance 
> > penalties.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Robert Sander 
> > Sent: Tuesday, May 23, 2023 6:35 PM
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] Re: Encryption per user Howto
> >
> > On 23.05.23 08:42, huxia...@horebdata.cn wrote:
> > > Indeed, the question is on  server-side encryption with keys managed by 
> > > ceph on a per-user basis
> >
> > What kind of security to you want to achieve with encryption keys stored
> > on the server side?
> >
> > Regards
> > --
> > Robert Sander
> > Heinlein Support GmbH
> > Linux: Akademie - Support - Hosting
> > http://www.heinlein-support.de
> >
> > Tel: 030-405051-43
> > Fax: 030-405051-19
> >
> > Zwangsangaben lt. §35a GmbHG:
> > HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> > Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> Alexander E. Patrakov



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-05-26 Thread Alexander E. Patrakov
Hello Frank,

On Fri, May 26, 2023 at 6:27 PM Frank Schilder  wrote:
>
> Hi all,
>
> jumping on this thread as we have requests for which per-client fs mount 
> encryption makes a lot of sense:
>
> > What kind of security to you want to achieve with encryption keys stored
> > on the server side?
>
> One of the use cases is if a user requests a share with encryption at rest. 
> Since encryption has an unavoidable performance impact, it is impractical to 
> make 100% of users pay for the requirements that only 1% of users really 
> have. Instead of all-OSD back-end encryption hitting everyone for little 
> reason, encrypting only some user-buckets/fs-shares on the front-end 
> application level will ensure that the data is encrypted at rest.
>

I would disagree about the unavoidable performance impact of at-rest
encryption of OSDs. Read the CloudFlare blog article which shows how
they make the encryption impact on their (non-Ceph) drives negligible:
https://blog.cloudflare.com/speeding-up-linux-disk-encryption/. The
main part of their improvements (the ability to disable dm-crypt
workqueues) is already in the mainline kernel. There is also a Ceph
pull request that disables dm-crypt workqueues on certain drives:
https://github.com/ceph/ceph/pull/49554

While the other part of the performance enhancements authored by
CloudFlare (namely, the "xtsproxy" module) is not mainlined yet, I
hope that some equivalent solution will find its way into the official
kernel sooner or later.

In summary: just encrypt everything.

> It may very well not serve any other purpose, but these are requests we get. 
> If I could provide an encryption key to a ceph-fs kernel at mount time, this 
> requirement could be solved very elegantly on a per-user (request) basis and 
> only making users who want it pay with performance penalties.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Robert Sander 
> Sent: Tuesday, May 23, 2023 6:35 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: Encryption per user Howto
>
> On 23.05.23 08:42, huxia...@horebdata.cn wrote:
> > Indeed, the question is on  server-side encryption with keys managed by 
> > ceph on a per-user basis
>
> What kind of security to you want to achieve with encryption keys stored
> on the server side?
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-05-21 Thread Alexander E. Patrakov
Hello Samuel,

On Sun, May 21, 2023 at 3:48 PM huxia...@horebdata.cn
 wrote:
>
> Dear Ceph folks,
>
> Recently one of our clients approached us with a request on encrpytion per 
> user, i.e. using individual encrytion key for each user and encryption  files 
> and object store.
>
> Does anyone know (or have experience) how to do with CephFS and Ceph RGW?

For CephFS, this is unachievable.

For RGW, please use Vault for storing encryption keys. Don't forget
about the proper high-availability setup. Use an AppRole to manage
tokens. Use Vault Agent as a proxy that adds the token to requests
issued by RGWs. Then create a bucket for each user and set the
encryption policy for this bucket using the PutBucketEncryption API
that is available through AWS CLI. Either SSE-S3 or SSE-KMS will work
for you. SSE-S3 is easier to manage. Each object will then be
encrypted using a different key derived from its name and a per-bucket
master key which never leaves Vault.

Note that users will be able to create additional buckets by
themselves, and they won't be encrypted, so tell them either not to do
that or to encrypt the new buckets similarly.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Disks are filling up even if there is not a single placement group on them

2023-04-10 Thread Alexander E. Patrakov
On Sat, Apr 8, 2023 at 2:26 PM Michal Strnad  wrote:
>cluster:
>  id: a12aa2d2-fae7-df35-ea2f-3de23100e345
>  health: HEALTH_WARN
...
>  pgs: 1656117639/32580808518 objects misplaced (5.083%)

That's why the space is eaten. The stuff that eats the disk space on
MONs is osdmaps, and the MONs have to keep old osdmaps back to the
moment in the past when the cluster was 100% healthy. Note that
osdmaps are also copied to all OSDs and eat space there, which is what
you have seen.

The relevant (but dangerous) configuration parameter is
"mon_osd_force_trim_to". Better don't use it, and let your ceph
cluster recover. If you can't wait, try to use upmaps to say that all
PGs are fine where they are now, i.e that they are not misplaced.
There is a script somewhere on GitHub that does this, but
unfortunately I can't find it right now.


--
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: compiling Nautilus for el9

2023-04-02 Thread Alexander E. Patrakov
On Sat, Apr 1, 2023 at 9:00 PM Marc  wrote:
>
>
> >
> > Is it possible to compile Nautilus for el9? Or maybe just the osd's?
> >
>
>
> I was thinking of updating first to el9/centos9/rocky9 one node at a time, 
> and after that do the ceph upgrade(s). I think this will give me the least 
> intrusive upgrade path.
>
> However that requires the availability of Nautilus el9 rpms. I have been 
> trying to build them via https://github.com/ceph/ceph and via rpmbuild 
> ceph.spec. And without applying to many 'hacks', currently I seem to get 
> around 30%-50%(?) of the code build[1] (albeit with quite some warnings[2]). 
> This leads me to believe that it is most likely possible to build these for 
> el9.
>
> Obviously I prefer to have someone with experience do this. Is it possible 
> someone from the ceph development team can build these rpms for el9? Or are 
> there serious issues that prevent this?

I would say, any such untested hacks of an EOL release are too risky
from the sysadmin perspective. My preferred approach would be to
migrate to containerized Ceph Nautilus at least temporarily
(ceph-ansible can do it), then upgrade the hosts to EL9 while still
keeping Nautilus, then, still containerized, upgrade to a more recent
Ceph release (but note that you can't upgrade from nautilus to Quincy
directly, you need Octopus or Pacific as a middle step), and then
optionally undo the containerization (but I have never tried this on
Quincy).

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-12 Thread Alexander E. Patrakov
пт, 7 окт. 2022 г. в 19:50, Frank Schilder :
> For the interested future reader, we have subdivided 400G high-performance 
> SSDs into 4x100G OSDs for our FS meta data pool. The increased concurrency 
> improves performance a lot. But yes, we are on the edge. OMAP+META is almost 
> 50%.

Please be careful with that. In the past, I had to help a customer who
ran out of disk space on small SSD partitions. This has happened
because MONs keep a history of all OSD and PG maps until at least the
clean state. So, during a prolonged semi-outage (when the cluster is
not healthy) they will slowly creep and accumulate and eat disk space
- and the problematic part is that this creepage is replicated to
OSDs.


-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: encrypt OSDs after creation

2022-10-11 Thread Alexander E. Patrakov
ср, 12 окт. 2022 г. в 00:32, Ali Akil :
>
> Hallo folks,
>
> i created before couple of months a quincy ceph cluster with cephadm. I
> didn't encpryt the OSDs at that time.
> What would be the process to encrypt these OSDs afterwards?
> The documentation states only adding `encrypted: true` to the osd
> manifest, which will work only upon creation.

There is no such process. Destroy one OSD, recreate it with the same
ID but with the encryption on, wait for the cluster to heal itself,
then do the same with the next OSD, rinse, repeat. You may want to set
the norebalance flag during the operation.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-02-22 Thread Alexander E. Patrakov
s.
> >>>
> >>> I redeployed the OSD.7 after the crash from 2 days ago. And I started 
> >>> this new shutdown and boot series shortly after ceph had finished writing 
> >>> everything back to OSD.7, earlier today.
> >>>
> >>> The corrupted RocksDB file (crash) is again only 2KB in size.
> >>> You can download the RocksDB file with the bad  table magic number and 
> >>> the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
> >>> What else do you want?
> >>>
> >>>  From the log of the OSD.7:
> >>> —
> >>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
> >>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
> >>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
> >>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 
> >>> 4096 ioc 0x55f0b8dbdd18 with 0 aios left
> >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated 
> >>> from /sbin/init  (PID: 1) UID: 0
> >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal 
> >>> Terminated ***
> >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate 
> >>> shutdown (osd_fast_shutdown=true) ***
> >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 
> >>> (ceph:ceph)
> >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
> >>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process 
> >>> ceph-osd, pid 1967
> >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty 
> >>> --pid-file
> >>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
> >>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
> >>> —
> >>>
> >>> For me this looks like that the OSD did nothing for nearly 2 minutes 
> >>> before it receives the termination request. Shouldn't this be enough time 
> >>> for flushing every imaginable write cache?
> >>>
> >>>
> >>> I hope this helps you.
> >>>
> >>>
> >>> Best wishes,
> >>> Sebastian
> >>>
> >> --
> >> Igor Fedotov
> >> Ceph Lead Developer
> >>
> >> Looking for help with your Ceph cluster? Contact us at https://croit.io
> >>
> >> croit GmbH, Freseniusstr. 31h, 81247 Munich
> >> CEO: Martin Verges - VAT-ID: DE310638492
> >> Com. register: Amtsgericht Munich HRB 231263
> >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Advice on enabling autoscaler

2022-02-07 Thread Alexander E. Patrakov
пн, 7 февр. 2022 г. в 17:30, Robert Sander :
> And keep in mind that when PGs are increased that you also may need to
> increase the number of OSDs as one OSD should carry a max of around 200
> PGs. But I do not know if that is still the case with current Ceph versions.

This is just the default limit. Even Nautilus can do 400 PGs per OSD,
given "mon max pg per osd = 400" in ceph.conf. Of course it doesn't
mean that you should allow this.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Correct Usage of the ceph-objectstore-tool??

2022-01-06 Thread Alexander E. Patrakov
пт, 7 янв. 2022 г. в 06:21, Lee :

> Hello,
>
> As per another post I been having a huge issue since a PGNUM increase took
> my cluster offline..
>
> I have got to a point where I have just 20 PG's Down / Unavailable due to
> not being able to start a OSD
>
> I have been able to export the PG from the the offline OSD
>
> I then import to a clean / new OSD which is set to weight 0 in Crush using
> the following command.
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-17 --no-mon-config
> --op import --file /mnt/croit/backup/pg10.0
>

If you did this with the OSD being offline, and then started it, then you
did everything correctly. OTOH my preferred approach would be not to set
the weight to 0, but to create a separate otherwise-unused CRUSH pool and
assign the OSDs with extra data to it, but your approach is also valid.


> When I start the OSD I see in the log loads of stuff being transitioned to
> Stray.
>

This is an indicator that you did everything right. Stray means an extra
copy of data in a place where it is not supposed to be - but that's exactly
what you did and what you were supposed to do!


> Do I need to tell CEPH to used the pg on the OSD to rebuild? When I query
> the PG at the end it complains about marking the offline OSD as offline?
>

We need to understand why this happens. The usual scenario where the
automatic rebuild doesn't start is when some of the PGs that you exported
and imported do not represent the known latest copy of the data. Maybe
there is another copy on another dead OSD, try exporting and importing it,
too. Basically, you want to inject all copies of all PGs that are
unhealthy. A copy of "ceph pg dump" output (as an attached file, not
inline) might be helpful. Also, run "ceph pg 1.456 query" where 1.456 is
the PG ID that you have imported - for a few problematic PGs.


> I have looked online and cannot find a definitive guide on how the process
> / steps that should be taken.
>

There is no single definitive guide. My approach would be to treat the
broken OSDs as broken for good, but without using any command that includes
the word "lost" (because this actually loses data). You can mark the dead
OSDs as "out", if Ceph didn't do it for you already. Then add enough
capacity with non-zero weight in the correct pool (or maybe do nothing if
you already have enough space). Ceph will rebalance the data automatically
when it obtains a proof that it really has the latest data.

When I encountered this "dead osd" issue last time, I found it useful to
compare the "ceph health detail" output over time, and with/without the
OSDs with injected PGs running. At the very least, it provides a useful
metric of what is remaining to do.

Also an interesting read-only command (but maybe for later) would be: "ceph
osd safe-to-destroy 123" where 123 is the dead OSD id.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help - Multiple OSD's Down

2022-01-06 Thread Alexander E. Patrakov
пт, 7 янв. 2022 г. в 00:50, Alexander E. Patrakov :

> чт, 6 янв. 2022 г. в 12:21, Lee :
>
>> I've tried add a swap and that fails also.
>>
>
> How exactly did it fail? Did you put it on some disk, or in zram?
>
> In the past I had to help a customer who hit memory over-use when
> upgrading Ceph (due to shallow_fsck), and we were able to fix it by adding
> 64 GB GB of zram-based swap on each server (with 128 GB of physical RAM in
> this type of server).
>
>
On the other hand, if you have some spare disks for temporary storage and
for new OSDs, and this failed OSD is not a part of an erasure-coded pool,
another approach might be to export all PGs using ceph-objectstore-tool as
files onto the temporary storage (in hope that it doesn't suffer from the
same memory explosion), and then import them all into a new temporary OSD.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help - Multiple OSD's Down

2022-01-06 Thread Alexander E. Patrakov
чт, 6 янв. 2022 г. в 12:21, Lee :

> I've tried add a swap and that fails also.
>

How exactly did it fail? Did you put it on some disk, or in zram?

In the past I had to help a customer who hit memory over-use when upgrading
Ceph (due to shallow_fsck), and we were able to fix it by adding 64 GB GB
of zram-based swap on each server (with 128 GB of physical RAM in this type
of server).

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can we deprecate FileStore in Quincy?

2021-06-26 Thread Alexander E. Patrakov
сб, 26 июн. 2021 г. в 10:54, Stuart Longland :
>
> On Tue, 1 Jun 2021 12:24:12 -0700
> Neha Ojha  wrote:
>
> > Given that BlueStore has been the default and more widely used
> > objectstore since quite some time, we would like to understand whether
> > we can consider deprecating FileStore in our next release, Quincy and
> > remove it in the R release. There is also a proposal [0] to add a
> > health warning to report FileStore OSDs.
>
> I'd consder this:
>
> - Bluestore requires OSD hosts with 8GB+ of RAM,
...
> There are very few single-board computers that have 8GB+ of RAM

I have mixed feelings about this.

That 8GB+ figure is not really true. Yes, the default OSD memory
target is 4 GB, and it sometimes overshoots. And especially it likes
to overshoot during ShallowFSCK, e.g. during (or after) upgrades, and
the amount of RAM consumed during ShallowFSCK does not really depend
on the OSD memory target. With a 14TB HDD, it could easily eat 10GB of
RAM.

But still - you could set the OSD memory target lower than the default
(performance will be limited then, due to insufficient caching, but
the OSD will still work). And for the ShallowFSCK phase to complete
successfully, you could add swap during upgrades. In fact, I had to do
so (with zram) during the upgrade of a Luminous cluster to Nautilus
some time before, and that with a beefy server with 128 GB of RAM and
16 OSDs, serving a lot of CephFS. So I don't really see this as an
obstacle, because swap/zram is needed only during upgrades, and
anyway, I wouldn't connect a 14TB drive to a Raspberry Pi, because of
its slow Ethernet and a huge time required to resync a new HDD.

So, a 4GB board could still be made to work as a cheap and bad OSD,
and even survive upgrades, but I wouldn't do it at home. Simply
because Ceph never made sense for small clusters, no matter what the
hardware is - for such use cases, you could always do a software RAID
over ISCSI or over AoE, with less overhead.

-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD migration between 2 EC pools : very slow

2021-06-23 Thread Alexander E. Patrakov
вт, 22 июн. 2021 г. в 23:22, Gilles Mocellin :
>
> Hello Cephers,
>
>
> On a capacitive Ceph cluster (13 nodes, 130 OSDs 8To HDD), I'm migrating a 40
> To image from a 3+2 EC pool to a 8+2 one.
>
> The use case is Veeam backup on XFS filesystems, mounted via KRBD.
>
>
> Backups are running, and I can see 200MB/s Throughput.
>
>
> But my migration (rbd migrate prepare / execute) is staling at 4% for 6h now.
>
> When the backups are not running, I can see a little 20MB/s of throughput,
> certainly my migration.
>
>
> I need a month to migrate 40 to at that speed !
>
>
> As I use a KRBD client, I cannot remap the rbd image straight after the rbd
> prepare. So the filesystem is not usable until the migration is completed.
>
> Not really usable for me...
>
>
> Is anyone has a clue either to speed up the rbd migration, or another method
> to move/copy an image between 2 pools, with the minimum downtime ?
>
>
> I thought of rbd export-diff | rbd import-diff, while mounted, and another
> unmapped before switching...
>
> But, it forces me to rename my image, because if I use another data pool, the
> metadata pool stays the same.
>
>
> Can you see another method ?

I suggest that you cancel the migration and don't ever attempt it
again because big EC setups are very easy to overload with IOPS.

When I worked at croit GmbH, we had a very unhappy customer with
almost the same setup as you are trying to achieve: Veeam Backup, XFS
on rbd on a 8+3 EC pool of HDDs. Their complaint was that both the
backup and restore were extremely slow, ~3 MB/s, and with 200 ms of
latency, but I would call their cluster overloaded due to too many
concurrent backups. We tried, unsuccessfully, to tune their setup, but
our final recommendation (successfully benchmarked but rejected due to
costs) was to create a separate replica 3 pool for new backups.

-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-17 Thread Alexander E. Patrakov
I agree with this sentiment.

Please do not make a containerized and orchestrated deployment
mandatory until all of the documentation is rewritten to take this
deployment scenario into account.

Also, in the past year, I have personally tested three Ceph training
courses from various vendors. They all share the same weakness:
explain how to deal with failed OSD disks in a non-containerized
scenario, how to redeploy the OSD after replacing the disk, then at
the end - how to take the cluster over using cephadm, and as a result,
suddenly the "how to replace a disk and redeploy the OSD" knowledge is
inapplicable.

ср, 17 мар. 2021 г. в 22:39, Teoman Onay :
>
> A containerized environment just makes troubleshooting more difficult,
> getting access and retrieving details on Ceph processes isn't as
> straightforward as with a non containerized infrastructure. I am still not
> convinced that containerizing everything brings any benefits except the
> collocation of services.
>
> On Wed, Mar 17, 2021 at 6:27 PM Matthew H  wrote:
>
> > There should not be any performance difference between an un-containerized
> > version and a containerized one.
> >
> > The shift to containers makes sense, as this is the general direction that
> > the industry as a whole is taking. I would suggest giving cephadm a try,
> > it's relatively straight forward and significantly faster for deployments
> > then ceph-ansible is.
> >
> > 
> > From: Matthew Vernon 
> > Sent: Wednesday, March 17, 2021 12:50 PM
> > To: ceph-users 
> > Subject: [ceph-users] ceph-ansible in Pacific and beyond?
> >
> > Hi,
> >
> > I caught up with Sage's talk on what to expect in Pacific (
> > https://www.youtube.com/watch?v=PVtn53MbxTc ) and there was no mention
> > of ceph-ansible at all.
> >
> > Is it going to continue to be supported? We use it (and uncontainerised
> > packages) for all our clusters, so I'd be a bit alarmed if it was going
> > to go away...
> >
> > Regards,
> >
> > Matthew
> >
> >
> > --
> >  The Wellcome Sanger Institute is operated by Genome Research
> >  Limited, a charity registered in England with number 1021457 and a
> >  company registered in England with number 2742969, whose registered
> >  office is 215 Euston Road, London, NW1 2BE.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best practices for OSD on bcache

2021-03-04 Thread Alexander E. Patrakov
вт, 2 мар. 2021 г. в 13:52, James Page :

(Disclaimer: I have never tried to run Ceph on bcache in production,
and the test cluster was destroyed before reaching its first deep
scrub)

> b) turn off the sequential cutoff
>
> sequential_cutoff = 0
>
> This means that sequential writes will also always go to the cache device
> rather than the backing device

Could you please explain the exact mechanics of the sequential cutoff?
Does it only affect big sequential writes, or big sequential reads
too? I am asking because of the potential of deep scrubs being cached
instead of the "real" hot data.

-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Questions RE: Ceph/CentOS/IBM

2021-03-03 Thread Alexander E. Patrakov
ср, 3 мар. 2021 г. в 20:45, Drew Weaver :
>
> Howdy,
>
> After the IBM acquisition of RedHat the landscape for CentOS quickly changed.
>
> As I understand it right now Ceph 14 is the last version that will run on 
> CentOS/EL7 but CentOS8 was "killed off".

This is wrong. Ceph 15 runs on CentOS 7 just fine, but without the dashboard.

-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Alexander E. Patrakov
There is a big difference between traditional RAID1 and Ceph. Namely, with
Ceph, there are nodes where OSDs are running, and these nodes need
maintenance. You want to be able to perform maintenance even if you have
one broken OSD, that's why the recommendation is to have three copies with
Ceph. There is no such "maintenance" consideration with traditional RAID1,
so two copies are OK there.

чт, 4 февр. 2021 г. в 00:49, Mario Giammarco :

> Thanks Simon and thanks to other people that have replied.
> Sorry but I try to explain myself better.
> It is evident to me that if I have two copies of data, one brokes and while
> ceph creates again a new copy of the data also the disk with the second
> copy brokes you lose the data.
> It is obvious and a bit paranoid because many servers on many customers run
> on raid1 and so you are saying: yeah you have two copies of the data but
> you can broke both. Consider that in ceph recovery is automatic, with raid1
> some one must manually go to the customer and change disks. So ceph is
> already an improvement in this case even with size=2. With size 3 and min 2
> it is a bigger improvement I know.
>
> What I ask is this: what happens with min_size=1 and split brain, network
> down or similar things: do ceph block writes because it has no quorum on
> monitors? Are there some failure scenarios that I have not considered?
> Thanks again!
> Mario
>
>
>
> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
> sirons...@caffetine.org> ha scritto:
>
> > On 03/02/2021 09:24, Mario Giammarco wrote:
> > > Hello,
> > > Imagine this situation:
> > > - 3 servers with ceph
> > > - a pool with size 2 min 1
> > >
> > > I know perfectly the size 3 and min 2 is better.
> > > I would like to know what is the worst thing that can happen:
> >
> > Hi Mario,
> >
> > This thread is worth a read, it's an oldie but a goodie:
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
> >
> > Especially this post, which helped me understand the importance of
> > min_size=2
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
> >
> > Cheers,
> > Simon
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG inconsistent with empty inconsistent objects

2021-01-16 Thread Alexander E. Patrakov
For a start, please post the "ceph health detail" output.

сб, 19 дек. 2020 г. в 23:48, Seena Fallah :
>
> Hi,
>
> I'm facing something strange! One of the PGs in my pool got inconsistent
> and when I run `rados list-inconsistent-obj $PG_ID --format=json-pretty`
> the `inconsistents` key was empty! What is this? Is it a bug in Ceph or..?
>
> Thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PGs down

2020-12-20 Thread Alexander E. Patrakov
On Mon, Dec 21, 2020 at 4:57 AM Jeremy Austin  wrote:
>
> On Sun, Dec 20, 2020 at 2:25 PM Jeremy Austin  wrote:
>
> > Will attempt to disable compaction and report.
> >
>
> Not sure I'm doing this right. In [osd] section of ceph.conf, I added
> periodic_compaction_seconds=0
>
> and attempted to start the OSDs in question. Same error as before. Am I
> setting compaction options correctly?

You may also want this:

bluefs_log_compact_min_size=999G

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Possibly unused client

2020-12-16 Thread Alexander E. Patrakov
Yes, thanks. This client was indeed unused.

On Wed, Dec 16, 2020 at 5:54 PM Eugen Block  wrote:
>
> Hi,
>
> the /var/log/ceph/ceph.audit.log file contains the client names:
>
> 2020-12-16 13:51:11.534010 mgr. (mgr.897778001) 1089671 : audit
> [DBG] from='client.908207535 v1::0/3495403341'
> entity='client.admin' cmd=[{"prefix": "pg stat", "target": ["mgr",
> ""]}]: dispatch
>
> Does that help?
>
> Regards,
> Eugen
>
>
> Zitat von "Alexander E. Patrakov" :
>
> > Hello,
> >
> > While working with a customer, I went through the output of "ceph auth
> > list", and found a client entry that nobody can tell what it is used
> > for. There is a strong suspicion that it is an unused left-over from
> > old times, but again, nobody is sure.
> >
> > How can I confirm that it was not used for, say, the past week? Or,
> > what logs should I turn on so that if it is used during the next week,
> > it is mentioned there?
> >
> > --
> > Alexander E. Patrakov
> > CV: http://pc.cd/PLz7
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Possibly unused client

2020-12-16 Thread Alexander E. Patrakov
Hello,

While working with a customer, I went through the output of "ceph auth
list", and found a client entry that nobody can tell what it is used
for. There is a strong suspicion that it is an unused left-over from
old times, but again, nobody is sure.

How can I confirm that it was not used for, say, the past week? Or,
what logs should I turn on so that if it is used during the next week,
it is mentioned there?

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph flash deployment

2020-11-03 Thread Alexander E. Patrakov
With the latest kernel, this is not valid for all-flash clusters.
Simply because cfq is not an option at all there, and readahead
usefulness depends on your workload (in other words, it can help or
hurt) and therefore cannot be included in a universally-applicable set
of tuning recommendations. Also, look again: the title talks about
all-flash deployments, while the context of the benchmark talks about
7200RPM HDDs!

On Wed, Nov 4, 2020 at 12:37 AM Seena Fallah  wrote:
>
> Thanks for your useful information.
>
> Can you please also point to the kernel and disk configuration that are still 
> valid for bluestore or not? I mean the read_ahead_kb and disk scheduler.
>
> Thanks.
>
> On Tue, Nov 3, 2020 at 10:55 PM Alexander E. Patrakov  
> wrote:
>>
>> On Tue, Nov 3, 2020 at 6:30 AM Seena Fallah  wrote:
>> >
>> > Hi all,
>> >
>> > Does this guid is still valid for a bluestore deployment with nautilus or
>> > octopus?
>> > https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments
>>
>> Some of the guidance is of course outdated.
>>
>> E.g., at the time of that writing, 1x 40GbE was indeed state of the
>> art in the networking world, but now 100GbE network cards are
>> affordable, and with 6 NVMe drives per server, even that might be a
>> bottleneck if the clients use a large block size (>64KB) and do an
>> fsync() only at the end.
>>
>> Regarding NUMA tuning, Ceph made some progress. If it finds that your
>> NVMe and your network card are on the same NUMA node, then, with
>> Nautilus or later, the OSD will pin itself to that NUMA node
>> automatically. I.e.: choose strategically which PCIe slots to use,
>> maybe use two network cards, and you will not have to do any tuning or
>> manual pinning.
>>
>> Partitioning the NVMe was also a popular advice in the past, but now
>> that there are "osd op num shards" and "osd op num threads per shard"
>> parameters, with sensible default values, this is something that tends
>> not to help.
>>
>> Filesystem considerations in that document obviously apply only to
>> Filestore, which is something you should not use.
>>
>> Large PG number per OSD helps more uniform data distribution, but
>> actually hurts performance a little bit.
>>
>> The advice regarding the "performance" cpufreq governor is valid, but
>> you might also look at (i.e. benchmark for your workload specifically)
>> disabling the deepest idle states.
>>
>> --
>> Alexander E. Patrakov
>> CV: http://pc.cd/PLz7



-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph flash deployment

2020-11-03 Thread Alexander E. Patrakov
On Tue, Nov 3, 2020 at 6:30 AM Seena Fallah  wrote:
>
> Hi all,
>
> Does this guid is still valid for a bluestore deployment with nautilus or
> octopus?
> https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments

Some of the guidance is of course outdated.

E.g., at the time of that writing, 1x 40GbE was indeed state of the
art in the networking world, but now 100GbE network cards are
affordable, and with 6 NVMe drives per server, even that might be a
bottleneck if the clients use a large block size (>64KB) and do an
fsync() only at the end.

Regarding NUMA tuning, Ceph made some progress. If it finds that your
NVMe and your network card are on the same NUMA node, then, with
Nautilus or later, the OSD will pin itself to that NUMA node
automatically. I.e.: choose strategically which PCIe slots to use,
maybe use two network cards, and you will not have to do any tuning or
manual pinning.

Partitioning the NVMe was also a popular advice in the past, but now
that there are "osd op num shards" and "osd op num threads per shard"
parameters, with sensible default values, this is something that tends
not to help.

Filesystem considerations in that document obviously apply only to
Filestore, which is something you should not use.

Large PG number per OSD helps more uniform data distribution, but
actually hurts performance a little bit.

The advice regarding the "performance" cpufreq governor is valid, but
you might also look at (i.e. benchmark for your workload specifically)
disabling the deepest idle states.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

2020-10-25 Thread Alexander E. Patrakov
On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com  wrote:
>
> Hi all,
>
> We are planning for a new pool to store our dataset using CephFS. These data 
> are almost read-only (but not guaranteed) and consist of a lot of small 
> files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will 
> deploy about 10 such nodes. We aim at getting the highest read throughput.
>
> If we just use a replicated pool of size 3 on SSD, we should get the best 
> performance, however, that only leave us 1/3 of usable SSD space. And EC 
> pools are not friendly to such small object read workload, I think.
>
> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 
> 3 data replications, each on a different host (fail domain). 1 of them on 
> SSD, the other 2 on HDD. And normally every read request is directed to SSD. 
> So, if every SSD OSD is up, I’d expect the same read throughout as the all 
> SSD deployment.
>
> I’ve read the documents and did some tests. Here is the crush rule I’m 
> testing with:
>
> rule mixed_replicated_rule {
> id 3
> type replicated
> min_size 1
> max_size 10
> step take default class ssd
> step chooseleaf firstn 1 type host
> step emit
> step take default class hdd
> step chooseleaf firstn -1 type host
> step emit
> }
>
> Now I have the following conclusions, but I’m not very sure:
> * The first OSD produced by crush will be the primary OSD (at least if I 
> don’t change the “primary affinity”). So, the above rule is guaranteed to map 
> SSD OSD as primary in pg. And every read request will read from SSD if it is 
> up.
> * It is currently not possible to enforce SSD and HDD OSD to be chosen from 
> different hosts. So, if I want to ensure data availability even if 2 hosts 
> fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the 
> replication size to 4, instead of the ideal value 3, on the pool using the 
> above crush rule.
>
> Am I correct about the above statements? How would this work from your 
> experience? Thanks.

This works (i.e. guards against host failures) only if you have
strictly separate sets of hosts that have SSDs and that have HDDs.
I.e., there should be no host that has both, otherwise there is a
chance that one hdd and one ssd from that host will be picked.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is cephfs multi-volume support stable?

2020-10-10 Thread Alexander E. Patrakov
Hello,

I found that documentation on the Internet on the question whether I
can safely have two instances of cephfs in my cluster is inconsistent.
For the record, I don't use snapshots.

FOSDEM 19 presentation by Sage Weil:
https://archive.fosdem.org/2019/schedule/event/ceph_project_status_update/attachments/slides/3251/export/events/attachments/ceph_project_status_update/slides/3251/ceph_new_in_nautilus.pdf

Slide 25 is specifically devoted to this topic, and declares
multi-volume support as stable.

But, https://docs.ceph.com/en/nautilus/cephfs/experimental-features/
declares that multiple filesystems in the same cluster are an
experimental feature, and the "latest" version of the same doc makes
the same claim.

What should I believe - the presentation or the official docs?

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe's

2020-09-23 Thread Alexander E. Patrakov
On Wed, Sep 23, 2020 at 8:12 PM Anthony D'Atri  wrote:

> With today’s networking, _maybe_ a super-dense NVMe box needs 100Gb/s where a 
> less-dense probably is fine with 25Gb/s. And of course PCI lanes.
>
> https://cephalocon2019.sched.com/event/M7uJ/affordable-nvme-performance-on-ceph-ceph-on-nvme-true-unbiased-story-to-fast-ceph-wido-den-hollander-42on-piotr-dalek-ovh

I was able to reach 35 Gb/s network traffic on each server (5 servers,
with 6 NVMEs per server, one OSD per NVME) during a read benchmark
from cephfs, and I wouldn't treat that as a super-dense box. So 25Gb/s
may be a bit too tight.


--
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Low level bluestore usage

2020-09-22 Thread Alexander E. Patrakov
On Wed, Sep 23, 2020 at 3:03 AM Ivan Kurnosov  wrote:
>
> Hi,
>
> this morning I woke up to a degraded test ceph cluster (managed by rook,
> but it does not really change anything for the question I'm about to ask).
>
> After checking logs I have found that bluestore on one of the OSDs run out
> of space.

I think this is a consequence, and the real error is something else
that happened before.

The problem is that, if the cluster is unhealthy, the MON storage
accumulates a lot of osdmaps and pgmaps, and is not cleaned up
automatically, because the MONs think that these old versions might be
needed. And OSDs also get a copy of these osdmaps and pgmaps, if I
understand correctly, that's why small OSDs get quickly filled up if
the cluster stays unhealthy for a few hours.

> So, my question would be: how could I have prevented that? From monitoring
> I have (prometheus) - OSDs are healthy, have plenty of space, yet they are
> not.
>
> What command (and prometheus metric) would help me understand the actual
> real bluestore use? Or am I missing something?

You can fix monitoring by setting the "mon data size warn" to
something like 1 GB or even less.

> Oh, and I "fixed" the cluster by expanding the broken osd.0 with a larger
> 15GB volume. And 2 other OSDs still run on 10GB volumes.

Sometimes this doesn't help. For data recovery purposes, the most
helpful step if you get the "bluefs enospc" error is to add a separate
db device, like this:

systemctl disable --now ceph-osd@${OSDID}
truncate -s 32G /junk/osd.${OSDID}-recover/block.db
sgdisk -n 0:0:0 /junk/osd.${OSDID}-recover/block.db
ceph-bluestore-tool \
bluefs-bdev-new-db --path /var/lib/ceph/osd/ceph-${OSDID} \
--dev-target /junk/osd.${OSDID}-recover/block.db \
--bluestore-block-db-size=31G --bluefs-log-compact-min-size=31G

Of course you can use a real block device instead of just a file.

After that, export all PGs using ceph-objecttstore-tool and re-import
into a fresh OSD, then destroy or purge the full one.

Here is why the options:

--bluestore-block-db-size=31G: ceph-bluestore-tool refuses to do
anything if this option is not set to any value
--bluefs-log-compact-min-size=31G: make absolutely sure that log
compaction doesn't happen, because it would hit "bluefs enospc" again.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Vitastor, a fast Ceph-like block storage for VMs

2020-09-22 Thread Alexander E. Patrakov
On Wed, Sep 23, 2020 at 3:44 AM  wrote:
>
> Hi!
>
> After almost a year of development in my spare time I present my own 
> software-defined block storage system: Vitastor - https://vitastor.io
>
> I designed it similar to Ceph in many ways, it also has Pools, PGs, OSDs, 
> different coding schemes, rebalancing and so on. However it's much simpler 
> and much faster. In a test cluster with SATA SSDs it achieved Q1T1 latency of 
> 0.14ms which is especially great compared to Ceph RBD's 1ms for writes and 
> 0.57ms for reads. In an "iops saturation" parallel load benchmark it reached 
> 895k read / 162k write iops, compared to Ceph's 480k / 100k on the same 
> hardware, but the most interesting part was CPU usage: Ceph OSDs were using 
> 40 CPU cores out of 64 on each node and Vitastor was only using 4.
>
> Of course it's an early pre-release which means that, for example, it lacks 
> snapshot support and other useful features. However the base is finished - it 
> works and runs QEMU VMs. I like the design and I plan to develop it further.
>
> There are more details in the README file which currently opens from the 
> domain https://vitastor.io

Very interesting.

Could you please add more details to the README file, as listed below?

1. Network benchmarks, in terms of achievable throughput and latency.
2. The type of the switch you used, and if there was any latency
tuning, please state it.
3. The network MTU.
4. The utilization figures for SSDs and network interfaces during each test.

Also, given that the scope of the project only includes block storage,
I think it would be fair to ask for a comparison with DRBD 9 and
possibly Linstor, not only with Ceph.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd-nbd stuck request

2020-07-24 Thread Alexander E. Patrakov
On Fri, Jul 24, 2020 at 7:43 PM Herbert Alexander Faleiros
 wrote:
>
> On Fri, Jul 24, 2020 at 07:28:07PM +0500, Alexander E. Patrakov wrote:
> > On Fri, Jul 24, 2020 at 6:01 PM Herbert Alexander Faleiros
> >  wrote:
> > >
> > > Hi,
> > >
> > > is there any way to fix it instead a reboot?
> > >
> > > [128632.995249] block nbd0: Possible stuck request b14a04af: 
> > > control (read@2097152,4096B). Runtime 9540 seconds
> > > [128663.718993] block nbd0: Possible stuck request b14a04af: 
> > > control (read@2097152,4096B). Runtime 9570 seconds
> > > [128694.434774] block nbd0: Possible stuck request b14a04af: 
> > > control (read@2097152,4096B). Runtime 9600 seconds
> > > [128725.154515] block nbd0: Possible stuck request b14a04af: 
> > > control (read@2097152,4096B). Runtime 9630 seconds
> > >
> > > # ceph -v
> > > ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous 
> > > (stable)
> > >
> > > # rbd-nbd list-mapped
> > > #
> > >
> > > # uname -r
> > > 5.4.52-050452-generic
> >
> > Not enough data to troubleshoot this. Is the rbd-nbd process running?
> >
> > I.e.:
> >
> > # cat /proc/partitions
> > # ps axww | grep nbd
>
> no nbd on /proc/partitions, ps shows only:
>
> root  192324  0.0  0.0  0 0 ?I<   07:12   0:00 
> [knbd0-recv]

Try (not sure if it will help):

# nbd-client -d /dev/nbd0

This is part of the "nbd" package (or whatever is the name in your
distribution), not of ceph.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd-nbd stuck request

2020-07-24 Thread Alexander E. Patrakov
On Fri, Jul 24, 2020 at 6:01 PM Herbert Alexander Faleiros
 wrote:
>
> Hi,
>
> is there any way to fix it instead a reboot?
>
> [128632.995249] block nbd0: Possible stuck request b14a04af: control 
> (read@2097152,4096B). Runtime 9540 seconds
> [128663.718993] block nbd0: Possible stuck request b14a04af: control 
> (read@2097152,4096B). Runtime 9570 seconds
> [128694.434774] block nbd0: Possible stuck request b14a04af: control 
> (read@2097152,4096B). Runtime 9600 seconds
> [128725.154515] block nbd0: Possible stuck request b14a04af: control 
> (read@2097152,4096B). Runtime 9630 seconds
>
> # ceph -v
> ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous 
> (stable)
>
> # rbd-nbd list-mapped
> #
>
> # uname -r
> 5.4.52-050452-generic

Not enough data to troubleshoot this. Is the rbd-nbd process running?

I.e.:

# cat /proc/partitions
# ps axww | grep nbd

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd map image with journaling

2020-07-24 Thread Alexander E. Patrakov
On Fri, Jul 24, 2020 at 6:10 PM Herbert Alexander Faleiros
 wrote:
>
> Hi,
>
> is there any way to do that without disabling journaling?
>
> # rbd map image@snap
> rbd: sysfs write failed
> RBD image feature set mismatch. You can disable features unsupported
> by the kernel with "rbd feature disable image@snap journaling".
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (6) No such device or address
>
> # ceph -v
> ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous 
> (stable)
>
> # uname -r
> 5.4.52-050452-generic

You could use rbd-nbd

# rbd-nbd map image@snap

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Octopus upgrade breaks Ubuntu 18.04 libvirt

2020-07-08 Thread Alexander E. Patrakov
;> >> > virEventPollMakePollFDs:401 :
> >> >> > Prepare n=19 w=1001, f=30 e=1 d=0
> >> >> > 2020-07-06 18:18:36.932+: 3273: debug : 
> >> >> > virEventPollMakePollFDs:401 :
> >> >> > Prepare n=20 w=1002, f=33 e=1 d=0
> >> >> > 2020-07-06 18:18:36.932+: 3273: debug : 
> >> >> > virEventPollMakePollFDs:401 :
> >> >> > Prepare n=21 w=1004, f=83 e=1 d=0
> >> >> > 2020-07-06 18:18:36.933+: 3273: debug : 
> >> >> > virEventPollCalculateTimeout:338 :
> >> >> > Calculate expiry of 8 timers
> >> >> > 2020-07-06 18:18:36.933+: 3273: debug : 
> >> >> > virEventPollCalculateTimeout:346 :
> >> >> > Got a timeout scheduled for 1594059521930
> >> >> > 2020-07-06 18:18:36.933+: 3273: debug : 
> >> >> > virEventPollCalculateTimeout:359 :
> >> >> > Schedule timeout then=1594059521930 now=1594059516933
> >> >> > 2020-07-06 18:18:36.933+: 3273: debug : 
> >> >> > virEventPollCalculateTimeout:369 :
> >> >> > Timeout at 1594059521930 due in 4997 ms
> >> >> > 2020-07-06 18:18:36.933+: 3273: info : virEventPollRunOnce:640 :
> >> >> > EVENT_POLL_RUN: nhandles=21 timeout=4997
> >> >> >
> >> >> >
> >> >> > The ceph itself seems to work, i.e. I can execute ceph -s / rbd -p 
> >> >> >  ls -l,
> >> >> > etc. That produces the output. It's just the libvirt seems to be no 
> >> >> > joy.
> >> >> >
> >> >> > The version of libvirt installed is:
> >> >> >
> >> >> > libvirt-bin 4.0.0-1ubuntu8.1
> >> >> >
> >> >> >
> >> >> >
> >> >> > Any idea how I can make ceph Octopus to play nicely with libvirt?
> >> >> >
> >> >> > Cheers
> >> >> >
> >> >> > Andrei
> >> >> >
> >> >> > - Original Message -
> >> >> >> From: "Andrei Mikhailovsky" 
> >> >> >> To: "ceph-users" 
> >> >> >> Sent: Monday, 29 June, 2020 20:40:01
> >> >> >> Subject: [ceph-users] Octopus upgrade breaks Ubuntu 18.04 libvirt
> >> >> >
> >> >> >> Hello,
> >> >> >>
> >> >> >> I've upgraded ceph to Octopus (15.2.3 from repo) on one of the 
> >> >> >> Ubuntu 18.04 host
> >> >> >> servers. The update caused problem with libvirtd which hangs when it 
> >> >> >> tries to
> >> >> >> access the storage pools. The problem doesn't exist on Nautilus. The 
> >> >> >> libvirtd
> >> >> >> process simply hangs. Nothing seem to happen. The log file for the 
> >> >> >> libvirtd
> >> >> >> shows:
> >> >> >>
> >> >> >> 2020-06-29 19:30:51.556+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:707 :
> >> >> >> dispatching to max 0 clients, called from event watch 11
> >> >> >> 2020-06-29 19:30:51.556+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:720 : event
> >> >> >> not handled.
> >> >> >> 2020-06-29 19:30:51.556+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:707 :
> >> >> >> dispatching to max 0 clients, called from event watch 11
> >> >> >> 2020-06-29 19:30:51.556+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:720 : event
> >> >> >> not handled.
> >> >> >> 2020-06-29 19:30:51.557+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:707 :
> >> >> >> dispatching to max 0 clients, called from event watch 11
> >> >> >> 2020-06-29 19:30:51.557+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:720 : event
> >> >> >> not handled.
> >> >> >> 2020-06-29 19:30:51.591+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:707 :
> >> >> >> dispatching to max 0 clients, called from event watch 11
> >> >> >> 2020-06-29 19:30:51.591+: 12040: debug : 
> >> >> >> virNetlinkEventCallback:720 : event
> >> >> >> not handled.
> >> >> >>
> >> >> >> Running strace on the libvirtd process shows:
> >> >> >>
> >> >> >> root@ais-cloudhost1:/home/andrei# strace -p 12040
> >> >> >> strace: Process 12040 attached
> >> >> >> restart_syscall(<... resuming interrupted poll ...>
> >> >> >>
> >> >> >>
> >> >> >> Nothing happens after that point.
> >> >> >>
> >> >> >> The same host server can get access to the ceph cluster and the 
> >> >> >> pools by running
> >> >> >> ceph -s or rbd -p  ls -l commands for example.
> >> >> >>
> >> >> >> Need some help to get the host servers working again with Octopus.
> >> >> >>
> >> >> >> Cheers
> >> >> >> ___
> >> >> >> ceph-users mailing list -- ceph-users@ceph.io
> >> >> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >> > ___
> >> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >> ___
> >> >> ceph-users mailing list -- ceph-users@ceph.io
> >> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >>
> >> >
> >> > [1] https://wiki.libvirt.org/page/DebugLogs
> >> >
> >> > --
> >> > Jason
> >>
> >
> >
> > --
> > Jason
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Placement of block/db and WAL on SSD?

2020-07-05 Thread Alexander E. Patrakov
On Sun, Jul 5, 2020 at 6:57 AM Lindsay Mathieson
 wrote:
>
> Nautilus install.
>
> Documentation seems a bit ambiguous to me - this is for a spinner + SSD,
> using ceph-volume
>
> If I put the block.db on the SSD with
>
>  "ceph-volume lvm create --bluestore --data /dev/sdd --block.db
> /dev/sdc1"
>
> does the wal exists on the ssd (/dev/sdc1) as well, or does it remain on
> the hdd (/dev/sdd)?

If the wal location is not explicitly specified, it goes together with
the db. So it is on the SSD.

>
>
> Conversely, what happens with the block.db if I place the wal with
> --block.wal

The db then stays with the data.

> Or do I have to setup separate partitions for the block.db and wal?

You can, in theory, provide all three devices, but nobody does that in practice.

Common setups are:

1) just --data, then the db and its wal are located on the same device;
2) --data on HDD and --block.db on a partition on the SSD (the wal
automatically goes together with the db). The partition needs to be 30
or 300 GB in size (this requirement was relaxed only very recently, so
let's not count on this), but not smaller than 1-4% of the data
device.
3) --data on something (then the db goes there as well) and
--block.wal on a small (i.e. not large enough to use as a db device)
but very fast nvdimm.

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cannot remove cache tier

2020-07-03 Thread Alexander E. Patrakov
Hello.

I have tried to follow through the documented writeback cache tier
removal procedure
(https://docs.ceph.com/docs/master/rados/operations/cache-tiering/#removing-a-writeback-cache)
on a test cluster, and failed.

I have successfully executed this command:

ceph osd tier cache-mode alex-test-rbd-cache proxy

Next, I am supposed to run this:

rados -p alex-test-rbd-cache ls
rados -p alex-test-rbd-cache cache-flush-evict-all

The failure mode is that, while the client i/o still going on, I
cannot get zero objects in the cache pool, even with the help of
"rados -p alex-test-rbd-cache cache-flush-evict-all". And yes, I have
waited more than 20 minutes (my cache tier has hit_set_count 10 and
hit_set_period 120).

I also tried to set both cache_target_dirty_ratio and
cache_target_full_ratio to 0, it didn't help.

Here is the relevant part of the pool setup:

# ceph osd pool ls detail
pool 25 'alex-test-rbd-metadata' replicated size 3 min_size 2
crush_rule 9 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 10973111 lfor 0/10971347/10971345 flags
hashpspool,nodelete stripe_width 0 application rbd
pool 26 'alex-test-rbd-data' erasure size 6 min_size 5 crush_rule 12
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn
last_change 10973112 lfor 10971705/10971705/10971705 flags
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps tiers 27 read_tier
27 write_tier 27 stripe_width 16384 application rbd
removed_snaps [1~3]
pool 27 'alex-test-rbd-cache' replicated size 3 min_size 2 crush_rule
9 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
last_change 10973113 lfor 10971705/10971705/10971705 flags
hashpspool,incomplete_clones,nodelete,selfmanaged_snaps tier_of 26
cache_mode proxy target_bytes 100 hit_set
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 120s
x10 decay_rate 0 search_last_n 0 stripe_width 0 application rbd
removed_snaps [1~3]

The relevant crush rules are selecting ssds for the
alex-test-rbd-cache and alex-test-rbd-metadata pools (plain old
"replicated size 3" pools), and hdds for alex-test-rbd-data (which is
EC 4+2).

The client workload, which seemingly outpaces the eviction and flushing, is:

for a in `seq 1000 2000` ; do
time rbd import --data-pool alex-test-rbd-data
./Fedora-Cloud-Base-32-1.6.x86_64.raw
alex-test-rbd-metadata/Fedora-copy-$a
done

The ceph version is "ceph version 14.2.9
(2afdc1f644870fb6315f25a777f9e4126dacc32d) nautilus (stable)" on all
osds.

The relevant part of "ceph df" is:

RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd23 TiB  20 TiB 2.9 TiB  3.0 TiB 12.99
ssd   1.7 TiB 1.7 TiB  19 GiB   23 GiB  1.28
TOTAL  25 TiB  22 TiB 2.9 TiB  3.0 TiB 12.17

POOLS:
POOL   ID STORED  OBJECTS USED
   %USED MAX AVAIL

alex-test-rbd-metadata 25 237 KiB   2.37k  59
MiB 0   564 GiB
alex-test-rbd-data 26 691 GiB 198.57k 1.0
TiB  6.52   9.7 TiB
alex-test-rbd-cache27 5.1 GiB   2.99k  15
GiB  0.90   564 GiB

The total size and the number of stored objects in the
alex-test-rbd-cache pool oscillate around 5 GB and 3K, respectively,
while "rados -p alex-test-rbd-cache cache-flush-evict-all" is running
in a loop. Without it, the size grows to 6 GB and stays there.

# ceph -s
  cluster:
id: 
health: HEALTH_WARN
1 cache pools at or near target size

  services:
mon: 3 daemons, quorum xx-4a,xx-3a,xx-2a (age 10d)
mgr: xx-3a(active, since 5w), standbys: xx-2b, xx-2a, xx-4a
mds: cephfs:1 {0=xx-4b=up:active} 2 up:standby
osd: 89 osds: 89 up (since 7d), 89 in (since 7d)
rgw: 3 daemons active (xx-2b, xx-3b, xx-4b)
tcmu-runner: 6 daemons active ()

  data:
pools:   15 pools, 1976 pgs
objects: 6.64M objects, 1.3 TiB
usage:   3.1 TiB used, 22 TiB / 25 TiB avail
pgs: 1976 active+clean

  io:
client:   290 KiB/s rd, 251 MiB/s wr, 366 op/s rd, 278 op/s wr
cache:123 MiB/s flush, 72 MiB/s evict, 31 op/s promote, 3 PGs
flushing, 1 PGs evicting

Is there any workaround, short of somehow telling the client to stop
creating new rbds?

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io