Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-07-31 Thread Janne Johansson
Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid :

> Your email has cleared many things to me. Let me repeat my understanding.
> Every Critical data (Like Oracle/Any Other DB) writes will be done with
> sync, fsync flags, meaning they will be only confirmed to DB/APP after it
> is actually written to Hard drives/OSD's. Any other application can do it
> also.
> All other writes, like OS logs etc will be confirmed immediately to
> app/user but later on written  passing through kernel, RBD Cache, Physical
> drive Cache (If any)  and then to disks. These are susceptible to
> power-failure-loss but overall things are recoverable/non-critical.
>

That last part is probably simplified a bit, I suspect between a program in
a guest sending its data to the virtualised device, running in a KVM on top
of an OS that has remote storage over network, to a storage server with its
own OS and drive controller chip and lastly physical drive(s) to store the
write, there will be something like ~10 layers of write caching possible,
out of which the RBD you were asking about, is just one.

It is just located very conveniently before the I/O has to leave the KVM
host and go back and forth over the network, so it is the last place where
you can see huge gains in the guests I/O response time, but at the same
time possible to share between lots of guests on the KVM host which should
have tons of RAM available compared to any single guest so it is a nice way
to get a large cache for outgoing writes.

Also, to answer your first part, yes all critical software that depend
heavily on write ordering and integrity is hopefully already doing write
operations that way, asking for sync(), fsync() or fdatasync() and similar
calls, but I can't produce a list of all programs that do. Since there
already are many layers of delayed cached writes even without
virtualisation and/or ceph, applications that are important have mostly
learned their lessons by now, so chances are very high that all your
important databases and similar program are doing the right thing.

But if the guest is instead running a mail filter that does antivirus
checks, spam checks and so on, operating on files that live on the machine
for something like one second, and then either get dropped or sent to the
destination mailbox somewhere else, then having aggressive write caches
would be very useful, since the effects of a crash would still mostly mean
"the emails that were in the queue were lost, not acked by the final
mailserver and will probably be resent by the previous smtp server". For
such a guest VM, forcing sync writes would only be a net loss, it would
gain much by having large ram write caches.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reading a crushtool compare output

2019-07-31 Thread Linh Vu
Hi all,

I'd like to update the tunables on our older ceph cluster, created with firefly 
and now on luminous. I need to update two tunables, chooseleaf_vary_r from 2 to 
1, and chooseleaf_stable from 0 to 1. I'm going to do 1 tunable update at a 
time.

With the first one, I've dumped the current crushmap out and compared it to the 
proposed updated crushmap with chooseleaf_vary_r set to 1 instead of 2. I need 
some help to understand the output:

# ceph osd getcrushmap -o crushmap-20190801-chooseleaf-vary-r-2
# crushtool -i crushmap-20190801-chooseleaf-vary-r-2 --set-chooseleaf-vary-r 1 
-o crushmap-20190801-chooseleaf-vary-r-1
# crushtool -i crushmap-20190801-chooseleaf-vary-r-2 --compare 
crushmap-20190801-chooseleaf-vary-r-1
rule 0 had 9137/10240 mismatched mappings (0.892285)
rule 1 had 9152/10240 mismatched mappings (0.89375)
rule 4 had 9173/10240 mismatched mappings (0.895801)
rule 5 had 0/7168 mismatched mappings (0)
rule 6 had 0/7168 mismatched mappings (0)
warning: maps are NOT equivalent

So I've learned in the past doing this sort of stuff that if the maps are 
equivalent then there is no data movement. In this case, obviously I'm 
expecting data movement, but by how much? Rules 0, 1 and 4 are about our 3 
different device classes in this cluster.

Does that mean I'm going to expect almost 90% mismatched based on the above 
output? That's much bigger than I expected, as in the previous steps of 
changing the chooseleaf-vary-r from 0 to 5 then down to 2 by 1 at a time 
(before knowing anything about this crushtool --compare command) I had only up 
to about 28% mismatched objects.

Also, if you've done a similar change, please let me know how mcuh data 
movement you encountered. Thanks!

Cheers,
Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-07-31 Thread nokia ceph
Thank you Greg,

Another question , we need to give new destination object  , so that we can
read them separately in parallel with src object .  This function resides
in objector.h , seems to be like internal and can it be used in interface
level  and can we use this in our client ? Currently we use librados.h in
our client to communicate with ceph cluster.
Also any equivalent librados api for the command rados -p poolname  

Thanks,
Muthu

On Wed, Jul 31, 2019 at 11:13 PM Gregory Farnum  wrote:

>
>
> On Wed, Jul 31, 2019 at 1:32 AM nokia ceph 
> wrote:
>
>> Hi Greg,
>>
>> We were trying to implement this however having issues in assigning the
>> destination object name with this api.
>> There is a rados command "rados -p  cp  " ,
>> is there any librados api equivalent to this ?
>>
>
> The copyfrom operation, like all other ops, is directed to a specific
> object. The object you run it on is the destination; it copies the
> specified “src” object into itself.
> -Greg
>
>
>> Thanks,
>> Muthu
>>
>> On Fri, Jul 5, 2019 at 4:00 PM nokia ceph 
>> wrote:
>>
>>> Thank you Greg, we will try this out .
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Wed, Jul 3, 2019 at 11:12 PM Gregory Farnum 
>>> wrote:
>>>
 Well, the RADOS interface doesn't have a great deal of documentation
 so I don't know if I can point you at much.

 But if you look at Objecter.h, you see that the ObjectOperation has
 this function:
 void copy_from(object_t src, snapid_t snapid, object_locator_t
 src_oloc, version_t src_version, unsigned flags, unsigned
 src_fadvise_flags)

 src: the object to copy from
 snapid: if you want to copy a specific snap instead of HEAD
 src_oloc: the object locator for the object
 src_version: the version of the object to copy from (helps identify if
 it was updated in the meantime)
 flags: probably don't want to set these, but see
 PrimaryLogPG::_copy_some for the choices
 src_fadvise_flags: these are the fadvise flags we have in various
 places that let you specify things like not to cache the data.
 Probably leave them unset.

 -Greg



 On Wed, Jul 3, 2019 at 2:47 AM nokia ceph 
 wrote:
 >
 > Hi Greg,
 >
 > Can you please share the api details  for COPY_FROM or any reference
 document?
 >
 > Thanks ,
 > Muthu
 >
 > On Wed, Jul 3, 2019 at 4:12 AM Brad Hubbard 
 wrote:
 >>
 >> On Wed, Jul 3, 2019 at 4:25 AM Gregory Farnum 
 wrote:
 >> >
 >> > I'm not sure how or why you'd get an object class involved in doing
 >> > this in the normal course of affairs.
 >> >
 >> > There's a copy_from op that a client can send and which copies an
 >> > object from another OSD into the target object. That's probably the
 >> > primitive you want to build on. Note that the OSD doesn't do much
 >>
 >> Argh! yes, good idea. We really should document that!
 >>
 >> > consistency checking (it validates that the object version matches
 an
 >> > input, but if they don't it just returns an error) so the client
 >> > application is responsible for any locking needed.
 >> > -Greg
 >> >
 >> > On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard 
 wrote:
 >> > >
 >> > > Yes, this should be possible using an object class which is also
 a
 >> > > RADOS client (via the RADOS API). You'll still have some client
 >> > > traffic as the machine running the object class will still need
 to
 >> > > connect to the relevant primary osd and send the write
 (presumably in
 >> > > some situations though this will be the same machine).
 >> > >
 >> > > On Tue, Jul 2, 2019 at 4:08 PM nokia ceph <
 nokiacephus...@gmail.com> wrote:
 >> > > >
 >> > > > Hi Brett,
 >> > > >
 >> > > > I think I was wrong here in the requirement description. It is
 not about data replication , we need same content stored in different
 object/name.
 >> > > > We store video contents inside the ceph cluster. And our new
 requirement is we need to store same content for different users , hence
 need same content in different object name . if client sends write request
 for object x and sets number of copies as 100, then cluster has to clone
 100 copies of object x and store it as object x1, objectx2,etc. Currently
 this is done in the client side where objectx1, object x2...objectx100 are
 cloned inside the client and write request sent for all 100 objects which
 we want to avoid to reduce network consumption.
 >> > > >
 >> > > > Similar usecases are rbd snapshot , radosgw copy .
 >> > > >
 >> > > > Is this possible in object class ?
 >> > > >
 >> > > > thanks,
 >> > > > Muthu
 >> > > >
 >> > > >
 >> > > > On Mon, Jul 1, 2019 at 7:58 PM Brett Chancellor <
 bchancel...@salesforce.com> wrote:
 >> > > >>
 >> > > >> Ceph already d

Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-07-31 Thread Muhammad Junaid
Thanks Paul and Janne

Your email has cleared many things to me. Let me repeat my understanding.
Every Critical data (Like Oracle/Any Other DB) writes will be done with
sync, fsync flags, meaning they will be only confirmed to DB/APP after it
is actually written to Hard drives/OSD's. Any other application can do it
also.
All other writes, like OS logs etc will be confirmed immediately to
app/user but later on written  passing through kernel, RBD Cache, Physical
drive Cache (If any)  and then to disks. These are susceptible to
power-failure-loss but overall things are recoverable/non-critical.

Please confirm if my understanding is correct. Best Regards.

Muhammad Junaid



On Wed, Jul 31, 2019 at 7:12 PM Paul Emmerich 
wrote:

> Yes, this is power-failure safe. It behaves exactly the same as a real
> disk's write cache.
>
> It's really a question about semantics: what does it mean to write
> data? Should the data still be guaranteed to be there after a power
> failure?
> The answer for most writes is: no, such a guarantee is neither
> necessary nor helpful.
>
> If you really want to write data then you'll have to specify that
> during the write (sync or similar commands). It doesn't matter if you
> are using
> a networked file system with a cache or a real disk with a cache
> underneath, the behavior will be the same. Data not flushed out
> explicitly
> or written with a sync flag will be lost after a power outage. But
> that's fine because all reasonable file systems, applications, and
> operating
> systems are designed to handle exactly this case (disk caches aren't a
> new thing).
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Wed, Jul 31, 2019 at 1:08 PM Janne Johansson 
> wrote:
> >
> > Den ons 31 juli 2019 kl 06:55 skrev Muhammad Junaid <
> junaid.fsd...@gmail.com>:
> >>
> >> The question is about RBD Cache in write-back mode using KVM/libvirt.
> If we enable this, it uses local KVM Host's RAM as cache for VM's write
> requests. And KVM Host immediately responds to VM's OS that data has been
> written to Disk (Actually it is still not on OSD's yet). Then how can be it
> power failure safe?
> >>
> > It is not. Nothing is power-failure safe. However you design things, you
> will always have the possibility of some long write being almost done when
> the power goes away, and that write (and perhaps others in caches) will be
> lost. Different filesystems handle losses good or bad, databases running on
> those filesystems will have their I/O interrupted and not acknowledged
> which may be not-so-bad or very bad.
> >
> > The write-caches you have in the KVM guest and this KVM RBD cache will
> make the guest I/Os faster, at the expense of higher risk of losing data in
> a power outage, but there will be some kind of roll-back, some kind of
> fsck/trash somewhere to clean up if a KVM host dies with guests running.
> > In 99% of the cases, this is ok, the only lost things are "last lines to
> this log file" or "those 3 temp files for the program that ran" and in the
> last 1% you need to pull out your backups like when the physical disks die.
> >
> > If someone promises "power failure safe" then I think they are
> misguided. Chances may be super small for bad things to happen, but it will
> just never be 0%. Also, I think the guest VMs have the possibility to ask
> kvm-rbd to flush data out, and as such take the "pain" of waiting for real
> completion when it is actually needed, so that other operation can go fast
> (and slightly less safe) and the IO that needs harder guarantees can call
> for flushing and know when data actually is on disk for real.
> >
> > --
> > May the most significant bit of your life be positive.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph+docker

2019-07-31 Thread Paul Emmerich
https://hub.docker.com/r/ceph/ceph/tags

> Am 01.08.2019 um 04:16 schrieb "zhanrzh...@teamsun.com.cn" 
> :
> 
> Hello Paul,
> Thank you a lot  for reply.Are there other verion ceph  ready  run on docker.
> 
> --
> zhanrzh...@teamsun.com.cn
>> Please don't start new deployments with Luminous, it's EOL since last
>> month: https://docs.ceph.com/docs/master/releases/schedule/
>> (but it'll probably still receive critical updates if anything happens
>> as there are lots of deployments out there)
>> 
>> But there's no good reason to not start with Nautilus at the moment.
>> 
>> Paul
>> 
>> --
>> Paul Emmerich
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>> 
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>> 
>> On Wed, Jul 31, 2019 at 4:04 AM zhanrzh...@teamsun.com.cn
>>  wrote:
>>> 
>>> 
>>> Hello,cephers
>>>  We are researching ceph(Luminous) and docker to ready to use in our 
>>> future production environment.
>>> Are there some one use ceph and docker in production ?Any suggestion will 
>>> be thanks.
>>> 
>>> --
>>> zhanrzh...@teamsun.com.cn
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Please help: change IP address of a cluster

2019-07-31 Thread ST Wong (ITSC)
Hi,  noted and thanks a lot.

Best Rgds
/stwong

-Original Message-
From: Ricardo Dias  
Sent: Thursday, July 25, 2019 8:47 PM
To: ST Wong (ITSC) ; Manuel Lausch 
; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Please help: change IP address of a cluster

Hi,

The monmaptool has a different option to add an address vector. To add an 
address vector do this:

monmaptool --addv node1 [v2:10.0.1.1:3330,v1:10.0.1.1:6789] {tmp}/{filename}

The --addv option does not appear in the usage description but the fix is 
already under way: https://github.com/ceph/ceph/pull/29307

Cheers,
--
Ricardo Dias
Senior Software Engineer - Storage Team
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)

On 24/07/19 07:20,  ST Wong (ITSC)  wrote:
> Hi,
> 
> Thanks for your help.
> I changed IP addresses of OSD nodes successfully.  When changing IP address 
> on MON, it works except that the MON only listens on v2 port 3300 after 
> adding the MON back to the cluster.  Previously the MON listens on both v1 
> (6789) and v2 (3300).  
> Besides, can't add both v1 and v2 entries manually using monmaptool like 
> following.  Only the 2nd one will get added.
> 
> monmaptool -add node1  v1:10.0.1.1:6789, v2:10.0.1.1:3330  
> {tmp}/{filename}
> 
> the monmap now looks like following:
> 
> min_mon_release 14 (nautilus)
> 0: [v2:10.0.1.92:3300/0,v1:10.0.1.92:6789/0] mon.cmon2
> 1: [v2:10.0.1.93:3300/0,v1:10.0.1.93:6789/0] mon.cmon2
> 2: [v2:10.0.1.94:3300/0,v1:10.0.1.94:6789/0] mon.cmon3
> 3: [v2:10.0.1.95:3300/0,v1:10.0.1.95:6789/0] mon.cmon4
> 4: v2:10.0.1.97:3300/0 mon.cmon5  <--- the MON 
> being removed/added 
> 
> Although it's okay to use v2 only, I'm afraid I've missed some steps and 
> messed the cluster up.Any advice?
> Thanks again.
> 
> Best Regards,
> /stwong
> 
> -Original Message-
> From: ceph-users  On Behalf Of 
> Manuel Lausch
> Sent: Tuesday, July 23, 2019 7:32 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Please help: change IP address of a cluster
> 
> Hi,
> 
> I had to change the IPs of my cluster some time ago. The process was quite 
> easy.
> 
> I don't understand what you mean with configuring and deleting static 
> routes. The easies way is if the router allows (at least for the
> change) all traffic between the old and the new network. 
> 
> I did the following steps.
> 
> 1. Add the new IP Network space separated to the "public network" line 
> in your ceph.conf
> 
> 2. OSDS: stop you OSDs on the first node. Reconfigure the host network 
> and start your OSDs again. Repeat this for all hosts one by one
> 
> 3. MON: stop and remove one mon from cluster, delete all data in 
> /var/ceph/mon/mon. reconfigure the host network. Create the new mon 
> instance (don't forget the "mon host" entrys in your ceph.conf and your 
> clients as well) Of course this requires at least 3 Mons in your cluster!
> After 2 of 5 Mons in my cluster I added the new mon adresses to my clients 
> and restarted them. 
> 
> 4. MGR: stop the mgr daemon. reconfigure the host network. Start the 
> mgr daemon one by one
> 
> 
> I wouldn't recomend the "messy way" to reconfigure your mons. removing and 
> adding mons to the cluster is quite easy and in my opinion the most secure.
> 
> The complet IP change in our cluster worked without outage while the cluster 
> was in production.
> 
> I hope I could help you.
> 
> Regards
> Manuel
> 
> 
> 
> On Fri, 19 Jul 2019 10:22:37 +
> "ST Wong (ITSC)"  wrote:
> 
>> Hi all,
>>
>> Our cluster has to change to new IP range in same VLAN:  10.0.7.0/24
>> -> 10.0.18.0/23, while IP address on private network for OSDs
>> remains unchanged. I wonder if we can do that in either one following
>> ways:
>>
>> =
>>
>> 1.
>>
>> a.   Define static route for 10.0.18.0/23 on each node
>>
>> b.   Do it one by one:
>>
>> For each monitor/mgr:
>>
>> -  remove from cluster
>>
>> -  change IP address
>>
>> -  add static route to original IP range 10.0.7.0/24
>>
>> -  delete static route for 10.0.18.0/23
>>
>> -  add back to cluster
>>
>> For each OSD:
>>
>> -  stop OSD daemons
>>
>> -   change IP address
>>
>> -  add static route to original IP range 10.0.7.0/24
>>
>> -  delete static route for 10.0.18.0/23
>>
>> -  start OSD daemons
>>
>> c.   Clean up all static routes defined.
>>
>>
>>
>> 2.
>>
>> a.   Export and update monmap using the messy way as described in
>> http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-mons/
>>
>>
>>
>> ceph mon getmap -o {tmp}/{filename}
>>
>> monmaptool -rm node1 -rm node2 ... --rm node n {tmp}/{filename}
>>
>> monmaptool -add node1 v2:10.0.18.1:3330,v1:10.0.18.1:6789 -add node2
>> v2:10.0.18.2:3330,v1:10.0.18.2:6789 ... --add nodeN
>> v2:10.0.18.N:3330,v1:10.0.18.N:6789  {tmp}/{filename}
>>
>>
>>
>> b.   stop entire cluster daemons and change IP addre

Re: [ceph-users] ceph+docker

2019-07-31 Thread zhanrzh...@teamsun.com.cn
Hello Paul,
Thank you a lot  for reply.Are there other verion ceph  ready  run on docker.

--
zhanrzh...@teamsun.com.cn
>Please don't start new deployments with Luminous, it's EOL since last
>month: https://docs.ceph.com/docs/master/releases/schedule/
>(but it'll probably still receive critical updates if anything happens
>as there are lots of deployments out there)
>
>But there's no good reason to not start with Nautilus at the moment.
>
>Paul
>
>--
>Paul Emmerich
>
>Looking for help with your Ceph cluster? Contact us at https://croit.io
>
>croit GmbH
>Freseniusstr. 31h
>81247 München
>www.croit.io
>Tel: +49 89 1896585 90
>
>On Wed, Jul 31, 2019 at 4:04 AM zhanrzh...@teamsun.com.cn
> wrote:
>>
>>
>> Hello,cephers
>> We are researching ceph(Luminous) and docker to ready to use in our 
>>future production environment.
>> Are there some one use ceph and docker in production ?Any suggestion will be 
>> thanks.
>>
>> --
>> zhanrzh...@teamsun.com.cn
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Wrong ceph df result

2019-07-31 Thread Sylvain PORTIER

Hi Igor,

Thank you so much for the intel, I will now repair all OSD

Kind regards,

Sylvain.

Le 30/07/2019 à 11:22, Igor Fedotov a écrit :

Pool stats issue with upgrades to nautilus


---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in zone.rgw.log pool

2019-07-31 Thread Brett Chancellor
I was able to answer my own question. For future interested parties, I
initiated a deep scrub on the placement group, which cleared the error.

On Tue, Jul 30, 2019 at 1:48 PM Brett Chancellor 
wrote:

> I was able to remove the meta objects, but the cluster is still in WARN
> state
> HEALTH_WARN 1 large omap objects
> LARGE_OMAP_OBJECTS 1 large omap objects
> 1 large objects found in pool 'us-prd-1.rgw.log'
> Search the cluster log for 'Large omap object found' for more details.
>
> How do I go about clearing it out? I don't see any other references to
> large omap in any of the logs.  I've tried restarted the mgr's, the
> monitors, and even the osd that reported the issue.
>
> -Brett
>
> On Thu, Jul 25, 2019 at 2:55 PM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
>
>> 14.2.1
>> Thanks, I'll try that.
>>
>> On Thu, Jul 25, 2019 at 2:54 PM Casey Bodley  wrote:
>>
>>> What ceph version is this cluster running? Luminous or later should not
>>> be writing any new meta.log entries when it detects a single-zone
>>> configuration.
>>>
>>> I'd recommend editing your zonegroup configuration (via 'radosgw-admin
>>> zonegroup get' and 'put') to set both log_meta and log_data to false,
>>> then commit the change with 'radosgw-admin period update --commit'.
>>>
>>> You can then delete any meta.log.* and data_log.* objects from your log
>>> pool using the rados tool.
>>>
>>> On 7/25/19 2:30 PM, Brett Chancellor wrote:
>>> > Casey,
>>> >   These clusters were setup with the intention of one day doing multi
>>> > site replication. That has never happened. The cluster has a single
>>> > realm, which contains a single zonegroup, and that zonegroup contains
>>> > a single zone.
>>> >
>>> > -Brett
>>> >
>>> > On Thu, Jul 25, 2019 at 2:16 PM Casey Bodley >> > > wrote:
>>> >
>>> > Hi Brett,
>>> >
>>> > These meta.log objects store the replication logs for metadata
>>> > sync in
>>> > multisite. Log entries are trimmed automatically once all other
>>> zones
>>> > have processed them. Can you verify that all zones in the multisite
>>> > configuration are reachable and syncing? Does 'radosgw-admin sync
>>> > status' on any zone show that it's stuck behind on metadata sync?
>>> > That
>>> > would prevent these logs from being trimmed and result in these
>>> large
>>> > omap warnings.
>>> >
>>> > On 7/25/19 1:59 PM, Brett Chancellor wrote:
>>> > > I'm having an issue similar to
>>> > >
>>> >
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html
>>>  .
>>> >
>>> > > I don't see where any solution was proposed.
>>> > >
>>> > > $ ceph health detail
>>> > > HEALTH_WARN 1 large omap objects
>>> > > LARGE_OMAP_OBJECTS 1 large omap objects
>>> > > 1 large objects found in pool 'us-prd-1.rgw.log'
>>> > > Search the cluster log for 'Large omap object found' for
>>> > more details.
>>> > >
>>> > > $ grep "Large omap object" /var/log/ceph/ceph.log
>>> > > 2019-07-25 14:58:21.758321 osd.3 (osd.3) 15 : cluster [WRN]
>>> > Large omap
>>> > > object found. Object:
>>> > >
>>> 51:61eb35fe:::meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19:head
>>> > > Key count: 3382154 Size (bytes): 611384043
>>> > >
>>> > > $ rados -p us-prd-1.rgw.log listomapkeys
>>> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19 |wc -l
>>> > > 3382154
>>> > >
>>> > > $ rados -p us-prd-1.rgw.log listomapvals
>>> > > meta.log.e557cf47-46df-4b45-988e-9a94c5004a2e.19
>>> > > This returns entries from almost every bucket, across multiple
>>> > > tenants. Several of the entries are from buckets that no longer
>>> > exist
>>> > > on the system.
>>> > >
>>> > > $ ceph df |egrep 'OBJECTS|.rgw.log'
>>> > > POOLID  STORED  OBJECTS USED%USED MAX
>>> > > AVAIL
>>> > > us-prd-1.rgw.log 51 758 MiB 228   758 MiB
>>> > >   0   102 TiB
>>> > >
>>> > > Thanks,
>>> > >
>>> > > -Brett
>>> > >
>>> > > ___
>>> > > ceph-users mailing list
>>> > > ceph-users@lists.ceph.com 
>>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com 
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adventures with large RGW buckets

2019-07-31 Thread Paul Emmerich
Hi,

we are seeing a trend towards rather large RGW S3 buckets lately.
we've worked on
several clusters with 100 - 500 million objects in a single bucket, and we've
been asked about the possibilities of buckets with several billion objects more
than once.

From our experience: buckets with tens of million objects work just fine with
no big problems usually. Buckets with hundreds of million objects require some
attention. Buckets with billions of objects? "How about indexless buckets?" -
"No, we need to list them".


A few stories and some questions:


1. The recommended number of objects per shard is 100k. Why? How was this
default configuration derived?

It doesn't really match my experiences. We know a few clusters running with
larger shards because resharding isn't possible for various reasons at the
moment. They sometimes work better than buckets with lots of shards.

So we've been considering to at least double that 100k target shard size
for large buckets, that would make the following point far less annoying.



2. Many shards + ordered object listing = lots of IO

Unfortunately telling people to not use ordered listings when they don't really
need them doesn't really work as their software usually just doesn't support
that :(

A listing request for X objects will retrieve up to X objects from each shard
for ordering them. That will lead to quite a lot of traffic between the OSDs
and the radosgw instances, even for relatively innocent simple queries as X
defaults to 1000 usually.

Simple example: just getting the first page of a bucket listing with 4096
shards fetches around 1 GB of data from the OSD to return ~300kb or so to the
S3 client.

I've got two clusters here that are only used for some relatively low-bandwidth
backup use case here. However, there are a few buckets with hundreds of millions
of objects that are sometimes being listed by the backup system.

The result is that this cluster has an average read IO of 1-2 GB/s, all going
to the index pool. Not a big deal since that's coming from SSDs and goes over
80 Gbit/s LACP bonds. But it does pose the question about scalability
as the user-
visible load created by the S3 clients is quite low.



3. Deleting large buckets

Someone accidentaly put 450 million small objects into a bucket and only noticed
when the cluster ran full. The bucket isn't needed, so just delete it and case
closed?

Deleting is unfortunately far slower than adding objects, also
radosgw-admin leaks
memory during deletion: https://tracker.ceph.com/issues/40700

Increasing --max-concurrent-ios helps with deletion speed (option does effect
deletion concurrency, documentation says it's only for other specific commands).

Since the deletion is going faster than new data is being added to that cluster
the "solution" was to run the deletion command in a memory-limited cgroup and
restart it automatically after it gets killed due to leaking.


How could the bucket deletion of the future look like? Would it be possible
to put all objects in buckets into RADOS namespaces and implement some kind
of efficient namespace deletion on the OSD level similar to how pool deletions
are handled at a lower level?



4. Common prefixes could filtered in the rgw class on the OSD instead
of in radosgw

Consider a bucket with 100 folders with 1000 objects in each and only one shard

/p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000


Now a user wants to list / with aggregating common prefixes on the
delimiter / and
wants up to 1000 results.
So there'll be 100 results returned to the client: the common prefixes
p1 to p100.

How much data will be transfered between the OSDs and radosgw for this request?
How many omap entries does the OSD scan?

radosgw will ask the (single) index object to list the first 1000 objects. It'll
return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, , /p1/1000

radosgw will discard 999 of these and detect one common prefix and continue the
iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
The OSD will then return everything in /p2/ in that next request and so on.

So it'll internally list every single object in that bucket. That will
be a problem
for large buckets and having lots of shards doesn't help either.


This shouldn't be too hard to fix: add an option "aggregate prefixes" to the RGW
class method and duplicate the fast-forward logic from radosgw in
cls_rgw. It doesn't
even need to change the response type or anything, it just needs to
limit entries in
common prefixes to one result.
Is this a good idea or am I missing something?

IO would be reduced by a factor of 100 for that particular
pathological case. I've
unfortunately seen a real-world setup that I think hits a case like that.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
__

Re: [ceph-users] details about cloning objects using librados

2019-07-31 Thread Gregory Farnum
On Wed, Jul 31, 2019 at 1:32 AM nokia ceph  wrote:

> Hi Greg,
>
> We were trying to implement this however having issues in assigning the
> destination object name with this api.
> There is a rados command "rados -p  cp  " , is
> there any librados api equivalent to this ?
>

The copyfrom operation, like all other ops, is directed to a specific
object. The object you run it on is the destination; it copies the
specified “src” object into itself.
-Greg


> Thanks,
> Muthu
>
> On Fri, Jul 5, 2019 at 4:00 PM nokia ceph 
> wrote:
>
>> Thank you Greg, we will try this out .
>>
>> Thanks,
>> Muthu
>>
>> On Wed, Jul 3, 2019 at 11:12 PM Gregory Farnum 
>> wrote:
>>
>>> Well, the RADOS interface doesn't have a great deal of documentation
>>> so I don't know if I can point you at much.
>>>
>>> But if you look at Objecter.h, you see that the ObjectOperation has
>>> this function:
>>> void copy_from(object_t src, snapid_t snapid, object_locator_t
>>> src_oloc, version_t src_version, unsigned flags, unsigned
>>> src_fadvise_flags)
>>>
>>> src: the object to copy from
>>> snapid: if you want to copy a specific snap instead of HEAD
>>> src_oloc: the object locator for the object
>>> src_version: the version of the object to copy from (helps identify if
>>> it was updated in the meantime)
>>> flags: probably don't want to set these, but see
>>> PrimaryLogPG::_copy_some for the choices
>>> src_fadvise_flags: these are the fadvise flags we have in various
>>> places that let you specify things like not to cache the data.
>>> Probably leave them unset.
>>>
>>> -Greg
>>>
>>>
>>>
>>> On Wed, Jul 3, 2019 at 2:47 AM nokia ceph 
>>> wrote:
>>> >
>>> > Hi Greg,
>>> >
>>> > Can you please share the api details  for COPY_FROM or any reference
>>> document?
>>> >
>>> > Thanks ,
>>> > Muthu
>>> >
>>> > On Wed, Jul 3, 2019 at 4:12 AM Brad Hubbard 
>>> wrote:
>>> >>
>>> >> On Wed, Jul 3, 2019 at 4:25 AM Gregory Farnum 
>>> wrote:
>>> >> >
>>> >> > I'm not sure how or why you'd get an object class involved in doing
>>> >> > this in the normal course of affairs.
>>> >> >
>>> >> > There's a copy_from op that a client can send and which copies an
>>> >> > object from another OSD into the target object. That's probably the
>>> >> > primitive you want to build on. Note that the OSD doesn't do much
>>> >>
>>> >> Argh! yes, good idea. We really should document that!
>>> >>
>>> >> > consistency checking (it validates that the object version matches
>>> an
>>> >> > input, but if they don't it just returns an error) so the client
>>> >> > application is responsible for any locking needed.
>>> >> > -Greg
>>> >> >
>>> >> > On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard 
>>> wrote:
>>> >> > >
>>> >> > > Yes, this should be possible using an object class which is also a
>>> >> > > RADOS client (via the RADOS API). You'll still have some client
>>> >> > > traffic as the machine running the object class will still need to
>>> >> > > connect to the relevant primary osd and send the write
>>> (presumably in
>>> >> > > some situations though this will be the same machine).
>>> >> > >
>>> >> > > On Tue, Jul 2, 2019 at 4:08 PM nokia ceph <
>>> nokiacephus...@gmail.com> wrote:
>>> >> > > >
>>> >> > > > Hi Brett,
>>> >> > > >
>>> >> > > > I think I was wrong here in the requirement description. It is
>>> not about data replication , we need same content stored in different
>>> object/name.
>>> >> > > > We store video contents inside the ceph cluster. And our new
>>> requirement is we need to store same content for different users , hence
>>> need same content in different object name . if client sends write request
>>> for object x and sets number of copies as 100, then cluster has to clone
>>> 100 copies of object x and store it as object x1, objectx2,etc. Currently
>>> this is done in the client side where objectx1, object x2...objectx100 are
>>> cloned inside the client and write request sent for all 100 objects which
>>> we want to avoid to reduce network consumption.
>>> >> > > >
>>> >> > > > Similar usecases are rbd snapshot , radosgw copy .
>>> >> > > >
>>> >> > > > Is this possible in object class ?
>>> >> > > >
>>> >> > > > thanks,
>>> >> > > > Muthu
>>> >> > > >
>>> >> > > >
>>> >> > > > On Mon, Jul 1, 2019 at 7:58 PM Brett Chancellor <
>>> bchancel...@salesforce.com> wrote:
>>> >> > > >>
>>> >> > > >> Ceph already does this by default. For each replicated pool,
>>> you can set the 'size' which is the number of copies you want Ceph to
>>> maintain. The accepted norm for replicas is 3, but you can set it higher if
>>> you want to incur the performance penalty.
>>> >> > > >>
>>> >> > > >> On Mon, Jul 1, 2019, 6:01 AM nokia ceph <
>>> nokiacephus...@gmail.com> wrote:
>>> >> > > >>>
>>> >> > > >>> Hi Brad,
>>> >> > > >>>
>>> >> > > >>> Thank you for your response , and we will check this video as
>>> well.
>>> >> > > >>> Our requirement is while writing an object into the cluster ,
>>> if we can provide number of copies to be made , the network consumption

Re: [ceph-users] cephfs quota setfattr permission denied

2019-07-31 Thread Mattia Belluco
Hi Nathan,

Indeed that was the reason. With your hint I was able to find the
relevant documentation:

https://docs.ceph.com/docs/master/cephfs/client-auth/

that is completely absent from:

https://docs.ceph.com/docs/master/cephfs/quota/#configuration

I will send a pull request to include the link to client-auth in the
CephFS quota page.

thanks
Mattia

On 7/31/19 2:45 PM, Nathan Fish wrote:
> The client key which is used to mount the FS needs the 'p' permission
> to set xattrs. eg:
> 
> ceph fs authorize cephfs client.foo / rwsp
> 
> That might be your problem.
> 
> On Wed, Jul 31, 2019 at 5:43 AM Mattia Belluco  wrote:
>>
>> Dear ceph users,
>>
>> We have been recently trying to use the two quota attributes:
>>
>> - ceph.quota.max_files
>> - ceph.quota.max_bytes
>>
>> to prepare for quota enforcing.
>>
>> While the idea is quite straightforward we found out we cannot set any
>> additional file attribute (we tried with the directory pinning, too): we
>> get a straight "permission denied" each time.
>>
>> We are running the latest version of Mimic on a cluster used exclusively
>> for cephfs.
>>
>> This is the output of a "ceph fs dump"
>>
>> Filesystem 'spindlefs' (4)
>> fs_name spindlefs
>> epoch   18062
>> flags   12
>> created 2019-02-21 17:53:48.240659
>> modified2019-07-30 16:52:13.141688
>> tableserver 0
>> root0
>> session_timeout 60
>> session_autoclose   300
>> max_file_size   1099511627776
>> min_compat_client   -1 (unspecified)
>> last_failure0
>> last_failure_osd_epoch  31203
>> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
>> ranges,3=default file layouts on dirs,4=dir inode in separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>> anchor table,9=file layout v2,10=snaprealm v2}
>> max_mds 1
>> in  0
>> up  {0=10744488}
>> failed
>> damaged
>> stopped
>> data_pools  [11]
>> metadata_pool   10
>> inline_data disabled
>> balancer
>> standby_count_wanted1
>> 10744488:   10.129.48.46:6800/549232138 'mds-l15-34' mds.0.18058 
>> up:active
>> seq 49
>>
>> and the command we are trying to use:
>>
>> setfattr -n ceph.quota.max_bytes -v 1 /user/dir
>>
>> It should be all pretty much default and we could find no reference
>> online regarding settings to allow the setfattr to succeed.
>>
>> Any hint is welcome.
>>
>> Kind regards
>> mattia
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Mattia Belluco
S3IT Services and Support for Science IT
Office Y11 F 52
University of Zürich
Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
Tel: +41 44 635 42 22
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph+docker

2019-07-31 Thread Paul Emmerich
Please don't start new deployments with Luminous, it's EOL since last
month: https://docs.ceph.com/docs/master/releases/schedule/
(but it'll probably still receive critical updates if anything happens
as there are lots of deployments out there)

But there's no good reason to not start with Nautilus at the moment.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Jul 31, 2019 at 4:04 AM zhanrzh...@teamsun.com.cn
 wrote:
>
>
> Hello,cephers
> We are researching ceph(Luminous) and docker to ready to use in our 
> future production environment.
> Are there some one use ceph and docker in production ?Any suggestion will be 
> thanks.
>
> --
> zhanrzh...@teamsun.com.cn
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-07-31 Thread Paul Emmerich
Yes, this is power-failure safe. It behaves exactly the same as a real
disk's write cache.

It's really a question about semantics: what does it mean to write
data? Should the data still be guaranteed to be there after a power
failure?
The answer for most writes is: no, such a guarantee is neither
necessary nor helpful.

If you really want to write data then you'll have to specify that
during the write (sync or similar commands). It doesn't matter if you
are using
a networked file system with a cache or a real disk with a cache
underneath, the behavior will be the same. Data not flushed out
explicitly
or written with a sync flag will be lost after a power outage. But
that's fine because all reasonable file systems, applications, and
operating
systems are designed to handle exactly this case (disk caches aren't a
new thing).


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Jul 31, 2019 at 1:08 PM Janne Johansson  wrote:
>
> Den ons 31 juli 2019 kl 06:55 skrev Muhammad Junaid :
>>
>> The question is about RBD Cache in write-back mode using KVM/libvirt. If we 
>> enable this, it uses local KVM Host's RAM as cache for VM's write requests. 
>> And KVM Host immediately responds to VM's OS that data has been written to 
>> Disk (Actually it is still not on OSD's yet). Then how can be it power 
>> failure safe?
>>
> It is not. Nothing is power-failure safe. However you design things, you will 
> always have the possibility of some long write being almost done when the 
> power goes away, and that write (and perhaps others in caches) will be lost. 
> Different filesystems handle losses good or bad, databases running on those 
> filesystems will have their I/O interrupted and not acknowledged which may be 
> not-so-bad or very bad.
>
> The write-caches you have in the KVM guest and this KVM RBD cache will make 
> the guest I/Os faster, at the expense of higher risk of losing data in a 
> power outage, but there will be some kind of roll-back, some kind of 
> fsck/trash somewhere to clean up if a KVM host dies with guests running.
> In 99% of the cases, this is ok, the only lost things are "last lines to this 
> log file" or "those 3 temp files for the program that ran" and in the last 1% 
> you need to pull out your backups like when the physical disks die.
>
> If someone promises "power failure safe" then I think they are misguided. 
> Chances may be super small for bad things to happen, but it will just never 
> be 0%. Also, I think the guest VMs have the possibility to ask kvm-rbd to 
> flush data out, and as such take the "pain" of waiting for real completion 
> when it is actually needed, so that other operation can go fast (and slightly 
> less safe) and the IO that needs harder guarantees can call for flushing and 
> know when data actually is on disk for real.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW - Multisite setup -> question about Bucket - Sharding, limitations and synchronization

2019-07-31 Thread Eric Ivancich

> On Jul 30, 2019, at 7:49 AM, Mainor Daly  
> wrote:
> 
> Hello,
> 
> (everything in context of S3)
> 
> 
> I'm currently trying to better understand bucket sharding in combination with 
> an multisite - rgw setup and possible limitations.
> 
> At the moment I understand that a bucket has a bucket index, which is a list 
> of objects within the bucket.
> 
> There are also indexless buckets, but those are not usable for cases like a 
> multisite - rgw bucket, where your need a [delayed] consistent relation/state 
> between bucket n [zone a] and bucket n [zone b].
> 
> Those bucket indexes are stored in "shards" and shards get distributed over 
> to whole zone - cluster for scaling purposes.
> Redhat recommends a maximum size of 102,400 object per shard and recommend 
> this forumular to determine the right shard size for a cluster:
> 
> number of objects expected in a bucket / 100,000 
> max number of supported shards (or tested limit) is 7877 shard.

Back in 2017 this maximum number of shards changed to 65521. This change is in 
luminous, mimic, and nautilus.

> That results in a total limit of 787.700.000 objects, as long you wanna stay 
> in known and tested water.
> 
> Now some the things I did not 100% understand:
> 
> = QUESTION 1 =
> 
> Does each bucket has it's own shards? E.g
> 
> Bucket 1 reached it's shard limit at 7877 shard, can i then create other  
> Buckets wish start with their own frish sets of shards?
> OR is it the other way around which would mean all buckets save their Index 
> in the the same shards and if i reach the shard limit I need to create a 
> second cluster?

Correct, each bucket has its own bucket index. And each bucket index can be 
sharded.

> = QUESTION 2 =
> How are this shards distrbuted over the cluster? I expect they are just 
> objects in the rgw.bucket.index pool, is that correct?
> So. those one:
> rados ls -p a.rgw.buckets.index 
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.274451.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.87683.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.64716.1
> .dir.3638e3a4-8dde-42ee-812a-f98e266548a4.78046.2

They are just objects and distributed via the CRUSH algorithm.

> = QUESTION 3 = 
> 
> 
> Does this Bucket Index Shards, has any relation to the RGW Sync shards in a 
> rgw multisite setup?
> E.g. If I have a ton of bucket index shards or buckets, does it have any 
> impact on the sync shards? 

They’re separate.

> radosgw-admin sync status
> realm f0019e09-c830-4fe8-a992-435e6f463b7c (mumu_1)
> zonegroup 307a1bb5-4d93-4a01-af21-0d8467b9bdfe (EU_1)
> zone 5a9c4d16-27a6-4721-aeda-b1a539b3d73a (b)
> metadata sync syncing
> full sync: 0/64 shards<= this ones I mean
> incremental sync: 64/64 shards
> metadata is caught up with master
> data sync source: 3638e3a4-8dde-42ee-812a-f98e266548a4 (a)
> syncing
> full sync: 0/128 shards   <= and this ones
> incremental sync: 128/128 shards <= and this ones
> data is caught up with source
> 
> 
> (swi to sync shard related topics)
> = QUESTION 4 = 
> (switching to sync shard related topics)
> 
> 
> What is the exact function and purpose of the sync shards? Do they implement 
> any limit? E.g. maybe a maximum amount of objects entries that waits for 
> synchronization to zone b.

They contain logs of items that need to be synced between zones. RGWs will look 
at them and sync objects. These logs are sharded so different RGWs can take on 
different shards and work on syncing in parallel.

> = QUESTION 5 = 
> Are those  Sync Shards processed parallel or sequentially? And where are 
> those shards stored?

They’re sharded to allow parallelism. At any given moment, each shard is 
claimed by (locked by) one RGW. And each RGW may be claiming multiple shards. 
Collectively, all RGW are claiming all shards. Each RGW is syncing multiple 
shards in parallel and all RGWs are doing this in parallel. So in some sense 
there are two levels of parallelism.

> = QUESTION 6 = 
> As far as I have experienced the sync process pretty much works like that:
> 
> 1.) The client sends a object or a operation to a rados gateway A (RGW A)
> 2.) RGW A logs this operation into one of it's sync shards and execute the 
> operation to it's local storage pool
> 3.) RGW B checks via get requests in a regular intervall if any new entries 
> in the RGW A log appears 
> 4.) If a new entry exists RGW B it's execute the operation to it's local pool 
> or pulls the new object from RGW A
> 
> Did I understand that correct? (For my roughly description of this 
> functionality, I want to apologize at the developers which for sure invested 
> much time and effort into design and building of that sync - process)

That’s about right.

> And If I understand it correct, how would look the exact strategy in a 
> multisite - setup to resync e.g. a single bucket at which one zone got 
> corrupted and must be get back into a synchronous state?

Be aware that there are full syncs and incremental syncs. Full syncs just copy 

Re: [ceph-users] cephfs quota setfattr permission denied

2019-07-31 Thread Nathan Fish
The client key which is used to mount the FS needs the 'p' permission
to set xattrs. eg:

ceph fs authorize cephfs client.foo / rwsp

That might be your problem.

On Wed, Jul 31, 2019 at 5:43 AM Mattia Belluco  wrote:
>
> Dear ceph users,
>
> We have been recently trying to use the two quota attributes:
>
> - ceph.quota.max_files
> - ceph.quota.max_bytes
>
> to prepare for quota enforcing.
>
> While the idea is quite straightforward we found out we cannot set any
> additional file attribute (we tried with the directory pinning, too): we
> get a straight "permission denied" each time.
>
> We are running the latest version of Mimic on a cluster used exclusively
> for cephfs.
>
> This is the output of a "ceph fs dump"
>
> Filesystem 'spindlefs' (4)
> fs_name spindlefs
> epoch   18062
> flags   12
> created 2019-02-21 17:53:48.240659
> modified2019-07-30 16:52:13.141688
> tableserver 0
> root0
> session_timeout 60
> session_autoclose   300
> max_file_size   1099511627776
> min_compat_client   -1 (unspecified)
> last_failure0
> last_failure_osd_epoch  31203
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in  0
> up  {0=10744488}
> failed
> damaged
> stopped
> data_pools  [11]
> metadata_pool   10
> inline_data disabled
> balancer
> standby_count_wanted1
> 10744488:   10.129.48.46:6800/549232138 'mds-l15-34' mds.0.18058 up:active
> seq 49
>
> and the command we are trying to use:
>
> setfattr -n ceph.quota.max_bytes -v 1 /user/dir
>
> It should be all pretty much default and we could find no reference
> online regarding settings to allow the setfattr to succeed.
>
> Any hint is welcome.
>
> Kind regards
> mattia
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-07-31 Thread Jason Dillaman
On Wed, Jul 31, 2019 at 6:20 AM Marc Schöchlin  wrote:
>
> Hello Jason,
>
> it seems that there is something wrong in the rbd-nbd implementation.
> (added this information also at  https://tracker.ceph.com/issues/40822)
>
> The problem not seems to be related to kernel releases, filesystem types or 
> the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
> to have the described problem.

Here is the complete delta between the two releases in rbd-nbd:

$ git diff v12.2.5..v12.2.12 -- .
diff --git a/src/tools/rbd_nbd/rbd-nbd.cc b/src/tools/rbd_nbd/rbd-nbd.cc
index 098d9925ca..aefdbd36e0 100644
--- a/src/tools/rbd_nbd/rbd-nbd.cc
+++ b/src/tools/rbd_nbd/rbd-nbd.cc
@@ -595,14 +595,13 @@ static int do_map(int argc, const char *argv[],
Config *cfg)
   cerr << err << std::endl;
   return r;
 }
-
 if (forker.is_parent()) {
-  global_init_postfork_start(g_ceph_context);
   if (forker.parent_wait(err) != 0) {
 return -ENXIO;
   }
   return 0;
 }
+global_init_postfork_start(g_ceph_context);
   }

   common_init_finish(g_ceph_context);
@@ -724,8 +723,8 @@ static int do_map(int argc, const char *argv[], Config *cfg)

   if (info.size > ULONG_MAX) {
 r = -EFBIG;
-cerr << "rbd-nbd: image is too large (" << prettybyte_t(info.size)
- << ", max is " << prettybyte_t(ULONG_MAX) << ")" << std::endl;
+cerr << "rbd-nbd: image is too large (" << byte_u_t(info.size)
+ << ", max is " << byte_u_t(ULONG_MAX) << ")" << std::endl;
 goto close_nbd;
   }

@@ -761,9 +760,8 @@ static int do_map(int argc, const char *argv[], Config *cfg)
 cout << cfg->devpath << std::endl;

 if (g_conf->daemonize) {
-  forker.daemonize();
-  global_init_postfork_start(g_ceph_context);
   global_init_postfork_finish(g_ceph_context);
+  forker.daemonize();
 }

 {

It's basically just a log message tweak and some changes to how the
process is daemonized. If you could re-test w/ each release after
12.2.5 and pin-point where the issue starts occurring, we would have
something more to investigate.

> This night a 18 hour testrun with the following procedure was successful:
> -
> #!/bin/bash
> set -x
> while true; do
>date
>find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 
> gzip -v
>date
>find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 
> -n 2 gunzip -v
> done
> -
> Previous tests crashed in a reproducible manner with "-P 1" (single io 
> gzip/gunzip) after a few minutes up to 45 minutes.
>
> Overview of my tests:
>
> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
> device timeout
>   -> 18 hour testrun was successful, no dmesg output
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created without reboot
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
> device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created
>   -> parallel krbd device usage with 99% io usage worked without a problem 
> while running the test
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
>   -> failed after < 10 minutes
>   -> system runs in a high system load, system is almost unusable, unable to 
> shutdown the system, hard reset of vm necessary, manual exclusive lock 
> removal is necessary before remapping the device
> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
> 120s device timeout
>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
> errors, map/mount can be re-created
>
> All device timeouts were set separately set by the nbd_set_ioctl tool because 
> luminous rbd-nbd does not provide the possibility to define timeouts.
>
> Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?
>
> From my point of view (without in depth-knowledge of rbd-nbd/librbd) my 
> assumption is that this problem might be caused by rbd-nbd code and not by 
> librbd.
> The probability that a bug like this survives uncovered in librbd for such a 
> long time seems to be low for me :-)
>
> Regards
> Marc
>
> Am 29.07.19 um 22:25 schrieb Marc Schöchlin:
> > Hello Jason,
> >
> > i updated the ticket https://tracker.ceph.com/issues/40822
> >
> > Am 24.07.19 um 19:20 schrieb Jason Dillaman:
> >> On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin  wrote:
> >>> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
> >>> because the ceph apt source does not contain that version.
> >>> Do you know a package source?
> >> All the upstream packages should be available here [1], includi

Re: [ceph-users] RGW 4 MiB objects

2019-07-31 Thread Aleksey Gutikov

Hi Thomas,

We did some investigations some time before and got several rules how to 
configure rgw and osd for big files stored on erasure-coded pool.

Hope it will be useful.
And if I have any mistakes, please let me know.

S3 object saving pipeline:

- S3 object is divided into multipart shards by client.
- Rgw shards each multipart shard into rados objects of size 
rgw_obj_stripe_size.
- Primary osd stripes rados object into ec stripes of width == 
ec.k*profile.stripe_unit, ec code them and send units into secondary 
osds and write into object store (bluestore).

- Each subobject of rados object has size == (rados object size)/k.
- Then while writing into disk bluestore can divide rados subobject into 
extents of minimal size == bluestore_min_alloc_size_hdd.


Next rules can save some space and iops:

- rgw_multipart_min_part_size SHOULD be multiple of rgw_obj_stripe_size 
(client can use different value greater than)

- MUST rgw_obj_stripe_size == rgw_max_chunk_size
- ec stripe == osd_pool_erasure_code_stripe_unit or profile.stripe_unit
- rgw_obj_stripe_size SHOULD be multiple of profile.stripe_unit*ec.k
- bluestore_min_alloc_size_hdd MAY be equal to bluefs_alloc_size (to 
avoid fragmentation)
- rgw_obj_stripe_size/ec.k SHOULD be multiple of 
bluestore_min_alloc_size_hdd

- bluestore_min_alloc_size_hdd MAY be multiple of profile.stripe_unit

For example, if ec.k=5:

- rgw_multipart_min_part_size = rgw_obj_stripe_size = rgw_max_chunk_size 
= 20M

- rados object size == 20M
- profile.stripe_unit = 256k
- rados subobject size == 4M, 16 ec stripe units (20M / 5)
- bluestore_min_alloc_size_hdd = bluefs_alloc_size = 1M
- rados subobject can be written in 4 extents each containing 4 ec 
stripe units




On 30.07.19 17:35, Thomas Bennett wrote:

Hi,

Does anyone out there use bigger than default values for 
rgw_max_chunk_size and rgw_obj_stripe_size?


I'm planning to set rgw_max_chunk_size and rgw_obj_stripe_size  to 
20MiB, as it suits our use case and from our testing we can't see any 
obvious reason not to.


Is there some convincing experience that we should stick with 4MiBs?

Regards,
Tom

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Best regards!
Aleksei Gutikov | Ceph storage engeneer
synesis.ru | Minsk. BY
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-07-31 Thread Janne Johansson
Den ons 31 juli 2019 kl 06:55 skrev Muhammad Junaid :

> The question is about RBD Cache in write-back mode using KVM/libvirt. If
> we enable this, it uses local KVM Host's RAM as cache for VM's write
> requests. And KVM Host immediately responds to VM's OS that data has been
> written to Disk (Actually it is still not on OSD's yet). Then how can be it
> power failure safe?
>
> It is not. Nothing is power-failure safe. However you design things, you
will always have the possibility of some long write being almost done when
the power goes away, and that write (and perhaps others in caches) will be
lost. Different filesystems handle losses good or bad, databases running on
those filesystems will have their I/O interrupted and not acknowledged
which may be not-so-bad or very bad.

The write-caches you have in the KVM guest and this KVM RBD cache will make
the guest I/Os faster, at the expense of higher risk of losing data in a
power outage, but there will be some kind of roll-back, some kind of
fsck/trash somewhere to clean up if a KVM host dies with guests running.
In 99% of the cases, this is ok, the only lost things are "last lines to
this log file" or "those 3 temp files for the program that ran" and in the
last 1% you need to pull out your backups like when the physical disks die.

If someone promises "power failure safe" then I think they are misguided.
Chances may be super small for bad things to happen, but it will just never
be 0%. Also, I think the guest VMs have the possibility to ask kvm-rbd to
flush data out, and as such take the "pain" of waiting for real completion
when it is actually needed, so that other operation can go fast (and
slightly less safe) and the IO that needs harder guarantees can call for
flushing and know when data actually is on disk for real.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducible rbd-nbd crashes

2019-07-31 Thread Marc Schöchlin
Hello Jason,

it seems that there is something wrong in the rbd-nbd implementation.
(added this information also at  https://tracker.ceph.com/issues/40822)

The problem not seems to be related to kernel releases, filesystem types or the 
ceph and network setup.
Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
to have the described problem.

This night a 18 hour testrun with the following procedure was successful:
-
#!/bin/bash
set -x
while true; do
   date
   find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 
gzip -v
   date
   find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 -n 
2 gunzip -v
done
-
Previous tests crashed in a reproducible manner with "-P 1" (single io 
gzip/gunzip) after a few minutes up to 45 minutes.

Overview of my tests:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created

All device timeouts were set separately set by the nbd_set_ioctl tool because 
luminous rbd-nbd does not provide the possibility to define timeouts.

Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?

From my point of view (without in depth-knowledge of rbd-nbd/librbd) my 
assumption is that this problem might be caused by rbd-nbd code and not by 
librbd.
The probability that a bug like this survives uncovered in librbd for such a 
long time seems to be low for me :-)

Regards
Marc

Am 29.07.19 um 22:25 schrieb Marc Schöchlin:
> Hello Jason,
>
> i updated the ticket https://tracker.ceph.com/issues/40822
>
> Am 24.07.19 um 19:20 schrieb Jason Dillaman:
>> On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin  wrote:
>>> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
>>> because the ceph apt source does not contain that version.
>>> Do you know a package source?
>> All the upstream packages should be available here [1], including 12.2.5.
> Ah okay, i will test this tommorow.
>> Did you pull the OSD blocked ops stats to figure out what is going on
>> with the OSDs?
> Yes, see referenced data in the ticket 
> https://tracker.ceph.com/issues/40822#note-15
>
> Regards
> Marc
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs quota setfattr permission denied

2019-07-31 Thread Mattia Belluco
Dear ceph users,

We have been recently trying to use the two quota attributes:

- ceph.quota.max_files
- ceph.quota.max_bytes

to prepare for quota enforcing.

While the idea is quite straightforward we found out we cannot set any
additional file attribute (we tried with the directory pinning, too): we
get a straight "permission denied" each time.

We are running the latest version of Mimic on a cluster used exclusively
for cephfs.

This is the output of a "ceph fs dump"

Filesystem 'spindlefs' (4)
fs_name spindlefs
epoch   18062
flags   12
created 2019-02-21 17:53:48.240659
modified2019-07-30 16:52:13.141688
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
min_compat_client   -1 (unspecified)
last_failure0
last_failure_osd_epoch  31203
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in  0
up  {0=10744488}
failed  
damaged 
stopped 
data_pools  [11]
metadata_pool   10
inline_data disabled
balancer
standby_count_wanted1
10744488:   10.129.48.46:6800/549232138 'mds-l15-34' mds.0.18058 up:active
seq 49

and the command we are trying to use:

setfattr -n ceph.quota.max_bytes -v 1 /user/dir

It should be all pretty much default and we could find no reference
online regarding settings to allow the setfattr to succeed.

Any hint is welcome.

Kind regards
mattia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-07-31 Thread nokia ceph
Hi Greg,

We were trying to implement this however having issues in assigning the
destination object name with this api.
There is a rados command "rados -p  cp  " , is
there any librados api equivalent to this ?

Thanks,
Muthu

On Fri, Jul 5, 2019 at 4:00 PM nokia ceph  wrote:

> Thank you Greg, we will try this out .
>
> Thanks,
> Muthu
>
> On Wed, Jul 3, 2019 at 11:12 PM Gregory Farnum  wrote:
>
>> Well, the RADOS interface doesn't have a great deal of documentation
>> so I don't know if I can point you at much.
>>
>> But if you look at Objecter.h, you see that the ObjectOperation has
>> this function:
>> void copy_from(object_t src, snapid_t snapid, object_locator_t
>> src_oloc, version_t src_version, unsigned flags, unsigned
>> src_fadvise_flags)
>>
>> src: the object to copy from
>> snapid: if you want to copy a specific snap instead of HEAD
>> src_oloc: the object locator for the object
>> src_version: the version of the object to copy from (helps identify if
>> it was updated in the meantime)
>> flags: probably don't want to set these, but see
>> PrimaryLogPG::_copy_some for the choices
>> src_fadvise_flags: these are the fadvise flags we have in various
>> places that let you specify things like not to cache the data.
>> Probably leave them unset.
>>
>> -Greg
>>
>>
>>
>> On Wed, Jul 3, 2019 at 2:47 AM nokia ceph 
>> wrote:
>> >
>> > Hi Greg,
>> >
>> > Can you please share the api details  for COPY_FROM or any reference
>> document?
>> >
>> > Thanks ,
>> > Muthu
>> >
>> > On Wed, Jul 3, 2019 at 4:12 AM Brad Hubbard 
>> wrote:
>> >>
>> >> On Wed, Jul 3, 2019 at 4:25 AM Gregory Farnum 
>> wrote:
>> >> >
>> >> > I'm not sure how or why you'd get an object class involved in doing
>> >> > this in the normal course of affairs.
>> >> >
>> >> > There's a copy_from op that a client can send and which copies an
>> >> > object from another OSD into the target object. That's probably the
>> >> > primitive you want to build on. Note that the OSD doesn't do much
>> >>
>> >> Argh! yes, good idea. We really should document that!
>> >>
>> >> > consistency checking (it validates that the object version matches an
>> >> > input, but if they don't it just returns an error) so the client
>> >> > application is responsible for any locking needed.
>> >> > -Greg
>> >> >
>> >> > On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard 
>> wrote:
>> >> > >
>> >> > > Yes, this should be possible using an object class which is also a
>> >> > > RADOS client (via the RADOS API). You'll still have some client
>> >> > > traffic as the machine running the object class will still need to
>> >> > > connect to the relevant primary osd and send the write (presumably
>> in
>> >> > > some situations though this will be the same machine).
>> >> > >
>> >> > > On Tue, Jul 2, 2019 at 4:08 PM nokia ceph <
>> nokiacephus...@gmail.com> wrote:
>> >> > > >
>> >> > > > Hi Brett,
>> >> > > >
>> >> > > > I think I was wrong here in the requirement description. It is
>> not about data replication , we need same content stored in different
>> object/name.
>> >> > > > We store video contents inside the ceph cluster. And our new
>> requirement is we need to store same content for different users , hence
>> need same content in different object name . if client sends write request
>> for object x and sets number of copies as 100, then cluster has to clone
>> 100 copies of object x and store it as object x1, objectx2,etc. Currently
>> this is done in the client side where objectx1, object x2...objectx100 are
>> cloned inside the client and write request sent for all 100 objects which
>> we want to avoid to reduce network consumption.
>> >> > > >
>> >> > > > Similar usecases are rbd snapshot , radosgw copy .
>> >> > > >
>> >> > > > Is this possible in object class ?
>> >> > > >
>> >> > > > thanks,
>> >> > > > Muthu
>> >> > > >
>> >> > > >
>> >> > > > On Mon, Jul 1, 2019 at 7:58 PM Brett Chancellor <
>> bchancel...@salesforce.com> wrote:
>> >> > > >>
>> >> > > >> Ceph already does this by default. For each replicated pool,
>> you can set the 'size' which is the number of copies you want Ceph to
>> maintain. The accepted norm for replicas is 3, but you can set it higher if
>> you want to incur the performance penalty.
>> >> > > >>
>> >> > > >> On Mon, Jul 1, 2019, 6:01 AM nokia ceph <
>> nokiacephus...@gmail.com> wrote:
>> >> > > >>>
>> >> > > >>> Hi Brad,
>> >> > > >>>
>> >> > > >>> Thank you for your response , and we will check this video as
>> well.
>> >> > > >>> Our requirement is while writing an object into the cluster ,
>> if we can provide number of copies to be made , the network consumption
>> between client and cluster will be only for one object write. However , the
>> cluster will clone/copy multiple objects and stores inside the cluster.
>> >> > > >>>
>> >> > > >>> Thanks,
>> >> > > >>> Muthu
>> >> > > >>>
>> >> > > >>> On Fri, Jun 28, 2019 at 9:23 AM Brad Hubbard <
>> bhubb...@redhat.com> wrote:
>> >> > > 
>> >> > >  On Thu, Jun 27, 2019 at 8:58 P