Re: [ceph-users] Balancer in HEALTH_ERR

2019-08-01 Thread Konstantin Shalygin

Two weeks ago, we started a data migration from one old ceph node to a new
one.

For task we added a 120TB Host to the cluster and evacuated the old one with
the ceph osd crush reweight osd.X 0.0 that move near 15 TB per day.

  


After 1 week and few days we found that balancer module don't work fine
under this situacion it don't distribute data between OSD if cluster is not
HEALTH status.

  


The current situation , some osd are at 96% and others at 75% , causing some
pools get very nearfull 99%.

  


I read several post about balancer only works in HEALHTY mode and that's the
problem, because ceph don't distribute the data equal between OSD in native
mode, causing in the scenario of "Evacuate+Add" huge problems.

  


Info:https://pastebin.com/HuEt5Ukn

  


Right now for solve we are manually change weight of most used osd.

  


Anyone more got this problem?



You can determine your biggest pools like this:


```

(header="pool objects bytes_used max_avail"; echo "$header"; echo 
"$header" | tr '[[:alpha:]_]' '-'; ceph df --format=json | jq 
'.pools[]|(.name,.stats.objects,.stats.bytes_used,.stats.max_avail)' | 
paste - - - -) | column -t


```


Then you can select your PGs for this pool:


```

(header="pg_id pg_used_mbytes pg_used_objects" ; echo "$header" ; echo 
"$header" | tr '[[:alpha:]_]' '-' ; ceph pg ls-by-pool  
--format=json | jq 'sort_by(.stat_sum.num_bytes) | .[] | (.pgid, 
.stat_sum.num_bytes/1024/1024, .stat_sum.num_objects)' | paste - - -) | 
column -t


```


And then upmap your biggest PG's to lower filled osd's.

Or another way, list PG's of your already nearfull osds like this `ceph 
pg ls-by-osd osd.0` and upmap it from this osd to lower filled osd's.




gl,

k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets

2019-08-01 Thread EDH - Manuel Rios Fernandez
HI Greg / Eric,

What about allow delete bucket object with a lifecycle policy?

You can actually put 1 day of object life, that task is done at cluster level. 
And them delete objects young than 1 day, and remove bucket.

That sometimes speed deletes as task is done by rgw's.

It should be like a background delete option, due deleting bucket of millions 
of objects take weeks.

Regards





-Mensaje original-
De: ceph-users  En nombre de Gregory Farnum
Enviado el: jueves, 1 de agosto de 2019 22:48
Para: Eric Ivancich 
CC: Ceph Users ; d...@ceph.io
Asunto: Re: [ceph-users] Adventures with large RGW buckets

On Thu, Aug 1, 2019 at 12:06 PM Eric Ivancich  wrote:
>
> Hi Paul,
>
> I’ll interleave responses below.
>
> On Jul 31, 2019, at 2:02 PM, Paul Emmerich  wrote:
>
> How could the bucket deletion of the future look like? Would it be 
> possible to put all objects in buckets into RADOS namespaces and 
> implement some kind of efficient namespace deletion on the OSD level 
> similar to how pool deletions are handled at a lower level?
>
> I’ll raise that with other RGW developers. I’m unfamiliar with how RADOS 
> namespaces are handled.

I expect RGW could do this, but unfortunately deleting namespaces at the RADOS 
level is not practical. People keep asking and maybe in some future world it 
will be cheaper, but a namespace is effectively just part of the object name 
(and I don't think it's even the first thing they sort by for the key entries 
in metadata tracking!), so deleting a namespace would be equivalent to deleting 
a snapshot[1] but with the extra cost that namespaces can be created 
arbitrarily on every write operation (so our solutions for handling snapshots 
without it being ludicrously expensive wouldn't apply). Deleting a namespace 
from the OSD-side using map updates would require the OSD to iterate through 
just about all the objects they have and examine them for deletion.

Is it cheaper than doing over the network? Sure. Is it cheap enough we're 
willing to let a single user request generate that kind of cluster IO on an 
unconstrained interface? Absolutely not.
-Greg
[1]: Deleting snapshots is only feasible because every OSD maintains a sorted 
secondary index from snapid->set. This is only possible because 
snapids are issued by the monitors and clients cooperate in making sure they 
can't get reused after being deleted.
Namespaces are generated by clients and there are no constraints on their use, 
reuse, or relationship to each other. We could maybe work around these 
problems, but it'd be building a fundamentally different interface than what 
namespaces currently are.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets

2019-08-01 Thread Gregory Farnum
On Thu, Aug 1, 2019 at 12:06 PM Eric Ivancich  wrote:
>
> Hi Paul,
>
> I’ll interleave responses below.
>
> On Jul 31, 2019, at 2:02 PM, Paul Emmerich  wrote:
>
> How could the bucket deletion of the future look like? Would it be possible
> to put all objects in buckets into RADOS namespaces and implement some kind
> of efficient namespace deletion on the OSD level similar to how pool deletions
> are handled at a lower level?
>
> I’ll raise that with other RGW developers. I’m unfamiliar with how RADOS 
> namespaces are handled.

I expect RGW could do this, but unfortunately deleting namespaces at
the RADOS level is not practical. People keep asking and maybe in some
future world it will be cheaper, but a namespace is effectively just
part of the object name (and I don't think it's even the first thing
they sort by for the key entries in metadata tracking!), so deleting a
namespace would be equivalent to deleting a snapshot[1] but with the
extra cost that namespaces can be created arbitrarily on every write
operation (so our solutions for handling snapshots without it being
ludicrously expensive wouldn't apply). Deleting a namespace from the
OSD-side using map updates would require the OSD to iterate through
just about all the objects they have and examine them for deletion.

Is it cheaper than doing over the network? Sure. Is it cheap enough
we're willing to let a single user request generate that kind of
cluster IO on an unconstrained interface? Absolutely not.
-Greg
[1]: Deleting snapshots is only feasible because every OSD maintains a
sorted secondary index from snapid->set. This is only
possible because snapids are issued by the monitors and clients
cooperate in making sure they can't get reused after being deleted.
Namespaces are generated by clients and there are no constraints on
their use, reuse, or relationship to each other. We could maybe work
around these problems, but it'd be building a fundamentally different
interface than what namespaces currently are.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer in HEALTH_ERR

2019-08-01 Thread EDH - Manuel Rios Fernandez
Hi Eric,

 

CEPH006 is the node that we’re evacuating , for that task we added CEPH005.

 

Thanks 

 

De: Smith, Eric  
Enviado el: jueves, 1 de agosto de 2019 20:12
Para: EDH - Manuel Rios Fernandez ; 
ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] Balancer in HEALTH_ERR

 

>From your pastebin data – it appears you need to change the crush weight of 
>the OSDs on CEPH006? They all have crush weight of 0, when other OSDs seem to 
>have a crush weight of 10.91309. You might look into the ceph osd crush 
>reweight-subtree command.

 

Eric

 

From: ceph-users mailto:ceph-users-boun...@lists.ceph.com> > on behalf of EDH - Manuel Rios 
Fernandez mailto:mrios...@easydatahost.com> >
Date: Thursday, August 1, 2019 at 1:52 PM
To: "ceph-users@lists.ceph.com  " 
mailto:ceph-users@lists.ceph.com> >
Subject: [ceph-users] Balancer in HEALTH_ERR

 

Hi ,

 

Two weeks ago, we started a data migration from one old ceph node to a new one.

For task we added a 120TB Host to the cluster and evacuated the old one with 
the ceph osd crush reweight osd.X 0.0 that move near 15 TB per day.

 

After 1 week and few days we found that balancer module don’t work fine under 
this situacion it don’t distribute data between OSD if cluster is not HEALTH 
status.

 

The current situation , some osd are at 96% and others at 75% , causing some 
pools get very nearfull 99%.

 

I read several post about balancer only works in HEALHTY mode and that’s the 
problem, because ceph don’t distribute the data equal between OSD in native 
mode, causing in the scenario of “Evacuate+Add” huge problems.

 

Info: https://pastebin.com/HuEt5Ukn

 

Right now for solve we are manually change weight of most used osd.

 

Anyone more got this problem?

 

Regards

 

Manuel

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] details about cloning objects using librados

2019-08-01 Thread Gregory Farnum
On Wed, Jul 31, 2019 at 10:31 PM nokia ceph  wrote:
>
> Thank you Greg,
>
> Another question , we need to give new destination object  , so that we can 
> read them separately in parallel with src object .  This function resides in 
> objector.h , seems to be like internal and can it be used in interface level  
> and can we use this in our client ? Currently we use librados.h in our client 
> to communicate with ceph cluster.

copy_from is an ObjectOperations and exposed via the librados C++ api
like all the others. It may not be in the simple
(, , ) interfaces. It may also not be
in the C API?

> Also any equivalent librados api for the command rados -p poolname  object> 

It's using the copy_from command we're discussing here. You can look
at the source as an example:
https://github.com/ceph/ceph/blob/master/src/tools/rados/rados.cc#L497
-Greg

>
> Thanks,
> Muthu
>
> On Wed, Jul 31, 2019 at 11:13 PM Gregory Farnum  wrote:
>>
>>
>>
>> On Wed, Jul 31, 2019 at 1:32 AM nokia ceph  wrote:
>>>
>>> Hi Greg,
>>>
>>> We were trying to implement this however having issues in assigning the 
>>> destination object name with this api.
>>> There is a rados command "rados -p  cp  " , is 
>>> there any librados api equivalent to this ?
>>
>>
>> The copyfrom operation, like all other ops, is directed to a specific 
>> object. The object you run it on is the destination; it copies the specified 
>> “src” object into itself.
>> -Greg
>>
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Fri, Jul 5, 2019 at 4:00 PM nokia ceph  wrote:

 Thank you Greg, we will try this out .

 Thanks,
 Muthu

 On Wed, Jul 3, 2019 at 11:12 PM Gregory Farnum  wrote:
>
> Well, the RADOS interface doesn't have a great deal of documentation
> so I don't know if I can point you at much.
>
> But if you look at Objecter.h, you see that the ObjectOperation has
> this function:
> void copy_from(object_t src, snapid_t snapid, object_locator_t
> src_oloc, version_t src_version, unsigned flags, unsigned
> src_fadvise_flags)
>
> src: the object to copy from
> snapid: if you want to copy a specific snap instead of HEAD
> src_oloc: the object locator for the object
> src_version: the version of the object to copy from (helps identify if
> it was updated in the meantime)
> flags: probably don't want to set these, but see
> PrimaryLogPG::_copy_some for the choices
> src_fadvise_flags: these are the fadvise flags we have in various
> places that let you specify things like not to cache the data.
> Probably leave them unset.
>
> -Greg
>
>
>
> On Wed, Jul 3, 2019 at 2:47 AM nokia ceph  
> wrote:
> >
> > Hi Greg,
> >
> > Can you please share the api details  for COPY_FROM or any reference 
> > document?
> >
> > Thanks ,
> > Muthu
> >
> > On Wed, Jul 3, 2019 at 4:12 AM Brad Hubbard  wrote:
> >>
> >> On Wed, Jul 3, 2019 at 4:25 AM Gregory Farnum  
> >> wrote:
> >> >
> >> > I'm not sure how or why you'd get an object class involved in doing
> >> > this in the normal course of affairs.
> >> >
> >> > There's a copy_from op that a client can send and which copies an
> >> > object from another OSD into the target object. That's probably the
> >> > primitive you want to build on. Note that the OSD doesn't do much
> >>
> >> Argh! yes, good idea. We really should document that!
> >>
> >> > consistency checking (it validates that the object version matches an
> >> > input, but if they don't it just returns an error) so the client
> >> > application is responsible for any locking needed.
> >> > -Greg
> >> >
> >> > On Tue, Jul 2, 2019 at 3:49 AM Brad Hubbard  
> >> > wrote:
> >> > >
> >> > > Yes, this should be possible using an object class which is also a
> >> > > RADOS client (via the RADOS API). You'll still have some client
> >> > > traffic as the machine running the object class will still need to
> >> > > connect to the relevant primary osd and send the write (presumably 
> >> > > in
> >> > > some situations though this will be the same machine).
> >> > >
> >> > > On Tue, Jul 2, 2019 at 4:08 PM nokia ceph 
> >> > >  wrote:
> >> > > >
> >> > > > Hi Brett,
> >> > > >
> >> > > > I think I was wrong here in the requirement description. It is 
> >> > > > not about data replication , we need same content stored in 
> >> > > > different object/name.
> >> > > > We store video contents inside the ceph cluster. And our new 
> >> > > > requirement is we need to store same content for different users 
> >> > > > , hence need same content in different object name . if client 
> >> > > > sends write request for object x and sets number of copies as 
> >> > > > 100, then cluster has to clone 100 copies of object x and store 
> >> > > > it as object x1, 

Re: [ceph-users] Adventures with large RGW buckets

2019-08-01 Thread Eric Ivancich
Hi Paul,

I’ve turned the following idea of yours into a tracker:

https://tracker.ceph.com/issues/41051 


> 4. Common prefixes could filtered in the rgw class on the OSD instead
> of in radosgw
> 
> Consider a bucket with 100 folders with 1000 objects in each and only one 
> shard
> 
> /p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000
> 
> 
> Now a user wants to list / with aggregating common prefixes on the
> delimiter / and
> wants up to 1000 results.
> So there'll be 100 results returned to the client: the common prefixes
> p1 to p100.
> 
> How much data will be transfered between the OSDs and radosgw for this 
> request?
> How many omap entries does the OSD scan?
> 
> radosgw will ask the (single) index object to list the first 1000 objects. 
> It'll
> return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, , /p1/1000
> 
> radosgw will discard 999 of these and detect one common prefix and continue 
> the
> iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
> The OSD will then return everything in /p2/ in that next request and so on.
> 
> So it'll internally list every single object in that bucket. That will
> be a problem
> for large buckets and having lots of shards doesn't help either.
> 
> 
> This shouldn't be too hard to fix: add an option "aggregate prefixes" to the 
> RGW
> class method and duplicate the fast-forward logic from radosgw in
> cls_rgw. It doesn't
> even need to change the response type or anything, it just needs to
> limit entries in
> common prefixes to one result.
> Is this a good idea or am I missing something?
> 
> IO would be reduced by a factor of 100 for that particular
> pathological case. I've
> unfortunately seen a real-world setup that I think hits a case like that.

Eric

--
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets

2019-08-01 Thread Eric Ivancich
Hi Paul,

I’ll interleave responses below.

> On Jul 31, 2019, at 2:02 PM, Paul Emmerich  wrote:
> 
> we are seeing a trend towards rather large RGW S3 buckets lately.
> we've worked on
> several clusters with 100 - 500 million objects in a single bucket, and we've
> been asked about the possibilities of buckets with several billion objects 
> more
> than once.
> 
> From our experience: buckets with tens of million objects work just fine with
> no big problems usually. Buckets with hundreds of million objects require some
> attention. Buckets with billions of objects? "How about indexless buckets?" -
> "No, we need to list them".
> 
> 
> A few stories and some questions:
> 
> 
> 1. The recommended number of objects per shard is 100k. Why? How was this
> default configuration derived?
> 
> It doesn't really match my experiences. We know a few clusters running with
> larger shards because resharding isn't possible for various reasons at the
> moment. They sometimes work better than buckets with lots of shards.
> 
> So we've been considering to at least double that 100k target shard size
> for large buckets, that would make the following point far less annoying.

I believe the 100,000 objects per shard was done with a little bit of 
experience and some back-of-the-envelope calculations. Please keep us updated 
as to what you find for 200,000 objects per shard.

> 2. Many shards + ordered object listing = lots of IO
> 
> Unfortunately telling people to not use ordered listings when they don't 
> really
> need them doesn't really work as their software usually just doesn't support
> that :(

We are exploring sharding schemes that maintain ordering, and that would really 
help here.

> A listing request for X objects will retrieve up to X objects from each shard
> for ordering them. That will lead to quite a lot of traffic between the OSDs
> and the radosgw instances, even for relatively innocent simple queries as X
> defaults to 1000 usually.

What you say is correct. And it gets worse, because we have to go through all 
the returned lists and select the, say, 1000 earliest to return. And then we 
throw the rest away.

> Simple example: just getting the first page of a bucket listing with 4096
> shards fetches around 1 GB of data from the OSD to return ~300kb or so to the
> S3 client.

Correct.

> I've got two clusters here that are only used for some relatively 
> low-bandwidth
> backup use case here. However, there are a few buckets with hundreds of 
> millions
> of objects that are sometimes being listed by the backup system.
> 
> The result is that this cluster has an average read IO of 1-2 GB/s, all going
> to the index pool. Not a big deal since that's coming from SSDs and goes over
> 80 Gbit/s LACP bonds. But it does pose the question about scalability
> as the user-
> visible load created by the S3 clients is quite low.
> 
> 
> 
> 3. Deleting large buckets
> 
> Someone accidentaly put 450 million small objects into a bucket and only 
> noticed
> when the cluster ran full. The bucket isn't needed, so just delete it and case
> closed?
> 
> Deleting is unfortunately far slower than adding objects, also
> radosgw-admin leaks
> memory during deletion: https://tracker.ceph.com/issues/40700
> 
> Increasing --max-concurrent-ios helps with deletion speed (option does effect
> deletion concurrency, documentation says it's only for other specific 
> commands).
> 
> Since the deletion is going faster than new data is being added to that 
> cluster
> the "solution" was to run the deletion command in a memory-limited cgroup and
> restart it automatically after it gets killed due to leaking.

That tracker is being investigated.

> How could the bucket deletion of the future look like? Would it be possible
> to put all objects in buckets into RADOS namespaces and implement some kind
> of efficient namespace deletion on the OSD level similar to how pool deletions
> are handled at a lower level?

I’ll raise that with other RGW developers. I’m unfamiliar with how RADOS 
namespaces are handled.

> 4. Common prefixes could filtered in the rgw class on the OSD instead
> of in radosgw
> 
> Consider a bucket with 100 folders with 1000 objects in each and only one 
> shard
> 
> /p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000
> 
> 
> Now a user wants to list / with aggregating common prefixes on the
> delimiter / and
> wants up to 1000 results.
> So there'll be 100 results returned to the client: the common prefixes
> p1 to p100.
> 
> How much data will be transfered between the OSDs and radosgw for this 
> request?
> How many omap entries does the OSD scan?
> 
> radosgw will ask the (single) index object to list the first 1000 objects. 
> It'll
> return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, , /p1/1000
> 
> radosgw will discard 999 of these and detect one common prefix and continue 
> the
> iteration at /p1/\xFF to skip the remaining entries in /p1/ if there are any.
> The OSD will 

Re: [ceph-users] Balancer in HEALTH_ERR

2019-08-01 Thread Smith, Eric
From your pastebin data – it appears you need to change the crush weight of the 
OSDs on CEPH006? They all have crush weight of 0, when other OSDs seem to have 
a crush weight of 10.91309. You might look into the ceph osd crush 
reweight-subtree command.

Eric

From: ceph-users  on behalf of EDH - Manuel 
Rios Fernandez 
Date: Thursday, August 1, 2019 at 1:52 PM
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] Balancer in HEALTH_ERR

Hi ,

Two weeks ago, we started a data migration from one old ceph node to a new one.
For task we added a 120TB Host to the cluster and evacuated the old one with 
the ceph osd crush reweight osd.X 0.0 that move near 15 TB per day.

After 1 week and few days we found that balancer module don’t work fine under 
this situacion it don’t distribute data between OSD if cluster is not HEALTH 
status.

The current situation , some osd are at 96% and others at 75% , causing some 
pools get very nearfull 99%.

I read several post about balancer only works in HEALHTY mode and that’s the 
problem, because ceph don’t distribute the data equal between OSD in native 
mode, causing in the scenario of “Evacuate+Add” huge problems.

Info: https://pastebin.com/HuEt5Ukn

Right now for solve we are manually change weight of most used osd.

Anyone more got this problem?

Regards

Manuel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Balancer in HEALTH_ERR

2019-08-01 Thread EDH - Manuel Rios Fernandez
Hi ,

 

Two weeks ago, we started a data migration from one old ceph node to a new
one.

For task we added a 120TB Host to the cluster and evacuated the old one with
the ceph osd crush reweight osd.X 0.0 that move near 15 TB per day.

 

After 1 week and few days we found that balancer module don't work fine
under this situacion it don't distribute data between OSD if cluster is not
HEALTH status.

 

The current situation , some osd are at 96% and others at 75% , causing some
pools get very nearfull 99%.

 

I read several post about balancer only works in HEALHTY mode and that's the
problem, because ceph don't distribute the data equal between OSD in native
mode, causing in the scenario of "Evacuate+Add" huge problems.

 

Info: https://pastebin.com/HuEt5Ukn

 

Right now for solve we are manually change weight of most used osd.

 

Anyone more got this problem?

 

Regards

 

Manuel

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph nfs ganesha exports

2019-08-01 Thread Lee Norvall
Hi Jeff

Thanks for the pointer on this.  I found some details on this the other day and 
your link is a big help.  I will get this updated in my ansible playbook and 
test.

Rgds

Lee

On 01/08/2019 17:03, Jeff Layton wrote:

On Sun, 2019-07-28 at 18:20 +, Lee Norvall wrote:


Update to this I found that you cannot create a 2nd files system as yet and it 
is still experimental.  So I went down this route:

Added a pool to the existing cephfs and then setfattr -n ceph.dir.layout.pool 
-v SSD-NFS /mnt/cephfs/ssdnfs/ from a ceph-fuse client.

I then nfs mounted from another box. I can see the files and dir etc from the 
nfs client but my issue now is that I do not have permission to write, create 
dir etc.  The same goes for the default setup after running the ansible 
playbook even when setting export to no_root_squash.  I am missing a chain of 
permission?  ganesha-nfs is using admin userid, is this the same as the 
client.admin or is this a user I need to create?  Any info appreciated.

Ceph is on CentOS 7 and SELinux is currently off as well.

Copy of the ganesha conf below.  Is secType correct or is it missing something?

RADOS_URLS {
   ceph_conf = '/etc/ceph/ceph.conf';
   userid = "admin";
}
%url rados://cephfs_data/ganesha-export-index

NFSv4 {
RecoveryBackend = 'rados_kv';
}



I your earlier email, you mentioned that you had more than one NFS
server, but rados_kv is not safe in a multi-server configuration. The
servers will be competing to store recovery information in the same
objects, and won't honor each others' grace periods/

You may want to explore using "RecoveryBackend = rados_cluster" instead,
which should handle that situation better. See this writeup, for some
guidelines:


https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/

Much of this is already automated too if you use k8s+rook.



RADOS_KV {
ceph_conf = '/etc/ceph/ceph.conf';
userid = "admin";
pool = "cephfs_data";
}

EXPORT
{
Export_id=20133;
Path = "/";
Pseudo = /cephfile;
Access_Type = RW;
Protocols = 3,4;
Transports = TCP;
SecType = sys,krb5,krb5i,krb5p;
Squash = Root_Squash;
Attr_Expiration_Time = 0;

FSAL {
Name = CEPH;
User_Id = "admin";
}


}










On 28/07/2019 12:11, Lee Norvall wrote:


Hi

I am using ceph-ansible to deploy and just looking for best way/tips on
how to export multiple pools/fs.

Ceph: nautilus (14.2.2)
NFS-Ganesha v 2.8
ceph-ansible stable 4.0

I have 3 x osd/NFS gateways running and NFS on the dashboard can see
them in the cluster.  I have managed to export for cephfs / and mounted
it on another box.

1) can I add a new pool/fs to the export under that same NFS gateway
cluster, or

2) do I have the to do something like add a new pool to the fs and then
setfattr to make the layout /newfs_dir point to /new_pool?  does this
cause issues and false object count?

3) any other better ways...

Rgds

Lee

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--


Lee Norvall | CEO / Founder
Mob. +44 (0)7768 201884
Tel. +44 (0)20 3026 8930
Web. www.blocz.io

Enterprise Cloud | Private Cloud | Hybrid/Multi Cloud | Cloud Backup



This e-mail (and any attachment) has been sent from a PC belonging to My Mind 
(Holdings) Limited. If you receive it in error, please tell us by return and 
then delete it from your system; you may not rely on its contents nor 
copy/disclose it to anyone. Opinions, conclusions and statements of intent in 
this e-mail are those of the sender and will not bind My Mind (Holdings) 
Limited unless confirmed by an authorised representative independently of this 
message. We do not accept responsibility for viruses; you must scan for these. 
Please note that e-mails sent to and from blocz IO Limited are routinely 
monitored for record keeping, quality control and training purposes, to ensure 
regulatory compliance and to prevent viruses and unauthorised use of our 
computer systems. My Mind (Holdings) Limited is registered in England & Wales 
under company number 10186410. Registered office: 1st Floor Offices, 2a 
Highfield Road, Ringwood, Hampshire, United Kingdom, BH24 1RQ. VAT Registration 
GB 244 9628 77
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
[blocz IO Limted]

Lee Norvall | CEO / Founder
Mob. +44 (0)7768 201884
Tel. +44 (0)20 3026 8930
Web. www.blocz.io

Enterprise Cloud | Private Cloud | Hybrid/Multi Cloud | Cloud Backup

[blocz IO Limted]   [blocz IO Limted][blocz IO Limted][blocz IO Limted] 
   [blocz IO Limted][ESET MSP Partner]

This e-mail 

Re: [ceph-users] Ceph nfs ganesha exports

2019-08-01 Thread Jeff Layton
On Sun, 2019-07-28 at 18:20 +, Lee Norvall wrote:
> Update to this I found that you cannot create a 2nd files system as yet and 
> it is still experimental.  So I went down this route:
> 
> Added a pool to the existing cephfs and then setfattr -n ceph.dir.layout.pool 
> -v SSD-NFS /mnt/cephfs/ssdnfs/ from a ceph-fuse client.
> 
> I then nfs mounted from another box. I can see the files and dir etc from the 
> nfs client but my issue now is that I do not have permission to write, create 
> dir etc.  The same goes for the default setup after running the ansible 
> playbook even when setting export to no_root_squash.  I am missing a chain of 
> permission?  ganesha-nfs is using admin userid, is this the same as the 
> client.admin or is this a user I need to create?  Any info appreciated.
> 
> Ceph is on CentOS 7 and SELinux is currently off as well.
> 
> Copy of the ganesha conf below.  Is secType correct or is it missing 
> something?
> 
> RADOS_URLS {
>ceph_conf = '/etc/ceph/ceph.conf';
>userid = "admin";
> }
> %url rados://cephfs_data/ganesha-export-index
> 
> NFSv4 {
> RecoveryBackend = 'rados_kv';
> }

I your earlier email, you mentioned that you had more than one NFS
server, but rados_kv is not safe in a multi-server configuration. The
servers will be competing to store recovery information in the same
objects, and won't honor each others' grace periods/

You may want to explore using "RecoveryBackend = rados_cluster" instead,
which should handle that situation better. See this writeup, for some
guidelines:


https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/

Much of this is already automated too if you use k8s+rook.

> RADOS_KV {
> ceph_conf = '/etc/ceph/ceph.conf';
> userid = "admin";
> pool = "cephfs_data";
> }
> 
> EXPORT
> {
> Export_id=20133;
> Path = "/";
> Pseudo = /cephfile;
> Access_Type = RW;
> Protocols = 3,4;
> Transports = TCP;
> SecType = sys,krb5,krb5i,krb5p;
> Squash = Root_Squash;
> Attr_Expiration_Time = 0;
> 
> FSAL {
> Name = CEPH;
> User_Id = "admin";
> }
> 
> 
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 28/07/2019 12:11, Lee Norvall wrote:
> > Hi
> > 
> > I am using ceph-ansible to deploy and just looking for best way/tips on 
> > how to export multiple pools/fs.
> > 
> > Ceph: nautilus (14.2.2)
> > NFS-Ganesha v 2.8
> > ceph-ansible stable 4.0
> > 
> > I have 3 x osd/NFS gateways running and NFS on the dashboard can see 
> > them in the cluster.  I have managed to export for cephfs / and mounted 
> > it on another box.
> > 
> > 1) can I add a new pool/fs to the export under that same NFS gateway 
> > cluster, or
> > 
> > 2) do I have the to do something like add a new pool to the fs and then 
> > setfattr to make the layout /newfs_dir point to /new_pool?  does this 
> > cause issues and false object count?
> > 
> > 3) any other better ways...
> > 
> > Rgds
> > 
> > Lee
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
>  
> 
> Lee Norvall | CEO / Founder 
> Mob. +44 (0)7768 201884 
> Tel. +44 (0)20 3026 8930 
> Web. www.blocz.io 
> 
> Enterprise Cloud | Private Cloud | Hybrid/Multi Cloud | Cloud Backup 
> 
> 
> 
> This e-mail (and any attachment) has been sent from a PC belonging to My Mind 
> (Holdings) Limited. If you receive it in error, please tell us by return and 
> then delete it from your system; you may not rely on its contents nor 
> copy/disclose it to anyone. Opinions, conclusions and statements of intent in 
> this e-mail are those of the sender and will not bind My Mind (Holdings) 
> Limited unless confirmed by an authorised representative independently of 
> this message. We do not accept responsibility for viruses; you must scan for 
> these. Please note that e-mails sent to and from blocz IO Limited are 
> routinely monitored for record keeping, quality control and training 
> purposes, to ensure regulatory compliance and to prevent viruses and 
> unauthorised use of our computer systems. My Mind (Holdings) Limited is 
> registered in England & Wales under company number 10186410. Registered 
> office: 1st Floor Offices, 2a Highfield Road, Ringwood, Hampshire, United 
> Kingdom, BH24 1RQ. VAT Registration GB 244 
 9628 77
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets [EXT]

2019-08-01 Thread Matthew Vernon

Hi,

On 31/07/2019 19:02, Paul Emmerich wrote:

Some interesting points here, thanks for raising them :)


 From our experience: buckets with tens of million objects work just fine with
no big problems usually. Buckets with hundreds of million objects require some
attention. Buckets with billions of objects? "How about indexless buckets?" -
"No, we need to list them".


We've had some problems with large buckets (from around the 70Mobject 
mark).


One you don't mention is that multipart uploads break during resharding 
- so if our users are filling up a bucket with many writers uploading 
multipart objects, some of these will fail (rather than blocking) when 
the bucket is resharded.



1. The recommended number of objects per shard is 100k. Why? How was this
default configuration derived?


I don't know what a good number is, but by the time you get into O(10M) 
objects, some sharding does seem to help - we've found a particular OSD 
getting really hammered by heavy updates on large buckets (in Jewel, 
before we had online resharding).



3. Deleting large buckets

Someone accidentaly put 450 million small objects into a bucket and only noticed
when the cluster ran full. The bucket isn't needed, so just delete it and case
closed?

Deleting is unfortunately far slower than adding objects, also
radosgw-admin leaks
memory during deletion:


We've also seen bucket deletion via radosgw-admin failing because of 
oddities in the bucket itself (e.g. missing shadow objects, omap objects 
that still exist when the related object is gone); sorting that was a 
bit fiddly (with some help from Canonical, who I think are working on 
patches).



Increasing --max-concurrent-ios helps with deletion speed (option does effect
deletion concurrency, documentation says it's only for other specific commands).


Yes, we found increasing max-concurrent-ios helped.

Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-08-01 Thread Oliver Freyermuth

Hi together,

Am 01.08.19 um 08:45 schrieb Janne Johansson:

Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid mailto:junaid.fsd...@gmail.com>>:

Your email has cleared many things to me. Let me repeat my understanding. 
Every Critical data (Like Oracle/Any Other DB) writes will be done with sync, 
fsync flags, meaning they will be only confirmed to DB/APP after it is actually 
written to Hard drives/OSD's. Any other application can do it also.
All other writes, like OS logs etc will be confirmed immediately to app/user but later on written  passing through kernel, RBD Cache, Physical drive Cache (If any)  and then to disks. These are susceptible to power-failure-loss but overall things are recoverable/non-critical. 



That last part is probably simplified a bit, I suspect between a program in a 
guest sending its data to the virtualised device, running in a KVM on top of an 
OS that has remote storage over network, to a storage server with its own OS 
and drive controller chip and lastly physical drive(s) to store the write, 
there will be something like ~10 layers of write caching possible, out of which 
the RBD you were asking about, is just one.

It is just located very conveniently before the I/O has to leave the KVM host 
and go back and forth over the network, so it is the last place where you can 
see huge gains in the guests I/O response time, but at the same time possible 
to share between lots of guests on the KVM host which should have tons of RAM 
available compared to any single guest so it is a nice way to get a large cache 
for outgoing writes.

Also, to answer your first part, yes all critical software that depend heavily 
on write ordering and integrity is hopefully already doing write operations 
that way, asking for sync(), fsync() or fdatasync() and similar calls, but I 
can't produce a list of all programs that do. Since there already are many 
layers of delayed cached writes even without virtualisation and/or ceph, 
applications that are important have mostly learned their lessons by now, so 
chances are very high that all your important databases and similar program are 
doing the right thing.


Just to add on this: One such software, for which people cared a lot, is of 
course a file system itself. BTRFS is notably a candidate very sensitive to 
broken flush / FUA ( 
https://en.wikipedia.org/wiki/Disk_buffer#Force_Unit_Access_(FUA) ) 
implementations at any layer of the I/O path due to the rather complicated 
metadata structure.
While for in-kernel and other open source software (such as librbd), there are 
usually a lot of people checking the code for a correct implementation and 
testing things, there is also broken hardware
(or rather, firmware) in the wild.

But there are even software issues around, if you think more general and strive 
for data correctness (since also corruption can happen at any layer):
I was hit by an in-kernel issue in the past (network driver writing network statistics 
via DMA to the wrong memory location - "sometimes")
corrupting two BTRFS partitions of mine, and causing random crashes in browsers 
and mail client apps. BTRFS has been hardened only in kernel 5.2 to check the 
metadata tree before flushing it to disk.

If you are curious about known hardware issues, check out this lengthy, but 
very insightful mail on the linux-btrfs list:
https://lore.kernel.org/linux-btrfs/20190623204523.gc11...@hungrycats.org/
As you can learn there, there are many drive and firmware combinations out 
there which do not implement flush / FUA correctly and your BTRFS may be 
corrupted after a power failure. The very same thing can happen to Ceph,
but with replication across several OSDs and lower probability to have broken 
disks in all hosts makes this issue less likely.

For what it is worth, we also use writeback caching for our virtualization 
cluster and are very happy with it - we also tried pulling power plugs on 
hypervisors, MONs and OSDs at random times during writes and ext4 could always 
recover easily with an fsck
making use of the journal.

Cheers and HTH,
Oliver



But if the guest is instead running a mail filter that does antivirus checks, spam checks 
and so on, operating on files that live on the machine for something like one second, and 
then either get dropped or sent to the destination mailbox somewhere else, then having 
aggressive write caches would be very useful, since the effects of a crash would still 
mostly mean "the emails that were in the queue were lost, not acked by the final 
mailserver and will probably be resent by the previous smtp server". For such a 
guest VM, forcing sync writes would only be a net loss, it would gain much by having 
large ram write caches.

--
May the most significant bit of your life be positive.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME 

Re: [ceph-users] High memory usage OSD with BlueStore

2019-08-01 Thread Mark Nelson

Hi Danny,


Are your arm binaries built using tcmalloc?  At least on x86 we saw 
significantly higher memory fragmentation and memory usage with glibc 
malloc.



First, you can look at the mempool stats which may provide a hint:


ceph daemon osd.NNN dump_mempools


Assuming you are using tcmalloc and have the cache autotuning enabled, 
you can also enable debug_bluestore = "5" and debug_prioritycache = "5" 
on one of the OSDs that using lots of memory.  Look for the lines 
containing "cache_size" "tune_memory target".  Those will tell you how 
much of your memory is being devoted for bluestore caches and how it's 
being divided up between kv, buffer, and rocksdb block cache.



Mark

On 8/1/19 4:25 AM, dannyyang(杨耿丹) wrote:


H all:
we have a cephfs env,ceph version is 12.2.10,server in arm,but fuse clients 
are x86,osd disk size is 8T,some osd use 12GB memory,is that normal?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High memory usage OSD with BlueStore

2019-08-01 Thread Janne Johansson
Den tors 1 aug. 2019 kl 11:31 skrev dannyyang(杨耿丹) :

> H all:
>
> we have a cephfs env,ceph version is 12.2.10,server in arm,but fuse clients 
> are x86,
> osd disk size is 8T,some osd use 12GB memory,is that normal?
>
>
For bluestore, there are certain tuneables you can use to limit memory a
bit. For filestore it would not be "normal" but very much possible in
recovery scenarios for memory to shoot up like that.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] High memory usage OSD with BlueStore

2019-08-01 Thread 杨耿丹
H all:
we have a cephfs env,ceph version is 12.2.10,server in arm,but fuse clients are 
x86,osd disk size is 8T,some osd use 12GB memory,is that normal?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW 4 MiB objects

2019-08-01 Thread Thomas Bennett
Hi Aleksey,

Thanks for the detailed breakdown!

We're currently using replication pools but will be testing ec pools soon
enough and this is a useful set of parameters to look at. Also, I had not
considered the bluestore parameters, thanks for pointing that out.

Kind regards

On Wed, Jul 31, 2019 at 2:36 PM Aleksey Gutikov 
wrote:

> Hi Thomas,
>
> We did some investigations some time before and got several rules how to
> configure rgw and osd for big files stored on erasure-coded pool.
> Hope it will be useful.
> And if I have any mistakes, please let me know.
>
> S3 object saving pipeline:
>
> - S3 object is divided into multipart shards by client.
> - Rgw shards each multipart shard into rados objects of size
> rgw_obj_stripe_size.
> - Primary osd stripes rados object into ec stripes of width ==
> ec.k*profile.stripe_unit, ec code them and send units into secondary
> osds and write into object store (bluestore).
> - Each subobject of rados object has size == (rados object size)/k.
> - Then while writing into disk bluestore can divide rados subobject into
> extents of minimal size == bluestore_min_alloc_size_hdd.
>
> Next rules can save some space and iops:
>
> - rgw_multipart_min_part_size SHOULD be multiple of rgw_obj_stripe_size
> (client can use different value greater than)
> - MUST rgw_obj_stripe_size == rgw_max_chunk_size
> - ec stripe == osd_pool_erasure_code_stripe_unit or profile.stripe_unit
> - rgw_obj_stripe_size SHOULD be multiple of profile.stripe_unit*ec.k
> - bluestore_min_alloc_size_hdd MAY be equal to bluefs_alloc_size (to
> avoid fragmentation)
> - rgw_obj_stripe_size/ec.k SHOULD be multiple of
> bluestore_min_alloc_size_hdd
> - bluestore_min_alloc_size_hdd MAY be multiple of profile.stripe_unit
>
> For example, if ec.k=5:
>
> - rgw_multipart_min_part_size = rgw_obj_stripe_size = rgw_max_chunk_size
> = 20M
> - rados object size == 20M
> - profile.stripe_unit = 256k
> - rados subobject size == 4M, 16 ec stripe units (20M / 5)
> - bluestore_min_alloc_size_hdd = bluefs_alloc_size = 1M
> - rados subobject can be written in 4 extents each containing 4 ec
> stripe units
>
>
>
> On 30.07.19 17:35, Thomas Bennett wrote:
> > Hi,
> >
> > Does anyone out there use bigger than default values for
> > rgw_max_chunk_size and rgw_obj_stripe_size?
> >
> > I'm planning to set rgw_max_chunk_size and rgw_obj_stripe_size  to
> > 20MiB, as it suits our use case and from our testing we can't see any
> > obvious reason not to.
> >
> > Is there some convincing experience that we should stick with 4MiBs?
> >
> > Regards,
> > Tom
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
>
> Best regards!
> Aleksei Gutikov | Ceph storage engeneer
> synesis.ru | Minsk. BY
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Thomas Bennett

Storage Engineer at SARAO
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error ENOENT: problem getting command descriptions from mon.5

2019-08-01 Thread Christoph Adomeit
Hi there,

i have updated my ceph-cluster from luminous to 14.2.1 and whenever I run a 
"ceph tell mon.* version"
I get the correct versions from all monitors except mon.5

For mon.5 is get  the error:
Error ENOENT: problem getting command descriptions from mon.5
mon.5: problem getting command descriptions from mon.5

The ceph status command is healthy and all monitors seem to work perfectly, 
also mon.5.


I have verified the running mon binaries are the same and the persmissions in
/var/lib/ceph are also the dame as on the other monitor hosts.

I am also not sure if the error came after the update to nautilus, i think it 
was there beforde.

Any idea what the problem getting command descriptions error might be, where i 
can look
and what to fix ? Monmap and ceph.conf seem also to be okay and monitor logfile 
does
not show anything unusual.

Thanks
  Christoph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to deal with slow requests related to OSD bugs

2019-08-01 Thread Thomas Bennett
Hi Xavier,

We have had OSDs backed with Samsung SSD 960 PRO 512GB nvmes which started
generating slow requests.

After running:

ceph osd tree up | grep nvme | awk '{print $4}' | xargs -P 10 -I _OSD sh -c
'BPS=$(ceph tell _OSD bench | jq -r .bytes_per_sec); MBPS=$(echo "scale=2;
$BPS/100" | bc -l); echo _OSD $MBPS MB/s' | sort -n -k 2 | column -t

I noticed that the data rate had dropped significantly on some of my NVMEs
(some where down from ~1000 MB/s to ~300 MB/s). This pointed me to the fact
that the NVMes not behaving as expected.

I thought it may be worth asking if you perhaps seeing something similar.

Cheers,
Tom

On Wed, Jul 24, 2019 at 6:39 PM Xavier Trilla 
wrote:

> Hi,
>
>
>
> We had an strange issue while adding a new OSD to our Ceph Luminous 12.2.8
> cluster. Our cluster has > 300 OSDs based on SSDs and NVMe.
>
>
>
> After adding a new OSD to the Ceph cluster one of the already running OSDs
> started to give us slow queries warnings.
>
>
>
> We checked the OSD and it was working properly, nothing strange on the
> logs and also it has disk activity. Looks like it stopped serving requests
> just for one PG.
>
>
>
> Request were just piling up, and the number of slow queries was just
> growing constantly till we restarted the OSD (All our OSDs are running
> bluestore).
>
>
>
> We’ve been checking out everything in our setup, and everything is
> properly configured (This cluster has been running for >5 years and it
> hosts several thousand VMs.)
>
>
>
> Beyond finding the real source of the issue -I guess I’ll have to add more
> OSDs and if it happens again I could just dump the stats of the OSD (ceph
> daemon osd.X dump_historic_slow_ops) – what I would like to find is a way
> to protect the cluster from this kind of issues.
>
>
>
> I mean, in some scenarios OSDs just suicide -actually I fixed the issue
> just restarting the offending OSD- but how can we deal with this kind of
> situation? I’ve been checking around, but I could not find anything
> (Obviously we could set our monitoring software to restart any OSD which
> has more than N slow queries, but I find that a little bit too aggressive).
>
>
>
> Is there anything build in Ceph to deal with these situations? A OSD
> blocking queries in a RBD scenario is a big deal, as plenty of VMs will
> have disk timeouts which can lead to the VM just panicking.
>
>
>
> Thanks!
>
> Xavier
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Thomas Bennett

Storage Engineer at SARAO
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Too few PGs per OSD (autoscaler)

2019-08-01 Thread Jan Kasprzak
Helo, Ceph users,

TL;DR: PG autoscaler should not cause the "too few PGs per OSD" warning

Detailed:
Some time ago, I upgraded the HW in my virtualization+Ceph cluster,
replacing 30+ old servers with <10 modern servers. I immediately got
"Too much PGs per OSD" warning, so I had to add more OSDs, even though
I did not need the space at that time. So I eagerly waited for the PG
autoscaling feature in Nautilus.

Yesterday I upgraded to Nautilus and enabled the autoscaler on my RBD pool.
Firstly I got the "objects per pg (XX) is more than XX times cluster average"
warning for several hours, which has been replaced with
"too few PGs per OSD" later on.

I have to set the minimum number of PGs per pool, but anyway, I think
autoscaler should not be too aggresive, and should not reduce the number
of PGs below the PGs per OSD limit.

(that said, the ability to reduce the number of PGs in a pool in Nautilus
works well for me, thanks for it!)

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent Help Needed (regarding rbd cache)

2019-08-01 Thread Janne Johansson
Den tors 1 aug. 2019 kl 07:31 skrev Muhammad Junaid :

> Your email has cleared many things to me. Let me repeat my understanding.
> Every Critical data (Like Oracle/Any Other DB) writes will be done with
> sync, fsync flags, meaning they will be only confirmed to DB/APP after it
> is actually written to Hard drives/OSD's. Any other application can do it
> also.
> All other writes, like OS logs etc will be confirmed immediately to
> app/user but later on written  passing through kernel, RBD Cache, Physical
> drive Cache (If any)  and then to disks. These are susceptible to
> power-failure-loss but overall things are recoverable/non-critical.
>

That last part is probably simplified a bit, I suspect between a program in
a guest sending its data to the virtualised device, running in a KVM on top
of an OS that has remote storage over network, to a storage server with its
own OS and drive controller chip and lastly physical drive(s) to store the
write, there will be something like ~10 layers of write caching possible,
out of which the RBD you were asking about, is just one.

It is just located very conveniently before the I/O has to leave the KVM
host and go back and forth over the network, so it is the last place where
you can see huge gains in the guests I/O response time, but at the same
time possible to share between lots of guests on the KVM host which should
have tons of RAM available compared to any single guest so it is a nice way
to get a large cache for outgoing writes.

Also, to answer your first part, yes all critical software that depend
heavily on write ordering and integrity is hopefully already doing write
operations that way, asking for sync(), fsync() or fdatasync() and similar
calls, but I can't produce a list of all programs that do. Since there
already are many layers of delayed cached writes even without
virtualisation and/or ceph, applications that are important have mostly
learned their lessons by now, so chances are very high that all your
important databases and similar program are doing the right thing.

But if the guest is instead running a mail filter that does antivirus
checks, spam checks and so on, operating on files that live on the machine
for something like one second, and then either get dropped or sent to the
destination mailbox somewhere else, then having aggressive write caches
would be very useful, since the effects of a crash would still mostly mean
"the emails that were in the queue were lost, not acked by the final
mailserver and will probably be resent by the previous smtp server". For
such a guest VM, forcing sync writes would only be a net loss, it would
gain much by having large ram write caches.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reading a crushtool compare output

2019-08-01 Thread Linh Vu
Hi all,

I'd like to update the tunables on our older ceph cluster, created with firefly 
and now on luminous. I need to update two tunables, chooseleaf_vary_r from 2 to 
1, and chooseleaf_stable from 0 to 1. I'm going to do 1 tunable update at a 
time.

With the first one, I've dumped the current crushmap out and compared it to the 
proposed updated crushmap with chooseleaf_vary_r set to 1 instead of 2. I need 
some help to understand the output:

# ceph osd getcrushmap -o crushmap-20190801-chooseleaf-vary-r-2
# crushtool -i crushmap-20190801-chooseleaf-vary-r-2 --set-chooseleaf-vary-r 1 
-o crushmap-20190801-chooseleaf-vary-r-1
# crushtool -i crushmap-20190801-chooseleaf-vary-r-2 --compare 
crushmap-20190801-chooseleaf-vary-r-1
rule 0 had 9137/10240 mismatched mappings (0.892285)
rule 1 had 9152/10240 mismatched mappings (0.89375)
rule 4 had 9173/10240 mismatched mappings (0.895801)
rule 5 had 0/7168 mismatched mappings (0)
rule 6 had 0/7168 mismatched mappings (0)
warning: maps are NOT equivalent

So I've learned in the past doing this sort of stuff that if the maps are 
equivalent then there is no data movement. In this case, obviously I'm 
expecting data movement, but by how much? Rules 0, 1 and 4 are about our 3 
different device classes in this cluster.

Does that mean I'm going to expect almost 90% mismatched based on the above 
output? That's much bigger than I expected, as in the previous steps of 
changing the chooseleaf-vary-r from 0 to 5 then down to 2 by 1 at a time 
(before knowing anything about this crushtool --compare command) I had only up 
to about 28% mismatched objects.

Also, if you've done a similar change, please let me know how mcuh data 
movement you encountered. Thanks!

Cheers,
Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com