Re: [ceph-users] PGs per OSD guidance

2017-07-19 Thread David Turner
Here are a few thoughts. The more PGs, the higher memory requirement for
the osd process. If you are having problems with scrubs causing problems
with customer io, check some of the io priority settings that received a
big overhaul with Jewel and again with 10.2.9. The more PGs you have, the
smaller each one will be, so the scrubs will finish faster... But you'll
have that many more scrubs to do so it will end up taking the same amount
of time to scrub everything because you have the same amount of data.

Generally increasing PG counts is aimed at improving the distribution of
data between osds, maintaining a desired pg size when data growth is
happening (not a common concern), maintaining a desired amount of PGs per
osd when you add osds to your cluster. Outside of those reasons, I don't
know of any benefits to increasing PG counts.

On Wed, Jul 19, 2017, 9:57 PM Adrian Saul 
wrote:

>
> Anyone able to offer any advice on this?
>
> Cheers,
>  Adrian
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Adrian Saul
> > Sent: Friday, 14 July 2017 6:05 PM
> > To: 'ceph-users@lists.ceph.com'
> > Subject: [ceph-users] PGs per OSD guidance
> >
> > Hi All,
> >I have been reviewing the sizing of our PGs with a view to some
> > intermittent performance issues.  When we have scrubs running, even when
> > only a few are, we can sometimes get severe impacts on the performance of
> > RBD images, enough to start causing VMs to appear stalled or
> unresponsive.
> > When some of these scrubs are running I can see very high latency on some
> > disks which I suspect is what is impacting the performance.  We currently
> > have around 70 PGs per SATA OSD, and 140 PGs per SSD OSD.   These
> > numbers are probably not really reflective as most of the data is in
> only really
> > half of the pools, so some PGs would be fairly heavy while others are
> > practically empty.   From what I have read we should be able to go
> > significantly higher though.We are running 10.2.1 if that matters in
> this
> > context.
> >
> >  My question is if we increase the numbers of PGs, is that likely to help
> > reduce the scrub impact or spread it wider?  For example, does the mere
> act
> > of scrubbing one PG mean the underlying disk is going to be hammered and
> > so we will impact more PGs with that load, or would having more PGs mean
> > the time to scrub the PG should be reduced and so the impact will be more
> > disbursed?
> >
> > I am also curious from a performance stand of view are we better off with
> > more PGs to reduce PG lock contention etc?
> >
> > Cheers,
> >  Adrian
> >
> >
> > Confidentiality: This email and any attachments are confidential and may
> be
> > subject to copyright, legal or some other professional privilege. They
> are
> > intended solely for the attention and use of the named addressee(s). They
> > may only be copied, distributed or disclosed with the consent of the
> > copyright owner. If you have received this email by mistake or by breach
> of
> > the confidentiality clause, please notify the sender immediately by
> return
> > email and delete or destroy all copies of the email. Any confidentiality,
> > privilege or copyright is not waived or lost because this email has been
> sent
> > to you by mistake.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They
> are intended solely for the attention and use of the named addressee(s).
> They may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems getting nfs-ganesha with cephfs backend to work.

2017-07-19 Thread Ramana Raja
On 07/20/2017 at 12:02 AM, Daniel Gryniewicz  wrote:
> On 07/19/2017 05:27 AM, Micha Krause wrote:
> > Hi,
> > 
> >> Ganesha version 2.5.0.1 from the nfs-ganesha repo hosted on
> >> download.ceph.com 
> > 
> > I didn't know about that repo, and compiled ganesha myself. The
> > developers in the #ganesha IRC channel pointed me to
> > the libcephfs version.
> > After recompiling ganesha with a kraken libcephfs instead of a jewel
> > version both errors went away.
> > 
> > I'm sure using a compiled Version from the repo you mention would have
> > worked out of the box.
> > 
> > Micha Krause
> > 
> 
> These packages aren't quite ready for use yet, the packaging work is
> still underway.  CCing Ali, who's doing the work.
> 
> Daniel

Ali told me that the rpm for Ganesha's CephFS FSAL(driver),
nfs-ganesha-ceph v2.5.0.1 (28th June) available at download.ceph.com,
was built using libcephfs2 of Ceph luminous release v12.0.3.

AFAIK Ali's working on building latest nfs-ganesha, and nfs-ganesha-ceph FSAL
(v2.5.0.4) rpms and debs using libcephfs2 in latest luminous release, 12.1.1.
You can expect them to be at download.ceph.com/nfs-ganesha sometime soon.

-Ramana
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: 答复: How's cephfs going?

2017-07-19 Thread Blair Bethwaite
On 20 July 2017 at 12:23, 许雪寒  wrote:
> May I ask how many users do you have on cephfs? And how much data does the 
> cephfs store?

https://www.redhat.com/en/resources/monash-university-improves-research-ceph-storage-case-study

As I said, we don't yet have CephFS in production, just finalising our
PoC setup to let some initial users have a go at it in the coming week
or two.

> interesting xattr acl behaviours (e.g. ACL'd dir writable through one gateway 
> node but not another), other colleagues looking at that and poking Red Hat 
> for assistance...

This turned out to be some something about having different versions
of the FUSE client on different SAMBA hosts - we hadn't finished
upgrading/restarting them since upgrading to 10.2.7 (RHCS 2.3).

Cheers,


> On 17 July 2017 at 13:27, 许雪寒  wrote:
>> Hi, thanks for the quick reply:-)
>>
>> May I ask which company are you in? I'm asking this because we are 
>> collecting cephfs's usage information as the basis of our judgement about 
>> whether to use cephfs. And also, how are you using it? Are you using 
>> single-mds, the so-called active-standby mode? And could you give some 
>> information of your cephfs's usage pattern, for example, does your client 
>> nodes directly mount cephfs or mount it through an NFS, or something like 
>> it, running a directory that is mounted with cephfs and are you using 
>> ceph-fuse?
>>
>> -邮件原件-
>> 发件人: Blair Bethwaite [mailto:blair.bethwa...@gmail.com]
>> 发送时间: 2017年7月17日 11:14
>> 收件人: 许雪寒
>> 抄送: ceph-users@lists.ceph.com
>> 主题: Re: [ceph-users] How's cephfs going?
>>
>> It works and can reasonably be called "production ready". However in Jewel 
>> there are still some features (e.g. directory sharding, multi active MDS, 
>> and some security constraints) that may limit widespread usage. Also note 
>> that userspace client support in e.g. nfs-ganesha and samba is a mixed bag 
>> across distros and you may find yourself having to resort to re-exporting 
>> ceph-fuse or kernel mounts in order to provide those gateway services. We 
>> haven't tried Luminous CephFS yet as still waiting for the first full 
>> (non-RC) release to drop, but things seem very positive there...
>>
>> On 17 July 2017 at 12:59, 许雪寒  wrote:
>>> Hi, everyone.
>>>
>>>
>>>
>>> We intend to use cephfs of Jewel version, however, we don’t know its status.
>>> Is it production ready in Jewel? Does it still have lots of bugs? Is
>>> it a major effort of the current ceph development? And who are using cephfs 
>>> now?
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>> --
>> Cheers,
>> ~Blairo
>
>
>
> --
> Cheers,
> ~Blairo



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: 答复: How's cephfs going?

2017-07-19 Thread 许雪寒
Hi, sir, thanks for your sharing.

May I ask how many users do you have on cephfs? And how much data does the 
cephfs store?

Thanks:-)

-邮件原件-
发件人: Blair Bethwaite [mailto:blair.bethwa...@gmail.com] 
发送时间: 2017年7月17日 11:51
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: 答复: [ceph-users] How's cephfs going?

I work at Monash University. We are using active-standby MDS. We don't yet have 
it in full production as we need some of the newer Luminous features before we 
can roll it out more broadly, however we are moving towards letting a subset of 
users on (just slowly ticking off related work like putting external backup 
system in-place, writing some janitor scripts to check quota enforcement, and 
so on). Our HPC folks are quite keen for more as it has proved very useful for 
shunting a bit of data around between disparate systems.

We're also testing NFS and CIFS gateways, after some initial issues with the 
CTDB setup that part seems to be working, but now hitting some interesting 
xattr acl behaviours (e.g. ACL'd dir writable through one gateway node but not 
another), other colleagues looking at that and poking Red Hat for assistance...

On 17 July 2017 at 13:27, 许雪寒  wrote:
> Hi, thanks for the quick reply:-)
>
> May I ask which company are you in? I'm asking this because we are collecting 
> cephfs's usage information as the basis of our judgement about whether to use 
> cephfs. And also, how are you using it? Are you using single-mds, the 
> so-called active-standby mode? And could you give some information of your 
> cephfs's usage pattern, for example, does your client nodes directly mount 
> cephfs or mount it through an NFS, or something like it, running a directory 
> that is mounted with cephfs and are you using ceph-fuse?
>
> -邮件原件-
> 发件人: Blair Bethwaite [mailto:blair.bethwa...@gmail.com]
> 发送时间: 2017年7月17日 11:14
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] How's cephfs going?
>
> It works and can reasonably be called "production ready". However in Jewel 
> there are still some features (e.g. directory sharding, multi active MDS, and 
> some security constraints) that may limit widespread usage. Also note that 
> userspace client support in e.g. nfs-ganesha and samba is a mixed bag across 
> distros and you may find yourself having to resort to re-exporting ceph-fuse 
> or kernel mounts in order to provide those gateway services. We haven't tried 
> Luminous CephFS yet as still waiting for the first full (non-RC) release to 
> drop, but things seem very positive there...
>
> On 17 July 2017 at 12:59, 许雪寒  wrote:
>> Hi, everyone.
>>
>>
>>
>> We intend to use cephfs of Jewel version, however, we don’t know its status.
>> Is it production ready in Jewel? Does it still have lots of bugs? Is 
>> it a major effort of the current ceph development? And who are using cephfs 
>> now?
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Cheers,
> ~Blairo



--
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs per OSD guidance

2017-07-19 Thread Adrian Saul

Anyone able to offer any advice on this?

Cheers,
 Adrian


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adrian Saul
> Sent: Friday, 14 July 2017 6:05 PM
> To: 'ceph-users@lists.ceph.com'
> Subject: [ceph-users] PGs per OSD guidance
>
> Hi All,
>I have been reviewing the sizing of our PGs with a view to some
> intermittent performance issues.  When we have scrubs running, even when
> only a few are, we can sometimes get severe impacts on the performance of
> RBD images, enough to start causing VMs to appear stalled or unresponsive.
> When some of these scrubs are running I can see very high latency on some
> disks which I suspect is what is impacting the performance.  We currently
> have around 70 PGs per SATA OSD, and 140 PGs per SSD OSD.   These
> numbers are probably not really reflective as most of the data is in only 
> really
> half of the pools, so some PGs would be fairly heavy while others are
> practically empty.   From what I have read we should be able to go
> significantly higher though.We are running 10.2.1 if that matters in this
> context.
>
>  My question is if we increase the numbers of PGs, is that likely to help
> reduce the scrub impact or spread it wider?  For example, does the mere act
> of scrubbing one PG mean the underlying disk is going to be hammered and
> so we will impact more PGs with that load, or would having more PGs mean
> the time to scrub the PG should be reduced and so the impact will be more
> disbursed?
>
> I am also curious from a performance stand of view are we better off with
> more PGs to reduce PG lock contention etc?
>
> Cheers,
>  Adrian
>
>
> Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs not deep-scrubbed for 86400

2017-07-19 Thread Brad Hubbard
This code shows how that all works (part of some new health reporting code).

https://github.com/ceph/ceph/blob/master/src/mon/PGMap.cc#L3188-L3203

So the last_deep_scrub_stamp of the pg is compared to deep_cutoff which is the
time now minus mon_warn_not_deep_scrubbed + osd_deep_scrub_interval.

By default mon_warn_not_deep_scrubbed = 0
OPTION(mon_warn_not_deep_scrubbed, OPT_INT, 0)

By default osd_deep_scrub_interval = 1 week
OPTION(osd_deep_scrub_interval, OPT_FLOAT, 60*60*24*7) // once a week

So by default if a pg has not been deep scrubbed in more than a week you will
get this warning. I believe the value reported, 86400, is incorrect in the case
of a deep scrub as it should be 604800 and this looks like a copy and paste
error. I'll submit a patch to rectify that.

The calculations for scrub (as opposed to deep scrub) are very similar except
the value involved is 86400, or one day (mon_scrub_interval).

I'll leave it as an exercise for you to do the math in your cases using the
values you have set locally.

HTH.

On Thu, Jul 20, 2017 at 5:10 AM, Gencer W. Genç  wrote:
> Exactly have this issue (or not?) at the moment. Mine says “906 pgs not
> scrubbed for 86400”. But it is decrementing slowly (very slowly).
>
>  
>
> I cannot find any documentation for exact “pgs not srubbed for” phrase on
> the web but only this.
>
>  
>
> Log like this:
>
>  
>
> 2017-07-19 15:05:10.125041 [INF]  3.5e scrub ok
>
> 2017-07-19 15:05:10.123522 [INF]  3.5e scrub starts
>
> 2017-07-19 15:05:14.613124 [WRN]  Health check update: 914 pgs not scrubbed
> for 86400 (PG_NOT_SCRUBBED)
>
> 2017-07-19 15:05:07.433748 [INF]  1.c4 scrub ok
>
> ...
>
>  
>
> Should this be scared us?
>
>  
>
> Gencer.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread Kjetil Jørgensen
Hi,

While not necessarily CephFS specific - we somehow seem to manage to
frequently end up with objects that have inconsistent omaps. This seems to
be replication (as anecdotally it's a replica that ends up diverging, and
it's at least a few times something that happened after the osd that held
that replica were re-started). (I had hoped
http://tracker.ceph.com/issues/17177 would solve this - but it doesn't
appear to have solved it completely).

We also have one workload which we'd need to re-engineer in order to be a
good fit for CephFS, we do a lot of hardlinks where there's no clear
"origin" file, which is slightly at odds with the hardlink implementation.
If I understand correctly, unlink is move from directory tree into the
stray directories, decrement link count, if link count = 0, purge, if not
keep it around until you encounter another link to it and re-integrate it
back in again. This netted us hilariously large stray directories, which
combined with the above were less than ideal.

Beyond that - there's been other small(-ish) bugs we've encountered, but
it's either been solvable by cherry-picking fixes, upgrading, or using the
available tools for doing surgery guided either by the internet and/or an
approximate understanding of how it's supposed to work/be).

-KJ

On Wed, Jul 19, 2017 at 11:20 AM, Brady Deetz  wrote:

> Thanks Greg. I thought it was impossible when I reported 34MB for 52
> million files.
>
> On Jul 19, 2017 1:17 PM, "Gregory Farnum"  wrote:
>
>>
>>
>> On Wed, Jul 19, 2017 at 10:25 AM David  wrote:
>>
>>> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <
>>> blair.bethwa...@gmail.com> wrote:
>>>
 We are a data-intensive university, with an increasingly large fleet
 of scientific instruments capturing various types of data (mostly
 imaging of one kind or another). That data typically needs to be
 stored, protected, managed, shared, connected/moved to specialised
 compute for analysis. Given the large variety of use-cases we are
 being somewhat more circumspect it our CephFS adoption and really only
 dipping toes in the water, ultimately hoping it will become a
 long-term default NAS choice from Luminous onwards.

 On 18 July 2017 at 15:21, Brady Deetz  wrote:
 > All of that said, you could also consider using rbd and zfs or
 whatever filesystem you like. That would allow you to gain the benefits of
 scaleout while still getting a feature rich fs. But, there are some down
 sides to that architecture too.

 We do this today (KVMs with a couple of large RBDs attached via
 librbd+QEMU/KVM), but the throughput able to be achieved this way is
 nothing like native CephFS - adding more RBDs doesn't seem to help
 increase overall throughput. Also, if you have NFS clients you will
 absolutely need SSD ZIL. And of course you then have a single point of
 failure and downtime for regular updates etc.

 In terms of small file performance I'm interested to hear about
 experiences with in-line file storage on the MDS.

 Also, while we're talking about CephFS - what size metadata pools are
 people seeing on their production systems with 10s-100s millions of
 files?

>>>
>>> On a system with 10.1 million files, metadata pool is 60MB
>>>
>>>
>> Unfortunately that's not really an accurate assessment, for good but
>> terrible reasons:
>> 1) CephFS metadata is principally stored via the omap interface (which is
>> designed for handling things like the directory storage CephFS needs)
>> 2) omap is implemented via Level/RocksDB
>> 3) there is not a good way to determine which pool is responsible for
>> which portion of RocksDBs data
>> 4) So the pool stats do not incorporate omap data usage at all in their
>> reports (it's part of the overall space used, and is one of the things that
>> can make that larger than the sum of the per-pool spaces)
>>
>> You could try and estimate it by looking at how much "lost" space there
>> is (and subtracting out journal sizes and things, depending on setup). But
>> I promise there's more than 60MB of CephFS metadata for 10.1 million files!
>> -Greg
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs not deep-scrubbed for 86400

2017-07-19 Thread Gencer W . Genç
Exactly have this issue (or not?) at the moment. Mine says "906 pgs not
scrubbed for 86400". But it is decrementing slowly (very slowly).

 

I cannot find any documentation for exact "pgs not srubbed for" phrase on
the web but only this.

 

Log like this:

 

2017-07-19 15:05:10.125041 [INF]  3.5e scrub ok 

2017-07-19 15:05:10.123522 [INF]  3.5e scrub starts 

2017-07-19 15:05:14.613124 [WRN]  Health check update: 914 pgs not scrubbed
for 86400 (PG_NOT_SCRUBBED) 

2017-07-19 15:05:07.433748 [INF]  1.c4 scrub ok

...

 

Should this be scared us?

 

Gencer.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread Brady Deetz
Thanks Greg. I thought it was impossible when I reported 34MB for 52
million files.

On Jul 19, 2017 1:17 PM, "Gregory Farnum"  wrote:

>
>
> On Wed, Jul 19, 2017 at 10:25 AM David  wrote:
>
>> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <
>> blair.bethwa...@gmail.com> wrote:
>>
>>> We are a data-intensive university, with an increasingly large fleet
>>> of scientific instruments capturing various types of data (mostly
>>> imaging of one kind or another). That data typically needs to be
>>> stored, protected, managed, shared, connected/moved to specialised
>>> compute for analysis. Given the large variety of use-cases we are
>>> being somewhat more circumspect it our CephFS adoption and really only
>>> dipping toes in the water, ultimately hoping it will become a
>>> long-term default NAS choice from Luminous onwards.
>>>
>>> On 18 July 2017 at 15:21, Brady Deetz  wrote:
>>> > All of that said, you could also consider using rbd and zfs or
>>> whatever filesystem you like. That would allow you to gain the benefits of
>>> scaleout while still getting a feature rich fs. But, there are some down
>>> sides to that architecture too.
>>>
>>> We do this today (KVMs with a couple of large RBDs attached via
>>> librbd+QEMU/KVM), but the throughput able to be achieved this way is
>>> nothing like native CephFS - adding more RBDs doesn't seem to help
>>> increase overall throughput. Also, if you have NFS clients you will
>>> absolutely need SSD ZIL. And of course you then have a single point of
>>> failure and downtime for regular updates etc.
>>>
>>> In terms of small file performance I'm interested to hear about
>>> experiences with in-line file storage on the MDS.
>>>
>>> Also, while we're talking about CephFS - what size metadata pools are
>>> people seeing on their production systems with 10s-100s millions of
>>> files?
>>>
>>
>> On a system with 10.1 million files, metadata pool is 60MB
>>
>>
> Unfortunately that's not really an accurate assessment, for good but
> terrible reasons:
> 1) CephFS metadata is principally stored via the omap interface (which is
> designed for handling things like the directory storage CephFS needs)
> 2) omap is implemented via Level/RocksDB
> 3) there is not a good way to determine which pool is responsible for
> which portion of RocksDBs data
> 4) So the pool stats do not incorporate omap data usage at all in their
> reports (it's part of the overall space used, and is one of the things that
> can make that larger than the sum of the per-pool spaces)
>
> You could try and estimate it by looking at how much "lost" space there is
> (and subtracting out journal sizes and things, depending on setup). But I
> promise there's more than 60MB of CephFS metadata for 10.1 million files!
> -Greg
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread Gregory Farnum
On Wed, Jul 19, 2017 at 10:25 AM David  wrote:

> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <
> blair.bethwa...@gmail.com> wrote:
>
>> We are a data-intensive university, with an increasingly large fleet
>> of scientific instruments capturing various types of data (mostly
>> imaging of one kind or another). That data typically needs to be
>> stored, protected, managed, shared, connected/moved to specialised
>> compute for analysis. Given the large variety of use-cases we are
>> being somewhat more circumspect it our CephFS adoption and really only
>> dipping toes in the water, ultimately hoping it will become a
>> long-term default NAS choice from Luminous onwards.
>>
>> On 18 July 2017 at 15:21, Brady Deetz  wrote:
>> > All of that said, you could also consider using rbd and zfs or whatever
>> filesystem you like. That would allow you to gain the benefits of scaleout
>> while still getting a feature rich fs. But, there are some down sides to
>> that architecture too.
>>
>> We do this today (KVMs with a couple of large RBDs attached via
>> librbd+QEMU/KVM), but the throughput able to be achieved this way is
>> nothing like native CephFS - adding more RBDs doesn't seem to help
>> increase overall throughput. Also, if you have NFS clients you will
>> absolutely need SSD ZIL. And of course you then have a single point of
>> failure and downtime for regular updates etc.
>>
>> In terms of small file performance I'm interested to hear about
>> experiences with in-line file storage on the MDS.
>>
>> Also, while we're talking about CephFS - what size metadata pools are
>> people seeing on their production systems with 10s-100s millions of
>> files?
>>
>
> On a system with 10.1 million files, metadata pool is 60MB
>
>
Unfortunately that's not really an accurate assessment, for good but
terrible reasons:
1) CephFS metadata is principally stored via the omap interface (which is
designed for handling things like the directory storage CephFS needs)
2) omap is implemented via Level/RocksDB
3) there is not a good way to determine which pool is responsible for which
portion of RocksDBs data
4) So the pool stats do not incorporate omap data usage at all in their
reports (it's part of the overall space used, and is one of the things that
can make that larger than the sum of the per-pool spaces)

You could try and estimate it by looking at how much "lost" space there is
(and subtracting out journal sizes and things, depending on setup). But I
promise there's more than 60MB of CephFS metadata for 10.1 million files!
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread David
On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite 
wrote:

> We are a data-intensive university, with an increasingly large fleet
> of scientific instruments capturing various types of data (mostly
> imaging of one kind or another). That data typically needs to be
> stored, protected, managed, shared, connected/moved to specialised
> compute for analysis. Given the large variety of use-cases we are
> being somewhat more circumspect it our CephFS adoption and really only
> dipping toes in the water, ultimately hoping it will become a
> long-term default NAS choice from Luminous onwards.
>
> On 18 July 2017 at 15:21, Brady Deetz  wrote:
> > All of that said, you could also consider using rbd and zfs or whatever
> filesystem you like. That would allow you to gain the benefits of scaleout
> while still getting a feature rich fs. But, there are some down sides to
> that architecture too.
>
> We do this today (KVMs with a couple of large RBDs attached via
> librbd+QEMU/KVM), but the throughput able to be achieved this way is
> nothing like native CephFS - adding more RBDs doesn't seem to help
> increase overall throughput. Also, if you have NFS clients you will
> absolutely need SSD ZIL. And of course you then have a single point of
> failure and downtime for regular updates etc.
>
> In terms of small file performance I'm interested to hear about
> experiences with in-line file storage on the MDS.
>
> Also, while we're talking about CephFS - what size metadata pools are
> people seeing on their production systems with 10s-100s millions of
> files?
>

On a system with 10.1 million files, metadata pool is 60MB



> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread David
On Wed, Jul 19, 2017 at 4:47 AM, 许雪寒  wrote:

> Is there anyone else willing to share some usage information of cephfs?
>

I look after 2 Cephfs deployments, both Jewel, been in production since
Jewel went stable so just over a year I think. We've had a really positive
experience, I've not experienced any MDS crashes or read-only operation
(touch wood!). The majority of clients are accessing through gateway
servers re-exporting over SMB and NFS. Data is mixed but lots of image
sequences/video.

Workload is primarily large reads but clients are all 1GbE currently
(gateway servers are on faster links) so I'd say our performance
requirements are modest. If/when we get clients on 10GbE we'll probably
need to start looking at performance more closely, definitely playing
around with stripe settings.

As I think someone already mentioned, the recursive stats are awesome. I
use the Python xattr module to grab the stats and format with the
prettytable library, it's a real pleasure to not have to wait for du to
stat through the directory tree. Thinking about doing something cool with
Kibana in the future.

The main issues we've had are with Kernel NFS, writes are currently slow in
Jewel, see http://tracker.ceph.com/issues/17563. That was fixed in master (
https://github.com/ceph/ceph/pull/11710) but I don't think that will make
its way into Jewel so I'm eagerly awaiting stable Luminous. I've also
experienced nfsd lock ups when OSDs fail.

Hope that helps a bit

> Could developers tell whether cephfs is a major effort in the whole ceph
> development?
>

I'm not a dev but I can confidently say this is very actively being worked
on.


>
> 发件人: 许雪寒
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
>
> Hi, everyone.
>
> We intend to use cephfs of Jewel version, however, we don’t know its
> status. Is it production ready in Jewel? Does it still have lots of bugs?
> Is it a major effort of the current ceph development? And who are using
> cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread David Turner
The main seeing you can control is osd_max_backfills. It's default is 1. I
watch iostat on my osds are I slowly increment that seeing to leave enough
overhead on the disks for client activity while the cluster moves all of
its data around.

On Wed, Jul 19, 2017, 11:45 AM Richard Hesketh 
wrote:

> In my case my cluster is under very little active load and so I have never
> had to be concerned about recovery operations impacting on client traffic.
> In fact, I generally tune up from the defaults (increase osx max backfills)
> to improve recovery speed when I'm doing major changes, because there's
> plenty of spare capacity in the cluster; and either way I'm in the
> fortunate position where I can place a higher value on having a HEALTH_OK
> cluster ASAP than on the client I/O being consistent.
>
> Rich
>
> On 19/07/17 16:27, Laszlo Budai wrote:
> > Hi Rich,
> >
> > Thank you for your answer. This is good news to hear :)
> > Regarding the reconfiguration you've done: if I understand correctly,
> you have changed it all at once (like download the crush map, edit it - add
> all the new OSDs, and upload the new map to the cluster). How did you
> controlled the impact of the recovery/refilling operation on your clients'
> data traffic? What setting have you used to avoid slow requests?
> >
> > Kind regards,
> > Laszlo
> >
> >
> > On 19.07.2017 17:40, Richard Hesketh wrote:
> >> On 19/07/17 15:14, Laszlo Budai wrote:
> >>> Hi David,
> >>>
> >>> Thank you for that reference about CRUSH. It's a nice one.
> >>> There I could read about expanding the cluster, but in one of my cases
> we want to do more: we want to move from host failure domain to chassis
> failure domain. Our concern is: how will ceph behave for those PGs where
> all the three replicas currently are in the same chassis? Because in this
> case according to the new CRUSH map two replicas are in the wrong place.
> >>>
> >>> Kind regards,
> >>> Laszlo
> >>
> >> Changing crush rules resulting in PGs being remapped works exactly the
> same way as changes in crush weights causing remapped data. The PGs will be
> remapped in accordance with the new crushmap/rules and then recovery
> operations will copy them over to the new OSDs as usual. Even if a PG is
> entirely remapped, the OSDs that were originally hosting it will operate as
> an acting set and continue to serve I/O and replicate data until copies on
> the new OSDs are ready to take over - ceph won't throw an upset because the
> acting set doesn't comply with the crush rules. I have done, for instance,
> a crush rule change which resulted in an entire pool being entirely
> remapped - switching the cephfs metadata pool from an HDD root to an SSD
> root rule, so every single PG was moved to a completely different set of
> OSDs - and it all continued to work fine while recovery took place.
> >>
> >> Rich
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI production ready?

2017-07-19 Thread Alex Gorbachev
On Sat, Jul 15, 2017 at 11:02 PM Alvaro Soto  wrote:

> Hi guys,
> does anyone know any news about in what release iSCSI interface is going
> to be production ready, if not yet?
>
> I mean without the use of a gateway, like a different endpoint connector
> to a CEPH cluster.
>

We very successfully use SCST with Pacemaker HA.



> Thanks in advance.
> Best.
>
> --
>
> ATTE. Alvaro Soto Escobar
>
> --
> Great people talk about ideas,
> average people talk about things,
> small people talk ... about other people.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pgs not deep-scrubbed for 86400

2017-07-19 Thread Roger Brown
I just upgraded from Luminous 12.1.0 to 12.1.1 and was greeted with this
new "pgs not deep-scrubbed for" warning. Should this resolve itself, or
should I get scrubbing?

$ ceph health detail
HEALTH_WARN 4 pgs not deep-scrubbed for 86400; 15 pgs not scrubbed for 86400
PG_NOT_DEEP_SCRUBBED 4 pgs not deep-scrubbed for 86400
pg 85.0 not deep-scrubbed since 2017-07-12 09:08:04.611006
pg 85.6 not deep-scrubbed since 2017-07-12 07:06:09.706367
pg 91.0 not deep-scrubbed since 2017-07-12 06:49:35.241460
pg 100.2 not deep-scrubbed since 2017-07-11 22:50:05.534546
PG_NOT_SCRUBBED 15 pgs not scrubbed for 86400
pg 101.7 not scrubbed since 2017-07-18 08:24:03.332885
pg 97.1 not scrubbed since 2017-07-18 06:24:20.317393
pg 96.0 not scrubbed since 2017-07-18 04:32:11.542037
pg 98.2 not scrubbed since 2017-07-18 02:19:48.638410
pg 85.0 not scrubbed since 2017-07-18 03:12:32.014656
pg 83.3 not scrubbed since 2017-07-18 07:55:48.96
pg 83.4 not scrubbed since 2017-07-18 10:14:33.595142
pg 91.2 not scrubbed since 2017-07-18 06:35:24.338351
pg 90.0 not scrubbed since 2017-07-18 05:48:33.604003
pg 89.3 not scrubbed since 2017-07-18 08:21:03.806888
pg 92.7 not scrubbed since 2017-07-18 09:20:27.170389
pg 91.7 not scrubbed since 2017-07-18 10:22:41.240209
pg 93.2 not scrubbed since 2017-07-18 09:24:26.186475
pg 89.6 not scrubbed since 2017-07-18 05:37:45.016260
pg 100.3 not scrubbed since 2017-07-18 09:53:10.459155
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph kraken: Calamari Centos7

2017-07-19 Thread Oscar Segarra
Hi,

Anybody has been able to setup Calamari on Centos7??

I've done a lot of Google but I haven't found any good documentation...

The command "ceph-deploy calamari connect"  does not work!

Thanks a lot for your help!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems getting nfs-ganesha with cephfs backend to work.

2017-07-19 Thread Daniel Gryniewicz

On 07/19/2017 05:27 AM, Micha Krause wrote:

Hi,

Ganesha version 2.5.0.1 from the nfs-ganesha repo hosted on 
download.ceph.com 


I didn't know about that repo, and compiled ganesha myself. The 
developers in the #ganesha IRC channel pointed me to

the libcephfs version.
After recompiling ganesha with a kraken libcephfs instead of a jewel 
version both errors went away.


I'm sure using a compiled Version from the repo you mention would have 
worked out of the box.


Micha Krause



These packages aren't quite ready for use yet, the packaging work is 
still underway.  CCing Ali, who's doing the work.


Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Anish Gupta
Hello,
Can anyone share their experience with the  bulit-in FSCache support with or 
without CephFS?
Interested in knowing the following:- Are you using FSCache in production 
environment?- How large is your Ceph deployment?- If with CephFS, how many Ceph 
clients are using FSCache- which version of Ceph and Linux kernel 

thank you.Anish Gupta


On Wednesday, July 19, 2017, 6:06:57 AM PDT, Donny Davis  
wrote:

I had a corruption issue with the FUSE client on Jewel. I use CephFS for a 
samba share with a light load, and I was using the FUSE client. I had a power 
flap and didn't realize my UPS batteries had went bad so the MDS servers were 
cycled a couple times and some how the file system had become corrupted. I 
moved to the kernel client and after the FUSE experience I put it through 
horrible things. 
I had every client connected start copying over their user profiles, and then I 
started pulling and restarting MDS servers. I saw very few errors, and only 
blips in the copy processes. My experience with the kernel client has been very 
positive and I would say stable. Nothing replaces a solid backup copy of your 
data if you care about it. 
I am still currently on Jewel, and my CephFS is daily driven and I can barely 
notice that difference between it and the past setups I have had. 



On Wed, Jul 19, 2017 at 7:02 AM, Дмитрий Глушенок  wrote:

Unfortunately no. Using FUSE was discarded due to poor performance.

19 июля 2017 г., в 13:45, Blair Bethwaite  
написал(а):
Interesting. Any FUSE client data-points?

On 19 July 2017 at 20:21, Дмитрий Глушенок  wrote:

RBD (via krbd) was in action at the same time - no problems.

19 июля 2017 г., в 12:54, Blair Bethwaite 
написал(а):

It would be worthwhile repeating the first test (crashing/killing an
OSD host) again with just plain rados clients (e.g. rados bench)
and/or rbd. It's not clear whether your issue is specifically related
to CephFS or actually something else.

Cheers,

On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:

Hi,

I can share negative test results (on Jewel 10.2.6). All tests were
performed while actively writing to CephFS from single client (about 1300
MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
all, active/standby.
- Crashing one node resulted in write hangs for 17 minutes. Repeating the
test resulted in CephFS hangs forever.
- Restarting active MDS resulted in successful failover to standby. Then,
after standby became active and the restarted MDS became standby the new
active was restarted. CephFS hanged for 12 minutes.

P.S. Planning to repeat the tests again on 10.2.7 or higher

19 июля 2017 г., в 6:47, 许雪寒  написал(а):

Is there anyone else willing to share some usage information of cephfs?
Could developers tell whether cephfs is a major effort in the whole ceph
development?

发件人: 许雪寒
发送时间: 2017年7月17日 11:00
收件人: ceph-users@lists.ceph.com
主题: How's cephfs going?

Hi, everyone.

We intend to use cephfs of Jewel version, however, we don’t know its status.
Is it production ready in Jewel? Does it still have lots of bugs? Is it a
major effort of the current ceph development? And who are using cephfs now?
__ _
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/ listinfo.cgi/ceph-users-ceph. com


--
Dmitry Glushenok
Jet Infosystems


__ _
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/ listinfo.cgi/ceph-users-ceph. com




--
Cheers,
~Blairo


--
Dmitry Glushenok
Jet Infosystems





-- 
Cheers,
~Blairo


--
Dmitry Glushenok
Jet Infosystems


__ _
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/ listinfo.cgi/ceph-users-ceph. com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Richard Hesketh
In my case my cluster is under very little active load and so I have never had 
to be concerned about recovery operations impacting on client traffic. In fact, 
I generally tune up from the defaults (increase osx max backfills) to improve 
recovery speed when I'm doing major changes, because there's plenty of spare 
capacity in the cluster; and either way I'm in the fortunate position where I 
can place a higher value on having a HEALTH_OK cluster ASAP than on the client 
I/O being consistent.

Rich

On 19/07/17 16:27, Laszlo Budai wrote:
> Hi Rich,
> 
> Thank you for your answer. This is good news to hear :)
> Regarding the reconfiguration you've done: if I understand correctly, you 
> have changed it all at once (like download the crush map, edit it - add all 
> the new OSDs, and upload the new map to the cluster). How did you controlled 
> the impact of the recovery/refilling operation on your clients' data traffic? 
> What setting have you used to avoid slow requests?
> 
> Kind regards,
> Laszlo
> 
> 
> On 19.07.2017 17:40, Richard Hesketh wrote:
>> On 19/07/17 15:14, Laszlo Budai wrote:
>>> Hi David,
>>>
>>> Thank you for that reference about CRUSH. It's a nice one.
>>> There I could read about expanding the cluster, but in one of my cases we 
>>> want to do more: we want to move from host failure domain to chassis 
>>> failure domain. Our concern is: how will ceph behave for those PGs where 
>>> all the three replicas currently are in the same chassis? Because in this 
>>> case according to the new CRUSH map two replicas are in the wrong place.
>>>
>>> Kind regards,
>>> Laszlo
>>
>> Changing crush rules resulting in PGs being remapped works exactly the same 
>> way as changes in crush weights causing remapped data. The PGs will be 
>> remapped in accordance with the new crushmap/rules and then recovery 
>> operations will copy them over to the new OSDs as usual. Even if a PG is 
>> entirely remapped, the OSDs that were originally hosting it will operate as 
>> an acting set and continue to serve I/O and replicate data until copies on 
>> the new OSDs are ready to take over - ceph won't throw an upset because the 
>> acting set doesn't comply with the crush rules. I have done, for instance, a 
>> crush rule change which resulted in an entire pool being entirely remapped - 
>> switching the cephfs metadata pool from an HDD root to an SSD root rule, so 
>> every single PG was moved to a completely different set of OSDs - and it all 
>> continued to work fine while recovery took place.
>>
>> Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple osd's to an active cluster

2017-07-19 Thread Peter Gervai
On Fri, Feb 17, 2017 at 10:42 AM, nigel davies  wrote:

> How is the best way to added multiple osd's to an active cluster?
> As the last time i done this i all most killed the VM's we had running on
> the cluster

You possibly mean that messing with OSDs caused the cluster to
reorganise the date and the recovery/backfill slowed you down.

If that's the case see --osd-max-backfills and
--osd-recovery-max-active options, and use them like
$ ceph tell osd.* injectargs '--osd-max-backfills 1'
'--osd-recovery-max-active 1'
but be aware that slowing down recovery makes it take longer and
longer recovery mean longer dangerous state of the cluster (in case of
any unfortunate event while recovering).

g
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Laszlo Budai

Hi Rich,

Thank you for your answer. This is good news to hear :)
Regarding the reconfiguration you've done: if I understand correctly, you have 
changed it all at once (like download the crush map, edit it - add all the new 
OSDs, and upload the new map to the cluster). How did you controlled the impact 
of the recovery/refilling operation on your clients' data traffic? What setting 
have you used to avoid slow requests?

Kind regards,
Laszlo


On 19.07.2017 17:40, Richard Hesketh wrote:

On 19/07/17 15:14, Laszlo Budai wrote:

Hi David,

Thank you for that reference about CRUSH. It's a nice one.
There I could read about expanding the cluster, but in one of my cases we want 
to do more: we want to move from host failure domain to chassis failure domain. 
Our concern is: how will ceph behave for those PGs where all the three replicas 
currently are in the same chassis? Because in this case according to the new 
CRUSH map two replicas are in the wrong place.

Kind regards,
Laszlo


Changing crush rules resulting in PGs being remapped works exactly the same way 
as changes in crush weights causing remapped data. The PGs will be remapped in 
accordance with the new crushmap/rules and then recovery operations will copy 
them over to the new OSDs as usual. Even if a PG is entirely remapped, the OSDs 
that were originally hosting it will operate as an acting set and continue to 
serve I/O and replicate data until copies on the new OSDs are ready to take 
over - ceph won't throw an upset because the acting set doesn't comply with the 
crush rules. I have done, for instance, a crush rule change which resulted in 
an entire pool being entirely remapped - switching the cephfs metadata pool 
from an HDD root to an SSD root rule, so every single PG was moved to a 
completely different set of OSDs - and it all continued to work fine while 
recovery took place.

Rich



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Writing data to pools other than filesystem

2017-07-19 Thread c . monty
Hello!

I want to organize data in pools and therefore created additional pools:
ceph osd lspools
0 rbd,1 templates,2 hdb-backup,3 cephfs_data,4 cephfs_metadata,

As you can see, pools "cephfs_data" and "cephfs_metadata" belong to a Ceph 
filesystem.

Question:
How can I write data to other pools, e.g. hdb-backup?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Richard Hesketh
On 19/07/17 15:14, Laszlo Budai wrote:
> Hi David,
> 
> Thank you for that reference about CRUSH. It's a nice one.
> There I could read about expanding the cluster, but in one of my cases we 
> want to do more: we want to move from host failure domain to chassis failure 
> domain. Our concern is: how will ceph behave for those PGs where all the 
> three replicas currently are in the same chassis? Because in this case 
> according to the new CRUSH map two replicas are in the wrong place.
> 
> Kind regards,
> Laszlo

Changing crush rules resulting in PGs being remapped works exactly the same way 
as changes in crush weights causing remapped data. The PGs will be remapped in 
accordance with the new crushmap/rules and then recovery operations will copy 
them over to the new OSDs as usual. Even if a PG is entirely remapped, the OSDs 
that were originally hosting it will operate as an acting set and continue to 
serve I/O and replicate data until copies on the new OSDs are ready to take 
over - ceph won't throw an upset because the acting set doesn't comply with the 
crush rules. I have done, for instance, a crush rule change which resulted in 
an entire pool being entirely remapped - switching the cephfs metadata pool 
from an HDD root to an SSD root rule, so every single PG was moved to a 
completely different set of OSDs - and it all continued to work fine while 
recovery took place.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-19 Thread Roger Brown
David,

So as I look at logs, it was originally 9.0956 for the 10TB drives and
0.9096 for the 1TB drives.

# zgrep -i weight /var/log/ceph/*.log*gz
/var/log/ceph/ceph.audit.log.4.gz:...cmd=[{"prefix": "osd crush
create-or-move", "id": 4, "weight":9.0956,...
/var/log/ceph/ceph.audit.log.4.gz:...cmd=[{"prefix": "osd crush
create-or-move", "id": 1, "weight":0.9096,...

With that, I updated the crushmap with the 9.0956 weights for 10TB drives:

$ ceph osd tree
ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 27.28679 root default
-5  9.09560 host osd1
 3  9.09560 osd.3  up  1.0  1.0
-6  9.09560 host osd2
 4  9.09560 osd.4  up  1.0  1.0
-2  9.09560 host osd3
 0  9.09560 osd.0  up  1.0  1.0

Thanks much!

Roger


On Wed, Jul 19, 2017 at 7:34 AM David Turner  wrote:

> I would go with the weight that was originally assigned to them. That way
> it is in line with what new osds will be weighted.
>
> On Wed, Jul 19, 2017, 9:17 AM Roger Brown  wrote:
>
>> David,
>>
>> Thank you. I have it currently as...
>>
>> $ ceph osd df
>> ID WEIGHT   REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS
>>  3 10.0  1.0  9313G 44404M  9270G 0.47 1.00 372
>>  4 10.0  1.0  9313G 46933M  9268G 0.49 1.06 372
>>  0 10.0  1.0  9313G 41283M  9273G 0.43 0.93 372
>>TOTAL 27941G   129G 27812G 0.46
>> MIN/MAX VAR: 0.93/1.06  STDDEV: 0.02
>>
>> The above output shows size not as 10TB but as 9313G. So should I
>> reweight each as 9.313? Or as the TiB value 9.09560?
>>
>>
>> On Tue, Jul 18, 2017 at 11:18 PM David Turner 
>> wrote:
>>
>>> I would recommend sucking with the weight of 9.09560 for the osds as
>>> that is the TiB size of the osds that ceph details to as supposed to the TB
>>> size of the osds. New osds will have their weights based on the TiB value.
>>> What is your `ceph osd df` output just to see what things look like?
>>> Hopefully very healthy.
>>>
>>> On Tue, Jul 18, 2017, 11:16 PM Roger Brown 
>>> wrote:
>>>
 Resolution confirmed!

 $ ceph -s
   cluster:
 id: eea7b78c-b138-40fc-9f3e-3d77afb770f0
 health: HEALTH_OK

   services:
 mon: 3 daemons, quorum desktop,mon1,nuc2
 mgr: desktop(active), standbys: mon1
 osd: 3 osds: 3 up, 3 in

   data:
 pools:   19 pools, 372 pgs
 objects: 54243 objects, 71722 MB
 usage:   129 GB used, 27812 GB / 27941 GB avail
 pgs: 372 active+clean


 On Tue, Jul 18, 2017 at 8:47 PM Roger Brown 
 wrote:

> Ah, that was the problem!
>
> So I edited the crushmap (
> http://docs.ceph.com/docs/master/rados/operations/crush-map/) with a
> weight of 10.000 for all three 10TB OSD hosts. The instant result was all
> those pgs with only 2 OSDs were replaced with 3 OSDs while the cluster
> started rebalancing the data. I trust it will complete with time and I'll
> be good to go!
>
> New OSD tree:
> $ ceph osd tree
> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 30.0 root default
> -5 10.0 host osd1
>  3 10.0 osd.3  up  1.0  1.0
> -6 10.0 host osd2
>  4 10.0 osd.4  up  1.0  1.0
> -2 10.0 host osd3
>  0 10.0 osd.0  up  1.0  1.0
>
> Kudos to Brad Hubbard for steering me in the right direction!
>
>
> On Tue, Jul 18, 2017 at 8:27 PM Brad Hubbard 
> wrote:
>
>> ID WEIGHT   TYPE NAME
>> -5  1.0 host osd1
>> -6  9.09560 host osd2
>> -2  9.09560 host osd3
>>
>> The weight allocated to host "osd1" should presumably be the same as
>> the other two hosts?
>>
>> Dump your crushmap and take a good look at it, specifically the
>> weighting of "osd1".
>>
>>
>> On Wed, Jul 19, 2017 at 11:48 AM, Roger Brown 
>> wrote:
>> > I also tried ceph pg query, but it gave no helpful recommendations
>> for any
>> > of the stuck pgs.
>> >
>> >
>> > On Tue, Jul 18, 2017 at 7:45 PM Roger Brown 
>> wrote:
>> >>
>> >> Problem:
>> >> I have some pgs with only two OSDs instead of 3 like all the other
>> pgs
>> >> have. This is causing active+undersized+degraded status.
>> >>
>> >> History:
>> >> 1. I started with 3 hosts, each with 1 OSD process (min_size 2)
>> for a 1TB
>> >> drive.
>> >> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
>> >> 3. Removed the original 3 1TB OSD hosts from the osd tree
>> (reweight 0,
>> >> wait, stop, remove, del osd, rm).
>> >> 4. The last OSD to be 

Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Laszlo Budai

Hi David,

Thank you for that reference about CRUSH. It's a nice one.
There I could read about expanding the cluster, but in one of my cases we want 
to do more: we want to move from host failure domain to chassis failure domain. 
Our concern is: how will ceph behave for those PGs where all the three replicas 
currently are in the same chassis? Because in this case according to the new 
CRUSH map two replicas are in the wrong place.

Kind regards,
Laszlo


On 19.07.2017 14:06, David Turner wrote:

One of the things you need to be aware of when doing this is that the crush map 
is, more or less, stupid in knowing your network setup.  You can configure your 
crush map with racks, datacenters, etc, but it has no idea where anything is. 
You have to tell it. You can use placement rules to help when adding things to 
the cluster, but for now you just need to create the buckets and move the hosts 
into them.

Sage explains a lot of the crush map here.

https://www.slideshare.net/mobile/sageweil1/a-crash-course-in-crush


On Wed, Jul 19, 2017, 2:43 AM Laszlo Budai > wrote:

Hi David,

thank you for pointing this out. Google wasn't able to find it ...

As far as I understand that thread is talking about a situation when you 
add hosts to an existing CRUSH bucket. That sounds good, and probably that will 
be our solution for cluster2.
I wonder whether there are any recommendations how to perform a migration 
from a CRUSH map that only has OSD-host-root to a new one that has 
OSD-host-chassis-root?

Thank you,
Laszlo


On 18.07.2017 20:05, David Turner wrote:
 > This was recently covered on the mailing list. I believe this will cover 
all of your questions.
 >
 > https://www.spinics.net/lists/ceph-users/msg37252.html
 >
 >
 > On Tue, Jul 18, 2017, 9:07 AM Laszlo Budai  >> wrote:
 >
 > Dear all,
 >
 > we are planning to add new hosts to our existing hammer clusters, 
and I'm looking for best practices recommendations.
 >
 > currently we have 2 clusters with 72 OSDs and 6 nodes each. We want 
to add 3 more nodes (36 OSDs) to each cluster, and we have some questions about 
what would be the best way to do it. Currently the two clusters have different 
CRUSH maps.
 >
 > Cluster 1
 > The CRUSH map only has OSDs, hosts and the root bucket. Failure 
domain is host.
 > Our final desired state would be:
 > OSD - hosts - chassis - root where each chassis has 3 hosts, each 
host has 12 OSDs, and the failure domain would be chassis.
 >
 > What would be the recommended way to achieve this without downtime 
for client operations?
 > I have read about the possibility to throttle down the 
recovery/backfill using
 > osd max backfills = 1
 > osd recovery max active = 1
 > osd recovery max single start = 1
 > osd recovery op priority = 1
 > osd recovery threads = 1
 > osd backfill scan max = 16
 > osd backfill scan min = 4
 >
 > but we wonder about the situation when, in a worst case scenario, 
all the replicas belonging to one pg have to be migrated to new locations 
according to the new CRUSH map. How will ceph behave in such situation?
 >
 >
 > Cluster 2
 > the crush map already contains chassis. Currently we have 3 chassis 
(c1, c2, c3) and 6 hosts:
 > - x1, x2 in chassis c1
 > - y1, y2 in chassis c2
 > - x3, y3 in chassis c3
 >
 > We are adding hosts z1, z2, z3 and our desired CRUSH map would look 
like this:
 > - x1, x2, x3 in c1
 > - y1, y2, y3 in c2
 > - z1, z2, z3 in c3
 >
 > Again, what would be the recommended way to achieve this while the 
clients are still accessing the data?
 >
 > Is it safe to add more OSDs at a time? or we should add them one by 
one?
 >
 > Thank you in advance for any suggestions, recommendations.
 >
 > Kind regards,
 > Laszlo
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com  
>
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 >


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-19 Thread David Turner
I would go with the weight that was originally assigned to them. That way
it is in line with what new osds will be weighted.

On Wed, Jul 19, 2017, 9:17 AM Roger Brown  wrote:

> David,
>
> Thank you. I have it currently as...
>
> $ ceph osd df
> ID WEIGHT   REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS
>  3 10.0  1.0  9313G 44404M  9270G 0.47 1.00 372
>  4 10.0  1.0  9313G 46933M  9268G 0.49 1.06 372
>  0 10.0  1.0  9313G 41283M  9273G 0.43 0.93 372
>TOTAL 27941G   129G 27812G 0.46
> MIN/MAX VAR: 0.93/1.06  STDDEV: 0.02
>
> The above output shows size not as 10TB but as 9313G. So should I reweight
> each as 9.313? Or as the TiB value 9.09560?
>
>
> On Tue, Jul 18, 2017 at 11:18 PM David Turner 
> wrote:
>
>> I would recommend sucking with the weight of 9.09560 for the osds as that
>> is the TiB size of the osds that ceph details to as supposed to the TB size
>> of the osds. New osds will have their weights based on the TiB value. What
>> is your `ceph osd df` output just to see what things look like? Hopefully
>> very healthy.
>>
>> On Tue, Jul 18, 2017, 11:16 PM Roger Brown  wrote:
>>
>>> Resolution confirmed!
>>>
>>> $ ceph -s
>>>   cluster:
>>> id: eea7b78c-b138-40fc-9f3e-3d77afb770f0
>>> health: HEALTH_OK
>>>
>>>   services:
>>> mon: 3 daemons, quorum desktop,mon1,nuc2
>>> mgr: desktop(active), standbys: mon1
>>> osd: 3 osds: 3 up, 3 in
>>>
>>>   data:
>>> pools:   19 pools, 372 pgs
>>> objects: 54243 objects, 71722 MB
>>> usage:   129 GB used, 27812 GB / 27941 GB avail
>>> pgs: 372 active+clean
>>>
>>>
>>> On Tue, Jul 18, 2017 at 8:47 PM Roger Brown 
>>> wrote:
>>>
 Ah, that was the problem!

 So I edited the crushmap (
 http://docs.ceph.com/docs/master/rados/operations/crush-map/) with a
 weight of 10.000 for all three 10TB OSD hosts. The instant result was all
 those pgs with only 2 OSDs were replaced with 3 OSDs while the cluster
 started rebalancing the data. I trust it will complete with time and I'll
 be good to go!

 New OSD tree:
 $ ceph osd tree
 ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 30.0 root default
 -5 10.0 host osd1
  3 10.0 osd.3  up  1.0  1.0
 -6 10.0 host osd2
  4 10.0 osd.4  up  1.0  1.0
 -2 10.0 host osd3
  0 10.0 osd.0  up  1.0  1.0

 Kudos to Brad Hubbard for steering me in the right direction!


 On Tue, Jul 18, 2017 at 8:27 PM Brad Hubbard 
 wrote:

> ID WEIGHT   TYPE NAME
> -5  1.0 host osd1
> -6  9.09560 host osd2
> -2  9.09560 host osd3
>
> The weight allocated to host "osd1" should presumably be the same as
> the other two hosts?
>
> Dump your crushmap and take a good look at it, specifically the
> weighting of "osd1".
>
>
> On Wed, Jul 19, 2017 at 11:48 AM, Roger Brown 
> wrote:
> > I also tried ceph pg query, but it gave no helpful recommendations
> for any
> > of the stuck pgs.
> >
> >
> > On Tue, Jul 18, 2017 at 7:45 PM Roger Brown 
> wrote:
> >>
> >> Problem:
> >> I have some pgs with only two OSDs instead of 3 like all the other
> pgs
> >> have. This is causing active+undersized+degraded status.
> >>
> >> History:
> >> 1. I started with 3 hosts, each with 1 OSD process (min_size 2) for
> a 1TB
> >> drive.
> >> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
> >> 3. Removed the original 3 1TB OSD hosts from the osd tree (reweight
> 0,
> >> wait, stop, remove, del osd, rm).
> >> 4. The last OSD to be removed would never return to active+clean
> after
> >> reweight 0. It returned undersized instead, but I went on with
> removal
> >> anyway, leaving me stuck with 5 undersized pgs.
> >>
> >> Things tried that didn't help:
> >> * give it time to go away on its own
> >> * Replace replicated default.rgw.buckets.data pool with
> erasure-code 2+1
> >> version.
> >> * ceph osd lost 1 (and 2)
> >> * ceph pg repair (pgs from dump_stuck)
> >> * googled 'ceph pg undersized' and similar searches for help.
> >>
> >> Current status:
> >> $ ceph osd tree
> >> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >> -1 19.19119 root default
> >> -5  1.0 host osd1
> >>  3  1.0 osd.3  up  1.0  1.0
> >> -6  9.09560 host osd2
> >>  4  9.09560 osd.4  up  1.0  1.0
> >> -2  9.09560 host osd3
> >>  0  9.09560 osd.0  up  1.0  1.0

Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-19 Thread Roger Brown
David,

Thank you. I have it currently as...

$ ceph osd df
ID WEIGHT   REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS
 3 10.0  1.0  9313G 44404M  9270G 0.47 1.00 372
 4 10.0  1.0  9313G 46933M  9268G 0.49 1.06 372
 0 10.0  1.0  9313G 41283M  9273G 0.43 0.93 372
   TOTAL 27941G   129G 27812G 0.46
MIN/MAX VAR: 0.93/1.06  STDDEV: 0.02

The above output shows size not as 10TB but as 9313G. So should I reweight
each as 9.313? Or as the TiB value 9.09560?


On Tue, Jul 18, 2017 at 11:18 PM David Turner  wrote:

> I would recommend sucking with the weight of 9.09560 for the osds as that
> is the TiB size of the osds that ceph details to as supposed to the TB size
> of the osds. New osds will have their weights based on the TiB value. What
> is your `ceph osd df` output just to see what things look like? Hopefully
> very healthy.
>
> On Tue, Jul 18, 2017, 11:16 PM Roger Brown  wrote:
>
>> Resolution confirmed!
>>
>> $ ceph -s
>>   cluster:
>> id: eea7b78c-b138-40fc-9f3e-3d77afb770f0
>> health: HEALTH_OK
>>
>>   services:
>> mon: 3 daemons, quorum desktop,mon1,nuc2
>> mgr: desktop(active), standbys: mon1
>> osd: 3 osds: 3 up, 3 in
>>
>>   data:
>> pools:   19 pools, 372 pgs
>> objects: 54243 objects, 71722 MB
>> usage:   129 GB used, 27812 GB / 27941 GB avail
>> pgs: 372 active+clean
>>
>>
>> On Tue, Jul 18, 2017 at 8:47 PM Roger Brown 
>> wrote:
>>
>>> Ah, that was the problem!
>>>
>>> So I edited the crushmap (
>>> http://docs.ceph.com/docs/master/rados/operations/crush-map/) with a
>>> weight of 10.000 for all three 10TB OSD hosts. The instant result was all
>>> those pgs with only 2 OSDs were replaced with 3 OSDs while the cluster
>>> started rebalancing the data. I trust it will complete with time and I'll
>>> be good to go!
>>>
>>> New OSD tree:
>>> $ ceph osd tree
>>> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> -1 30.0 root default
>>> -5 10.0 host osd1
>>>  3 10.0 osd.3  up  1.0  1.0
>>> -6 10.0 host osd2
>>>  4 10.0 osd.4  up  1.0  1.0
>>> -2 10.0 host osd3
>>>  0 10.0 osd.0  up  1.0  1.0
>>>
>>> Kudos to Brad Hubbard for steering me in the right direction!
>>>
>>>
>>> On Tue, Jul 18, 2017 at 8:27 PM Brad Hubbard 
>>> wrote:
>>>
 ID WEIGHT   TYPE NAME
 -5  1.0 host osd1
 -6  9.09560 host osd2
 -2  9.09560 host osd3

 The weight allocated to host "osd1" should presumably be the same as
 the other two hosts?

 Dump your crushmap and take a good look at it, specifically the
 weighting of "osd1".


 On Wed, Jul 19, 2017 at 11:48 AM, Roger Brown 
 wrote:
 > I also tried ceph pg query, but it gave no helpful recommendations
 for any
 > of the stuck pgs.
 >
 >
 > On Tue, Jul 18, 2017 at 7:45 PM Roger Brown 
 wrote:
 >>
 >> Problem:
 >> I have some pgs with only two OSDs instead of 3 like all the other
 pgs
 >> have. This is causing active+undersized+degraded status.
 >>
 >> History:
 >> 1. I started with 3 hosts, each with 1 OSD process (min_size 2) for
 a 1TB
 >> drive.
 >> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
 >> 3. Removed the original 3 1TB OSD hosts from the osd tree (reweight
 0,
 >> wait, stop, remove, del osd, rm).
 >> 4. The last OSD to be removed would never return to active+clean
 after
 >> reweight 0. It returned undersized instead, but I went on with
 removal
 >> anyway, leaving me stuck with 5 undersized pgs.
 >>
 >> Things tried that didn't help:
 >> * give it time to go away on its own
 >> * Replace replicated default.rgw.buckets.data pool with erasure-code
 2+1
 >> version.
 >> * ceph osd lost 1 (and 2)
 >> * ceph pg repair (pgs from dump_stuck)
 >> * googled 'ceph pg undersized' and similar searches for help.
 >>
 >> Current status:
 >> $ ceph osd tree
 >> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
 >> -1 19.19119 root default
 >> -5  1.0 host osd1
 >>  3  1.0 osd.3  up  1.0  1.0
 >> -6  9.09560 host osd2
 >>  4  9.09560 osd.4  up  1.0  1.0
 >> -2  9.09560 host osd3
 >>  0  9.09560 osd.0  up  1.0  1.0
 >> $ ceph pg dump_stuck
 >> ok
 >> PG_STAT STATE  UPUP_PRIMARY ACTING
 ACTING_PRIMARY
 >> 88.3active+undersized+degraded [4,0]  4  [4,0]
 4
 >> 97.3active+undersized+degraded [4,0]  4  [4,0]
 4
 >> 85.6active+undersized+degraded [4,0]  4  [4,0]

Re: [ceph-users] iSCSI production ready?

2017-07-19 Thread Lenz Grimmer
On 07/17/2017 10:15 PM, Alvaro Soto wrote:

> The second part, nevermind know I see that the solution is to use
> the TCMU daemon, I was thinking in a out of the box iSCSI endpoint
> directly from CEPH, sorry don't have to much expertise in this area.

There is no "native" iSCSI support built into Ceph directly. In addition
to TCMU mentioned by Jason, there also is "lrbd":
https://github.com/SUSE/lrbd

Both use Ceph RBDs as storage format in the background.

Lenz



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Donny Davis
I had a corruption issue with the FUSE client on Jewel. I use CephFS for a
samba share with a light load, and I was using the FUSE client. I had a
power flap and didn't realize my UPS batteries had went bad so the MDS
servers were cycled a couple times and some how the file system had become
corrupted. I moved to the kernel client and after the FUSE experience I put
it through horrible things.

I had every client connected start copying over their user profiles, and
then I started pulling and restarting MDS servers. I saw very few errors,
and only blips in the copy processes. My experience with the kernel client
has been very positive and I would say stable. Nothing replaces a solid
backup copy of your data if you care about it.

I am still currently on Jewel, and my CephFS is daily driven and I can
barely notice that difference between it and the past setups I have had.



On Wed, Jul 19, 2017 at 7:02 AM, Дмитрий Глушенок  wrote:

> Unfortunately no. Using FUSE was discarded due to poor performance.
>
> 19 июля 2017 г., в 13:45, Blair Bethwaite 
> написал(а):
>
> Interesting. Any FUSE client data-points?
>
> On 19 July 2017 at 20:21, Дмитрий Глушенок  wrote:
>
> RBD (via krbd) was in action at the same time - no problems.
>
> 19 июля 2017 г., в 12:54, Blair Bethwaite 
> написал(а):
>
> It would be worthwhile repeating the first test (crashing/killing an
> OSD host) again with just plain rados clients (e.g. rados bench)
> and/or rbd. It's not clear whether your issue is specifically related
> to CephFS or actually something else.
>
> Cheers,
>
> On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:
>
> Hi,
>
> I can share negative test results (on Jewel 10.2.6). All tests were
> performed while actively writing to CephFS from single client (about 1300
> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
> all, active/standby.
> - Crashing one node resulted in write hangs for 17 minutes. Repeating the
> test resulted in CephFS hangs forever.
> - Restarting active MDS resulted in successful failover to standby. Then,
> after standby became active and the restarted MDS became standby the new
> active was restarted. CephFS hanged for 12 minutes.
>
> P.S. Planning to repeat the tests again on 10.2.7 or higher
>
> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
>
> Is there anyone else willing to share some usage information of cephfs?
> Could developers tell whether cephfs is a major effort in the whole ceph
> development?
>
> 发件人: 许雪寒
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
>
> Hi, everyone.
>
> We intend to use cephfs of Jewel version, however, we don’t know its
> status.
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
> major effort of the current ceph development? And who are using cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Dmitry Glushenok
> Jet Infosystems
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Cheers,
> ~Blairo
>
>
> --
> Dmitry Glushenok
> Jet Infosystems
>
>
>
>
> --
> Cheers,
> ~Blairo
>
>
> --
> Dmitry Glushenok
> Jet Infosystems
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous RC OSD Crashing

2017-07-19 Thread Ashley Merrick
Logged a bug ticket, let me know if need anything further : 
http://tracker.ceph.com/issues/20687

From: Ashley Merrick
Sent: Wednesday, 19 July 2017 8:05 PM
To: ceph-us...@ceph.com
Subject: RE: Luminous RC OSD Crashing

Also found this error on some of the OSD's crashing:

2017-07-19 12:50:57.587194 7f19348f1700 -1 
/build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: In function 'virtual void 
C_CopyFrom_AsyncReadCb::finish(int)' thread 7f19348f1700 time 2017-07-19 
12:50:57.583192
/build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: 7585: FAILED assert(len <= 
reply_obj.data.length())

ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) 
[0x55f1c67bfe32]
2: (C_CopyFrom_AsyncReadCb::finish(int)+0x131) [0x55f1c63ec9e1]
3: (Context::complete(int)+0x9) [0x55f1c626b8b9]
4: (()+0x79bc70) [0x55f1c650fc70]
5: (ECBackend::kick_reads()+0x48) [0x55f1c651f908]
6: (CallClientContexts::finish(std::pair&)+0x562) [0x55f1c652e162]
7: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x7f) 
[0x55f1c650495f]
8: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, 
RecoveryMessages*, ZTracer::Trace const&)+0x1077) [0x55f1c6519da7]
9: (ECBackend::handle_message(boost::intrusive_ptr)+0x2a6) 
[0x55f1c651a946]
10: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x5e7) [0x55f1c638f667]
11: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, 
ThreadPool::TPHandle&)+0x3f7) [0x55f1c622fb07]
12: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
const&)+0x57) [0x55f1c648a0a7]
13: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x108c) [0x55f1c625b34c]
14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x93d) 
[0x55f1c67c5add]
15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f1c67c7d00]
16: (()+0x8064) [0x7f194cf89064]
17: (clone()+0x6d) [0x7f194c07d62d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---
-1> 2017-07-19 12:50:46.691617 7f194a0ec700  1 -- 172.16.3.3:6806/3482 <== 
osd.28 172.16.3.4:6800/27027 18606  MOSDECSubOpRead(6.71s2 102354/102344 
ECSubRead(tid=605721, 
to_read={6:8e0c91b4:::rbd_data.61c662238e1f29.$
-> 2017-07-19 12:50:46.692100 7f19330ee700  1 -- 172.16.3.3:6806/3482 --> 
172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 
ECSubReadReply(tid=605720, attrs_read=0)) v2 -- 0x55f1d5083180 con 0
-9998> 2017-07-19 12:50:46.692388 7f19330ee700  1 -- 172.16.3.3:6806/3482 --> 
172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 
ECSubReadReply(tid=605721, attrs_read=0)) v2 -- 0x55f2412c1700 con 0

,Ashley

From: Ashley Merrick
Sent: Wednesday, 19 July 2017 7:08 PM
To: Ashley Merrick >; 
ceph-us...@ceph.com
Subject: RE: Luminous RC OSD Crashing

I have just found : http://tracker.ceph.com/issues/20167

Looks to be the same error in an earlier release : 12.0.2-1883-gb3f5819, is 
marked as resolved one month ago by Sage, however unable to see how and by 
what. However would guess this fix would have made it to latest RC?

,Ashley

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ashley 
Merrick
Sent: Wednesday, 19 July 2017 5:47 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Luminous RC OSD Crashing

Hello,

Getting the following on random OSD's crashing during a backfill/rebuilding on 
the latest RC, from the log's so far I have seen the following:

172.16.3.10:6802/21760 --> 172.16.3.6:6808/15997 -- 
pg_update_log_missing(6.19ds12 epoch 101931/101928 rep_tid 59 entries 
101931'55683 (0'0) error
6:b984d72a:::rbd_data.a1d870238e1f29.7c0b:head by 
client.30604127.0:31963 0.00 -2) v2 -- 0x55bea0faefc0 con 0

log_channel(cluster) log [ERR] : 4.11c required past_interval bounds are empty 
[101500,100085) but past_intervals is not: ([90726,100084...0083] acting 28)

failed to decode message of type 70 v3: buffer::malformed_input: void 
osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer u...1 < 
struct_compat

Let me know if need anything else.

,Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating 12.1.0 -> 12.1.1

2017-07-19 Thread Marc Roos
 

Thanks! updating all indeed resolved this.



-Original Message-
From: Gregory Farnum [mailto:gfar...@redhat.com] 
Sent: dinsdag 18 juli 2017 23:01
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] Updating 12.1.0 -> 12.1.1

Yeah, some of the message formats changed (incompatibly) during 
development. If you update all your nodes it should go away; that one I 
think is just ephemeral state.

On Tue, Jul 18, 2017 at 3:09 AM Marc Roos  
wrote:



I just updated packages on one CentOS7 node and getting these 
errors:

Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 
7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are 
enabled:
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 
7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are 
enabled:
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 
7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are 
enabled:
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 
7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are 
enabled:
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 
7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are 
enabled:
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 
7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are 
enabled:
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 
7f4fa1c14e40 -1
mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous 
dev
version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 
7f4fa1c14e40 -1
mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous 
dev
version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 
7f4f977d9700 -1
mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; 
luminous
dev version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 
7f4f977d9700 -1
mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; 
luminous
dev version?
Jul 18 12:03:34 c01 ceph-mon:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABL
E_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/rele
ase/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In 
function
'PaxosServiceMessage* MForward::claim_message()' thread 
7f4f977d9700
time 2017-07-18 12:03:34.870230
Jul 18 12:03:34 c01 ceph-mon:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABL
E_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/rele
ase/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: 
FAILED
assert(msg)
Jul 18 12:03:34 c01 ceph-mon:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABL
E_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/rele
ase/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In 
function
'PaxosServiceMessage* MForward::claim_message()' thread 
7f4f977d9700
time 2017-07-18 12:03:34.870230
Jul 18 12:03:34 c01 ceph-mon:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABL
E_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/rele
ase/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: 
FAILED
assert(msg)
Jul 18 12:03:34 c01 ceph-mon: ceph version 12.1.1
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
Jul 18 12:03:34 c01 ceph-mon: 1: (ceph::__ceph_assert_fail(char 
const*,
char const*, int, char const*)+0x110) [0x7f4fa21f4310]
Jul 18 12:03:34 c01 ceph-mon: 2:
(Monitor::handle_forward(boost::intrusive_ptr)+0xd70)
[0x7f4fa1fddcd0]
Jul 18 12:03:34 c01 ceph-mon: 3:
(Monitor::dispatch_op(boost::intrusive_ptr)+0xd8d)
[0x7f4fa1fdb29d]
Jul 18 12:03:34 c01 ceph-mon: 4: 
(Monitor::_ms_dispatch(Message*)+0x7de)
[0x7f4fa1fdc06e]
Jul 18 12:03:34 c01 ceph-mon: 5: 
(Monitor::ms_dispatch(Message*)+0x23)
[0x7f4fa2004303]
Jul 18 12:03:34 c01 ceph-mon: 6: (DispatchQueue::entry()+0x792)
[0x7f4fa242c812]
Jul 18 12:03:34 c01 ceph-mon: 7:
(DispatchQueue::DispatchThread::entry()+0xd) [0x7f4fa229a3cd]
Jul 18 12:03:34 c01 ceph-mon: 8: (()+0x7dc5) [0x7f4fa0fbedc5]
Jul 18 12:03:34 c01 ceph-mon: 9: (clone()+0x6d) [0x7f4f9e34a76d]
Jul 18 12:03:34 c01 ceph-mon: NOTE: a copy of the executable, or
`objdump -rdS ` is needed to interpret this.
  

Re: [ceph-users] Luminous RC OSD Crashing

2017-07-19 Thread Ashley Merrick
Also found this error on some of the OSD's crashing:

2017-07-19 12:50:57.587194 7f19348f1700 -1 
/build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: In function 'virtual void 
C_CopyFrom_AsyncReadCb::finish(int)' thread 7f19348f1700 time 2017-07-19 
12:50:57.583192
/build/ceph-12.1.1/src/osd/PrimaryLogPG.cc: 7585: FAILED assert(len <= 
reply_obj.data.length())

ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) 
[0x55f1c67bfe32]
2: (C_CopyFrom_AsyncReadCb::finish(int)+0x131) [0x55f1c63ec9e1]
3: (Context::complete(int)+0x9) [0x55f1c626b8b9]
4: (()+0x79bc70) [0x55f1c650fc70]
5: (ECBackend::kick_reads()+0x48) [0x55f1c651f908]
6: (CallClientContexts::finish(std::pair&)+0x562) [0x55f1c652e162]
7: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x7f) 
[0x55f1c650495f]
8: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, 
RecoveryMessages*, ZTracer::Trace const&)+0x1077) [0x55f1c6519da7]
9: (ECBackend::handle_message(boost::intrusive_ptr)+0x2a6) 
[0x55f1c651a946]
10: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x5e7) [0x55f1c638f667]
11: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, 
ThreadPool::TPHandle&)+0x3f7) [0x55f1c622fb07]
12: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
const&)+0x57) [0x55f1c648a0a7]
13: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x108c) [0x55f1c625b34c]
14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x93d) 
[0x55f1c67c5add]
15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55f1c67c7d00]
16: (()+0x8064) [0x7f194cf89064]
17: (clone()+0x6d) [0x7f194c07d62d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---
-1> 2017-07-19 12:50:46.691617 7f194a0ec700  1 -- 172.16.3.3:6806/3482 <== 
osd.28 172.16.3.4:6800/27027 18606  MOSDECSubOpRead(6.71s2 102354/102344 
ECSubRead(tid=605721, 
to_read={6:8e0c91b4:::rbd_data.61c662238e1f29.$
-> 2017-07-19 12:50:46.692100 7f19330ee700  1 -- 172.16.3.3:6806/3482 --> 
172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 
ECSubReadReply(tid=605720, attrs_read=0)) v2 -- 0x55f1d5083180 con 0
-9998> 2017-07-19 12:50:46.692388 7f19330ee700  1 -- 172.16.3.3:6806/3482 --> 
172.16.3.4:6800/27027 -- MOSDECSubOpReadReply(6.71s0 102354/102344 
ECSubReadReply(tid=605721, attrs_read=0)) v2 -- 0x55f2412c1700 con 0

,Ashley

From: Ashley Merrick
Sent: Wednesday, 19 July 2017 7:08 PM
To: Ashley Merrick ; ceph-us...@ceph.com
Subject: RE: Luminous RC OSD Crashing

I have just found : http://tracker.ceph.com/issues/20167

Looks to be the same error in an earlier release : 12.0.2-1883-gb3f5819, is 
marked as resolved one month ago by Sage, however unable to see how and by 
what. However would guess this fix would have made it to latest RC?

,Ashley

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ashley 
Merrick
Sent: Wednesday, 19 July 2017 5:47 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Luminous RC OSD Crashing

Hello,

Getting the following on random OSD's crashing during a backfill/rebuilding on 
the latest RC, from the log's so far I have seen the following:

172.16.3.10:6802/21760 --> 172.16.3.6:6808/15997 -- 
pg_update_log_missing(6.19ds12 epoch 101931/101928 rep_tid 59 entries 
101931'55683 (0'0) error
6:b984d72a:::rbd_data.a1d870238e1f29.7c0b:head by 
client.30604127.0:31963 0.00 -2) v2 -- 0x55bea0faefc0 con 0

log_channel(cluster) log [ERR] : 4.11c required past_interval bounds are empty 
[101500,100085) but past_intervals is not: ([90726,100084...0083] acting 28)

failed to decode message of type 70 v3: buffer::malformed_input: void 
osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer u...1 < 
struct_compat

Let me know if need anything else.

,Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ipv6 monclient

2017-07-19 Thread Wido den Hollander

> Op 19 juli 2017 om 10:36 schreef Dan van der Ster :
> 
> 
> Hi Wido,
> 
> Quick question about IPv6 clusters which you may have already noticed.
> We have an IPv6 cluster and clients use this as the ceph.conf:
> 
> [global]
>   mon host = cephv6.cern.ch
> 
> cephv6 is an alias to our three mons, which are listening on their v6
> addrs (ms bind ipv6 = true). But those mon hosts are in fact dual
> stack -- our network infrastructure does not (yet) allow IPv6-only
> hosts. And due to limitations in our DNS service, cephv6.cern.ch is
> therefore an alias to 6 addresses -- the three IPv6 + three IPv4
> addresses of the mon hosts.
> 
> Now, when clients connect to this cluster, we have the annoying
> behaviour that the Ceph monclient will construct the initial mon hosts
> with those 6 addrs, then consult them in random order to find the mon
> map. The attempts to the IPv4 addrs fail of course, so the connections
> are delayed until they attempt one of the v6 addrs.
> 
> Do you also suffer from this? Did you find a workaround to encourage
> the ceph clients to try the IPv6 addrs first? Unfortunately ms bind
> ipv6 is not used for clients, and even though getaddrinfo returns IPv6
> addrs first, Ceph randomizes the mon addrs before connecting.
> 

No, sorry, I haven't seen that! When I deploy IPv6 with Ceph it's IPv6-only and 
those DNS-records will then only contain -records.

Never tried this nor tested it :)

Btw, with newer versions you can also use SRV records, might be useful: 
http://docs.ceph.com/docs/master/rados/configuration/mon-lookup-dns/

Wido

> Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous RC OSD Crashing

2017-07-19 Thread Ashley Merrick
I have just found : http://tracker.ceph.com/issues/20167

Looks to be the same error in an earlier release : 12.0.2-1883-gb3f5819, is 
marked as resolved one month ago by Sage, however unable to see how and by 
what. However would guess this fix would have made it to latest RC?

,Ashley

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ashley 
Merrick
Sent: Wednesday, 19 July 2017 5:47 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Luminous RC OSD Crashing

Hello,

Getting the following on random OSD's crashing during a backfill/rebuilding on 
the latest RC, from the log's so far I have seen the following:

172.16.3.10:6802/21760 --> 172.16.3.6:6808/15997 -- 
pg_update_log_missing(6.19ds12 epoch 101931/101928 rep_tid 59 entries 
101931'55683 (0'0) error
6:b984d72a:::rbd_data.a1d870238e1f29.7c0b:head by 
client.30604127.0:31963 0.00 -2) v2 -- 0x55bea0faefc0 con 0

log_channel(cluster) log [ERR] : 4.11c required past_interval bounds are empty 
[101500,100085) but past_intervals is not: ([90726,100084...0083] acting 28)

failed to decode message of type 70 v3: buffer::malformed_input: void 
osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer u...1 < 
struct_compat

Let me know if need anything else.

,Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread David Turner
One of the things you need to be aware of when doing this is that the crush
map is, more or less, stupid in knowing your network setup.  You can
configure your crush map with racks, datacenters, etc, but it has no idea
where anything is. You have to tell it. You can use placement rules to help
when adding things to the cluster, but for now you just need to create the
buckets and move the hosts into them.

Sage explains a lot of the crush map here.

https://www.slideshare.net/mobile/sageweil1/a-crash-course-in-crush

On Wed, Jul 19, 2017, 2:43 AM Laszlo Budai  wrote:

> Hi David,
>
> thank you for pointing this out. Google wasn't able to find it ...
>
> As far as I understand that thread is talking about a situation when you
> add hosts to an existing CRUSH bucket. That sounds good, and probably that
> will be our solution for cluster2.
> I wonder whether there are any recommendations how to perform a migration
> from a CRUSH map that only has OSD-host-root to a new one that has
> OSD-host-chassis-root?
>
> Thank you,
> Laszlo
>
>
> On 18.07.2017 20:05, David Turner wrote:
> > This was recently covered on the mailing list. I believe this will cover
> all of your questions.
> >
> > https://www.spinics.net/lists/ceph-users/msg37252.html
> >
> >
> > On Tue, Jul 18, 2017, 9:07 AM Laszlo Budai  > wrote:
> >
> > Dear all,
> >
> > we are planning to add new hosts to our existing hammer clusters,
> and I'm looking for best practices recommendations.
> >
> > currently we have 2 clusters with 72 OSDs and 6 nodes each. We want
> to add 3 more nodes (36 OSDs) to each cluster, and we have some questions
> about what would be the best way to do it. Currently the two clusters have
> different CRUSH maps.
> >
> > Cluster 1
> > The CRUSH map only has OSDs, hosts and the root bucket. Failure
> domain is host.
> > Our final desired state would be:
> > OSD - hosts - chassis - root where each chassis has 3 hosts, each
> host has 12 OSDs, and the failure domain would be chassis.
> >
> > What would be the recommended way to achieve this without downtime
> for client operations?
> > I have read about the possibility to throttle down the
> recovery/backfill using
> > osd max backfills = 1
> > osd recovery max active = 1
> > osd recovery max single start = 1
> > osd recovery op priority = 1
> > osd recovery threads = 1
> > osd backfill scan max = 16
> > osd backfill scan min = 4
> >
> > but we wonder about the situation when, in a worst case scenario,
> all the replicas belonging to one pg have to be migrated to new locations
> according to the new CRUSH map. How will ceph behave in such situation?
> >
> >
> > Cluster 2
> > the crush map already contains chassis. Currently we have 3 chassis
> (c1, c2, c3) and 6 hosts:
> > - x1, x2 in chassis c1
> > - y1, y2 in chassis c2
> > - x3, y3 in chassis c3
> >
> > We are adding hosts z1, z2, z3 and our desired CRUSH map would look
> like this:
> > - x1, x2, x3 in c1
> > - y1, y2, y3 in c2
> > - z1, z2, z3 in c3
> >
> > Again, what would be the recommended way to achieve this while the
> clients are still accessing the data?
> >
> > Is it safe to add more OSDs at a time? or we should add them one by
> one?
> >
> > Thank you in advance for any suggestions, recommendations.
> >
> > Kind regards,
> > Laszlo
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
Unfortunately no. Using FUSE was discarded due to poor performance.

> 19 июля 2017 г., в 13:45, Blair Bethwaite  
> написал(а):
> 
> Interesting. Any FUSE client data-points?
> 
> On 19 July 2017 at 20:21, Дмитрий Глушенок  wrote:
>> RBD (via krbd) was in action at the same time - no problems.
>> 
>> 19 июля 2017 г., в 12:54, Blair Bethwaite 
>> написал(а):
>> 
>> It would be worthwhile repeating the first test (crashing/killing an
>> OSD host) again with just plain rados clients (e.g. rados bench)
>> and/or rbd. It's not clear whether your issue is specifically related
>> to CephFS or actually something else.
>> 
>> Cheers,
>> 
>> On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:
>> 
>> Hi,
>> 
>> I can share negative test results (on Jewel 10.2.6). All tests were
>> performed while actively writing to CephFS from single client (about 1300
>> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
>> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
>> all, active/standby.
>> - Crashing one node resulted in write hangs for 17 minutes. Repeating the
>> test resulted in CephFS hangs forever.
>> - Restarting active MDS resulted in successful failover to standby. Then,
>> after standby became active and the restarted MDS became standby the new
>> active was restarted. CephFS hanged for 12 minutes.
>> 
>> P.S. Planning to repeat the tests again on 10.2.7 or higher
>> 
>> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
>> 
>> Is there anyone else willing to share some usage information of cephfs?
>> Could developers tell whether cephfs is a major effort in the whole ceph
>> development?
>> 
>> 发件人: 许雪寒
>> 发送时间: 2017年7月17日 11:00
>> 收件人: ceph-users@lists.ceph.com
>> 主题: How's cephfs going?
>> 
>> Hi, everyone.
>> 
>> We intend to use cephfs of Jewel version, however, we don’t know its status.
>> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
>> major effort of the current ceph development? And who are using cephfs now?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
>> --
>> Cheers,
>> ~Blairo
>> 
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
> 
> 
> 
> -- 
> Cheers,
> ~Blairo

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Blair Bethwaite
Interesting. Any FUSE client data-points?

On 19 July 2017 at 20:21, Дмитрий Глушенок  wrote:
> RBD (via krbd) was in action at the same time - no problems.
>
> 19 июля 2017 г., в 12:54, Blair Bethwaite 
> написал(а):
>
> It would be worthwhile repeating the first test (crashing/killing an
> OSD host) again with just plain rados clients (e.g. rados bench)
> and/or rbd. It's not clear whether your issue is specifically related
> to CephFS or actually something else.
>
> Cheers,
>
> On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:
>
> Hi,
>
> I can share negative test results (on Jewel 10.2.6). All tests were
> performed while actively writing to CephFS from single client (about 1300
> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
> all, active/standby.
> - Crashing one node resulted in write hangs for 17 minutes. Repeating the
> test resulted in CephFS hangs forever.
> - Restarting active MDS resulted in successful failover to standby. Then,
> after standby became active and the restarted MDS became standby the new
> active was restarted. CephFS hanged for 12 minutes.
>
> P.S. Planning to repeat the tests again on 10.2.7 or higher
>
> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
>
> Is there anyone else willing to share some usage information of cephfs?
> Could developers tell whether cephfs is a major effort in the whole ceph
> development?
>
> 发件人: 许雪寒
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
>
> Hi, everyone.
>
> We intend to use cephfs of Jewel version, however, we don’t know its status.
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
> major effort of the current ceph development? And who are using cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Dmitry Glushenok
> Jet Infosystems
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Cheers,
> ~Blairo
>
>
> --
> Dmitry Glushenok
> Jet Infosystems
>



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
RBD (via krbd) was in action at the same time - no problems.

> 19 июля 2017 г., в 12:54, Blair Bethwaite  
> написал(а):
> 
> It would be worthwhile repeating the first test (crashing/killing an
> OSD host) again with just plain rados clients (e.g. rados bench)
> and/or rbd. It's not clear whether your issue is specifically related
> to CephFS or actually something else.
> 
> Cheers,
> 
> On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:
>> Hi,
>> 
>> I can share negative test results (on Jewel 10.2.6). All tests were
>> performed while actively writing to CephFS from single client (about 1300
>> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
>> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
>> all, active/standby.
>> - Crashing one node resulted in write hangs for 17 minutes. Repeating the
>> test resulted in CephFS hangs forever.
>> - Restarting active MDS resulted in successful failover to standby. Then,
>> after standby became active and the restarted MDS became standby the new
>> active was restarted. CephFS hanged for 12 minutes.
>> 
>> P.S. Planning to repeat the tests again on 10.2.7 or higher
>> 
>> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
>> 
>> Is there anyone else willing to share some usage information of cephfs?
>> Could developers tell whether cephfs is a major effort in the whole ceph
>> development?
>> 
>> 发件人: 许雪寒
>> 发送时间: 2017年7月17日 11:00
>> 收件人: ceph-users@lists.ceph.com
>> 主题: How's cephfs going?
>> 
>> Hi, everyone.
>> 
>> We intend to use cephfs of Jewel version, however, we don’t know its status.
>> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
>> major effort of the current ceph development? And who are using cephfs now?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> 
> -- 
> Cheers,
> ~Blairo

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: How's cephfs going?

2017-07-19 Thread 许雪寒
I got it, thank you☺

发件人: Дмитрий Глушенок [mailto:gl...@jet.msk.su] 
发送时间: 2017年7月19日 18:20
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] How's cephfs going?

You right. Forgot to mention that the client was using kernel 4.9.9.

19 июля 2017 г., в 12:36, 许雪寒  написал(а):

Hi, thanks for your sharing:-)

So I guess you have not put cephfs into real production environment, and it's 
still in test phase, right?

Thanks again:-)

发件人: Дмитрий Глушенок [mailto:gl...@jet.msk.su] 
发送时间: 2017年7月19日 17:33
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] How's cephfs going?

Hi,

I can share negative test results (on Jewel 10.2.6). All tests were performed 
while actively writing to CephFS from single client (about 1300 MB/sec). 
Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and metadata, 6 HDD 
RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at all, active/standby.
- Crashing one node resulted in write hangs for 17 minutes. Repeating the test 
resulted in CephFS hangs forever.
- Restarting active MDS resulted in successful failover to standby. Then, after 
standby became active and the restarted MDS became standby the new active was 
restarted. CephFS hanged for 12 minutes.

P.S. Planning to repeat the tests again on 10.2.7 or higher

19 июля 2017 г., в 6:47, 许雪寒  написал(а):

Is there anyone else willing to share some usage information of cephfs?
Could developers tell whether cephfs is a major effort in the whole ceph 
development?

发件人: 许雪寒 
发送时间: 2017年7月17日 11:00
收件人: ceph-users@lists.ceph.com
主题: How's cephfs going?

Hi, everyone.

We intend to use cephfs of Jewel version, however, we don’t know its status. Is 
it production ready in Jewel? Does it still have lots of bugs? Is it a major 
effort of the current ceph development? And who are using cephfs now?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
You right. Forgot to mention that the client was using kernel 4.9.9.

> 19 июля 2017 г., в 12:36, 许雪寒  написал(а):
> 
> Hi, thanks for your sharing:-)
> 
> So I guess you have not put cephfs into real production environment, and it's 
> still in test phase, right?
> 
> Thanks again:-)
> 
> 发件人: Дмитрий Глушенок [mailto:gl...@jet.msk.su] 
> 发送时间: 2017年7月19日 17:33
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] How's cephfs going?
> 
> Hi,
> 
> I can share negative test results (on Jewel 10.2.6). All tests were performed 
> while actively writing to CephFS from single client (about 1300 MB/sec). 
> Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and metadata, 6 
> HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at all, 
> active/standby.
> - Crashing one node resulted in write hangs for 17 minutes. Repeating the 
> test resulted in CephFS hangs forever.
> - Restarting active MDS resulted in successful failover to standby. Then, 
> after standby became active and the restarted MDS became standby the new 
> active was restarted. CephFS hanged for 12 minutes.
> 
> P.S. Planning to repeat the tests again on 10.2.7 or higher
> 
> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
> 
> Is there anyone else willing to share some usage information of cephfs?
> Could developers tell whether cephfs is a major effort in the whole ceph 
> development?
> 
> 发件人: 许雪寒 
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
> 
> Hi, everyone.
> 
> We intend to use cephfs of Jewel version, however, we don’t know its status. 
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a 
> major effort of the current ceph development? And who are using cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Dmitry Glushenok
> Jet Infosystems
> 

--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous RC OSD Crashing

2017-07-19 Thread Ashley Merrick
Hello,

Seems recovering is fine, only happens when I do ceph osd unset nobackfill, 
rapidly random OSD's start to fail (I am guessing backfill sources but unable 
to catch due to speed)

The backfilling OSD is a recently re-created OSD using Bluestore.

,Ashley

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ashley 
Merrick
Sent: Wednesday, 19 July 2017 5:47 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Luminous RC OSD Crashing

Hello,

Getting the following on random OSD's crashing during a backfill/rebuilding on 
the latest RC, from the log's so far I have seen the following:

172.16.3.10:6802/21760 --> 172.16.3.6:6808/15997 -- 
pg_update_log_missing(6.19ds12 epoch 101931/101928 rep_tid 59 entries 
101931'55683 (0'0) error
6:b984d72a:::rbd_data.a1d870238e1f29.7c0b:head by 
client.30604127.0:31963 0.00 -2) v2 -- 0x55bea0faefc0 con 0

log_channel(cluster) log [ERR] : 4.11c required past_interval bounds are empty 
[101500,100085) but past_intervals is not: ([90726,100084...0083] acting 28)

failed to decode message of type 70 v3: buffer::malformed_input: void 
osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer u...1 < 
struct_compat

Let me know if need anything else.

,Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] upgrade ceph from 10.2.7 to 10.2.9

2017-07-19 Thread Ansgar Jazdzewski
hi *,

we are facing some issue with the upgrade of our OSD

the updateprogess on ubuntu 16.04  stops at:


Setting system user ceph properties..usermod: no changes
..done
Fixing /var/run/ceph ownershipdone


no more output is given to the system, my permission are ok so how to go ahead

Thanks,
Ansgar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Blair Bethwaite
It would be worthwhile repeating the first test (crashing/killing an
OSD host) again with just plain rados clients (e.g. rados bench)
and/or rbd. It's not clear whether your issue is specifically related
to CephFS or actually something else.

Cheers,

On 19 July 2017 at 19:32, Дмитрий Глушенок  wrote:
> Hi,
>
> I can share negative test results (on Jewel 10.2.6). All tests were
> performed while actively writing to CephFS from single client (about 1300
> MB/sec). Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and
> metadata, 6 HDD RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at
> all, active/standby.
> - Crashing one node resulted in write hangs for 17 minutes. Repeating the
> test resulted in CephFS hangs forever.
> - Restarting active MDS resulted in successful failover to standby. Then,
> after standby became active and the restarted MDS became standby the new
> active was restarted. CephFS hanged for 12 minutes.
>
> P.S. Planning to repeat the tests again on 10.2.7 or higher
>
> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
>
> Is there anyone else willing to share some usage information of cephfs?
> Could developers tell whether cephfs is a major effort in the whole ceph
> development?
>
> 发件人: 许雪寒
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
>
> Hi, everyone.
>
> We intend to use cephfs of Jewel version, however, we don’t know its status.
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
> major effort of the current ceph development? And who are using cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Dmitry Glushenok
> Jet Infosystems
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous RC OSD Crashing

2017-07-19 Thread Ashley Merrick
Hello,

Getting the following on random OSD's crashing during a backfill/rebuilding on 
the latest RC, from the log's so far I have seen the following:

172.16.3.10:6802/21760 --> 172.16.3.6:6808/15997 -- 
pg_update_log_missing(6.19ds12 epoch 101931/101928 rep_tid 59 entries 
101931'55683 (0'0) error
6:b984d72a:::rbd_data.a1d870238e1f29.7c0b:head by 
client.30604127.0:31963 0.00 -2) v2 -- 0x55bea0faefc0 con 0

log_channel(cluster) log [ERR] : 4.11c required past_interval bounds are empty 
[101500,100085) but past_intervals is not: ([90726,100084...0083] acting 28)

failed to decode message of type 70 v3: buffer::malformed_input: void 
osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer u...1 < 
struct_compat

Let me know if need anything else.

,Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: How's cephfs going?

2017-07-19 Thread 许雪寒
Hi, thanks for your sharing:-)

So I guess you have not put cephfs into real production environment, and it's 
still in test phase, right?

Thanks again:-)

发件人: Дмитрий Глушенок [mailto:gl...@jet.msk.su] 
发送时间: 2017年7月19日 17:33
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] How's cephfs going?

Hi,

I can share negative test results (on Jewel 10.2.6). All tests were performed 
while actively writing to CephFS from single client (about 1300 MB/sec). 
Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and metadata, 6 HDD 
RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at all, active/standby.
- Crashing one node resulted in write hangs for 17 minutes. Repeating the test 
resulted in CephFS hangs forever.
- Restarting active MDS resulted in successful failover to standby. Then, after 
standby became active and the restarted MDS became standby the new active was 
restarted. CephFS hanged for 12 minutes.

P.S. Planning to repeat the tests again on 10.2.7 or higher

19 июля 2017 г., в 6:47, 许雪寒  написал(а):

Is there anyone else willing to share some usage information of cephfs?
Could developers tell whether cephfs is a major effort in the whole ceph 
development?

发件人: 许雪寒 
发送时间: 2017年7月17日 11:00
收件人: ceph-users@lists.ceph.com
主题: How's cephfs going?

Hi, everyone.

We intend to use cephfs of Jewel version, however, we don’t know its status. Is 
it production ready in Jewel? Does it still have lots of bugs? Is it a major 
effort of the current ceph development? And who are using cephfs now?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-19 Thread Дмитрий Глушенок
Hi,

I can share negative test results (on Jewel 10.2.6). All tests were performed 
while actively writing to CephFS from single client (about 1300 MB/sec). 
Cluster consists of 8 nodes, 8 OSD each (2 SSD for journals and metadata, 6 HDD 
RAID6 for data), MON/MDS are on dedicated nodes. 2 MDS at all, active/standby.
- Crashing one node resulted in write hangs for 17 minutes. Repeating the test 
resulted in CephFS hangs forever.
- Restarting active MDS resulted in successful failover to standby. Then, after 
standby became active and the restarted MDS became standby the new active was 
restarted. CephFS hanged for 12 minutes.

P.S. Planning to repeat the tests again on 10.2.7 or higher

> 19 июля 2017 г., в 6:47, 许雪寒  написал(а):
> 
> Is there anyone else willing to share some usage information of cephfs?
> Could developers tell whether cephfs is a major effort in the whole ceph 
> development?
> 
> 发件人: 许雪寒 
> 发送时间: 2017年7月17日 11:00
> 收件人: ceph-users@lists.ceph.com
> 主题: How's cephfs going?
> 
> Hi, everyone.
> 
> We intend to use cephfs of Jewel version, however, we don’t know its status. 
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a 
> major effort of the current ceph development? And who are using cephfs now?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems getting nfs-ganesha with cephfs backend to work.

2017-07-19 Thread Micha Krause

Hi,


Ganesha version 2.5.0.1 from the nfs-ganesha repo hosted on download.ceph.com 



I didn't know about that repo, and compiled ganesha myself. The developers in 
the #ganesha IRC channel pointed me to
the libcephfs version.
After recompiling ganesha with a kraken libcephfs instead of a jewel version 
both errors went away.

I'm sure using a compiled Version from the repo you mention would have worked 
out of the box.

Micha Krause

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ipv6 monclient

2017-07-19 Thread Dan van der Ster
Hi Wido,

Quick question about IPv6 clusters which you may have already noticed.
We have an IPv6 cluster and clients use this as the ceph.conf:

[global]
  mon host = cephv6.cern.ch

cephv6 is an alias to our three mons, which are listening on their v6
addrs (ms bind ipv6 = true). But those mon hosts are in fact dual
stack -- our network infrastructure does not (yet) allow IPv6-only
hosts. And due to limitations in our DNS service, cephv6.cern.ch is
therefore an alias to 6 addresses -- the three IPv6 + three IPv4
addresses of the mon hosts.

Now, when clients connect to this cluster, we have the annoying
behaviour that the Ceph monclient will construct the initial mon hosts
with those 6 addrs, then consult them in random order to find the mon
map. The attempts to the IPv4 addrs fail of course, so the connections
are delayed until they attempt one of the v6 addrs.

Do you also suffer from this? Did you find a workaround to encourage
the ceph clients to try the IPv6 addrs first? Unfortunately ms bind
ipv6 is not used for clients, and even though getaddrinfo returns IPv6
addrs first, Ceph randomizes the mon addrs before connecting.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Laszlo Budai

Hi David,

thank you for pointing this out. Google wasn't able to find it ...

As far as I understand that thread is talking about a situation when you add 
hosts to an existing CRUSH bucket. That sounds good, and probably that will be 
our solution for cluster2.
I wonder whether there are any recommendations how to perform a migration from 
a CRUSH map that only has OSD-host-root to a new one that has 
OSD-host-chassis-root?

Thank you,
Laszlo


On 18.07.2017 20:05, David Turner wrote:

This was recently covered on the mailing list. I believe this will cover all of 
your questions.

https://www.spinics.net/lists/ceph-users/msg37252.html


On Tue, Jul 18, 2017, 9:07 AM Laszlo Budai > wrote:

Dear all,

we are planning to add new hosts to our existing hammer clusters, and I'm 
looking for best practices recommendations.

currently we have 2 clusters with 72 OSDs and 6 nodes each. We want to add 
3 more nodes (36 OSDs) to each cluster, and we have some questions about what 
would be the best way to do it. Currently the two clusters have different CRUSH 
maps.

Cluster 1
The CRUSH map only has OSDs, hosts and the root bucket. Failure domain is 
host.
Our final desired state would be:
OSD - hosts - chassis - root where each chassis has 3 hosts, each host has 
12 OSDs, and the failure domain would be chassis.

What would be the recommended way to achieve this without downtime for 
client operations?
I have read about the possibility to throttle down the recovery/backfill 
using
osd max backfills = 1
osd recovery max active = 1
osd recovery max single start = 1
osd recovery op priority = 1
osd recovery threads = 1
osd backfill scan max = 16
osd backfill scan min = 4

but we wonder about the situation when, in a worst case scenario, all the 
replicas belonging to one pg have to be migrated to new locations according to 
the new CRUSH map. How will ceph behave in such situation?


Cluster 2
the crush map already contains chassis. Currently we have 3 chassis (c1, 
c2, c3) and 6 hosts:
- x1, x2 in chassis c1
- y1, y2 in chassis c2
- x3, y3 in chassis c3

We are adding hosts z1, z2, z3 and our desired CRUSH map would look like 
this:
- x1, x2, x3 in c1
- y1, y2, y3 in c2
- z1, z2, z3 in c3

Again, what would be the recommended way to achieve this while the clients 
are still accessing the data?

Is it safe to add more OSDs at a time? or we should add them one by one?

Thank you in advance for any suggestions, recommendations.

Kind regards,
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Long OSD restart after upgrade to 10.2.9

2017-07-19 Thread Anton Dmitriev

root@storage07:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 14.04.5 LTS
Release:14.04
Codename:   trusty

root@storage07:~$ uname -a
Linux storage07 4.4.0-83-generic #106~14.04.1-Ubuntu SMP Mon Jun 26 
18:10:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


root@storage07:~$ dpkg -l | grep leveld
ii  libleveldb1:amd64 1.15.0-2   
amd64fast key-value storage library


From /var/log/apt/history.log nothing more was installed or upgraded 
for the last few months except listed below. Leveldb was not upgraded 
recently.


Start-Date: 2017-07-15  17:35:25
Commandline: apt upgrade
Install: linux-headers-4.4.0-83-generic:amd64 (4.4.0-83.106~14.04.1, 
automatic), linux-image-4.4.0-83-generic:amd64 (4.4.0-83.106~14.04.1, 
automatic), linux-image-3.13.0-123-generic:amd64 (3.13.0-123.172, 
automatic), linux-image-extra-3.13.0-123-generic:amd64 (3.13.0-123.172, 
automatic), linux-headers-4.4.0-83:amd64 (4.4.0-83.106~14.04.1, 
automatic), linux-image-extra-4.4.0-83-generic:amd64 
(4.4.0-83.106~14.04.1, automatic), 
linux-headers-3.13.0-123-generic:amd64 (3.13.0-123.172, automatic), 
linux-headers-3.13.0-123:amd64 (3.13.0-123.172, automatic)
Upgrade: bind9-host:amd64 (9.9.5.dfsg-3ubuntu0.14, 
9.9.5.dfsg-3ubuntu0.15), python3-problem-report:amd64 
(2.14.1-0ubuntu3.23, 2.14.1-0ubuntu3.24), liblwres90:amd64 
(9.9.5.dfsg-3ubuntu0.14, 9.9.5.dfsg-3ubuntu0.15), libnl-genl-3-200:amd64 
(3.2.21-1ubuntu4, 3.2.21-1ubuntu4.1), linux-headers-generic:amd64 
(3.13.0.119.129, 3.13.0.123.133), libgnutls-openssl27:amd64 
(2.12.23-12ubuntu2.7, 2.12.23-12ubuntu2.8), multiarch-support:amd64 
(2.19-0ubuntu6.11, 2.19-0ubuntu6.13), libdns100:amd64 
(9.9.5.dfsg-3ubuntu0.14, 9.9.5.dfsg-3ubuntu0.15), radosgw:amd64 
(10.2.7-1trusty, 10.2.9-1trusty), libisccfg90:amd64 
(9.9.5.dfsg-3ubuntu0.14, 9.9.5.dfsg-3ubuntu0.15), isc-dhcp-common:amd64 
(4.2.4-7ubuntu12.8, 4.2.4-7ubuntu12.10), libbind9-90:amd64 
(9.9.5.dfsg-3ubuntu0.14, 9.9.5.dfsg-3ubuntu0.15), python-cephfs:amd64 
(10.2.7-1trusty, 10.2.9-1trusty), librbd1:amd64 (10.2.7-1trusty, 
10.2.9-1trusty), sudo:amd64 (1.8.9p5-1ubuntu1.3, 1.8.9p5-1ubuntu1.4), 
libradosstriper1:amd64 (10.2.7-1trusty, 10.2.9-1trusty), 
python3-software-properties:amd64 (0.92.37.7, 0.92.37.8), 
linux-tools-common:amd64 (3.13.0-119.166, 3.13.0-123.172), 
libc-dev-bin:amd64 (2.19-0ubuntu6.11, 2.19-0ubuntu6.13), librados2:amd64 
(10.2.7-1trusty, 10.2.9-1trusty), libc-bin:amd64 (2.19-0ubuntu6.11, 
2.19-0ubuntu6.13), libc6:amd64 (2.19-0ubuntu6.11, 2.19-0ubuntu6.13), 
linux-generic-lts-xenial:amd64 (4.4.0.78.63, 4.4.0.83.68), 
dnsutils:amd64 (9.9.5.dfsg-3ubuntu0.14, 9.9.5.dfsg-3ubuntu0.15), 
ceph-base:amd64 (10.2.7-1trusty, 10.2.9-1trusty), libnl-3-200:amd64 
(3.2.21-1ubuntu4, 3.2.21-1ubuntu4.1), klibc-utils:amd64 
(2.0.3-0ubuntu1.14.04.2, 2.0.3-0ubuntu1.14.04.3), ceph-osd:amd64 
(10.2.7-1trusty, 10.2.9-1trusty), gnutls-bin:amd64 
(3.0.11+really2.12.23-12ubuntu2.7, 3.0.11+really2.12.23-12ubuntu2.8), 
libldap-2.4-2:amd64 (2.4.31-1+nmu2ubuntu8.3, 2.4.31-1+nmu2ubuntu8.4), 
libnss3-nssdb:amd64 (3.28.4-0ubuntu0.14.04.1, 3.28.4-0ubuntu0.14.04.2), 
linux-headers-generic-lts-xenial:amd64 (4.4.0.78.63, 4.4.0.83.68), 
ntp:amd64 (4.2.6.p5+dfsg-3ubuntu2.14.04.10, 
4.2.6.p5+dfsg-3ubuntu2.14.04.11), ceph-mds:amd64 (10.2.7-1trusty, 
10.2.9-1trusty), ceph:amd64 (10.2.7-1trusty, 10.2.9-1trusty), 
ceph-common:amd64 (10.2.7-1trusty, 10.2.9-1trusty), librgw2:amd64 
(10.2.7-1trusty, 10.2.9-1trusty), vlan:amd64 (1.9-3ubuntu10.1, 
1.9-3ubuntu10.4), isc-dhcp-client:amd64 (4.2.4-7ubuntu12.8, 
4.2.4-7ubuntu12.10), libnss3:amd64 (3.28.4-0ubuntu0.14.04.1, 
3.28.4-0ubuntu0.14.04.2), python-rbd:amd64 (10.2.7-1trusty, 
10.2.9-1trusty), libgnutls26:amd64 (2.12.23-12ubuntu2.7, 
2.12.23-12ubuntu2.8), apport:amd64 (2.14.1-0ubuntu3.23, 
2.14.1-0ubuntu3.24), libklibc:amd64 (2.0.3-0ubuntu1.14.04.2, 
2.0.3-0ubuntu1.14.04.3), ceph-mon:amd64 (10.2.7-1trusty, 
10.2.9-1trusty), libcephfs1:amd64 (10.2.7-1trusty, 10.2.9-1trusty), 
linux-image-generic-lts-xenial:amd64 (4.4.0.78.63, 4.4.0.83.68), 
python3-apport:amd64 (2.14.1-0ubuntu3.23, 2.14.1-0ubuntu3.24), 
software-properties-common:amd64 (0.92.37.7, 0.92.37.8), 
linux-libc-dev:amd64 (3.13.0-119.166, 3.13.0-123.172), libtasn1-6:amd64 
(3.4-3ubuntu0.4, 3.4-3ubuntu0.5), linux-image-generic:amd64 
(3.13.0.119.129, 3.13.0.123.133), libisccc90:amd64 
(9.9.5.dfsg-3ubuntu0.14, 9.9.5.dfsg-3ubuntu0.15), libgcrypt11:amd64 
(1.5.3-2ubuntu4.4, 1.5.3-2ubuntu4.5), libc6-dev:amd64 (2.19-0ubuntu6.11, 
2.19-0ubuntu6.13), libisc95:amd64 (9.9.5.dfsg-3ubuntu0.14, 
9.9.5.dfsg-3ubuntu0.15), python-rados:amd64 (10.2.7-1trusty, 
10.2.9-1trusty), ntpdate:amd64 (4.2.6.p5+dfsg-3ubuntu2.14.04.10, 
4.2.6.p5+dfsg-3ubuntu2.14.04.11), linux-generic:amd64 (3.13.0.119.129, 
3.13.0.123.133)

End-Date: 2017-07-15  17:39:43

root@storage07:~$ ceph --admin-daemon /var/run/ceph/ceph-osd.195.asok 
config show | grep leveldb

"debug_leveldb": "20\/20",