[ceph-users] librbd compatibility

2016-06-20 Thread min fang
Hi, is there a document describing librbd compatibility?  For example,
something like this: librbd from Ceph 0.88 can also be applied to
0.90,0.91..

I hope not keep librbd relative stable, so can avoid more code iteration
and testing.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] delete all pool,but the data is still exist.

2016-06-20 Thread Leo Yu
hi,
   i delete all pool by this script

arr=( $(rados lspools) )
for key in "${!arr[@]}"; do
ceph osd pool delete ${arr[$key]} ${arr[$key]}
 --yes-i-really-really-mean-it
done

the output of ceph df after delete all pool,it seems this is no pool any
more,but still 251M usaged disk space.
[root@ceph03 ~]# ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
92093M 91842M 251M  0.27
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
[root@ceph03 ~]#


and the output of ceph osd df
[root@ceph03 ~]# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS
 0 0.01459  1.0 15348M 44384k 15305M 0.28 1.03   0
 3 0.01459  1.0 15348M 41936k 15308M 0.27 0.98   0
 1 0.01459  1.0 15348M 44144k 15305M 0.28 1.03   0
 4 0.01459  1.0 15348M 41724k 15308M 0.27 0.97   0
 2 0.01459  1.0 15348M 45492k 15304M 0.29 1.06   0
 5 0.01459  1.0 15348M 39724k 15310M 0.25 0.93   0
  TOTAL 92093M   251M 91842M 0.27

so,why the data is still on the osd? dose it disapear after some time(eq.
10 min)? or the gc thread will process this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH with NVMe SSDs and Caching vs Journaling on SSDs

2016-06-20 Thread Christian Balzer

Hello,

On Mon, 20 Jun 2016 15:12:49 + Tim Gipson wrote:

> Christian,
> 
> Thanks for all the info. I’ve been looking over the mailing lists.
> There is so much info there and from the looks of it, setting up a cache
> tier is much more complex than I had originally thought.  
> 
More complex, yes.
But depending on your use case also potentially very rewarding.

> Moving the journals to OSDs was much simpler for me because you can just
> use ceph-deploy and point the journal to the device you want.
> 
SSD journals for HDD OSDs is always a good first step.

> I do understand the difference between the cache tier and journaling.
> 
> As per your comment about the monitor nodes, the extra monitor nodes are
> for the purpose of resiliency.  We are trying to build our storage and
> compute clusters with lots of failure in mind.
> 
Since you have the HW already, not much of a point, but things would have
been more resilient and performant with 1 dedicated monitor node and 4
storage nodes also running MONs.

Note that more than 5 monitors is considered counterproductive in nearly
all cases.

> Our NVME drives are only the 800GB 3600 series.
> 
That's 2.4TB per day, or a mere 28MB/s when looking at the endurance of
these NVMes. 
Not accounting for any write amplification (which should be negligible
with journals).

Wouldn't be an issue in my use case, but YMMW, so monitor the wearout. 

> As to our networking setup: The OSD nodes have 4 x 10G nics, a bonded
> pair for front end traffic and a bonded pair for cluster traffic.  The
> monitor nodes have a bonded pair of 1Gig nics.  Our clients have 4 x 10G
> nics as well with a bonded pair dedicated to storage front end traffic
> connected to the ceph cluster.
> 
Overkill, as your storage nodes are limited by the 1GB/s of the P3600s.

In short, a split network in your case is a bit of a waste, as your reads
are potentially hampered by it.

Read the recent "Best Network Switches for Redundancy" for example.

> The single NVMe for journaling was a concern but as you mentioned
> before, a host is our failure domain at this point.
> 
And with that in mind, don't fill your OSDs more than 60%.

Christian

> I did find your comments to another user about having to add multiple
> roots per node because their NVMe drives were on different nodes.  That
> is the case for our gear as well.
> 
> Also, my gear is already in house so I’ve got what I’ve got to work with
> at this point, for good for ill.
> 
> Tim Gipson
> 
> 
> On 6/16/16, 7:47 PM, "Christian Balzer"  wrote:
> 
> 
> Hello,
> 
> On Thu, 16 Jun 2016 15:31:13 + Tim Gipson wrote:
> 
> > A few questions.
> > 
> > First, is there a good step by step to setting up a caching tier with
> > NVMe SSDs that are on separate hosts?  Is that even possible?
> > 
> Yes. And with a cluster of your size that's the way I'd do it.
> Larger cluster (dozen plus nodes) are likely to be better suited with
> storage nodes that have shared HDD OSDs for slow storage and SSD OSDs for
> cache pools.
> 
> It would behoove you to scour this ML for the dozens of threads covering
> this and other aspects, like:
> "journal or cache tier on SSDs ?"
> "Steps for Adding Cache Tier"
> and even yesterdays:
> "Is Dynamic Cache tiering supported in Jewel"
> 
> > Second, what sort of performance are people seeing from caching
> > tiers/journaling on SSDs in Jewel?
> > 
> Not using Jewel, but it's bound to be better than Hammer.
> 
> Performance will depend on a myriad of things, including CPU, SSD/NVMe
> models, networking, tuning, etc.
> It would be better if you had a performance target and a budget to see if
> they can be matched up.
> 
> Cache tiering and journaling are very different things, don't mix them
> up.
> 
> > Right now I am working on trying to find best practice for a CEPH
> > cluster with 3 monitor nodes, and 3 OSDs with 1 800GB NVMe drive and 12
> > 6TB drives.
> > 
> No need for dedicated monitor notes (definitely not 3 and with cluster of
> that size) if your storage nodes are designed correctly, see for example:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/008879.html
> 
> > My goal is reliable/somewhat fast performance.
> >
> Well, for starters this cluster will give you the space of one of these
> nodes and worse performance than a single node due to the 3x replication.
> 
> What NVMe did you have in mind, a DC P3600 will give you 1GB/s writes
> (and 3DWPD endurance), a P3700 2GB/s (and 10DWPD endurance).
> 
> What about your network?
> 
> Since the default failure domain in Ceph is the host, a single NVMe as
> journal for all HDD OSDs isn't particular risky, but it's something to
> keep in mind.
>  
> Christian
> > Any help would be greatly appreciated!
> > 
> > Tim Gipson
> > Systems Engineer
> > 
> > [http://www.ena.com/signature/enaemaillogo.gif]
> > 
> > 
> > 618 Grassmere Park Drive, Suite 12
> > Nashville, TN 37211
> > 
> > 
> > 
> > website | 

Re: [ceph-users] Ceph OSD journal utilization

2016-06-20 Thread Christian Balzer

Hello,

On Mon, 20 Jun 2016 12:20:30 -0400 Jonathan Proulx wrote:

> On Mon, Jun 20, 2016 at 04:02:04PM +, David Turner wrote:
> :If you want to watch what a disk is doing while you watch it, use
> iostat on the journal device.  If you want to see it's patterns at all
> times of the day, use sar.  Neither of these are ceph specific commands,
> just Linux tools that can watch your disk utilization, speeds, etc
> (among other things.  Both tools are well documented and easy to use.
> 
> For spot testing, like watching disk I/O during stress testing iostat
> as mentioned above is something I use frequently.
> 
> Sar is a simple way to pull historic load info from a single host, but
> I find it a bit combersome at even my modest scale.
> 
> We use http://munin-monitoring.org/ to do visual trending of many
> different server statistics including disk throughput, latency,
> utilization etc...this is a bit old school and there are plenty of
> other opensource ways of gathering and dispalying performance metrics. 
> 
> None of this is ceph specific and I agree there's no reason it should
> be.
>
The OP asked specifically about provisioning, so Nick's answer about using
the Ceph counters to see how much of the journal is used is the correct
one.
And incidentally in the current "Criteria for Ceph journal sizing" are
some pertinent points about this.

As for utilization in terms of IOPS and bandwidth, I find atop a very good
tool for spot (live) monitoring/analysis.
For longterm (and retrospective) analysis, collectd with graphics by
graphite/grafana or mumin are quite useful.
 
Christian

> -Jon
> 
> :
> :From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of EP
> Komarla [ep.koma...@flextronics.com] :Sent: Friday, June 17, 2016 5:13 PM
> :To: ceph-users@lists.ceph.com
> :Subject: [ceph-users] Ceph OSD journal utilization
> :
> :Hi,
> :
> :I am looking for a way to monitor the utilization of OSD journals – by
> observing the utilization pattern over time, I can determine if I have
> over provisioned them or not. Is there a way to do this? : :When I
> googled on this topic, I saw one similar request about 4 years back.  I
> am wondering if there is some traction on this topic since
> then. : :Thanks a lot. :
> :- epk
> :
> :Legal Disclaimer:
> :The information contained in this message may be privileged and
> confidential. It is intended to be read only by the individual or entity
> to whom it is addressed or by their designee. If the reader of this
> message is not the intended recipient, you are on notice that any
> distribution of this message, in any form, is strictly prohibited. If
> you have received this message in error, please immediately notify the
> sender and delete or destroy any copy of this message!
> 
> :___
> :ceph-users mailing list
> :ceph-users@lists.ceph.com
> :http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Criteria for Ceph journal sizing

2016-06-20 Thread Christian Balzer

Hello,

On Mon, 20 Jun 2016 21:15:47 +0200 Michael Hanscho wrote:

> Hi!
> On 2016-06-20 14:32, Daleep Singh Bais wrote:
> > Dear All,
> > 
> > Is their some criteria for deciding on Ceph journal size to be used,
> > whether in respect to Data partition size etc? I have noticed that if
> > not specified, it takes the journal size to be 5GB.
> > 
> > Any insight in this regard will be helpful for my understanding.
> 
> See documentation:
> http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
> 
> osd journal size = {2 * (expected throughput * filestore max sync
> interval)}
> 
> http://comments.gmane.org/gmane.comp.file-systems.ceph.user/28433
> 
Thanks for quoting that thread. ^o^

For the OP, read it, because while the above formula certainly is correct,
large journals are nearly always a waste.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-20 Thread Christian Balzer

Hello,

On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:

> Sorry, late to the party here. I agree, up the merge and split
> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
> One of those things you just have to find out as an operator since it's
> not well documented :(
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> 
> We have over 200 million objects in this cluster, and it's still doing
> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
> journals. Having enough memory and dropping your vfs_cache_pressure
> should also help.
> 
Indeed.

Since it was asked in that bug report and also my first suspicion, it
would probably be good time to clarify that it isn't the splits that cause
the performance degradation, but the resulting inflation of dir entries
and exhaustion of SLAB and thus having to go to disk for things that
normally would be in memory.

Looking at Blair's graph from yesterday pretty much makes that clear, a
purely split caused degradation should have relented much quicker. 


> Keep in mind that if you change the values, it won't take effect
> immediately. It only merges them back if the directory is under the
> calculated threshold and a write occurs (maybe a read, I forget).
> 
If it's a read a plain scrub might do the trick.

Christian
> Warren
> 
> 
> From: ceph-users
> >
> on behalf of Wade Holler
> > Date: Monday, June
> 20, 2016 at 2:48 PM To: Blair Bethwaite
> >, Wido den
> Hollander > Cc: Ceph Development
> >,
> "ceph-users@lists.ceph.com"
> > Subject:
> Re: [ceph-users] Dramatic performance drop at certain number of objects
> in pool
> 
> Thanks everyone for your replies.  I sincerely appreciate it. We are
> testing with different pg_num and filestore_split_multiple settings.
> Early indications are  well not great. Regardless it is nice to
> understand the symptoms better so we try to design around it.
> 
> Best Regards,
> Wade
> 
> 
> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> > wrote: On
> 20 June 2016 at 09:21, Blair Bethwaite
> > wrote:
> > slow request issues). If you watch your xfs stats you'll likely get
> > further confirmation. In my experience xs_dir_lookups balloons (which
> > means directory lookups are missing cache and going to disk).
> 
> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> preparation for Jewel/RHCS2. Turns out when we last hit this very
> problem we had only ephemerally set the new filestore merge/split
> values - oops. Here's what started happening when we upgraded and
> restarted a bunch of OSDs:
> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
> 
> Seemed to cause lots of slow requests :-/. We corrected it about
> 12:30, then still took a while to settle.
> 
> --
> Cheers,
> ~Blairo
> 
> This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed.
> If you have received this email in error destroy it immediately. ***
> Walmart Confidential ***


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore Talk

2016-06-20 Thread Patrick McGarry
Hey cephers,

Just a reminder that this is the 2-for-1 Ceph Tech Talk week. Tomorrow
Sage will be giving a Bluestore talk at 4p EDT and Thursday at 1p EDT
(the usual time) Lenz Grimmer will be talking about the OpenATTIC/Ceph
integration work.

http://ceph.com/ceph-tech-talks/

If you have any questions please feel free to shoot them my way. Thanks!

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-20 Thread Warren Wang - ISD
Sorry, late to the party here. I agree, up the merge and split thresholds. 
We're as high as 50/12. I chimed in on an RH ticket here. One of those things 
you just have to find out as an operator since it's not well documented :(

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

We have over 200 million objects in this cluster, and it's still doing over 
15000 write IOPS all day long with 302 spinning drives + SATA SSD journals. 
Having enough memory and dropping your vfs_cache_pressure should also help.

Keep in mind that if you change the values, it won't take effect immediately. 
It only merges them back if the directory is under the calculated threshold and 
a write occurs (maybe a read, I forget).

Warren


From: ceph-users 
> 
on behalf of Wade Holler >
Date: Monday, June 20, 2016 at 2:48 PM
To: Blair Bethwaite 
>, Wido den 
Hollander >
Cc: Ceph Development 
>, 
"ceph-users@lists.ceph.com" 
>
Subject: Re: [ceph-users] Dramatic performance drop at certain number of 
objects in pool

Thanks everyone for your replies.  I sincerely appreciate it. We are testing 
with different pg_num and filestore_split_multiple settings.  Early indications 
are  well not great. Regardless it is nice to understand the symptoms 
better so we try to design around it.

Best Regards,
Wade


On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite 
> wrote:
On 20 June 2016 at 09:21, Blair Bethwaite 
> wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).

Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png

Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.

--
Cheers,
~Blairo

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Andrei Mikhailovsky
Hi Josef, 

are you saying that there is no ceph config option that can be used to provide 
IO to the vms while the ceph cluster is in heavy data move? I am really 
struggling to understand that this could be the case. I've read so much about 
ceph being the solution to the modern storage needs and that all of its 
components were designed to be redundant to provide an always on availability 
of the storage in case of upgrades and hardware failures. Has something been 
overlooked? 

Also, judging by a low number of people with similar issues I am thinking that 
there are a lot of ceph users which are still using non optimal profile, either 
because they don't want to risk the downtime or simply they don't know about 
the latest crush tunables. 

For any future updates, should I be scheduling a maintenance day or two and 
shutdown all vms prior to upgrading the cluster? It so seems like the backwards 
approach of the 90s and early 2000s ((( 

Cheers 

Andrei 

> From: "Josef Johansson" 
> To: "Gregory Farnum" , "Daniel Swarbrick"
> 
> Cc: "ceph-users" , "ceph-devel"
> 
> Sent: Monday, 20 June, 2016 20:22:02
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and
> client IO optimisations

> Hi,

> People ran into this when there were some changes in tunables that caused
> 70-100% movement, the solution was to find out what values that changed and
> increment them in the smallest steps possible.

> I've found that with major rearrangement in ceph the VMs does not neccesarily
> survive ( last time on a ssd cluster ), so linux and timeouts doesn't work 
> well
> os my assumption. Which is true with any other storage backend out there ;)

> Regards,
> Josef
> On Mon, 20 Jun 2016, 19:51 Gregory Farnum, < gfar...@redhat.com > wrote:

>> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>> < daniel.swarbr...@profitbricks.com > wrote:
>> > We have just updated our third cluster from Infernalis to Jewel, and are
>> > experiencing similar issues.

>> > We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
>> > have seen a lot of D-state processes and even jbd/2 timeouts and kernel
>> > stack traces inside the guests. At first I thought the VMs were being
>> > starved of IO, but this is still happening after throttling back the
>> > recovery with:

>> > osd_max_backfills = 1
>> > osd_recovery_max_active = 1
>> > osd_recovery_op_priority = 1

>> > After upgrading the cluster to Jewel, I changed our crushmap to use the
>> > newer straw2 algorithm, which resulted in a little data movment, but no
>> > problems at that stage.

>> > Once the cluster had settled down again, I set tunables to optimal
>> > (hammer profile -> jewel profile), which has triggered between 50% and
>> > 70% misplaced PGs on our clusters. This is when the trouble started each
>> > time, and when we had cascading failures of VMs.

>> > However, after performing hard shutdowns on the VMs and restarting them,
>> > they seemed to be OK.

>> > At this stage, I have a strong suspicion that it is the introduction of
>> > "require_feature_tunables5 = 1" in the tunables. This seems to require
>> > all RADOS connections to be re-established.

>> Do you have any evidence of that besides the one restart?

>> I guess it's possible that we aren't kicking requests if the crush map
>> but not the rest of the osdmap changes, but I'd be surprised.
>> -Greg



>> > On 20/06/16 13:54, Andrei Mikhailovsky wrote:
>> >> Hi Oliver,

 I am also seeing this as a strange behavriour indeed! I was going through 
 the
 logs and I was not able to find any errors or issues. There was also no
>> >> slow/blocked requests that I could see during the recovery process.

 Does anyone has an idea what could be the issue here? I don't want to shut 
 down
>> >> all vms every time there is a new release with updated tunable values.


>> >> Andrei



>> >> - Original Message -
>> >>> From: "Oliver Dzombic" < i...@ip-interactive.de >
>> >>> To: "andrei" < and...@arhont.com >, "ceph-users" < 
>> >>> ceph-users@lists.ceph.com >
>> >>> Sent: Sunday, 19 June, 2016 10:14:35
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
> and
>> >>> client IO optimisations

>> >>> Hi,

>> >>> so far the key values for that are:

>> >>> osd_client_op_priority = 63 ( anyway default, but i set it to remember 
>> >>> it )
>> >>> osd_recovery_op_priority = 1


>> >>> In addition i set:

>> >>> osd_max_backfills = 1
>> >>> osd_recovery_max_active = 1



>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> 

Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Andrei Mikhailovsky
Hi Daniel,


> 
> After upgrading the cluster to Jewel, I changed our crushmap to use the
> newer straw2 algorithm, which resulted in a little data movment, but no
> problems at that stage.


I've not done that, instead i've switch the profile to optimal rightaway.


> 
> Once the cluster had settled down again, I set tunables to optimal
> (hammer profile -> jewel profile), which has triggered between 50% and
> 70% misplaced PGs on our clusters. This is when the trouble started each
> time, and when we had cascading failures of VMs.
> 
> However, after performing hard shutdowns on the VMs and restarting them,
> they seemed to be OK.
> 
> At this stage, I have a strong suspicion that it is the introduction of
> "require_feature_tunables5 = 1" in the tunables. This seems to require
> all RADOS connections to be re-established.
> 


In my experience,, shutting down the vm and restarting didn't help. I've waited 
about 30+ minutes for the vm to start, but it was still unable to start.

I've also noticed that it took a while for vms to start failing, initially the 
IO wait on vms went up just a bit and it slowly started to increase over the 
course of about an hour. At the end, there was 100% iowait on all vms. If this 
was the case, wouldn't I see iowait jumping to 100% pretty quickly? Also, I 
wasn't able to start any of my vms until i've rebooted one of my osd / mon 
servers following the successful PGs rebuild.








> 
> On 20/06/16 13:54, Andrei Mikhailovsky wrote:
>> Hi Oliver,
>> 
>> I am also seeing this as a strange behavriour indeed! I was going through the
>> logs and I was not able to find any errors or issues. There was also no
>> slow/blocked requests that I could see during the recovery process.
>> 
>> Does anyone has an idea what could be the issue here? I don't want to shut 
>> down
>> all vms every time there is a new release with updated tunable values.
>> 
>> 
>> Andrei
>> 
>> 
>> 
>> - Original Message -
>>> From: "Oliver Dzombic" 
>>> To: "andrei" , "ceph-users" 
>>> Sent: Sunday, 19 June, 2016 10:14:35
>>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>>> and
>>> client IO optimisations
>> 
>>> Hi,
>>>
>>> so far the key values for that are:
>>>
>>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
>>> osd_recovery_op_priority = 1
>>>
>>>
>>> In addition i set:
>>>
>>> osd_max_backfills = 1
>>> osd_recovery_max_active = 1
>>>
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Josef Johansson
Hi,

People ran into this when there were some changes in tunables that caused
70-100% movement, the solution was to find out what values that changed and
increment them in the smallest steps possible.

I've found that with major rearrangement in ceph the VMs does not
neccesarily survive ( last time on a ssd cluster ), so linux and timeouts
doesn't work well os my assumption. Which is true with any other storage
backend out there ;)

Regards,
Josef

On Mon, 20 Jun 2016, 19:51 Gregory Farnum,  wrote:

> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>  wrote:
> > We have just updated our third cluster from Infernalis to Jewel, and are
> > experiencing similar issues.
> >
> > We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
> > have seen a lot of D-state processes and even jbd/2 timeouts and kernel
> > stack traces inside the guests. At first I thought the VMs were being
> > starved of IO, but this is still happening after throttling back the
> > recovery with:
> >
> > osd_max_backfills = 1
> > osd_recovery_max_active = 1
> > osd_recovery_op_priority = 1
> >
> > After upgrading the cluster to Jewel, I changed our crushmap to use the
> > newer straw2 algorithm, which resulted in a little data movment, but no
> > problems at that stage.
> >
> > Once the cluster had settled down again, I set tunables to optimal
> > (hammer profile -> jewel profile), which has triggered between 50% and
> > 70% misplaced PGs on our clusters. This is when the trouble started each
> > time, and when we had cascading failures of VMs.
> >
> > However, after performing hard shutdowns on the VMs and restarting them,
> > they seemed to be OK.
> >
> > At this stage, I have a strong suspicion that it is the introduction of
> > "require_feature_tunables5 = 1" in the tunables. This seems to require
> > all RADOS connections to be re-established.
>
> Do you have any evidence of that besides the one restart?
>
> I guess it's possible that we aren't kicking requests if the crush map
> but not the rest of the osdmap changes, but I'd be surprised.
> -Greg
>
> >
> >
> > On 20/06/16 13:54, Andrei Mikhailovsky wrote:
> >> Hi Oliver,
> >>
> >> I am also seeing this as a strange behavriour indeed! I was going
> through the logs and I was not able to find any errors or issues. There was
> also no slow/blocked requests that I could see during the recovery process.
> >>
> >> Does anyone has an idea what could be the issue here? I don't want to
> shut down all vms every time there is a new release with updated tunable
> values.
> >>
> >>
> >> Andrei
> >>
> >>
> >>
> >> - Original Message -
> >>> From: "Oliver Dzombic" 
> >>> To: "andrei" , "ceph-users" <
> ceph-users@lists.ceph.com>
> >>> Sent: Sunday, 19 June, 2016 10:14:35
> >>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel
> tunables and client IO optimisations
> >>
> >>> Hi,
> >>>
> >>> so far the key values for that are:
> >>>
> >>> osd_client_op_priority = 63 ( anyway default, but i set it to remember
> it )
> >>> osd_recovery_op_priority = 1
> >>>
> >>>
> >>> In addition i set:
> >>>
> >>> osd_max_backfills = 1
> >>> osd_recovery_max_active = 1
> >>>
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Criteria for Ceph journal sizing

2016-06-20 Thread Michael Hanscho
Hi!
On 2016-06-20 14:32, Daleep Singh Bais wrote:
> Dear All,
> 
> Is their some criteria for deciding on Ceph journal size to be used,
> whether in respect to Data partition size etc? I have noticed that if
> not specified, it takes the journal size to be 5GB.
> 
> Any insight in this regard will be helpful for my understanding.

See documentation:
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/

osd journal size = {2 * (expected throughput * filestore max sync interval)}

http://comments.gmane.org/gmane.comp.file-systems.ceph.user/28433

Gruesse
Michael


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-20 Thread Wade Holler
Thanks everyone for your replies.  I sincerely appreciate it. We are
testing with different pg_num and filestore_split_multiple settings.  Early
indications are  well not great. Regardless it is nice to understand
the symptoms better so we try to design around it.

Best Regards,
Wade


On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite 
wrote:

> On 20 June 2016 at 09:21, Blair Bethwaite 
> wrote:
> > slow request issues). If you watch your xfs stats you'll likely get
> > further confirmation. In my experience xs_dir_lookups balloons (which
> > means directory lookups are missing cache and going to disk).
>
> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> preparation for Jewel/RHCS2. Turns out when we last hit this very
> problem we had only ephemerally set the new filestore merge/split
> values - oops. Here's what started happening when we upgraded and
> restarted a bunch of OSDs:
>
> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>
> Seemed to cause lots of slow requests :-/. We corrected it about
> 12:30, then still took a while to settle.
>
> --
> Cheers,
> ~Blairo
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Gregory Farnum
On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
 wrote:
> We have just updated our third cluster from Infernalis to Jewel, and are
> experiencing similar issues.
>
> We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
> have seen a lot of D-state processes and even jbd/2 timeouts and kernel
> stack traces inside the guests. At first I thought the VMs were being
> starved of IO, but this is still happening after throttling back the
> recovery with:
>
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_op_priority = 1
>
> After upgrading the cluster to Jewel, I changed our crushmap to use the
> newer straw2 algorithm, which resulted in a little data movment, but no
> problems at that stage.
>
> Once the cluster had settled down again, I set tunables to optimal
> (hammer profile -> jewel profile), which has triggered between 50% and
> 70% misplaced PGs on our clusters. This is when the trouble started each
> time, and when we had cascading failures of VMs.
>
> However, after performing hard shutdowns on the VMs and restarting them,
> they seemed to be OK.
>
> At this stage, I have a strong suspicion that it is the introduction of
> "require_feature_tunables5 = 1" in the tunables. This seems to require
> all RADOS connections to be re-established.

Do you have any evidence of that besides the one restart?

I guess it's possible that we aren't kicking requests if the crush map
but not the rest of the osdmap changes, but I'd be surprised.
-Greg

>
>
> On 20/06/16 13:54, Andrei Mikhailovsky wrote:
>> Hi Oliver,
>>
>> I am also seeing this as a strange behavriour indeed! I was going through 
>> the logs and I was not able to find any errors or issues. There was also no 
>> slow/blocked requests that I could see during the recovery process.
>>
>> Does anyone has an idea what could be the issue here? I don't want to shut 
>> down all vms every time there is a new release with updated tunable values.
>>
>>
>> Andrei
>>
>>
>>
>> - Original Message -
>>> From: "Oliver Dzombic" 
>>> To: "andrei" , "ceph-users" 
>>> Sent: Sunday, 19 June, 2016 10:14:35
>>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>>> and client IO optimisations
>>
>>> Hi,
>>>
>>> so far the key values for that are:
>>>
>>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
>>> osd_recovery_op_priority = 1
>>>
>>>
>>> In addition i set:
>>>
>>> osd_max_backfills = 1
>>> osd_recovery_max_active = 1
>>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue while building Jewel on ARM

2016-06-20 Thread Gregory Farnum
On Mon, Jun 20, 2016 at 5:28 AM, Daleep Singh Bais  wrote:
> Dear All,
>
> I am getting below error message while trying to build Jewel on ARM. Any
> help / suggestion will be appreciated.
>
> g++: error: unrecognized command line option '-momit-leaf-frame-pointer'
> g++: error: unrecognized command line option '-momit-leaf-frame-pointer'
>   CC   db/builder.o
> g++: error: unrecognized command line option '-momit-leaf-frame-pointer'
> Makefile:1189: recipe for target 'db/builder.o' failed
> make[5]: *** [db/builder.o] Error 1
> make[5]: *** Waiting for unfinished jobs
>   CC   db/c.o
> g++: error: unrecognized command line option '-momit-leaf-frame-pointer'
> Makefile:1189: recipe for target 'db/c.o' failed
> make[5]: *** [db/c.o] Error 1
> make[5]: Leaving directory '/ceph_build/ceph/src/rocksdb'
> Makefile:32669: recipe for target 'rocksdb/librocksdb.a' failed
> make[4]: *** [rocksdb/librocksdb.a] Error 2
> make[4]: *** Waiting for unfinished jobs

Well, that's in rocksdb, and googling for "momit-leaf-frame-pointer"
the fourth result I see is
https://github.com/facebook/rocksdb/issues/810

So check that out?
-Greg
PS: And my fifth result is http://tracker.ceph.com/issues/15692, so
update that if you figure something out please! :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW memory usage

2016-06-20 Thread Василий Ангапов
We have OSD and RGW processes collocated on all Ceph nodes. There is
and Apache load balancer towards them.
Currently we have something like 18 million objects in RGW and each
OSD consumes 2.5 GB of  memory in average.
And yes, RGW data is stored in EC pool. Ceph version is 10.2.1.

2016-06-20 19:37 GMT+03:00 Abhishek Varshney :
> Hi,
>
> Is the memory issue seen on OSD nodes or on the RGW nodes? We
> encountered memory issues on OSD nodes with EC pools. Here is the mail
> thread : http://www.spinics.net/lists/ceph-devel/msg30597.html
>
> Hope this helps.
>
> Thanks
> Abhishek
>
> On Mon, Jun 20, 2016 at 9:59 PM, Василий Ангапов  wrote:
>> Hello,
>>
>> I'm sorry, can anyone share something on this matter?
>>
>> Regards, Vasily.
>>
>> 2016-06-09 16:14 GMT+03:00 Василий Ангапов :
>>> Hello!
>>>
>>> I have a question regarding Ceph RGW memory usage.
>>> We currently have 10 node 1.5 PB raw space cluster with EC profile
>>> 6+3. Every node has 29x6TB OSDs and 64 GB of RAM.
>>> Recently I've noticed that nodes are starting to suffer from RAM
>>> insufficiency. There is currently about 2.6 million files in RGW.
>>> Each OSD consumes 1-2 GB of RAM.
>>>
>>> Our plan is to store something like 200-300 million files of average
>>> size about 5 MB. How much RAM may I need approximately?
>>> Do somebody else having such cluster with many files?
>>>
>>> Thanks for help!
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW memory usage

2016-06-20 Thread Abhishek Varshney
Hi,

Is the memory issue seen on OSD nodes or on the RGW nodes? We
encountered memory issues on OSD nodes with EC pools. Here is the mail
thread : http://www.spinics.net/lists/ceph-devel/msg30597.html

Hope this helps.

Thanks
Abhishek

On Mon, Jun 20, 2016 at 9:59 PM, Василий Ангапов  wrote:
> Hello,
>
> I'm sorry, can anyone share something on this matter?
>
> Regards, Vasily.
>
> 2016-06-09 16:14 GMT+03:00 Василий Ангапов :
>> Hello!
>>
>> I have a question regarding Ceph RGW memory usage.
>> We currently have 10 node 1.5 PB raw space cluster with EC profile
>> 6+3. Every node has 29x6TB OSDs and 64 GB of RAM.
>> Recently I've noticed that nodes are starting to suffer from RAM
>> insufficiency. There is currently about 2.6 million files in RGW.
>> Each OSD consumes 1-2 GB of RAM.
>>
>> Our plan is to store something like 200-300 million files of average
>> size about 5 MB. How much RAM may I need approximately?
>> Do somebody else having such cluster with many files?
>>
>> Thanks for help!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW memory usage

2016-06-20 Thread Василий Ангапов
Hello,

I'm sorry, can anyone share something on this matter?

Regards, Vasily.

2016-06-09 16:14 GMT+03:00 Василий Ангапов :
> Hello!
>
> I have a question regarding Ceph RGW memory usage.
> We currently have 10 node 1.5 PB raw space cluster with EC profile
> 6+3. Every node has 29x6TB OSDs and 64 GB of RAM.
> Recently I've noticed that nodes are starting to suffer from RAM
> insufficiency. There is currently about 2.6 million files in RGW.
> Each OSD consumes 1-2 GB of RAM.
>
> Our plan is to store something like 200-300 million files of average
> size about 5 MB. How much RAM may I need approximately?
> Do somebody else having such cluster with many files?
>
> Thanks for help!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD journal utilization

2016-06-20 Thread Jonathan Proulx
On Mon, Jun 20, 2016 at 04:02:04PM +, David Turner wrote:
:If you want to watch what a disk is doing while you watch it, use iostat on 
the journal device.  If you want to see it's patterns at all times of the day, 
use sar.  Neither of these are ceph specific commands, just Linux tools that 
can watch your disk utilization, speeds, etc (among other things.  Both tools 
are well documented and easy to use.

For spot testing, like watching disk I/O during stress testing iostat
as mentioned above is something I use frequently.

Sar is a simple way to pull historic load info from a single host, but
I find it a bit combersome at even my modest scale.

We use http://munin-monitoring.org/ to do visual trending of many
different server statistics including disk throughput, latency,
utilization etc...this is a bit old school and there are plenty of
other opensource ways of gathering and dispalying performance metrics. 

None of this is ceph specific and I agree there's no reason it should
be.

-Jon

:
:From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of EP Komarla 
[ep.koma...@flextronics.com]
:Sent: Friday, June 17, 2016 5:13 PM
:To: ceph-users@lists.ceph.com
:Subject: [ceph-users] Ceph OSD journal utilization
:
:Hi,
:
:I am looking for a way to monitor the utilization of OSD journals – by 
observing the utilization pattern over time, I can determine if I have over 
provisioned them or not. Is there a way to do this?
:
:When I googled on this topic, I saw one similar request about 4 years back.  I 
am wondering if there is some traction on this topic since then.
:
:Thanks a lot.
:
:- epk
:
:Legal Disclaimer:
:The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!

:___
:ceph-users mailing list
:ceph-users@lists.ceph.com
:http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD journal utilization

2016-06-20 Thread Nick Fisk
There is a journal_bytes counter for the OSD, or the journal_full counter
will show if you have ever filled the journal.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
David Turner
Sent: 20 June 2016 17:02
To: EP Komarla ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph OSD journal utilization

 

If you want to watch what a disk is doing while you watch it, use iostat on
the journal device.  If you want to see it's patterns at all times of the
day, use sar.  Neither of these are ceph specific commands, just Linux tools
that can watch your disk utilization, speeds, etc (among other things.  Both
tools are well documented and easy to use.

  _  

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of EP Komarla
[ep.koma...@flextronics.com]
Sent: Friday, June 17, 2016 5:13 PM
To: ceph-users@lists.ceph.com  
Subject: [ceph-users] Ceph OSD journal utilization

Hi,

 

I am looking for a way to monitor the utilization of OSD journals - by
observing the utilization pattern over time, I can determine if I have over
provisioned them or not. Is there a way to do this?  

 

When I googled on this topic, I saw one similar request about 4 years back.
I am wondering if there is some traction on this topic since then.

 

Thanks a lot.

 

- epk


Legal Disclaimer:
The information contained in this message may be privileged and
confidential. It is intended to be read only by the individual or entity to
whom it is addressed or by their designee. If the reader of this message is
not the intended recipient, you are on notice that any distribution of this
message, in any form, is strictly prohibited. If you have received this
message in error, please immediately notify the sender and delete or destroy
any copy of this message!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD journal utilization

2016-06-20 Thread Benjeman Meekhof
For automatically collecting stats like this you might also look into
collectd.  It has many plugins for different system statistics
including one for collecting stats from Ceph daemon admin sockets.
There are several ways to collect and view the data from collectd.  We
are pointing clients at Influxdb and then viewing with Grafana.  There
are many small tutorials on this combination of tools if you google
and the docs for the tools cover how to configure each end.

In particular this plugin will get disk utilization counters:
https://collectd.org/wiki/index.php/Plugin:Disk

This combination is how we are monitoring OSD journal utilization
among other things.

thanks,
Ben





On Mon, Jun 20, 2016 at 12:02 PM, David Turner
 wrote:
> If you want to watch what a disk is doing while you watch it, use iostat on
> the journal device.  If you want to see it's patterns at all times of the
> day, use sar.  Neither of these are ceph specific commands, just Linux tools
> that can watch your disk utilization, speeds, etc (among other things.  Both
> tools are well documented and easy to use.
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of EP Komarla
> [ep.koma...@flextronics.com]
> Sent: Friday, June 17, 2016 5:13 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Ceph OSD journal utilization
>
> Hi,
>
>
>
> I am looking for a way to monitor the utilization of OSD journals – by
> observing the utilization pattern over time, I can determine if I have over
> provisioned them or not. Is there a way to do this?
>
>
>
> When I googled on this topic, I saw one similar request about 4 years back.
> I am wondering if there is some traction on this topic since then.
>
>
>
> Thanks a lot.
>
>
>
> - epk
>
>
> Legal Disclaimer:
> The information contained in this message may be privileged and
> confidential. It is intended to be read only by the individual or entity to
> whom it is addressed or by their designee. If the reader of this message is
> not the intended recipient, you are on notice that any distribution of this
> message, in any form, is strictly prohibited. If you have received this
> message in error, please immediately notify the sender and delete or destroy
> any copy of this message!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD journal utilization

2016-06-20 Thread David Turner
If you want to watch what a disk is doing while you watch it, use iostat on the 
journal device.  If you want to see it's patterns at all times of the day, use 
sar.  Neither of these are ceph specific commands, just Linux tools that can 
watch your disk utilization, speeds, etc (among other things.  Both tools are 
well documented and easy to use.

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of EP Komarla 
[ep.koma...@flextronics.com]
Sent: Friday, June 17, 2016 5:13 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph OSD journal utilization

Hi,

I am looking for a way to monitor the utilization of OSD journals – by 
observing the utilization pattern over time, I can determine if I have over 
provisioned them or not. Is there a way to do this?

When I googled on this topic, I saw one similar request about 4 years back.  I 
am wondering if there is some traction on this topic since then.

Thanks a lot.

- epk

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Daniel Swarbrick
We have just updated our third cluster from Infernalis to Jewel, and are
experiencing similar issues.

We run a number of KVM virtual machines (qemu 2.5) with RBD images, and
have seen a lot of D-state processes and even jbd/2 timeouts and kernel
stack traces inside the guests. At first I thought the VMs were being
starved of IO, but this is still happening after throttling back the
recovery with:

osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1

After upgrading the cluster to Jewel, I changed our crushmap to use the
newer straw2 algorithm, which resulted in a little data movment, but no
problems at that stage.

Once the cluster had settled down again, I set tunables to optimal
(hammer profile -> jewel profile), which has triggered between 50% and
70% misplaced PGs on our clusters. This is when the trouble started each
time, and when we had cascading failures of VMs.

However, after performing hard shutdowns on the VMs and restarting them,
they seemed to be OK.

At this stage, I have a strong suspicion that it is the introduction of
"require_feature_tunables5 = 1" in the tunables. This seems to require
all RADOS connections to be re-established.


On 20/06/16 13:54, Andrei Mikhailovsky wrote:
> Hi Oliver,
> 
> I am also seeing this as a strange behavriour indeed! I was going through the 
> logs and I was not able to find any errors or issues. There was also no 
> slow/blocked requests that I could see during the recovery process.
> 
> Does anyone has an idea what could be the issue here? I don't want to shut 
> down all vms every time there is a new release with updated tunable values.
> 
> 
> Andrei
> 
> 
> 
> - Original Message -
>> From: "Oliver Dzombic" 
>> To: "andrei" , "ceph-users" 
>> Sent: Sunday, 19 June, 2016 10:14:35
>> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables 
>> and client IO optimisations
> 
>> Hi,
>>
>> so far the key values for that are:
>>
>> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
>> osd_recovery_op_priority = 1
>>
>>
>> In addition i set:
>>
>> osd_max_backfills = 1
>> osd_recovery_max_active = 1
>>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH with NVMe SSDs and Caching vs Journaling on SSDs

2016-06-20 Thread Tim Gipson
Christian,

Thanks for all the info. I’ve been looking over the mailing lists.  There is so 
much info there and from the looks of it, setting up a cache tier is much more 
complex than I had originally thought.  

Moving the journals to OSDs was much simpler for me because you can just use 
ceph-deploy and point the journal to the device you want.

I do understand the difference between the cache tier and journaling.

As per your comment about the monitor nodes, the extra monitor nodes are for 
the purpose of resiliency.  We are trying to build our storage and compute 
clusters with lots of failure in mind.

Our NVME drives are only the 800GB 3600 series.

As to our networking setup: The OSD nodes have 4 x 10G nics, a bonded pair for 
front end traffic and a bonded pair for cluster traffic.  The monitor nodes 
have a bonded pair of 1Gig nics.  Our clients have 4 x 10G nics as well with a 
bonded pair dedicated to storage front end traffic connected to the ceph 
cluster.

The single NVMe for journaling was a concern but as you mentioned before, a 
host is our failure domain at this point.

I did find your comments to another user about having to add multiple roots per 
node because their NVMe drives were on different nodes.  That is the case for 
our gear as well.

Also, my gear is already in house so I’ve got what I’ve got to work with at 
this point, for good for ill.

Tim Gipson


On 6/16/16, 7:47 PM, "Christian Balzer"  wrote:


Hello,

On Thu, 16 Jun 2016 15:31:13 + Tim Gipson wrote:

> A few questions.
> 
> First, is there a good step by step to setting up a caching tier with
> NVMe SSDs that are on separate hosts?  Is that even possible?
> 
Yes. And with a cluster of your size that's the way I'd do it.
Larger cluster (dozen plus nodes) are likely to be better suited with
storage nodes that have shared HDD OSDs for slow storage and SSD OSDs for
cache pools.

It would behoove you to scour this ML for the dozens of threads covering
this and other aspects, like:
"journal or cache tier on SSDs ?"
"Steps for Adding Cache Tier"
and even yesterdays:
"Is Dynamic Cache tiering supported in Jewel"

> Second, what sort of performance are people seeing from caching
> tiers/journaling on SSDs in Jewel?
> 
Not using Jewel, but it's bound to be better than Hammer.

Performance will depend on a myriad of things, including CPU, SSD/NVMe
models, networking, tuning, etc.
It would be better if you had a performance target and a budget to see if
they can be matched up.

Cache tiering and journaling are very different things, don't mix them up.

> Right now I am working on trying to find best practice for a CEPH
> cluster with 3 monitor nodes, and 3 OSDs with 1 800GB NVMe drive and 12
> 6TB drives.
> 
No need for dedicated monitor notes (definitely not 3 and with cluster of
that size) if your storage nodes are designed correctly, see for example:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/008879.html

> My goal is reliable/somewhat fast performance.
>
Well, for starters this cluster will give you the space of one of these
nodes and worse performance than a single node due to the 3x replication.

What NVMe did you have in mind, a DC P3600 will give you 1GB/s writes
(and 3DWPD endurance), a P3700 2GB/s (and 10DWPD endurance).

What about your network?

Since the default failure domain in Ceph is the host, a single NVMe as
journal for all HDD OSDs isn't particular risky, but it's something to
keep in mind.
 
Christian
> Any help would be greatly appreciated!
> 
> Tim Gipson
> Systems Engineer
> 
> [http://www.ena.com/signature/enaemaillogo.gif]
> 
> 
> 618 Grassmere Park Drive, Suite 12
> Nashville, TN 37211
> 
> 
> 
> website | blog |
> support
> 
> 
> [http://www.ena.com/signature/facebook.png]
> [http://www.ena.com/signature/twitter.png]
> 
> [http://www.ena.com/signature/linkedin.png]
> 
> [http://www.ena.com/signature/youtube.png]
> 
> 
> 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IOPS requirements

2016-06-20 Thread Gandalf Corvotempesta
Il 18 giu 2016 07:10, "Christian Balzer"  ha scritto:
> That sounds extremely high, is that more or less consistent?
> How many VMs is that for?
> What are you looking at, as in are those individual disks/SSDs, a raid
> (what kind)?

800-1000 was a peak in a about 5 minutes. it was just a test to see some
data.

this server has about 20 VMs over a SAS 15K RAID6 (8 disks)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Criteria for Ceph journal sizing

2016-06-20 Thread Daleep Singh Bais
Dear All,

Is their some criteria for deciding on Ceph journal size to be used,
whether in respect to Data partition size etc? I have noticed that if
not specified, it takes the journal size to be 5GB.

Any insight in this regard will be helpful for my understanding.

Thanks.

Daleep Singh Bais
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Issue while building Jewel on ARM

2016-06-20 Thread Daleep Singh Bais
Dear All,

I am getting below error message while trying to build Jewel on ARM. Any
help / suggestion will be appreciated.

g++: error: unrecognized command line option '-momit-leaf-frame-pointer'
g++: error: unrecognized command line option '-momit-leaf-frame-pointer'
  CC   db/builder.o
g++: error: unrecognized command line option '-momit-leaf-frame-pointer'
Makefile:1189: recipe for target 'db/builder.o' failed
make[5]: *** [db/builder.o] Error 1
make[5]: *** Waiting for unfinished jobs
  CC   db/c.o
g++: _/*error: unrecognized command line option
'-momit-leaf-frame-pointer'*/_
Makefile:1189: recipe for target 'db/c.o' failed
make[5]: *** [db/c.o] Error 1
make[5]: Leaving directory '/ceph_build/ceph/src/rocksdb'
Makefile:32669: recipe for target 'rocksdb/librocksdb.a' failed
make[4]: *** [rocksdb/librocksdb.a] Error 2
make[4]: *** Waiting for unfinished jobs


Thanks,

Daleep Singh Bais
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS failover, how to speed it up?

2016-06-20 Thread Yan, Zheng
On Mon, Jun 20, 2016 at 7:04 PM, Brian Lagoni  wrote:
> Are anyone here able to help us with a question about mds failover?
>
> The case is that we are hitting a bug in ceph which requires us to restart
> the mds every week.
> There is a bug and PR for it here - https://github.com/ceph/ceph/pull/9456
> but until this have been resolved we need to do a restart. Unless there are
> a better workaround for this bug?
>
> The issue we are having are when we do a failover, the time it takes for the
> cephfs kernel client to recover are high enough so that the vm guests using
> this cephfs are having timeouts to they storage and therefor enters readonly
> mode.
>
> We have tried with making a failover to another mds or restarting the mds
> while it's the only mds in the cluser and in both cases our cephfs kernel
> client are taking too long to recover.
> We have also tried to set the failover MDS into "MDS_STANDBY_REPLAY" mode
> which didn't help on this matter.
>
> When doing a failover all IOPS against ceph are being blocked for 2-5 min
> until the kernel cephfs clients recovers after some timeouts messages like
> these:
> "2016-06-19 19:09:55.573739 7faaf8f48700  0 log_channel(cluster) log [WRN] :
> slow request 75.141028 seconds old, received at 2016-06-19 19:08:40.432655:
> client_request(client.4283066:4164703242 getattr pAsLsXsFs #1fe
> 2016-06-19 19:08:40.429496) currently failed to rdlock, waiting"
> After this there is a huge spike i IOPS data starts to being processed
> again.
>
> I'm not sure if any of this can be related to this warning which are present
> 90% of the day.
> "mds0: Behind on trimming (94/30)"?
> I have searched the mailing list for clues and answers on what to do about
> this but haven't found anything which have helped us.
> We have move/isolated the MDS service to it's own VM with the fastest
> processor we having, without any real changes to this warning.
>
>  Our infrastructure is the following:
>  - We use CEPH/CEPHFS (10.2.1)
>  - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>  - We have one main mds and one standby mds.
>  - The primary MDS is a virtual machine with 8 core E5-2643 v3 @
> 3.40GHz(steal time=0), 16G mem
>  - We are using ceph kernel client to mount cephfs.
>  - Ubuntu 16.04 (4.4.0-22-generic kernel)
>  - The OSD's are physical machines with 8 cores & 32GB memory
>  - All networking is 10Gb
>
> So at the end are there anything we can do to make the failover and recovery
> to go faster?

I guess your MDS is very busy. there are lots of inodes in client
cache. Please run 'ceph daemon mds.xxx session ls' before restarting
the MDS, and send the output to us.

Regards
Yan, Zheng


>
> Regards,
> Brian Lagoni
> System administrator, Engineering Tools
> Unity Technologies
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-20 Thread Andrei Mikhailovsky
Hi Oliver,

I am also seeing this as a strange behavriour indeed! I was going through the 
logs and I was not able to find any errors or issues. There was also no 
slow/blocked requests that I could see during the recovery process.

Does anyone has an idea what could be the issue here? I don't want to shut down 
all vms every time there is a new release with updated tunable values.


Andrei



- Original Message -
> From: "Oliver Dzombic" 
> To: "andrei" , "ceph-users" 
> Sent: Sunday, 19 June, 2016 10:14:35
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and 
> client IO optimisations

> Hi,
> 
> so far the key values for that are:
> 
> osd_client_op_priority = 63 ( anyway default, but i set it to remember it )
> osd_recovery_op_priority = 1
> 
> 
> In addition i set:
> 
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> 
> 
> ---
> 
> 
> But according to your settings its all ok.
> 
> According to what you described, the problem was not the backfilling but
> something else inside the cluster. Maybe something was blocked somewhere
> and only a reset could help. The logs would might have given an answer
> about that.
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 18.06.2016 um 18:04 schrieb Andrei Mikhailovsky:
>> Hello ceph users,
>> 
>> I've recently upgraded my ceph cluster from Hammer to Jewel (10.2.1 and
>> then 10.2.2). The cluster was running okay after the upgrade. I've
>> decided to use the optimal tunables for Jewel as the ceph status was
>> complaining about the straw version and my cluster settings were not
>> optimal for jewel. I've not touched tunables since the Firefly release I
>> think. After reading the release notes and the tunables section I have
>> decided to set the crush tunables value to optimal. Taking into account
>> that a few weeks ago I have done a /reweight/-by-/utilization /which has
>> moved around about 8% of my cluster objects. This process has not caused
>> any downtime and IO to the virtual machines was available. I have also
>> altered several settings to prioritise client IO in case of repair and
>> backfilling (see config show output below).
>> 
>> Right, so, after i've set tunables to optimal value my cluster indicated
>> that it needs to move around 61% of data in the cluster. The process
>> started and I was seeing speeds of between 800MB/s - 1.5GB/s for
>> recovery. My cluster is pretty small (3 osd servers with 30 osds in
>> total). The load on the osd servers was pretty low. I was seeing a
>> typical load of 4 spiking to around 10. The IO wait values on the osd
>> servers were also pretty reasonable - around 5-15%. There were around
>> 10-15 backfilling processes.
>> 
>> About 10 minutes after the optimal tunables were set i've noticed that
>> IO wait on the vms started to increase. Initially it was 15%, after
>> another 10 mins or so it increased to around 50% and about 30-40 minutes
>> later the iowait became 95-100% on all vms. Shortly after that the vms
>> showed a bunch of hang tasks in dmesg output and shorly stopped
>> responding all together. This kind of behaviour didn't happen after
>> doing reweight-by-utilization, which i've done a few weeks prior. The
>> vms IO wait during the reweithing was around 15-20% and there were no
>> hanged tasks and all vms were running pretty well.
>> 
>> I wasn't sure how to resolve the problem. On one hand I know that
>> recovery and backfilling cause extra load on the cluster, but it should
>> never break client IO. Afterall, this seems to negate one of the key
>> points behind ceph - resilient storage cluster. Looking at the ceph -w
>> output the client IO has decreased to 0-20 IOPs, where as a typical load
>> that I see at that time of the day is around 700-1000 IOPs.
>> 
>> The strange thing is that after the cluster has finished with data move
>> (it took around 11 hours) the client IO was still not available! I was
>> not able to start any new vms despite having OK health status and all
>> PGs in active + clean state. This was pretty strange. All osd servers
>> having almost 0 load, all PGs are active + clean, all osds are up and
>> all mons are up, yet no client IO. The cluster became operational once
>> again after a reboot of one of the osd servers, which seem to have
>> brought the cluster to life.
>> 
>> My question to the community is what ceph options should be implemented
>> to make sure the client IO is _always_ available and has the highest
>> priority during any recovery/migration/backfilling operations?
>> 
>> My current settings, which i've gathered over the years 

[ceph-users] New Ceph mirror

2016-06-20 Thread Tim Bishop
Hi Wido,

Six months or so ago you asked for new Ceph mirrors. I saw there wasn't
currently one in the UK, so I've set one up following your guidelines
here:

https://github.com/ceph/ceph/tree/master/mirroring

The mirror is available at:

http://ceph.mirrorservice.org/

Over both IPv4 and IPv6, and rsync is enabled too. The vhost is
configured to respond to uk.ceph.com should you feel it's appropriate to
use that for our site.

Our service is hosted on the UK academic network and has an 8 Gbit
uplink speed. We're also users of Ceph ourselves within the School of
Computing here at the University, so I'm following the Ceph developments
closely.

If you need any further information from us please let me know.

Tim.

-- 
Tim Bishop,
Computing Officer, School of Computing, University of Kent.
PGP Key: 0x6C226B37FDF38D55

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS failover, how to speed it up?

2016-06-20 Thread John Spray
On Mon, Jun 20, 2016 at 12:04 PM, Brian Lagoni  wrote:
> Are anyone here able to help us with a question about mds failover?
>
> The case is that we are hitting a bug in ceph which requires us to restart
> the mds every week.
> There is a bug and PR for it here - https://github.com/ceph/ceph/pull/9456
> but until this have been resolved we need to do a restart. Unless there are
> a better workaround for this bug?
>
> The issue we are having are when we do a failover, the time it takes for the
> cephfs kernel client to recover are high enough so that the vm guests using
> this cephfs are having timeouts to they storage and therefor enters readonly
> mode.
>
> We have tried with making a failover to another mds or restarting the mds
> while it's the only mds in the cluser and in both cases our cephfs kernel
> client are taking too long to recover.
> We have also tried to set the failover MDS into "MDS_STANDBY_REPLAY" mode
> which didn't help on this matter.
>
> When doing a failover all IOPS against ceph are being blocked for 2-5 min
> until the kernel cephfs clients recovers after some timeouts messages like
> these:

Sounds like we need to investigate why it's taking 2-5 minutes.

You should be seeing an initial 30s delay while the mons decide that
the dead MDS is dead (you can skip this by explicitly doing "ceph mds
fail ", which you might already be doing).

Then the new MDS will proceed through a series of states (replay,
clientreplay, etc).  Your cluster log should have messages showing the
MDS state changes (mdsmap updates), so hopefully you can identify
which phase is taking unexpectedly long.  Then, you can turn up the
MDS log level, and get some insight into what it's actually doing
during that phase.

John

> "2016-06-19 19:09:55.573739 7faaf8f48700  0 log_channel(cluster) log [WRN] :
> slow request 75.141028 seconds old, received at 2016-06-19 19:08:40.432655:
> client_request(client.4283066:4164703242 getattr pAsLsXsFs #1fe
> 2016-06-19 19:08:40.429496) currently failed to rdlock, waiting"
> After this there is a huge spike i IOPS data starts to being processed
> again.
>
> I'm not sure if any of this can be related to this warning which are present
> 90% of the day.
> "mds0: Behind on trimming (94/30)"?
> I have searched the mailing list for clues and answers on what to do about
> this but haven't found anything which have helped us.
> We have move/isolated the MDS service to it's own VM with the fastest
> processor we having, without any real changes to this warning.
>
>  Our infrastructure is the following:
>  - We use CEPH/CEPHFS (10.2.1)
>  - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
>  - We have one main mds and one standby mds.
>  - The primary MDS is a virtual machine with 8 core E5-2643 v3 @
> 3.40GHz(steal time=0), 16G mem
>  - We are using ceph kernel client to mount cephfs.
>  - Ubuntu 16.04 (4.4.0-22-generic kernel)
>  - The OSD's are physical machines with 8 cores & 32GB memory
>  - All networking is 10Gb
>
> So at the end are there anything we can do to make the failover and recovery
> to go faster?
>
> Regards,
> Brian Lagoni
> System administrator, Engineering Tools
> Unity Technologies
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS failover, how to speed it up?

2016-06-20 Thread Brian Lagoni
Are anyone here able to help us with a question about mds failover?

The case is that we are hitting a bug in ceph which requires us to restart
the mds every week.
There is a bug and PR for it here - https://github.com/ceph/ceph/pull/9456
but until this have been resolved we need to do a restart. Unless there are
a better workaround for this bug?

The issue we are having are when we do a failover, the time it takes for
the cephfs kernel client to recover are high enough so that the vm guests
using this cephfs are having timeouts to they storage and therefor enters
readonly mode.

We have tried with making a failover to another mds or restarting the mds
while it's the only mds in the cluser and in both cases our cephfs kernel
client are taking too long to recover.
We have also tried to set the failover MDS into "MDS_STANDBY_REPLAY" mode
which didn't help on this matter.

When doing a failover all IOPS against ceph are being blocked for 2-5 min
until the kernel cephfs clients recovers after some timeouts messages like
these:
"2016-06-19 19:09:55.573739 7faaf8f48700  0 log_channel(cluster) log [WRN]
: slow request 75.141028 seconds old, received at 2016-06-19
19:08:40.432655: client_request(client.4283066:4164703242 getattr pAsLsXsFs
#1fe 2016-06-19 19:08:40.429496) currently failed to rdlock,
waiting"
After this there is a huge spike i IOPS data starts to being processed
again.

I'm not sure if any of this can be related to this warning which are
present 90% of the day.
"mds0: Behind on trimming (94/30)"?
I have searched the mailing list for clues and answers on what to do about
this but haven't found anything which have helped us.
We have move/isolated the MDS service to it's own VM with the fastest
processor we having, without any real changes to this warning.

 Our infrastructure is the following:
 - We use CEPH/CEPHFS (10.2.1)
 - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs).
 - We have one main mds and one standby mds.
 - The primary MDS is a virtual machine with 8 core E5-2643 v3 @
3.40GHz(steal time=0), 16G mem
 - We are using ceph kernel client to mount cephfs.
 - Ubuntu 16.04 (4.4.0-22-generic kernel)
 - The OSD's are physical machines with 8 cores & 32GB memory
 - All networking is 10Gb

So at the end are there anything we can do to make the failover and
recovery to go faster?

Regards,
Brian Lagoni
System administrator, Engineering Tools
Unity Technologies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph rgw federated (multi site)

2016-06-20 Thread fridifree
Hi everybody,

Anyone knows if there is a detailed guide about how to create multi site
replication with rgw cause I have looked in ceph.com and the guide is not
detailed.

Thank you for your help ☺
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] heartbeat_check failures

2016-06-20 Thread Peter Kerdisle
Hey guys,

Today I noticed when adding new monitors to the cluster that two OSD
servers couldn't talk to each other for some reason. I am not sure if
adding the monitors caused this issue or whether the issue was always there
but adding the monitor showed it. After removing the new monitor the
cluster went back to healthy but the following errors are still being
spewed.

On both servers all the OSD logs show various messages like:

2016-06-20 12:51:32.148682 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.89 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148699 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.90 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148708 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.91 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148717 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.92 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148724 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.93 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148763 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.95 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148770 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.96 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)

On Server A these errors are all generated mentioning Server B's OSDs and
on Server B it's reported on Server A's OSDs. None of the other 10 servers
have any of these issues.

I confirmed using telnet that the OSD ports are reachable.

I'm using a cluster and public network, one of the things I did notice is
this error: "0 -- private-ip-server-a:0/15329 >>
public-ip-server-b:6806/6465 pipe(0x7f9910761000 sd=64 :0 s=1 pgs=0 cs=0
l=1 c=0x7f9910f7e100).fault"

This seems to imply that server A is trying to connect to server B from
it's cluster ip to the client ip. Could this be the root cause? And if so
how can I prevent that from happening?

Thanks,

Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph mirror

2016-06-20 Thread Wido den Hollander
Hey Tim,

> Op 20 juni 2016 om 12:42 schreef Tim Bishop :
> 
> 
> Hi Wido,
> 
> Six months or so ago you asked for new Ceph mirrors. I saw there wasn't
> currently one in the UK, so I've set one up following your guidelines
> here:
> 
> https://github.com/ceph/ceph/tree/master/mirroring
> 
> The mirror is available at:
> 
> http://ceph.mirrorservice.org/
> 
> Over both IPv4 and IPv6, and rsync is enabled too. The vhost is
> configured to respond to uk.ceph.com should you feel it's appropriate to
> use that for our site.
> 
> Our service is hosted on the UK academic network and has an 8 Gbit
> uplink speed. We're also users of Ceph ourselves within the School of
> Computing here at the University, so I'm following the Ceph developments
> closely.
> 

That's great! I am about to go on holiday, so it might take a bit longer.

Somebody at RedHat will have to create the CNAME for uk.ceph.com to make this 
work.

I will invite you to the ceph-mirrors mailinglist.

Wido

> If you need any further information from us please let me know.
> 
> Tim.
> 
> -- 
> Tim Bishop,
> Computing Officer, School of Computing, University of Kent.
> PGP Key: 0x6C226B37FDF38D55
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cosbench with ceph s3

2016-06-20 Thread Kanchana. P
Sorry java version "1.7.0_101"


On Mon, Jun 20, 2016 at 3:46 PM, Kanchana. P 
wrote:

> HI,
>
> My configuration:
> --
> Ceph with jewel 10.2.2 on RHEl 7.2
> Radosgw is configured on another RHEL node.
> Client node has ubuntu with 14.02 version, Installed Java 1.17 version and
> cosbench 0.4.2.c4
>
> Modified the objects to 20 from 8192. workload failed in main stage, with
> the below error. Can you please let me know what I am missing.
>
> FreeMarker template error: The following has evaluated to null or missing:
> ==> info.errorStatistics.stackTraceAndMessage[trace] [in template
> "mission.ftl" at line 239, column 48] Tip: If the failing expression is
> known to be legally null/missing, either specify a default value with
> myOptionalVar!myDefault, or use <#if
> myOptionalVar??>when-present<#else>when-missing. (These only cover the last
> step of the expression; to cover the whole expression, use parenthessis:
> (myOptionVar.foo)!myDefault, (myOptionVar.foo)?? The failing instruction
> (FTL stack trace): -- ==> ${info.errorStatistics.stackTraceAndM...
> [in template "mission.ftl" at line 239, column 46] #list
> info.errorStatistics.stackTrace... [in template "mission.ftl" at line 232,
> column 9] #if showErrorStatistics [in template "mission.ftl" at line 223,
> column 5] -- Java stack trace (for programmers): --
> freemarker.core.InvalidReferenceException: [... Exception message was
> already printed; see it above ...] at
> freemarker.core.InvalidReferenceException.getInstance(InvalidReferenceException.java:98)
> at freemarker.core.EvalUtil.coerceModelToString(EvalUtil.java:382) at
> freemarker.core.Expression.evalAndCoerceToString(Expression.java:115) at
> freemarker.core.DollarVariable.accept(DollarVariable.java:76) at
> freemarker.core.Environment.visit(Environment.java:265) at
> freemarker.core.MixedContent.accept(MixedContent.java:93) at
> freemarker.core.Environment.visitByHiddingParent(Environment.java:286) at
> freemarker.core.ConditionalBlock.accept(ConditionalBlock.java:86) at
> freemarker.core.Environment.visit(Environment.java:265) at
> freemarker.core.MixedContent.accept(MixedContent.java:93) at
> freemarker.core.Environment.visit(Environment.java:265) at
> freemarker.core.IteratorBlock$Context.runLoop(IteratorBlock.java:181) at
> freemarker.core.Environment.visitIteratorBlock(Environment.java:509) at
> freemarker.core.IteratorBlock.accept(IteratorBlock.java:103) at
> freemarker.core.Environment.visit(Environment.java:265) at
> freemarker.core.MixedContent.accept(MixedContent.java:93) at
> freemarker.core.Environment.visit(Environment.java:265) at
> freemarker.core.IfBlock.accept(IfBlock.java:84) at
> freemarker.core.Environment.visitByHiddingParent(Environment.java:286) at
> freemarker.core.ConditionalBlock.accept(ConditionalBlock.java:86) at
> freemarker.core.Environment.visit(Environment.java:265) at
> freemarker.core.MixedContent.accept(MixedContent.java:93) at
> freemarker.core.Environment.visit(Environment.java:265) at
> freemarker.core.Environment.process(Environment.java:243) at
> freemarker.template.Template.process(Template.java:277) at
> org.springframework.web.servlet.view.freemarker.FreeMarkerView.processTemplate(FreeMarkerView.java:366)
> at
> org.springframework.web.servlet.view.freemarker.FreeMarkerView.doRender(FreeMarkerView.java:283)
> at
> org.springframework.web.servlet.view.freemarker.FreeMarkerView.renderMergedTemplateModel(FreeMarkerView.java:233)
> at
> org.springframework.web.servlet.view.AbstractTemplateView.renderMergedOutputModel(AbstractTemplateView.java:167)
> at
> org.springframework.web.servlet.view.AbstractView.render(AbstractView.java:250)
> at
> org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1047)
> at
> org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:817)
> at
> org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719)
> at
> org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644)
> at
> org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> 

Re: [ceph-users] Cosbench with ceph s3

2016-06-20 Thread Kanchana. P
HI,

My configuration:
--
Ceph with jewel 10.2.2 on RHEl 7.2
Radosgw is configured on another RHEL node.
Client node has ubuntu with 14.02 version, Installed Java 1.17 version and
cosbench 0.4.2.c4

Modified the objects to 20 from 8192. workload failed in main stage, with
the below error. Can you please let me know what I am missing.

FreeMarker template error: The following has evaluated to null or missing:
==> info.errorStatistics.stackTraceAndMessage[trace] [in template
"mission.ftl" at line 239, column 48] Tip: If the failing expression is
known to be legally null/missing, either specify a default value with
myOptionalVar!myDefault, or use <#if
myOptionalVar??>when-present<#else>when-missing. (These only cover the last
step of the expression; to cover the whole expression, use parenthessis:
(myOptionVar.foo)!myDefault, (myOptionVar.foo)?? The failing instruction
(FTL stack trace): -- ==> ${info.errorStatistics.stackTraceAndM...
[in template "mission.ftl" at line 239, column 46] #list
info.errorStatistics.stackTrace... [in template "mission.ftl" at line 232,
column 9] #if showErrorStatistics [in template "mission.ftl" at line 223,
column 5] -- Java stack trace (for programmers): --
freemarker.core.InvalidReferenceException: [... Exception message was
already printed; see it above ...] at
freemarker.core.InvalidReferenceException.getInstance(InvalidReferenceException.java:98)
at freemarker.core.EvalUtil.coerceModelToString(EvalUtil.java:382) at
freemarker.core.Expression.evalAndCoerceToString(Expression.java:115) at
freemarker.core.DollarVariable.accept(DollarVariable.java:76) at
freemarker.core.Environment.visit(Environment.java:265) at
freemarker.core.MixedContent.accept(MixedContent.java:93) at
freemarker.core.Environment.visitByHiddingParent(Environment.java:286) at
freemarker.core.ConditionalBlock.accept(ConditionalBlock.java:86) at
freemarker.core.Environment.visit(Environment.java:265) at
freemarker.core.MixedContent.accept(MixedContent.java:93) at
freemarker.core.Environment.visit(Environment.java:265) at
freemarker.core.IteratorBlock$Context.runLoop(IteratorBlock.java:181) at
freemarker.core.Environment.visitIteratorBlock(Environment.java:509) at
freemarker.core.IteratorBlock.accept(IteratorBlock.java:103) at
freemarker.core.Environment.visit(Environment.java:265) at
freemarker.core.MixedContent.accept(MixedContent.java:93) at
freemarker.core.Environment.visit(Environment.java:265) at
freemarker.core.IfBlock.accept(IfBlock.java:84) at
freemarker.core.Environment.visitByHiddingParent(Environment.java:286) at
freemarker.core.ConditionalBlock.accept(ConditionalBlock.java:86) at
freemarker.core.Environment.visit(Environment.java:265) at
freemarker.core.MixedContent.accept(MixedContent.java:93) at
freemarker.core.Environment.visit(Environment.java:265) at
freemarker.core.Environment.process(Environment.java:243) at
freemarker.template.Template.process(Template.java:277) at
org.springframework.web.servlet.view.freemarker.FreeMarkerView.processTemplate(FreeMarkerView.java:366)
at
org.springframework.web.servlet.view.freemarker.FreeMarkerView.doRender(FreeMarkerView.java:283)
at
org.springframework.web.servlet.view.freemarker.FreeMarkerView.renderMergedTemplateModel(FreeMarkerView.java:233)
at
org.springframework.web.servlet.view.AbstractTemplateView.renderMergedOutputModel(AbstractTemplateView.java:167)
at
org.springframework.web.servlet.view.AbstractView.render(AbstractView.java:250)
at
org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1047)
at
org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:817)
at
org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719)
at
org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644)
at
org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 

Re: [ceph-users] Cosbench with ceph s3

2016-06-20 Thread Jaroslaw Owsiewski
Hi,

attached.

Regards,
-- 
Jarek

-- 
Jarosław Owsiewski

2016-06-20 11:01 GMT+02:00 Kanchana. P :

> Hi,
>
> Do anyone have a working configuration of ceph s3 to run with cosbench
> tool.
>
> Thanks,
> Kanchana.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>



http://s3.domain"/>



http://s3.domain"/>


http://s3.domain"/>





http://s3.domain"/>


http://s3.domain"/>





http://s3.domain"/>


http://s3.domain"/>






http://s3.domain"/>


http://s3.domain"/>





http://s3.domain"/>


http://s3.domain"/>





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cosbench with ceph s3

2016-06-20 Thread Kanchana. P
Hi,

Do anyone have a working configuration of ceph s3 to run with cosbench
tool.

Thanks,
Kanchana.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel Multisite RGW Memory Issues

2016-06-20 Thread Ben Agricola
I have 2 distinct clusters configured, in 2 different locations, and 1
zonegroup.

Cluster 1 has ~11TB of data currently on it, S3 / Swift backups via the
duplicity backup tool - each file is 25Mb and probably 20% are multipart
uploads from S3 (so 4Mb stripes) - in total 3217kobjects. This cluster has
been running for months (without RGW replication) with no issue. Each site
has 1 RGW instance at the moment.

I recently set up the second cluster on identical hardware in a secondary
site. I configured a multi-site setup, with both of these sites in an
active-active configuration. The second cluster has no active data set, so
I would expect site 1 to start mirroring to site 2 - and it does.

Unfortunately as soon as the RGW syncing starts to run, the resident memory
usage of radosgw instances on both clusters balloons massively until the
process is OOMed. This isn't a slow leak - when testing I've found that the
radosgw processes on either side can consume up to 300MB/s of extra RSS per
*second*, completely ooming a machine with 96GB of ram in approximately 20
minutes.

If I stop the radosgw processes on one cluster (i.e. breaking replication)
then the memory usage of the radosgw processes on the other cluster stays
at around 100-500MB and does not really increase over time.

Obviously this makes multi-site replication completely unusable so
wondering if anyone has a fix or workaround. I noticed some pull requests
have been merged into the master branch for RGW memory leak fixes so I
switched to v10.2.0-2453-g94fac96 from autobuild packages, it seems like
this slows the memory increase slightly but not enough to make replication
usable yet.

I've tried valgrinding the radosgw process but doesn't come up with
anything obviously leaking (I could be doing it wrong), but an example of
the memory ballooning is captured by collectd:
http://i.imgur.com/jePYnwz.png - this memory usage is *all* on the radosgw
process RSS.

Anyone else seen this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster ceph -s error

2016-06-20 Thread 施柏安
Hi,
It seems that one of your OSD server are dead. If you use the default
setting of Ceph(size=3, min_size=2), there should have three OSD nodes to
distribute objects' replicas. The important one is that, you only have one
OSD node alive. The living object replication leave 1 (< min_size). So
there show inactive to pgs.

2016-06-20 14:55 GMT+08:00 Ishmael Tsoaela :

> Hi David,
>
> Apologies for the late response.
>
> NodeB is mon+client, nodeC is client:
>
>
>
> Cheph health details:
>
> HEALTH_ERR 819 pgs are stuck inactive for more than 300 seconds; 883 pgs
> degraded; 64 pgs stale; 819 pgs stuck inactive; 1064 pgs stuck unclean; 883
> pgs undersized; 22 requests are blocked > 32 sec; 3 osds have slow
> requests; recovery 2/8 objects degraded (25.000%); recovery 2/8 objects
> misplaced (25.000%); crush map has legacy tunables (require argonaut, min
> is firefly); crush map has straw_calc_version=0
> pg 2.fc is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [2]
> pg 2.fd is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [0]
> pg 2.fe is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [2]
> pg 2.ff is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [1]
> pg 1.fb is stuck inactive for 493857.572982, current state
> undersized+degraded+peered, last acting [4]
> pg 2.f8 is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [3]
> pg 1.fa is stuck inactive for 492185.443146, current state
> undersized+degraded+peered, last acting [0]
> pg 2.f9 is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [0]
> pg 1.f9 is stuck inactive for 492185.452890, current state
> undersized+degraded+peered, last acting [2]
> pg 2.fa is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [3]
> pg 1.f8 is stuck inactive for 492185.443324, current state
> undersized+degraded+peered, last acting [0]
> pg 2.fb is stuck inactive since forever, current state
> undersized+degraded+peered, last acting [2]
> .
> .
> .
>
> pg 1.fb is undersized+degraded+peered, acting [4]
> pg 2.ff is undersized+degraded+peered, acting [1]
> pg 2.fe is undersized+degraded+peered, acting [2]
> pg 2.fd is undersized+degraded+peered, acting [0]
> pg 2.fc is undersized+degraded+peered, acting [2]
> 3 ops are blocked > 536871 sec on osd.4
> 15 ops are blocked > 268435 sec on osd.4
> 1 ops are blocked > 262.144 sec on osd.4
> 2 ops are blocked > 268435 sec on osd.3
> 1 ops are blocked > 268435 sec on osd.1
> 3 osds have slow requests
> recovery 2/8 objects degraded (25.000%)
> recovery 2/8 objects misplaced (25.000%)
> crush map has legacy tunables (require argonaut, min is firefly); see
> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> crush map has straw_calc_version=0; see
> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
>
>
> ceph osd stat
>
> cluster-admin@nodeB:~/.ssh/ceph-cluster$ cat ceph_osd_stat.txt
>  osdmap e80: 10 osds: 5 up, 5 in; 558 remapped pgs
> flags sortbitwise
>
>
> ceph osd tree:
>
> cluster-admin@nodeB:~/.ssh/ceph-cluster$ ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 9.08691 root default
> -2 4.54346 host nodeB
>  5 0.90869 osd.5 down0  1.0
>  6 0.90869 osd.6 down0  1.0
>  7 0.90869 osd.7 down0  1.0
>  8 0.90869 osd.8 down0  1.0
>  9 0.90869 osd.9 down0  1.0
> -3 4.54346 host nodeC
>  0 0.90869 osd.0   up  1.0  1.0
>  1 0.90869 osd.1   up  1.0  1.0
>  2 0.90869 osd.2   up  1.0  1.0
>  3 0.90869 osd.3   up  1.0  1.0
>  4 0.90869 osd.4   up  1.0  1.0
>
>
>
>
> CrushMap:
>
>
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host nodeB {
> id -2   # do not change unnecessarily
> # weight 4.543
> alg straw
> hash 0  # rjenkins1
> item osd.5 weight 0.909
> item osd.6 weight 0.909
> item osd.7 weight 0.909
> item osd.8 weight 0.909
> item osd.9 weight 0.909
> }
> host nodeC {
> id -3   # do not change unnecessarily
> # weight 4.543
> alg straw
> hash 0  # rjenkins1
> item osd.0 weight 0.909
> item osd.1 weight 

[ceph-users] Chown / symlink issues on download.ceph.com

2016-06-20 Thread Wido den Hollander
Hi Dan,

There seems to be a symlink issue on download.ceph.com:

# rsync -4 -avrn download.ceph.com::ceph /tmp|grep 'rpm-hammer/rhel7'
rpm-hammer/rhel7 -> /home/dhc-user/repos/rpm-hammer/el7

Could you take a quick look at that? It breaks the syncs for all the other 
mirrors who sync from download.ceph.com

Maybe do a chown (automated, cron?) as well to make sure all the files are 
readable by rsync?

Thanks!

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster ceph -s error

2016-06-20 Thread Ishmael Tsoaela
Hi David,

Apologies for the late response.

NodeB is mon+client, nodeC is client:



Cheph health details:

HEALTH_ERR 819 pgs are stuck inactive for more than 300 seconds; 883 pgs
degraded; 64 pgs stale; 819 pgs stuck inactive; 1064 pgs stuck unclean; 883
pgs undersized; 22 requests are blocked > 32 sec; 3 osds have slow
requests; recovery 2/8 objects degraded (25.000%); recovery 2/8 objects
misplaced (25.000%); crush map has legacy tunables (require argonaut, min
is firefly); crush map has straw_calc_version=0
pg 2.fc is stuck inactive since forever, current state
undersized+degraded+peered, last acting [2]
pg 2.fd is stuck inactive since forever, current state
undersized+degraded+peered, last acting [0]
pg 2.fe is stuck inactive since forever, current state
undersized+degraded+peered, last acting [2]
pg 2.ff is stuck inactive since forever, current state
undersized+degraded+peered, last acting [1]
pg 1.fb is stuck inactive for 493857.572982, current state
undersized+degraded+peered, last acting [4]
pg 2.f8 is stuck inactive since forever, current state
undersized+degraded+peered, last acting [3]
pg 1.fa is stuck inactive for 492185.443146, current state
undersized+degraded+peered, last acting [0]
pg 2.f9 is stuck inactive since forever, current state
undersized+degraded+peered, last acting [0]
pg 1.f9 is stuck inactive for 492185.452890, current state
undersized+degraded+peered, last acting [2]
pg 2.fa is stuck inactive since forever, current state
undersized+degraded+peered, last acting [3]
pg 1.f8 is stuck inactive for 492185.443324, current state
undersized+degraded+peered, last acting [0]
pg 2.fb is stuck inactive since forever, current state
undersized+degraded+peered, last acting [2]
.
.
.

pg 1.fb is undersized+degraded+peered, acting [4]
pg 2.ff is undersized+degraded+peered, acting [1]
pg 2.fe is undersized+degraded+peered, acting [2]
pg 2.fd is undersized+degraded+peered, acting [0]
pg 2.fc is undersized+degraded+peered, acting [2]
3 ops are blocked > 536871 sec on osd.4
15 ops are blocked > 268435 sec on osd.4
1 ops are blocked > 262.144 sec on osd.4
2 ops are blocked > 268435 sec on osd.3
1 ops are blocked > 268435 sec on osd.1
3 osds have slow requests
recovery 2/8 objects degraded (25.000%)
recovery 2/8 objects misplaced (25.000%)
crush map has legacy tunables (require argonaut, min is firefly); see
http://ceph.com/docs/master/rados/operations/crush-map/#tunables
crush map has straw_calc_version=0; see
http://ceph.com/docs/master/rados/operations/crush-map/#tunables


ceph osd stat

cluster-admin@nodeB:~/.ssh/ceph-cluster$ cat ceph_osd_stat.txt
 osdmap e80: 10 osds: 5 up, 5 in; 558 remapped pgs
flags sortbitwise


ceph osd tree:

cluster-admin@nodeB:~/.ssh/ceph-cluster$ ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 9.08691 root default
-2 4.54346 host nodeB
 5 0.90869 osd.5 down0  1.0
 6 0.90869 osd.6 down0  1.0
 7 0.90869 osd.7 down0  1.0
 8 0.90869 osd.8 down0  1.0
 9 0.90869 osd.9 down0  1.0
-3 4.54346 host nodeC
 0 0.90869 osd.0   up  1.0  1.0
 1 0.90869 osd.1   up  1.0  1.0
 2 0.90869 osd.2   up  1.0  1.0
 3 0.90869 osd.3   up  1.0  1.0
 4 0.90869 osd.4   up  1.0  1.0




CrushMap:


# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host nodeB {
id -2   # do not change unnecessarily
# weight 4.543
alg straw
hash 0  # rjenkins1
item osd.5 weight 0.909
item osd.6 weight 0.909
item osd.7 weight 0.909
item osd.8 weight 0.909
item osd.9 weight 0.909
}
host nodeC {
id -3   # do not change unnecessarily
# weight 4.543
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.909
item osd.1 weight 0.909
item osd.2 weight 0.909
item osd.3 weight 0.909
item osd.4 weight 0.909
}
root default {
id -1   # do not change unnecessarily
# weight 9.087
alg straw
hash 0  # rjenkins1
item nodeB weight 4.543
item nodeC weight 4.543
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map



ceph.conf


cluster-admin@nodeB:~/.ssh/ceph-cluster$ cat /etc/ceph/ceph.conf
[global]
fsid = 

Re: [ceph-users] Wrong Content-Range for zero size object

2016-06-20 Thread Victor Efimov
Reported http://tracker.ceph.com/issues/16388
ceph version 10.2.1


2016-06-19 20:54 GMT+03:00 Victor Efimov :

> That was 5 megabytes size. I tried 6 megabytes and  600 bytes, same
> strory. So seems unrelated to size. I think important things here: 1)
> actual object size is zero 2) Range request
> I'll ask my sysadmin-team for version, they'll answer tomorrow. I'll
> report issue tomorrow.
>
>
> 2016-06-19 18:28 GMT+03:00 Wido den Hollander :
>
>>
>> > Op 19 juni 2016 om 12:21 schreef Victor Efimov :
>> >
>> >
>> > When I submit request to zero-size object with Range header, I am
>> getting
>> > wrong Content-Length and Content-Range.
>> >
>> > See "Content-Range: bytes 0-5242880/0" and "Content-Length: 5242881"
>> below.
>> >
>>
>> That sounds like a bug! Good catch :) Which version of Ceph did you test
>> this against?
>>
>> Could you be so kind to report this in tracker.ceph.com as a issue?
>>
>> The 512k seems to be related to the first HEAD object, a internal thing
>> on how RGW stores , see rgw_rados.cc:
>>
>> #define HEAD_SIZE 512 * 1024
>>
>> Wido
>>
>> > GET
>> >
>> http:///test-vsespb-1/mykey?AWSAccessKeyId=XXX=1467330825=XXX
>> > Range: bytes=0-5242880
>> >
>> > HTTP/1.1 206 Partial Content
>> > Connection: close
>> > Date: Sun, 19 Jun 2016 10:12:40 GMT
>> > Accept-Ranges: bytes
>> > ETag: "d41d8cd98f00b204e9800998ecf8427e"
>> > Server: Apache/2.4.7 (Ubuntu)
>> > Content-Length: 5242881
>> > Content-Range: bytes 0-5242880/0
>> > Content-Type: binary/octet-stream
>> > Last-Modified: Sun, 19 Jun 2016 09:51:13 GMT
>> > Client-Date: Sun, 19 Jun 2016 10:12:41 GMT
>> > Client-Peer: 31.31.205.50:80
>> > Client-Response-Num: 1
>> > X-Amz-Meta-Md5: d41d8cd98f00b204e9800998ecf8427e
>> > X-Amz-Request-Id: XXX
>> >
>> > Without request Range everything is ok:
>> >
>> > GET
>> >
>> http://XXX/test-vsespb-1/mykey?AWSAccessKeyId=XXX=1467330825=XXX
>> >
>> > HTTP/1.1 200 OK
>> > Connection: close
>> > Date: Sun, 19 Jun 2016 10:18:58 GMT
>> > Accept-Ranges: bytes
>> > ETag: "d41d8cd98f00b204e9800998ecf8427e"
>> > Server: Apache/2.4.7 (Ubuntu)
>> > Content-Length: 0
>> > Content-Type: binary/octet-stream
>> > Last-Modified: Sun, 19 Jun 2016 09:51:13 GMT
>> > Client-Date: Sun, 19 Jun 2016 10:18:58 GMT
>> > Client-Peer: 31.31.205.50:80
>> > Client-Response-Num: 1
>> > X-Amz-Meta-Md5: d41d8cd98f00b204e9800998ecf8427e
>> > X-Amz-Request-Id: XXX
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-20 Thread Blair Bethwaite
On 20 June 2016 at 09:21, Blair Bethwaite  wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).

Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png

Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com