Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-23 Thread Andrew Cowie
On Mon, 2014-12-22 at 15:26 -0800, Craig Lewis wrote:

> My problems were memory pressure plus an XFS bug, so it took a while
> to manifest. 


The following (long, ongoing) thread on linux-mm discusses our [severe]
problems with memory pressure taking out entire OSD servers. The
upstream problems are still unresolved as at Linux 3.18, but anyone
running Ceph on XFS over especially Infiniband or *anything* that does
custom allocation in the kernel should probably be aware of this.
http://marc.info/?l=linux-mm&m=141605213522925&w=2

AfC
Sydney


-- 
Andrew Frederick Cowie
Head of Engineering
Anchor Systems

 afcowie   anchor   hosting
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-23 Thread Sean Sullivan
I am trying to understand these drive throttle markers that were
mentioned to get an idea of why these drives are marked as slow.::

here is the iostat of the drive /dev/sdbm
http://paste.ubuntu.com/9607168/
 
an IO wait of .79 doesn't seem bad but a write wait of 21.52 seems
really high.  Looking at the ops in flight::
http://paste.ubuntu.com/9607253/


If we check against all of the osds on this node, this seems strange::
http://paste.ubuntu.com/9607331/

I do not understand why this node has ops in flight while the the
remainder seem to be performing without issue. The load on the node is
pretty light as well with an average CPU at 16 and an average iowait of
.79::

---
/var/run/ceph# iostat -xm /dev/sdbm
Linux 3.13.0-40-generic (kh10-4) 12/23/2014 _x86_64_(40 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   3.940.00   23.300.790.00   71.97

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdbm  0.09 0.255.033.42 0.55 0.63  
288.02 0.09   10.562.55   22.32   2.54   2.15
---

I am still trying to understand the osd throttle perfdump so if anyone
can help shed some light on this that would be rad. From what I can tell
from the perfdump 4 osds (the last one, 228, being the slow one
currently). I ended up pulling .228 from the cluster and I have yet to
see another slow/blocked osd in the output of ceph -s. It is still
rebuilding as I just pulled .228 out but I am still getting at least
200MB/s via bonnie while the rebuild is occurring.

Finally, if this helps anyone. Although one 1x1Gb takes around 2.0 - 2.5
minutes. If we split a 10 file into 100 x 100MB we get a completion time
of about 1 minute. Which would be a 10G file in about 1-1.5 minutes or
166.66MB/s versus the 8MB/s I was getting before with sequential
uploads. All of these are coming from a single client via boto. This
leads me to think that this is a radosgw issue specifically.  

This again makes me think that this is not a slow disk issue but an
overall radosgw issue. If this were structural in anyway I would think
that all of rados/cephs faculties would be hit and the 8MBps limit per
client would be due to client throttling due to a ceiling being hit.  As
it turns out I am not hitting the ceiling but some other aspect of the
radosgw or boto is limiting my throughput. Is this logic not correct? I
feel like I am missing something.

Thanks for the help everyone!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Sean Sullivan
Awesome! I have yet to hear of any zfs in ceph chat nor have I seen it
on the mailing lists that I have caught. I would assume it would
function pretty well considering how long it has been in use along some
production systems I have seen. I have little to no experience with it
personally though.

I thought the rados issue was weird as well. Even with a degraded
cluster I feel like I should be getting better throughput unless I hit
an object with a bunch of bad PGs or something. We are using 2x 2x10G 
cards in LACP to get over 10G on average and have separate gateway nodes
(Went with the Supermicro kit after all) so CPU on those nodes shouldn't
be an issue. It is extremely low as it is currently which is again
surprising.

I honestly think that this is some kind of radosgw bug in giant as I
have another giant cluster with the exact same config that is performing
much better with much less hardware. Hopefully it is indeed a bug of
somesort and not yet another screw up on my end. Furthermore hopefully I
find the bug and fix it for others to find and profit from ^_^.

Thanks for all of your help!


On 12/22/2014 05:26 PM, Craig Lewis wrote:
>
>
> On Mon, Dec 22, 2014 at 2:57 PM, Sean Sullivan
> mailto:seapasu...@uchicago.edu>> wrote:
>
> Thanks Craig!
>
> I think that this may very well be my issue with osds dropping out
> but I am still not certain as I had the cluster up for a small
> period while running rados bench for a few days without any status
> changes.
>
>
> Mine were fine for a while too, through several benchmarks and a large
> RadosGW import.  My problems were memory pressure plus an XFS bug, so
> it took a while to manifest.  When it did, all of the ceph-osd
> processes on that node would have periods of ~30 seconds with 100%
> CPU.  Some OSDs would get kicked out.  Once that started, it was a
> downward spiral of recovery causing increasing load causing more OSDs
> to get kicked out...
>
> Once I found the memory problem, I cronned a buffer flush, and that
> usually kept things from getting too bad.
>
> I was able to see on the CPU graphs that CPU was increasing before the
> problems started.  Once CPU got close to 100% usage on all cores,
> that's when the OSDs started dropping out.  Hard to say if it was the
> CPU itself, or if the CPU was just a symptom of the memory pressure
> plus XFS bug.
>
>
>  
>
> The real big issue that I have is the radosgw one currently. After
> I figure out the root cause of the slow radosgw performance and
> correct that, it should hopefully buy me enough time to figure out
> the osd slow issue.
>
> It just doesn't make sense that I am getting 8mbps per client no
> matter 1 or 60 clients while rbd and rados shoot well above 600MBs
> (above 1000 as well).
>
>
> That is strange.  I was able to get >300 Mbps per client, on a 3 node
> cluster with GigE.  I expected that each client would saturate the
> GigE on their own, but 300 Mbps is more than enough for now.
>
> I am using the Ceph apache and fastcgi module, but otherwise it's a
> pretty standard apache setup.  My RadosGW processes are using a fair
> amount of CPU, but as long as you have some idle CPU, that shouldn't
> be the bottleneck.
>  
>
>  
>
>
> May I ask how you are monitoring your clusters logs? Are you just
> using rsyslog or do you have a logstash type system set up? Load
> wise I do not see a spike until I pull an osd out of the cluster
> or stop then start an osd without marking nodown.
>
>
> I'm monitoring the cluster with Zabbix, and that gives me pretty much
> the same info that I'd get in the logs.  I am planning to start
> pushing the logs to Logstash soon, as soon as I get my logstash is
> able to handle the extra load.
>  
>
>
> I do think that CPU is probably the cause of the osd slow issue
> though as it makes the most logical sense. Did you end up dropping
> ceph and moving to zfs or did you stick with it and try to
> mitigate it via file flusher/ other tweaks?
>
>
> I'm still on Ceph.  I worked around the memory pressure by
> reformatting my XFS filesystems to use regular sized inodes.  It was a
> rough couple of months, but everything has been stable for the last
> two months.
>
> I do still want to use ZFS on my OSDs.  It's got all the features of
> BtrFS, with the extra feature of being production ready.  It's just
> not production ready in Ceph yet.  It's coming along nicely though,
> and I hope to reformat one node to be all ZFS sometime next year.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Craig Lewis
On Mon, Dec 22, 2014 at 2:57 PM, Sean Sullivan 
wrote:

>  Thanks Craig!
>
> I think that this may very well be my issue with osds dropping out but I
> am still not certain as I had the cluster up for a small period while
> running rados bench for a few days without any status changes.
>

Mine were fine for a while too, through several benchmarks and a large
RadosGW import.  My problems were memory pressure plus an XFS bug, so it
took a while to manifest.  When it did, all of the ceph-osd processes on
that node would have periods of ~30 seconds with 100% CPU.  Some OSDs would
get kicked out.  Once that started, it was a downward spiral of recovery
causing increasing load causing more OSDs to get kicked out...

Once I found the memory problem, I cronned a buffer flush, and that usually
kept things from getting too bad.

I was able to see on the CPU graphs that CPU was increasing before the
problems started.  Once CPU got close to 100% usage on all cores, that's
when the OSDs started dropping out.  Hard to say if it was the CPU itself,
or if the CPU was just a symptom of the memory pressure plus XFS bug.




> The real big issue that I have is the radosgw one currently. After I
> figure out the root cause of the slow radosgw performance and correct that,
> it should hopefully buy me enough time to figure out the osd slow issue.
>
> It just doesn't make sense that I am getting 8mbps per client no matter 1
> or 60 clients while rbd and rados shoot well above 600MBs (above 1000 as
> well).
>

That is strange.  I was able to get >300 Mbps per client, on a 3 node
cluster with GigE.  I expected that each client would saturate the GigE on
their own, but 300 Mbps is more than enough for now.

I am using the Ceph apache and fastcgi module, but otherwise it's a pretty
standard apache setup.  My RadosGW processes are using a fair amount of
CPU, but as long as you have some idle CPU, that shouldn't be the
bottleneck.




>
> May I ask how you are monitoring your clusters logs? Are you just using
> rsyslog or do you have a logstash type system set up? Load wise I do not
> see a spike until I pull an osd out of the cluster or stop then start an
> osd without marking nodown.
>

I'm monitoring the cluster with Zabbix, and that gives me pretty much the
same info that I'd get in the logs.  I am planning to start pushing the
logs to Logstash soon, as soon as I get my logstash is able to handle the
extra load.


>
> I do think that CPU is probably the cause of the osd slow issue though as
> it makes the most logical sense. Did you end up dropping ceph and moving to
> zfs or did you stick with it and try to mitigate it via file flusher/ other
> tweaks?
>
>
I'm still on Ceph.  I worked around the memory pressure by reformatting my
XFS filesystems to use regular sized inodes.  It was a rough couple of
months, but everything has been stable for the last two months.

I do still want to use ZFS on my OSDs.  It's got all the features of BtrFS,
with the extra feature of being production ready.  It's just not production
ready in Ceph yet.  It's coming along nicely though, and I hope to reformat
one node to be all ZFS sometime next year.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Sean Sullivan
Hello Christian,

Sorry for the long wait. Actually I have done a rados bench earlier on
in the cluster without any failure but it did take a while. That and
there is actually a lot of data being downloaded to the cluster now.
Here are the rados results for 100 seconds::
http://pastebin.com/q5E6JjkG

On 12/19/2014 08:10 PM, Christian Balzer wrote
> Hello Sean,
>
> On Fri, 19 Dec 2014 02:47:41 -0600 Sean Sullivan wrote:
>
>> Hello Christian,
>>
>> Thanks again for all of your help! I started a bonnie test using the 
>> following::
>> bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b
>>
> While that gives you a decent idea of what the limitations of kernelspace e -
> mounted RBD images are, it won't tell you what your cluster is actually
> capable of in raw power.
Indeed I agree here, and I am not interested in raw power at this point
as I am a bit past this. I performed a rados test prior and it seemed to
do pretty well, or as expected. What I have noticed in rados bench tests
is that the test can only go as fast as the client network can allow.
The above seems to demonstrate this as well. If I were to start two
rados bench tests from two different hosts I am confident I can push
above 1100 Mbps without any issue.



>
> For that use rados bench, however if your cluster is as brittle as it
> seems, this may very well cause OSDs to flop, so look out for that.
> Observe your nodes (a bit tricky with 21, but try) while this is going on.
>
> To test the write throughput, do something like this:
> "rados -p rbd bench 60 write  -t 64"
>  
> To see your CPUs melt and get an idea of the IOPS capability with 4k
> blocks, do this:
>
> "rados -p rbd bench 60 write  -t 64 -b 4096"
>  
I will try with 4k blocks next to see how this works out. I honestly
think that the cluster will stress but should be able to handle it. A
rebuild on failure will be scary however.

>> Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
>> clears the slow marker for now
>>
>> kh10-9$ ceph -w
>>  cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
>>   health HEALTH_OK
>>   monmap e1: 3 mons at
> 3 monitors, another recommendation/default that isn't really adequate for
> a cluster of this size and magnitude. Because it means you can only loose
> ONE monitor before the whole thing seizes up. I'd get 2 more (with DC
> S3700, 100 or 200GB will do fine) and spread them among the racks. 
The plan is to scale out the monitors to have two more, they have not
arrived yet but that is in the plan. I agree about the number of
monitors. When I talked to Inktank/redhat about this when I was testing
the 36 disk storage node cluster.  though they said something along the
lines of we shouldn't need 2 more until we have a much larger cluster.
Just know that two more monitors are indeed on the way and that this is
a known issue.

>  
>> {kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0},
>>  
>> election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8
>>   osdmap e15356: 1256 osds: 1256 up, 1256 in
>>pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
>>  566 TB used, 4001 TB / 4567 TB avail
> That's a lot of objects and data, was your cluster that full when before
> it started to have problems?
This is due to the rados benches I ran as well as the massive amount of
data we are transferring to the current cluster.
We have 20 pools currently:
1 data,2 rbd,3 .rgw,4 .rgw.root,5 .rgw.control,6 .rgw.gc,7
.rgw.buckets,8 .rgw.buckets.index,9 .log,10 .intent-log,11 .usage,12
.users,13 .users.email,14 .users.swift,15 .users.uid,16 volumes,18
vms,19 .rgw.buckets.extra,20 images,

data and rbd will be removed once I am done testing. these pools were my
test pools I created.  The rest are the standard s3/swift // openstack
pools.

>> 87560 active+clean
> Odd number of PGs, it makes for 71 per OSD, a bit on the low side. OTOH
> you're already having scaling issues of sorts, so probably leave it be for
> now. How many pools?
20 pools, but we will only have 18 once I delete data and rbd (these
were just testing pools to begin with).
>
>>client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s
>>
> Is that a typical, idle, steady state example or is this while you're
> running bonnie and pushing things into radosgw?
I am doing both actually. The downloads into radosgw can't be stopped
right now but I can stop the bonnie tests.


>
>> 2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
>> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
>> MB/s rd, 1090 MB/s wr, 5774 op/s
>> 2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 
>> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 
>> MB/s rd, 1548 MB/s wr, 7552 op/s
>> 2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 
>> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 
>> MB/s rd, 2284 MB/s wr, 10451 op/

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-19 Thread Christian Balzer

Hello Sean,

On Fri, 19 Dec 2014 02:47:41 -0600 Sean Sullivan wrote:

> Hello Christian,
> 
> Thanks again for all of your help! I started a bonnie test using the 
> following::
> bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b
>
While that gives you a decent idea of what the limitations of kernelspace 
mounted RBD images are, it won't tell you what your cluster is actually
capable of in raw power.

For that use rados bench, however if your cluster is as brittle as it
seems, this may very well cause OSDs to flop, so look out for that.
Observe your nodes (a bit tricky with 21, but try) while this is going on.

To test the write throughput, do something like this:
"rados -p rbd bench 60 write  -t 64"
 
To see your CPUs melt and get an idea of the IOPS capability with 4k
blocks, do this:

"rados -p rbd bench 60 write  -t 64 -b 4096"
 
> Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
> clears the slow marker for now
> 
> kh10-9$ ceph -w
>  cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
>   health HEALTH_OK
>   monmap e1: 3 mons at

3 monitors, another recommendation/default that isn't really adequate for
a cluster of this size and magnitude. Because it means you can only loose
ONE monitor before the whole thing seizes up. I'd get 2 more (with DC
S3700, 100 or 200GB will do fine) and spread them among the racks. 
 
> {kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0},
>  
> election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8
>   osdmap e15356: 1256 osds: 1256 up, 1256 in
>pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
>  566 TB used, 4001 TB / 4567 TB avail
That's a lot of objects and data, was your cluster that full when before
it started to have problems?

> 87560 active+clean
Odd number of PGs, it makes for 71 per OSD, a bit on the low side. OTOH
you're already having scaling issues of sorts, so probably leave it be for
now. How many pools?

>client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s
> 
Is that a typical, idle, steady state example or is this while you're
running bonnie and pushing things into radosgw?

> 2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
> MB/s rd, 1090 MB/s wr, 5774 op/s
> 2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 
> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 
> MB/s rd, 1548 MB/s wr, 7552 op/s
> 2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 
> active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 
> MB/s rd, 2284 MB/s wr, 10451 op/s
> 
> Once the next slow osd comes up I guess I can tell it to bump it's log 
> up to 5 and see what may be going on.
> 
> That said I didn't see much last time.
>
You may be looking at something I saw the other day, find my 
"Unexplainable slow request" thread. Which unfortunately still remains a
mystery.
But the fact that it worked until recently suggest loading issues.

> On 12/19/2014 12:17 AM, Christian Balzer wrote:
> > Hello,
> >
> > On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:
> >
> >> Wow Christian,
> >>
> >> Sorry I missed these in line replies. Give me a minute to gather some
> >> data. Thanks a million for the in depth responses!
> >>
> > No worries.
> >
> >> I thought about raiding it but I needed the space unfortunately. I
> >> had a 3x60 osd node test cluster that we tried before this and it
> >> didn't have this flopping issue or rgw issue I am seeing .
> >>
> > I think I remember that...
> 
> I  hope not. I don't think I posted about it at all. I only had it for a 
> short period before it was re purposed. I did post about a cluster 
> before that with 32 osds per node though. That one had tons of issues 
> but now seems to be running relatively smoothly.
> 
Might have been that then.

> >
> > You do realize that the RAID6 configuration option I mentioned would
> > actually give you MORE space (replication of 2 is sufficient with
> > reliable OSDs) than what you have now?
> > Albeit probably at reduced performance, how much would also depend on
> > the controllers used, but at worst the RAID6 OSD performance would be
> > equivalent to that of single disk.
> > So a Cluster (performance wise) with 21 nodes and 8 disks each.
> 
> Ah I must have misread, I thought you said raid 10 which would half the 
> storage and  a small write penalty. For a raid 6 of 4 drives I would get 
> something like 160 iops (assuming each drive is 75) which may be worth 
> it. I would just hate to have 2+ failures and lose 4-5 drives as opposed 
> to 2 and the rebuild for a raid 6 always left a sour taste in my mouth. 
> Still 4 slow drives is better than 4TB of data over the network slowing 
> down the whole cluster.
> 
Please re-read what I wrote, I suggested 2 alternatives, one with RAID10
that would indeed reduce your space by 25% (

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-19 Thread Gregory Farnum
On Thu, Dec 18, 2014 at 8:44 PM, Sean Sullivan  wrote:
> Thanks for the reply Gegory,
>
> Sorry if this is in the wrong direction or something. Maybe I do not
> understand
>
> To test uploads I either use bash time and either python-swiftclient or boto
> key.set_contents_from_filename to the radosgw. I was unaware that radosgw
> had any type of throttle settings in the configuration (I can't seem to find
> any either).  As for rbd mounts I test by creating a 1TB mount and writing a
> file to it through time+cp or dd. Not the most accurate test but I think
> should be good enough as a quick functionality test. So for writes, it's
> more for functionality than performance. I would think a basic functionality
> test should yield more than 8mb/s though.
>
> As for checking admin sockets: I have actually, I set the 3rd gateways
> debug_civetweb to 10 as well as debug_rgw to 5 but I still do not see
> anything that stands out. The snippet of the log I pasted has these values
> set. I did the same for an osd that is marked as slow (1112). All I can see
> in the log for the osd are ticks and heartbeat responses though, nothing
> that shows any issues. Finally I did it for the primary monitor node to see
> if I would see anything there with debug_mon set to 5
> (http://pastebin.com/hhnaFac1). I do not really see anything that would
> stand out as a failure (like a fault or timeout error).
>
> What kind of throttler limits do you mean? I didn't/don't see any mention of
> rgw throttler limits in the ceph.com docs or admin socket just osd/
> filesystem throttle like inode/flusher limits, do you mean these? I have not
> messed with these limits yet on this cluster, do you think it would help?

The admin socket is different from the logs:
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#admin-socket

You should run the "help" command against it to see what you can do,
but in particular it lets you see the status of running operations and
history of slow ones, in addition to "perfcounters" which will let you
see things like "throttler" counts. If any of those are persistently
full on your OSDs at the same time as you have operations which seem
to be stuck waiting to get in, that would be a hint. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-19 Thread Sean Sullivan

Hello Christian,

Thanks again for all of your help! I started a bonnie test using the 
following::

bonnie -d /mnt/rbd/scratch2/  -m $(hostname) -f -b

Hopefully it completes in the next hour or so. A reboot of the slow OSDs 
clears the slow marker for now


kh10-9$ ceph -w
cluster 9ea4d9d9-04e4-42fe-835a-34e4259cf8ec
 health HEALTH_OK
 monmap e1: 3 mons at 
{kh08-8=10.64.64.108:6789/0,kh09-8=10.64.64.117:6789/0,kh10-8=10.64.64.125:6789/0}, 
election epoch 338, quorum 0,1,2 kh08-8,kh09-8,kh10-8

 osdmap e15356: 1256 osds: 1256 up, 1256 in
  pgmap v788798: 87560 pgs, 18 pools, 187 TB data, 47919 kobjects
566 TB used, 4001 TB / 4567 TB avail
   87560 active+clean
  client io 542 MB/s rd, 1548 MB/s wr, 7552 op/s

2014-12-19 01:27:28.547884 mon.0 [INF] pgmap v788797: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 433 
MB/s rd, 1090 MB/s wr, 5774 op/s
2014-12-19 01:27:29.581955 mon.0 [INF] pgmap v788798: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 542 
MB/s rd, 1548 MB/s wr, 7552 op/s
2014-12-19 01:27:30.638744 mon.0 [INF] pgmap v788799: 87560 pgs: 87560 
active+clean; 187 TB data, 566 TB used, 4001 TB / 4567 TB avail; 726 
MB/s rd, 2284 MB/s wr, 10451 op/s


Once the next slow osd comes up I guess I can tell it to bump it's log 
up to 5 and see what may be going on.


That said I didn't see much last time.

On 12/19/2014 12:17 AM, Christian Balzer wrote:

Hello,

On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:


Wow Christian,

Sorry I missed these in line replies. Give me a minute to gather some
data. Thanks a million for the in depth responses!


No worries.


I thought about raiding it but I needed the space unfortunately. I had a
3x60 osd node test cluster that we tried before this and it didn't have
this flopping issue or rgw issue I am seeing .


I think I remember that...


I  hope not. I don't think I posted about it at all. I only had it for a 
short period before it was re purposed. I did post about a cluster 
before that with 32 osds per node though. That one had tons of issues 
but now seems to be running relatively smoothly.




You do realize that the RAID6 configuration option I mentioned would
actually give you MORE space (replication of 2 is sufficient with reliable
OSDs) than what you have now?
Albeit probably at reduced performance, how much would also depend on the
controllers used, but at worst the RAID6 OSD performance would be
equivalent to that of single disk.
So a Cluster (performance wise) with 21 nodes and 8 disks each.


Ah I must have misread, I thought you said raid 10 which would half the 
storage and  a small write penalty. For a raid 6 of 4 drives I would get 
something like 160 iops (assuming each drive is 75) which may be worth 
it. I would just hate to have 2+ failures and lose 4-5 drives as opposed 
to 2 and the rebuild for a raid 6 always left a sour taste in my mouth. 
Still 4 slow drives is better than 4TB of data over the network slowing 
down the whole cluster.


I knew about the 40 cores being low but I thought at 2.7 we may be fine 
as the docs recommend 1 X 1G xeons per osd. The cluster hovers around 
15-18 CPU but with the constant flipping disks I am seeing it bump up as 
high as 120 when a disk is marked as out of the cluster.


kh10-3$ cat /proc/loadavg
14.35 29.50 66.06 14/109434 724476





  
No need, now that strange monitor configuration makes sense, you (or

whoever spec'ed this) went for the Supermicro Ceph solution, right?

indeed.

In my not so humble opinion, this the worst storage chassis ever designed
by a long shot and totally unsuitable for Ceph.
I told the Supermicro GM for Japan as much. ^o^
Well it looks like I done goofed. I thought it was odd that they went 
against most of what ceph documentation says about recommended hardware. 
I read/heard from them that they worked with intank on this though so I 
was swayed. Besides that we really needed the density per rack due to 
limited floor space. As I said in capable hands this cluster would work 
but by stroke of luck..




Every time a HDD dies, you will have to go and shut down the other OSD
that resides on the same tray (and set the cluster to noout).
Even worse of course if a SSD should fail.
And if somebody should just go and hotswap things w/o that step first,
hello data movement storm (2 or 10 OSDs instead of 1 or 5 respectively).

Christian
Thanks for your help and insight on this! I am going to take a nap and 
hope the cluster doesn't set fire before I wake up o_o

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Christian Balzer

Hello,

On Thu, 18 Dec 2014 23:45:57 -0600 Sean Sullivan wrote:

> Wow Christian,
> 
> Sorry I missed these in line replies. Give me a minute to gather some
> data. Thanks a million for the in depth responses!
> 
No worries.

> I thought about raiding it but I needed the space unfortunately. I had a 
> 3x60 osd node test cluster that we tried before this and it didn't have 
> this flopping issue or rgw issue I am seeing .
>
I think I remember that...

You do realize that the RAID6 configuration option I mentioned would
actually give you MORE space (replication of 2 is sufficient with reliable
OSDs) than what you have now? 
Albeit probably at reduced performance, how much would also depend on the
controllers used, but at worst the RAID6 OSD performance would be
equivalent to that of single disk. 
So a Cluster (performance wise) with 21 nodes and 8 disks each.
 
> I can quickly answered the case/make questions, the model will need to
> wait till I get home :)
> 
> Case is a 72 disk supermicro chassis, I'll grab the exact model in my
> next reply.
>
No need, now that strange monitor configuration makes sense, you (or
whoever spec'ed this) went for the Supermicro Ceph solution, right?

In my not so humble opinion, this the worst storage chassis ever designed
by a long shot and totally unsuitable for Ceph. 
I told the Supermicro GM for Japan as much. ^o^

Every time a HDD dies, you will have to go and shut down the other OSD
that resides on the same tray (and set the cluster to noout).
Even worse of course if a SSD should fail.
And if somebody should just go and hotswap things w/o that step first,
hello data movement storm (2 or 10 OSDs instead of 1 or 5 respectively).

Christian
 
> Drives are HGST 4TB drives, ill grab the model once I get home as well.
> 
> The 300 was completely incorrect and it can push more, it was just meant 
> for a quick comparison but I agree it should be higher.
> 
> Thank you so much. Please hold up and ill grab the extra info ^~^
> 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan

Wow Christian,

Sorry I missed these in line replies. Give me a minute to gather some data. 
Thanks a million for the in depth responses!


I thought about raiding it but I needed the space unfortunately. I had a 
3x60 osd node test cluster that we tried before this and it didn't have 
this flopping issue or rgw issue I am seeing .


I can quickly answered the case/make questions, the model will need to wait 
till I get home :)


Case is a 72 disk supermicro chassis, I'll grab the exact model in my next 
reply.


Drives are HGST 4TB drives, ill grab the model once I get home as well.

The 300 was completely incorrect and it can push more, it was just meant 
for a quick comparison but I agree it should be higher.


Thank you so much. Please hold up and ill grab the extra info ^~^


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan

thanks!
It would be really great in the right hands. Through some stroke of luck 
it's in mine. The flapping osd is becoming a real issue at this point as it 
is the only possible lead I have to why the gateways are transferring so 
slowly. The weird issue is that I can have 8 or 60 transfers going to the 
radosgw and they are all at roughly 8mbps. To work around this right now I 
am starting 60+ clients across 10 boxes to get roughly 1gbps per gateway 
across gw1 and gw2.


I heve been staring at logs for hours trying to get a handle at what the 
issue may be with no luck.


The third gateway was made last minute to test and rule out the hardware.


On December 18, 2014 10:57:41 PM Christian Balzer  wrote:



Hello,

Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er,
wrong movie. ^o^

On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote:

> Hello Yall!
>
> I can't figure out why my gateways are performing so poorly and I am not
> sure where to start looking. My RBD mounts seem to be performing fine
> (over 300 MB/s)
>
I wouldn't call 300MB/s writes fine with a cluster of this size.
How are you testing this (which tool, settings, from where)?

> while uploading a 5G file to Swift/S3 takes 2m32s
> (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
> with nuttcp shows that I can transfer from a client with 10G interface
> to any node on the ceph cluster at the full 10G and ceph can transfer
> close to 20G between itself. I am not really sure where to start looking
> as outside of another issue which I will mention below I am clueless.
>
I know nuttin about radosgw, but I wouldn't be surprised that the
difference you see here is based how that is eventually written to the
storage (smaller chunks than what you're using to test RBD performance).

> I have a weird setup
I'm always interested in monster storage nodes, care to share what case
this is?

> [osd nodes]
> 60 x 4TB 7200 RPM SATA Drives
What maker/model?

> 12 x  400GB s3700 SSD drives
Journals, one assumes.

> 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
> the 3 cards)
I smell a port-expander or 3 on your backplane.
And while making sure that your SSDs get undivided 6Gb/s love would
probably help, you still have plenty of bandwidth here (4.5Gb/s per
drive), so no real issue.

> 512 GB of RAM
Sufficient.

> 2 x CPU E5-2670 v2 @ 2.50GHz
Vastly, and I mean VASTLY insufficient.
It would still be 10GHz short of the (optimistic IMHO) recommendation of
1GHz per OSD w/o SSD journals.
With SSD journals my experience shows that with certain write patterns
even 3.5GHz per OSD isn't sufficient. (there are several threads
about this here)

> 2 x 10G interfaces  LACP bonded for cluster traffic
> 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
> ports)
>
Your journals could handle 5.5GB/s, so you're limiting yourself here a
bit, but not too horribly.

If I had been given this hardware, I would have RAIDed things (different
controller) to keep the number of OSDs per node to something the CPUs (any
CPU really!) can handle.
Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for
performance and  8 x 8HDD RAID6 + SSDs +spares for capacity.
That still gives you 336 or 168 OSDs, allows for a replication size of 2
and as bonus you'll probably never have to deal with a failed OSD. ^o^

> [monitor nodes and gateway nodes]
> 4 x 300G 1500RPM SAS drives in raid 10
I would have used Intel DC S3700s here as well, mons love their leveldb to
be fast but
> 1 x SAS 2208
combined with this it should be fine.

> 64G of RAM
> 2 x CPU E5-2630 v2
> 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)
>
>
> Here is a pastebin dump of my details, I am running ceph giant 0.87
> (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
> across the entire cluster.
>
> http://pastebin.com/XQ7USGUz -- ceph health detail
That looks positively scary, blocked requests for hours...

> http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
> http://pastebin.com/BC3gzWhT -- ceph osd tree
scroll, scroll, woah! ^o^

> http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
> http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)
>
>
> We ran into a few issues with density (conntrack limits, pid limit, and
> number of open files) all of which I adjusted by bumping the ulimits in
> /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
> signs of these limits being hit so I have not included my limits or
> sysctl conf. If you like this as well let me know and I can include it.
>
> One of the issues I am seeing is that OSDs have started to flop/ be
> marked as slow. The cluster was HEALTH_OK with all of the disks added
> for over 3 weeks before this behaviour started.
Anything changed?
In particular I assume this a new cluster, has much data been added?
A "ceph -s" output would be nice and educational.

Can you correlate th

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Christian Balzer

Hello,

Nice cluster, I wouldn't mind getting my hand or her ample nacelles, er,
wrong movie. ^o^

On Thu, 18 Dec 2014 21:35:36 -0600 Sean Sullivan wrote:

> Hello Yall!
> 
> I can't figure out why my gateways are performing so poorly and I am not
> sure where to start looking. My RBD mounts seem to be performing fine
> (over 300 MB/s) 
>
I wouldn't call 300MB/s writes fine with a cluster of this size. 
How are you testing this (which tool, settings, from where)?

> while uploading a 5G file to Swift/S3 takes 2m32s
> (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
> with nuttcp shows that I can transfer from a client with 10G interface
> to any node on the ceph cluster at the full 10G and ceph can transfer
> close to 20G between itself. I am not really sure where to start looking
> as outside of another issue which I will mention below I am clueless.
> 
I know nuttin about radosgw, but I wouldn't be surprised that the
difference you see here is based how that is eventually written to the
storage (smaller chunks than what you're using to test RBD performance).

> I have a weird setup
I'm always interested in monster storage nodes, care to share what case
this is?

> [osd nodes]
> 60 x 4TB 7200 RPM SATA Drives
What maker/model?

> 12 x  400GB s3700 SSD drives
Journals, one assumes. 

> 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
> the 3 cards)
I smell a port-expander or 3 on your backplane. 
And while making sure that your SSDs get undivided 6Gb/s love would
probably help, you still have plenty of bandwidth here (4.5Gb/s per
drive), so no real issue.

> 512 GB of RAM
Sufficient.

> 2 x CPU E5-2670 v2 @ 2.50GHz
Vastly, and I mean VASTLY insufficient.
It would still be 10GHz short of the (optimistic IMHO) recommendation of
1GHz per OSD w/o SSD journals. 
With SSD journals my experience shows that with certain write patterns
even 3.5GHz per OSD isn't sufficient. (there are several threads
about this here)

> 2 x 10G interfaces  LACP bonded for cluster traffic
> 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
> ports)
> 
Your journals could handle 5.5GB/s, so you're limiting yourself here a
bit, but not too horribly.

If I had been given this hardware, I would have RAIDed things (different
controller) to keep the number of OSDs per node to something the CPUs (any
CPU really!) can handle. 
Something like 16 x 4HDD RAID10 + SSDs +spares (if possible) for
performance and  8 x 8HDD RAID6 + SSDs +spares for capacity.
That still gives you 336 or 168 OSDs, allows for a replication size of 2
and as bonus you'll probably never have to deal with a failed OSD. ^o^

> [monitor nodes and gateway nodes]
> 4 x 300G 1500RPM SAS drives in raid 10
I would have used Intel DC S3700s here as well, mons love their leveldb to
be fast but
> 1 x SAS 2208
combined with this it should be fine.

> 64G of RAM
> 2 x CPU E5-2630 v2
> 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)
> 
> 
> Here is a pastebin dump of my details, I am running ceph giant 0.87 
> (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
> across the entire cluster.
> 
> http://pastebin.com/XQ7USGUz -- ceph health detail
That looks positively scary, blocked requests for hours...

> http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
> http://pastebin.com/BC3gzWhT -- ceph osd tree
scroll, scroll, woah! ^o^

> http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
> http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)
> 
> 
> We ran into a few issues with density (conntrack limits, pid limit, and
> number of open files) all of which I adjusted by bumping the ulimits in
> /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
> signs of these limits being hit so I have not included my limits or
> sysctl conf. If you like this as well let me know and I can include it.
> 
> One of the issues I am seeing is that OSDs have started to flop/ be
> marked as slow. The cluster was HEALTH_OK with all of the disks added
> for over 3 weeks before this behaviour started. 
Anything changed? 
In particular I assume this a new cluster, has much data been added?
A "ceph -s" output would be nice and educational.

Can you correlate the time when you start seeing slow, blocked requests
with scrubs or deep-scrubs? If so try setting your cluster temporarily to
noscrub and nodeep-scrub and see if that helps.  In case it does, setting  
"osd_scrub_sleep" (start with something high like 1.0 or 0.5 and lower until it 
hurts again) should help permanently.

I have a cluster that could scrub things in minutes until the amount of
objects/data and steady load reached a threshold and now its hours.

In this context, check the fragmentation of your OSDs.

How busy (ceph.log ops/s) is your cluster at these times?

> RBD transfers seem to be
> fine for the most part which makes me think that this has little baring
> on the gateway issue 

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Thanks for the reply Gegory,

Sorry if this is in the wrong direction or something. Maybe I do not
understand

To test uploads I either use bash time and either python-swiftclient or
boto key.set_contents_from_filename to the radosgw. I was unaware that
radosgw had any type of throttle settings in the configuration (I can't
seem to find any either).  As for rbd mounts I test by creating a 1TB
mount and writing a file to it through time+cp or dd. Not the most
accurate test but I think should be good enough as a quick functionality
test. So for writes, it's more for functionality than performance. I
would think a basic functionality test should yield more than 8mb/s though.

As for checking admin sockets: I have actually, I set the 3rd gateways
debug_civetweb to 10 as well as debug_rgw to 5 but I still do not see
anything that stands out. The snippet of the log I pasted has these
values set. I did the same for an osd that is marked as slow (1112). All
I can see in the log for the osd are ticks and heartbeat responses
though, nothing that shows any issues. Finally I did it for the primary
monitor node to see if I would see anything there with debug_mon set to
5 (http://pastebin.com/hhnaFac1). I do not really see anything that
would stand out as a failure (like a fault or timeout error).

What kind of throttler limits do you mean? I didn't/don't see any
mention of rgw throttler limits in the ceph.com docs or admin socket
just osd/ filesystem throttle like inode/flusher limits, do you mean
these? I have not messed with these limits yet on this cluster, do you
think it would help?

On 12/18/2014 10:24 PM, Gregory Farnum wrote:
> What kind of uploads are you performing? How are you testing?
> Have you looked at the admin sockets on any daemons yet? Examining the
> OSDs to see if they're behaving differently on the different requests
> is one angle of attack. The other is look into is if the RGW daemons
> are hitting throttler limits or something that the RBD clients aren't.
> -Greg
> On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan  > wrote:
>
> Hello Yall!
>
> I can't figure out why my gateways are performing so poorly and I
> am not
> sure where to start looking. My RBD mounts seem to be performing fine
> (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
> (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
> with nuttcp shows that I can transfer from a client with 10G interface
> to any node on the ceph cluster at the full 10G and ceph can transfer
> close to 20G between itself. I am not really sure where to start
> looking
> as outside of another issue which I will mention below I am clueless.
>
> I have a weird setup
> [osd nodes]
> 60 x 4TB 7200 RPM SATA Drives
> 12 x  400GB s3700 SSD drives
> 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly
> across
> the 3 cards)
> 512 GB of RAM
> 2 x CPU E5-2670 v2 @ 2.50GHz
> 2 x 10G interfaces  LACP bonded for cluster traffic
> 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
> ports)
>
> [monitor nodes and gateway nodes]
> 4 x 300G 1500RPM SAS drives in raid 10
> 1 x SAS 2208
> 64G of RAM
> 2 x CPU E5-2630 v2
> 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G
> ports)
>
>
> Here is a pastebin dump of my details, I am running ceph giant 0.87
> (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel
> 3.13.0-40-generic
> across the entire cluster.
>
> http://pastebin.com/XQ7USGUz -- ceph health detail
> http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
> http://pastebin.com/BC3gzWhT -- ceph osd tree
> http://pastebin.com/eRyY4H4c --
> /var/log/radosgw/client.radosgw.rgw03.log
> http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't
> let me)
>
>
> We ran into a few issues with density (conntrack limits, pid
> limit, and
> number of open files) all of which I adjusted by bumping the
> ulimits in
> /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
> signs of these limits being hit so I have not included my limits or
> sysctl conf. If you like this as well let me know and I can
> include it.
>
> One of the issues I am seeing is that OSDs have started to flop/ be
> marked as slow. The cluster was HEALTH_OK with all of the disks added
> for over 3 weeks before this behaviour started. RBD transfers seem
> to be
> fine for the most part which makes me think that this has little
> baring
> on the gateway issue but it may be related. Rebooting the OSD seems to
> fix this issue.
>
> I would like to figure out the root cause of both of these issues and
> post the results back here if possible (perhaps it can help other
> people). I am really looking for a place to start looking at as the
> gatewa

Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Gregory Farnum
What kind of uploads are you performing? How are you testing?
Have you looked at the admin sockets on any daemons yet? Examining the OSDs
to see if they're behaving differently on the different requests is one
angle of attack. The other is look into is if the RGW daemons are hitting
throttler limits or something that the RBD clients aren't.
-Greg
On Thu, Dec 18, 2014 at 7:35 PM Sean Sullivan 
wrote:

> Hello Yall!
>
> I can't figure out why my gateways are performing so poorly and I am not
> sure where to start looking. My RBD mounts seem to be performing fine
> (over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
> (32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
> with nuttcp shows that I can transfer from a client with 10G interface
> to any node on the ceph cluster at the full 10G and ceph can transfer
> close to 20G between itself. I am not really sure where to start looking
> as outside of another issue which I will mention below I am clueless.
>
> I have a weird setup
> [osd nodes]
> 60 x 4TB 7200 RPM SATA Drives
> 12 x  400GB s3700 SSD drives
> 3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
> the 3 cards)
> 512 GB of RAM
> 2 x CPU E5-2670 v2 @ 2.50GHz
> 2 x 10G interfaces  LACP bonded for cluster traffic
> 2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
> ports)
>
> [monitor nodes and gateway nodes]
> 4 x 300G 1500RPM SAS drives in raid 10
> 1 x SAS 2208
> 64G of RAM
> 2 x CPU E5-2630 v2
> 2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)
>
>
> Here is a pastebin dump of my details, I am running ceph giant 0.87
> (c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
> across the entire cluster.
>
> http://pastebin.com/XQ7USGUz -- ceph health detail
> http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
> http://pastebin.com/BC3gzWhT -- ceph osd tree
> http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
> http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)
>
>
> We ran into a few issues with density (conntrack limits, pid limit, and
> number of open files) all of which I adjusted by bumping the ulimits in
> /etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
> signs of these limits being hit so I have not included my limits or
> sysctl conf. If you like this as well let me know and I can include it.
>
> One of the issues I am seeing is that OSDs have started to flop/ be
> marked as slow. The cluster was HEALTH_OK with all of the disks added
> for over 3 weeks before this behaviour started. RBD transfers seem to be
> fine for the most part which makes me think that this has little baring
> on the gateway issue but it may be related. Rebooting the OSD seems to
> fix this issue.
>
> I would like to figure out the root cause of both of these issues and
> post the results back here if possible (perhaps it can help other
> people). I am really looking for a place to start looking at as the
> gateway just outputs that it is posting data and all of the logs
> (outside of the monitors reporting down osds) seem to show a fully
> functioning cluster.
>
> Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as
> well.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-18 Thread Sean Sullivan
Hello Yall!

I can't figure out why my gateways are performing so poorly and I am not
sure where to start looking. My RBD mounts seem to be performing fine
(over 300 MB/s) while uploading a 5G file to Swift/S3 takes 2m32s
(32MBps i believe). If we try a 1G file it's closer to 8MBps. Testing
with nuttcp shows that I can transfer from a client with 10G interface
to any node on the ceph cluster at the full 10G and ceph can transfer
close to 20G between itself. I am not really sure where to start looking
as outside of another issue which I will mention below I am clueless.

I have a weird setup
[osd nodes]
60 x 4TB 7200 RPM SATA Drives
12 x  400GB s3700 SSD drives
3 x SAS2308 PCI-Express Fusion-MPT cards (drives are split evenly across
the 3 cards)
512 GB of RAM
2 x CPU E5-2670 v2 @ 2.50GHz
2 x 10G interfaces  LACP bonded for cluster traffic
2 x 10G interfaces LACP bonded for public traffic (so a total of 4 10G
ports)

[monitor nodes and gateway nodes]
4 x 300G 1500RPM SAS drives in raid 10
1 x SAS 2208
64G of RAM
2 x CPU E5-2630 v2
2 x 10G interfaces LACP bonded for public traffic (total of 2 10G ports)


Here is a pastebin dump of my details, I am running ceph giant 0.87 
(c51c8f9d80fa4e0168aa52685b8de40e42758578) and kernel 3.13.0-40-generic
across the entire cluster.

http://pastebin.com/XQ7USGUz -- ceph health detail
http://pastebin.com/8DCzrnq1 -- /etc/ceph/ceph.conf
http://pastebin.com/BC3gzWhT -- ceph osd tree
http://pastebin.com/eRyY4H4c -- /var/log/radosgw/client.radosgw.rgw03.log
http://paste.ubuntu.com/9565385/ -- crushmap (pastebin wouldn't let me)


We ran into a few issues with density (conntrack limits, pid limit, and
number of open files) all of which I adjusted by bumping the ulimits in
/etc/security/limits.d/ceph.conf or sysctl. I am no longer seeing any
signs of these limits being hit so I have not included my limits or
sysctl conf. If you like this as well let me know and I can include it.

One of the issues I am seeing is that OSDs have started to flop/ be
marked as slow. The cluster was HEALTH_OK with all of the disks added
for over 3 weeks before this behaviour started. RBD transfers seem to be
fine for the most part which makes me think that this has little baring
on the gateway issue but it may be related. Rebooting the OSD seems to
fix this issue.

I would like to figure out the root cause of both of these issues and
post the results back here if possible (perhaps it can help other
people). I am really looking for a place to start looking at as the
gateway just outputs that it is posting data and all of the logs
(outside of the monitors reporting down osds) seem to show a fully
functioning cluster.

Please help. I am in the #ceph room on OFTC every day as 'seapasulli' as
well.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com