Re: [ceph-users] PCIE-SSD OSD bottom performance issue

2015-08-22 Thread Wang, Warren
Are you running fio against a sparse file, prepopulated file, or a raw device?

Warren

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
scott_tan...@yahoo.com
Sent: Thursday, August 20, 2015 3:48 AM
To: ceph-users 
Cc: liuxy666 
Subject: [ceph-users] PCIE-SSD OSD bottom performance issue

dear ALL:
I used PCIE-SSD to OSD disk . But I found it very bottom performance.
I have two hosts, each host 1 PCIE-SSD,so i create two osd by PCIE-SSD.

ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1   0.35999 root default
-2   0.17999 host tds_node03
0 0.17999  osd.0 up 1.0 1.0
-30.17999host tds_node04
1 0.17999  osd.1 up 1.0 1.0

I create pool and rbd device.
I use fio test 8K randrw(70%) in rbd device,the result is only 1W IOPS, I have 
tried many osd thread parameters, but not effect.
But i tested 8K randrw(70%) in single PCIE-SSD, it has 10W IOPS.

Is there any way to improve the PCIE-SSD  OSD performance?




scott_tan...@yahoo.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Discuss: New default recovery config settings

2015-08-22 Thread Shinobu
Based on original concept of *osd_max_backfills* which prevents the
following:

"*situationIf all of these backfills happen simultaneously, it would put
excessive load on the osd.*"

the value of "osd_max_backfills" could be important in some situation. So
we might not be able to say how it's important.

>From my experience, big cluster easily could become complicted. Because I
know some automobile manufacturers which faced performance issues. Actually
their ceph cluster are not quite big so -;



*"dropping down max backfills is more important than reducing max recovery
(gathering recovery metadata happens largely in memory)"*

As Jan said,

"*increasing the number of PGs helped with this as the “blocks” of work are
much smaller than before.*"

A number of PGs is also one of factors that improve performance, and needs
to be considered.

>From messages of Huang and Jan, we might need to think that a total number
of PGs are not always equal to the following formular.

"*Total PGs = (OSDs * 100) / pool size*"

So what I like and would like to try are:

"
*What I would be happy to see is more of a QOS style tunable along the
lines of networking traffic shaping.*"
 - Milosz Tanski

"
*Another idea would be to have a better way to prioritize recovery traffic
to an*
*even lower priority level by setting the ionice value to 'idle' in the CFQ
scheduler*"
 - Bryan Stillwell

 Shinobu


On Fri, Jun 5, 2015 at 8:24 AM, Scottix  wrote:

> From a ease of use standpoint and depending on the situation you are
> setting up your environment, the idea is as follow;
>
> It seems like it would be nice to have some easy on demand control where
> you don't have to think a whole lot other than knowing how it is going to
> affect your cluster in a general sense.
>
> The two extremes and a general limitation would be:
> 1. Priority data recover
> 2. Priority client usability
> 3rd might be hardware related like 1Gb connection
>
> With predefined settings you can setup different levels that have sensible
> settings and maybe 1 that is custom for the advanced user.
> Example command (Caveat: I don't fully know how your configs work):
> ceph osd set priority 
> *With priority set it would lock certain attributes
> **With priority unset it would unlock certain attributes
>
> In our use case basically after 8pm the activity goes way down. Here I can
> up the priority to medium or high, then at 6 am I can adjust it back to low.
>
> With cron I can easily schedule that or depending on the current situation
> I can schedule maintenance and change the priority to fit my needs.
>
>
>
> On Thu, Jun 4, 2015 at 2:01 PM Mike Dawson 
> wrote:
>
>> With a write-heavy RBD workload, I add the following to ceph.conf:
>>
>> osd_max_backfills = 2
>> osd_recovery_max_active = 2
>>
>> If things are going well during recovery (i.e. guests happy and no slow
>> requests), I will often bump both up to three:
>>
>> # ceph tell osd.* injectargs '--osd-max-backfills 3
>> --osd-recovery-max-active 3'
>>
>> If I see slow requests, I drop them down.
>>
>> The biggest downside to setting either to 1 seems to be the long tail
>> issue detailed in:
>>
>> http://tracker.ceph.com/issues/9566
>>
>> Thanks,
>> Mike Dawson
>>
>>
>> On 6/3/2015 6:44 PM, Sage Weil wrote:
>> > On Mon, 1 Jun 2015, Gregory Farnum wrote:
>> >> On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz
>> >>  wrote:
>> >>> On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum 
>> wrote:
>>  On Fri, May 29, 2015 at 2:47 PM, Samuel Just 
>> wrote:
>> > Many people have reported that they need to lower the osd recovery
>> config options to minimize the impact of recovery on client io.  We are
>> talking about changing the defaults as follows:
>> >
>> > osd_max_backfills to 1 (from 10)
>> > osd_recovery_max_active to 3 (from 15)
>> > osd_recovery_op_priority to 1 (from 10)
>> > osd_recovery_max_single_start to 1 (from 5)
>> 
>>  I'm under the (possibly erroneous) impression that reducing the
>> number of max backfills doesn't actually reduce recovery speed much (but
>> will reduce memory use), but that dropping the op priority can. I'd rather
>> we make users manually adjust values which can have a material impact on
>> their data safety, even if most of them choose to do so.
>> 
>>  After all, even under our worst behavior we're still doing a lot
>> better than a resilvering RAID array. ;) -Greg
>>  --
>> >>>
>> >>>
>> >>> Greg,
>> >>> When we set...
>> >>>
>> >>> osd recovery max active = 1
>> >>> osd max backfills = 1
>> >>>
>> >>> We see rebalance times go down by more than half and client write
>> performance increase significantly while rebalancing. We initially played
>> with these settings to improve client IO expecting recovery time to get
>> worse, but we got a 2-for-1.
>> >>> This was with firefly using replication, downing an entire node with
>> lots of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority,
>> and osd_recovery_max_single_sta

[ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-22 Thread Alex Gorbachev
Hello, this is an issue we have been suffering from and researching
along with a good number of other Ceph users, as evidenced by the
recent posts.  In our specific case, these issues manifest themselves
in a RBD -> iSCSI LIO -> ESXi configuration, but the problem is more
general.

When there is an issue on OSD nodes (examples: network hangs/blips,
disk HBAs failing, driver issues, page cache/XFS issues), some OSDs
respond slowly or with significant delays.  ceph osd perf does not
show this, neither does ceph osd tree, ceph -s / ceph -w.  Instead,
the RBD IO hangs to a point where the client times out, crashes or
displays other unsavory behavior - operationally this crashes
production processes.

Today in our lab we had a disk controller issue, which brought an OSD
node down.  Upon restart, the OSDs started up and rejoined into the
cluster.  However, immediately all IOs started hanging for a long time
and aborts from ESXi -> LIO were not succeeding in canceling these
IOs.  The only warning I could see was:

root@lab2-mon1:/var/log/ceph# ceph health detail
HEALTH_WARN 30 requests are blocked > 32 sec;
1 osds have slow requests 30 ops are blocked > 2097.15 sec
30 ops are blocked > 2097.15 sec on osd.4
1 osds have slow requests

However, ceph osd perf is not showing high latency on osd 4:

root@lab2-mon1:/var/log/ceph# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
  0 0   13
  1 00
  2 00
  3   172  208
  4 00
  5 00
  6 01
  7 00
  8   174  819
  9 6   10
 10 01
 11 01
 12 35
 13 01
 14 7   23
 15 01
 16 00
 17 59
 18 01
 1910   18
 20 00
 21 00
 22 01
 23 5   10

SMART state for osd 4 disk is OK.  The OSD in up and in:

root@lab2-mon1:/var/log/ceph# ceph osd tree
ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-80 root ssd
-7 14.71997 root platter
-3  7.12000 host croc3
22  0.89000 osd.22  up  1.0  1.0
15  0.89000 osd.15  up  1.0  1.0
16  0.89000 osd.16  up  1.0  1.0
13  0.89000 osd.13  up  1.0  1.0
18  0.89000 osd.18  up  1.0  1.0
 8  0.89000 osd.8   up  1.0  1.0
11  0.89000 osd.11  up  1.0  1.0
20  0.89000 osd.20  up  1.0  1.0
-4  0.47998 host croc2
10  0.06000 osd.10  up  1.0  1.0
12  0.06000 osd.12  up  1.0  1.0
14  0.06000 osd.14  up  1.0  1.0
17  0.06000 osd.17  up  1.0  1.0
19  0.06000 osd.19  up  1.0  1.0
21  0.06000 osd.21  up  1.0  1.0
 9  0.06000 osd.9   up  1.0  1.0
23  0.06000 osd.23  up  1.0  1.0
-2  7.12000 host croc1
 7  0.89000 osd.7   up  1.0  1.0
 2  0.89000 osd.2   up  1.0  1.0
 6  0.89000 osd.6   up  1.0  1.0
 1  0.89000 osd.1   up  1.0  1.0
 5  0.89000 osd.5   up  1.0  1.0
 0  0.89000 osd.0   up  1.0  1.0
 4  0.89000 osd.4   up  1.0  1.0
 3  0.89000 osd.3   up  1.0  1.0

How can we proactively detect this condition?  Is there anything I can
run that will output all slow OSDs?

Regards,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD GHz vs. Cores Question

2015-08-22 Thread Luis Periquito
I've been meaning to write an email with the experience we had at the
company I work. For the lack of a more complete one I'll just tell some of
the findings. Please note these are my experiences, and are correct for my
environment. The clients are running on openstack, and all servers are
trusty. Tests were made with Hammer (0.94.2).

TLDR: if performance is your objective buy 1S boxes with high frequency,
good journal SSDs, and not many SSDs. Also change the cpu to performance
mode, instead the default ondemand. And don't forget 10Gig is a must.
Replicated pools are also a must for performance.

We wanted to have a small cluster (30TB RAW), performance was important
(IOPS and latency), network was designed to be 10G copper with BGP attached
hosts. There was complete leeway in design and some in budget.

Starting with the network that required us to only create a single network,
but both links are usable - iperf between boxes is usually around
17-19Gbits.

We could choose the nodes, we evaluated dual cpu and single cpu nodes. The
dual cpus would have 24 2.5'' drive bays on a 2U chassis whereas the single
were 8 2.5'' drive bays on a 1U chassis. Long story short we chose the
single cpu (E3 1241 v3). On the CPU all the tests we did with the scaling
governors shown that "performance" would give us a 30-50% boost in IOPS.
Latency also improved but not by much. The downside was that each system
increased power usage by 5W (!?).

For the difference in price (£80) we bought the boxes with 32G of ram.

As for the disks, as we wanted fast IO we had to go with SSDs. Due to the
budget we had we went with 4x Samsung 850 PRO + 1x Intel S3710 200G. We
also tested the P3600, but one of the critical IO clients had far worse
performance with it. From benchmarking the write performance is that of the
Intel SSD. We made tests with Intel SSD with journal + different Intel SSD
with data and performance was within margin for error the same that Intel
SSD for journal + Samsung SSD for data. Single SSD performance was slightly
lower with either one (around 10%).

>From what I've seen: on very big sequential read and write I can get up to
700-800 MBps. On random IO (8k, random writes, reads or mixed workloads) we
still haven't finished all the tests, but so far it indicates the SSDs are
the bottleneck on the writes, and ceph latency on the reads. However we've
been able to extract 400 MBps read IO with 4 clients, each doing 32
threads. I don't have the numbers here but that represents around 50k IOPS
out of a smallish cluster.

Stuff we still have to do revolves around jemalloc vs tcmalloc - trusty has
the bug on the thread cache bytes variable. Also we still have to test
various tunable options, like threads, caches, etc...

Hope this helps.


On Sat, Aug 22, 2015 at 4:45 PM, Nick Fisk  wrote:

> Another thing that is probably worth considering is the practical side as
> well. A lot of the Xeon E5 boards tend to have more SAS/SATA ports and
> onboard 10GB, this can make quite a difference to the overall cost of the
> solution if you need to buy extra PCI-E cards.
>
> Unless I've missed one, I've not spotted a Xeon-D board with a large amount
> of onboard sata/sas ports. Please let me know if such a system exists as I
> would be very interested.
>
> We settled on the Hadoop version of the Supermicro Fat Twin. 12 x 3.5"
> disks
> + 2x 2.5 SSD's per U, onboard 10GB-T and the fact they share chassis and
> PSU's keeps the price down. For bulk storage one of these with a single 8
> core low clocked E5 Xeon is ideal in my mind. I did a spreadsheet working
> out U space, power and cost per GB for several different types of server,
> this solution came out ahead in nearly every category.
>
> If there is a requirement for a high perf SSD tier I would probably look at
> dedicated SSD nodes as I doubt you could cram enough CPU power into a
> single
> server to drive 12xSSD's.
>
> You mentioned low latency was a key requirement, is this always going to be
> at low queue depths? If you just need very low latency but won't actually
> be
> driving the SSD's very hard you will probably find a very highly clocked E3
> is the best bet with 2-4 SSD's per node. However if you drive the SSD's
> hard, a single one can easily max out several cores.
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Mark Nelson
> > Sent: 22 August 2015 00:00
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] OSD GHz vs. Cores Question
> >
> > FWIW, we recently were looking at a couple of different options for the
> > machines in our test lab that run the nightly QA suite jobs via
> teuthology.
> >
> >  From a cost/benefit perspective, I think it really comes down to
> something
> > like a XEON E3-12XXv3 or the new XEON D-1540, each of which have
> > advantages/disadvantages.
> >
> > We were very tempted by the Xeon D but it was still just a little too new
> for
> > us so we ended up going with s

Re: [ceph-users] OSD GHz vs. Cores Question

2015-08-22 Thread Nick Fisk
Another thing that is probably worth considering is the practical side as
well. A lot of the Xeon E5 boards tend to have more SAS/SATA ports and
onboard 10GB, this can make quite a difference to the overall cost of the
solution if you need to buy extra PCI-E cards.

Unless I've missed one, I've not spotted a Xeon-D board with a large amount
of onboard sata/sas ports. Please let me know if such a system exists as I
would be very interested.

We settled on the Hadoop version of the Supermicro Fat Twin. 12 x 3.5" disks
+ 2x 2.5 SSD's per U, onboard 10GB-T and the fact they share chassis and
PSU's keeps the price down. For bulk storage one of these with a single 8
core low clocked E5 Xeon is ideal in my mind. I did a spreadsheet working
out U space, power and cost per GB for several different types of server,
this solution came out ahead in nearly every category.

If there is a requirement for a high perf SSD tier I would probably look at
dedicated SSD nodes as I doubt you could cram enough CPU power into a single
server to drive 12xSSD's.

You mentioned low latency was a key requirement, is this always going to be
at low queue depths? If you just need very low latency but won't actually be
driving the SSD's very hard you will probably find a very highly clocked E3
is the best bet with 2-4 SSD's per node. However if you drive the SSD's
hard, a single one can easily max out several cores.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 22 August 2015 00:00
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OSD GHz vs. Cores Question
> 
> FWIW, we recently were looking at a couple of different options for the
> machines in our test lab that run the nightly QA suite jobs via
teuthology.
> 
>  From a cost/benefit perspective, I think it really comes down to
something
> like a XEON E3-12XXv3 or the new XEON D-1540, each of which have
> advantages/disadvantages.
> 
> We were very tempted by the Xeon D but it was still just a little too new
for
> us so we ended up going with servers using more standard E3 processors.
> The Xeon D setup was slightly cheaper, offers more theoretical
performance,
> and is way lower power, but at a much slower per-core clock speed.  It's
likely
> that for our functional tests that clock speed may be more important than
> the cores (but on these machines we'll only have 4 OSDs per server).
> 
> Anyway, I suspect that either setup will probably work fairly well for
> spinners.  SSDs get trickier.
> 
> Mark
> 
> On 08/21/2015 05:46 PM, Robert LeBlanc wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > We are looking to purchase our next round of Ceph hardware and based
> > off the work by Nick Fisk [1] our previous thought of cores over clock
> > is being revisited.
> >
> > I have two camps of thoughts and would like to get some feedback, even
> > if it is only theoretical. We currently have 12 disks per node (2
> > SSD/10 4TB spindle), but we may adjust that to 4/8. SSD would be used
> > for journals and cache tier (when [2] and fstrim are resolved). We
> > also want to stay with a single processor for cost, power and NUMA
> > considerations.
> >
> > 1. For 12 disks with three threads each (2 client and 1 background),
> > lots of slower cores would allow I/O (ceph code) to be scheduled as
> > soon as a core is available.
> >
> > 2. Faster cores would get through the Ceph code faster but there would
> > be less cores and so some I/O may have to wait to be scheduled.
> >
> > I'm leaning towards #2 for these reasons, please expose anything I may
> > be missing:
> > * The latency will only really be improved in the SSD I/O with faster
> > clock speed, all writes and any reads from the cache tier. So 8 fast
> > cores might be sufficient, reading from spindle and flushing the
> > journal will have a "substantial" amount of sleep to allow other Ceph
> > I/O to be hyperthreaded.
> > * Even though SSDs are much faster than spindles they are still orders
> > of magnitude slower than the processor, so it is still possible to get
> > more lines of code executed between SSD I/O with a faster processor
> > even with less cores.
> > * As the Ceph code is improved through optimization and less code has
> > to be executed for each I/O, faster clock speeds will only provide
> > even more benefit (lower latency, less waiting for cores) as the delay
> > shifts more from CPU to disk.
> >
> > Since our workload is typically small I/O 12K-18K, latency means a lot
> > to our performance.
> >
> > Our current processors are Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
> >
> > [1] http://www.spinics.net/lists/ceph-users/msg19305.html
> > [2] http://article.gmane.org/gmane.comp.file-systems.ceph.user/22713
> >
> > Thanks,
> > - 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > -BEGIN PGP SIGNATURE-
> > Version: Mailvelope v1.0.0
> > Comment: https

Re: [ceph-users] Object Storage and POSIX Mix

2015-08-22 Thread Sage Weil
On Fri, 21 Aug 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Shouldn't this already be possible with HTTP Range requests? I don't
> work with RGW or S3 so please ignore me if I'm talking crazy.

Yup.

sage


> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Aug 21, 2015 at 3:27 PM, Scottix  wrote:
> > I saw this article on Linux Today and immediately thought of Ceph.
> >
> > http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html
> >
> > I was thinking would it theoretically be possible with RGW to do a GET and
> > set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file.
> >
> > The other option to append data to a RGW object instead of rewriting the
> > entire object.
> > And so on...
> >
> > Just food for thought.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.0
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJV15/ICRDmVDuy+mK58QAAnkAP/3q804Y7xJDqadNxFjWd
> A1hzTcRfN6oqzZCf0T8stteTTG93Jt1R01ae2ZoVCM8EsefbovaPX68qy6kC
> sw4JN+G9h2Ow01X5nWD1mvQIPde0+kdTqK6jejTPr8tWQ/J1/98kkkqH4FGp
> TI3bOVBHik38RMt1G+yzVOS8E2lmckujzUsoQqA8kOyodsglQqAVj3kD8KAc
> me+BlcOvZhP2eV0Tg8FtAjaUp22bJbh/V+a2ycwoNKKS5YsiP3bQHbaI8FAK
> DYzndaS6UiwAhYjszmADRCqLXfmo8KkNYCr6xzr8oHSdPR33V87eFnkkaNmX
> pkGSuwblA19QT0PiVan8B5XRUd7HcdcjUPrbGtjmRsrF2QtzHD+Fda6qw48/
> TljMye6rnMX6A87UuIVpIj33OZiJRdiFwjMXQuSWCMl7WIYXU75KZKR5rsss
> zX6NRIF3tSq0TBjcOFQN3+531XuCgsjwe3/zu2f1a/1JaGMAmMCO6vMdPhxU
> dgkk31Ou7BbIuOzZmfagnNvRSdNLu5AUXZLlu5D+BhrH28kxzW0fXtoqyqU5
> tGk83pP+sr6sJaAk4nfzEQWLE8LHxtkS21CE5Aa0u1av9Sg0T5R84hYfPw+W
> skc67t2TVPHnphuLF2x2+xPArG3Ghuf2qD2Roz6zwkhpKQVprI8eiuu1lIfd
> Yl/b
> =w+bI
> -END PGP SIGNATURE-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com