from:"Warren Wang"

Re: [ceph-users] Hardware configuration for OSD in a new all flash Ceph cluster

2018-08-02 Thread Warren Wang

I would recommend a dedicated faster low latency device for journaling. If cost 
is an issue, you can try to swap the 2 CPUs for a single CPU, like the 5120. 
This also gets you out of any NUMA related issues. The reigning king for Ceph 
journals is dead (Intel P3700), but there are a few other options out there, 
including some NVMe options from Micron, like the 9200 Max.

Warren

From: ceph-users  on behalf of Réal Waite 

Date: Thursday, August 2, 2018 at 2:02 PM
To: "ceph-users@lists.ceph.com" 
Subject: EXT: [ceph-users] Hardware configuration for OSD in a new all flash 
Ceph cluster

Hello,

We'd liked to setup a Ceph Cluster for IOPS-optimized Workloads. Our needs are 
for object storage (S3A for Spark, Boto for Python notebooks,…), RBD and, 
eventually, CephFS workloads.
Trough different readings for IOPS-optimized Ceph Workloads, we think to buy 
this kind of servers for the OSD:

  *   Dell R740 with Chassis with up to 16 X 2.5” drive
  *   2 x Intel® Xeon® Silver 4116 2.1G, 12C/24T, 9.6GT/s, 16M Cache, Turbo, HT 
(85W) DDR4-2400
  *   8 x 16GB RDIMM, 2666MT/s, Dual Rank
  *   HBA330 Controller, 12Gbps Adapter, Low Profile
  *   16 x 1.92TB SSD SAS Mix Use 12Gbps 512n 2.5in Hot-plug Drive, PX05SV,3 
DWPD,10512 TBW
  *   OS disk = BOSS controller card + with 2 M.2 Sticks 120G (RAID 1),FH
  *   Broadcom 5720 QP 1Gb Network Daughter Card (Configuration interface)
  *   Mellanox ConnectX-3 Pro Dual Port 40 GbE QSFP+ PCIE Adapter Full Height 
(Cluster and Client Interface)
We will used the latest stable Luminous Ceph release support by RedHat. 
Therefore, we will used the XFS filesystem with the journal co-located with 
OSDs on the same SSD.

We will begin with 9 OSD servers and we will use a 3X or, maybe, a 2X 
replication factor since it is a all flash Ceph cluster.

What do you think of this configuration?

Réal Waite
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why the change from ceph-disk to ceph-volume and lvm? (and just not stick with direct disk access)

2018-06-11 Thread Warren Wang

I'll chime in as a large scale operator, and a strong proponent of ceph-volume.
Ceph-disk wasn't accomplishing what was needed with anything other than 
vanilla use cases (even then, still kind of broken). I'm not going to re-hash 
Sage's valid points too much, but trying to manipulate the old ceph-disk to 
work with your own LVM (or other block manager). As far as the pain of 
doing something new goes, yes, sometimes moving to newer more flexible 
methods results in a large amount of work. Trust me, I feel that pain when 
we're talking about things like ceph-volume, bluestore, etc, but these 
changes are not made without reason.

As far as LVM performance goes, I think that's well understood in the larger
Linux community. We accept that minimal overhead to accomplish some 
of the setups that we're interested in, such as encrypted, lvm-cached 
OSDs. The above is not a trivial thing to do using ceph-disk. We know, we 
run that in production, at large scale. It's plagued with problems, and since 
it's done without Ceph itself, it is difficult to tie the two together. Having 
it 
managed directly by Ceph, via ceph-volume makes much more sense. 
We're not alone in this, so I know it will benefit others as well, at the cost 
of technical expertise.

There are maintainers now for ceph-volume, so if there's something you 
don't like, I suggest proposing a change. 

Warren Wang

On 6/8/18, 11:05 AM, "ceph-users on behalf of Konstantin Shalygin" 
 wrote:

> - ceph-disk was replaced for two reasons: (1) It's design was
> centered around udev, and it was terrible.  We have been plagued for years
> with bugs due to race conditions in the udev-driven activation of OSDs,
> mostly variations of "I rebooted and not all of my OSDs started."  It's
> horrible to observe and horrible to debug. (2) It was based on GPT
> partitions, lots of people had block layer tools they wanted to use
> that were LVM-based, and the two didn't mix (no GPT partitions on top of
> LVs).
>
> - We designed ceph-volome to be *modular* because antipicate that there
> are going to be lots of ways that people provision the hardware devices
> that we need to consider.  There are already two: legacy ceph-disk devices
> that are still in use and have GPT partitions (handled by 'simple'), and
> lvm.  SPDK devices where we manage NVMe devices directly from userspace
> are on the immediate horizon--obviously LVM won't work there since the
> kernel isn't involved at all.  We can add any other schemes we like.
>
> - If you don't like LVM (e.g., because you find that there is a measurable
> overhead), let's design a new approach!  I wouldn't bother unless you can
> actually measure an impact.  But if you can demonstrate a measurable cost,
> let's do it.
>
> - LVM was chosen as the default appraoch for new devices are a few
> reasons:
>- It allows you to attach arbitrary metadata do each device, like which
> cluster uuid it belongs to, which osd uuid it belongs to, which type of
> device it is (primary, db, wal, journal), any secrets needed to fetch it's
> decryption key from a keyserver (the mon by default), and so on.
>- One of the goals was to enable lvm-based block layer modules beneath
> OSDs (dm-cache).  All of the other devicemapper-based tools we are
> aware of work with LVM.  It was a hammer that hit all nails.
>
> - The 'simple' mode is the current 'out' that avoids using LVM if it's not
> an option for you.  We only implemented scan and activate because that was
> all that we saw a current need for.  It should be quite easy to add the
> ability to create new OSDs.
>
> I would caution you, though, that simple relies on a file in /etc/ceph
> that has the metadata about the devices.  If you lose that file you need
> to have some way to rebuild it or we won't know what to do with your
> devices.  That means you should make the devices self-describing in some
> way... not, say, a raw device with dm-crypt layered directly on top, or
> some other option that makes it impossible to tell what it is.  As long as
> you can implement 'scan' and get any other info you need (e.g., whatever
> is necessary to fetch decryption keys) then great.

Thanks, I got what I wanted. It was in this form that it was necessary 
to submit deprecations to the community: "why do we do this, and what 
will it give us." As it was presented: "We kill the tool along with its 
functionality, you should use the new one as is, even if you do not know 
what it does."

Thanks again, Sage. I think this post should be in ceph blog.

Re: [ceph-users] OSD servers swapping despite having free memory capacity

2018-01-24 Thread Warren Wang

Forgot to mention another hint. If kswapd is constantly using CPU, and your sar 
-r ALL and sar -B stats look like it's trashing, kswapd is probably busy 
evicting things from memory in order to make a larger order allocation.

The other thing I can think of is if you have OSDs locking up and getting 
corrupted, there is a severe XFS bug where the kernel will throw a NULL pointer 
dereference under heavy memory pressure. Again, it's due to memory issues, but 
you will see the message in your kernel logs. It's fixed in upstream kernels as 
of this month. I forget what version exactly. 4.4.0-102? 
https://launchpad.net/bugs/1729256 

Warren Wang

On 1/23/18, 11:01 PM, "Blair Bethwaite" <blair.bethwa...@gmail.com> wrote:

+1 to Warren's advice on checking for memory fragmentation. Are you
seeing kmem allocation failures in dmesg on these hosts?

On 24 January 2018 at 10:44, Warren Wang <warren.w...@walmart.com> wrote:
> Check /proc/buddyinfo for memory fragmentation. We have some pretty 
severe memory frag issues with Ceph to the point where we keep excessive 
min_free_kbytes configured (8GB), and are starting to order more memory than we 
actually need. If you have a lot of objects, you may find that you need to 
increase vfs_cache_pressure as well, to something like the default of 100.
>
> In your buddyinfo, the columns represent the quantity of each page size 
available. So if you only see numbers in the first 2 columns, you only have 4K 
and 8K pages available, and will fail any allocations larger than that. The 
problem is so severe for us that we have stopped using jumbo frames due to 
dropped packets as a result of not being able to DMA map pages that will fit 9K 
frames.
>
> In short, you might have enough memory, but not contiguous. It's even 
worse on RGW nodes.
>
> Warren Wang
>
> On 1/23/18, 2:56 PM, "ceph-users on behalf of Samuel Taylor Liston" 
<ceph-users-boun...@lists.ceph.com on behalf of sam.lis...@utah.edu> wrote:
>
> We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 
7.4.  The OSDs are configured with encryption.  The cluster is accessed via two 
- RGWs  and there are 3 - mon servers.  The data pool is using 6+3 erasure 
coding.
>
> About 2 weeks ago I found two of the nine servers wedged and had to 
hard power cycle them to get them back.  In this hard reboot 22 - OSDs came 
back with either a corrupted encryption or data partitions.  These OSDs were 
removed and recreated, and the resultant rebalance moved along just fine for 
about a week.  At the end of that week two different nodes were unresponsive 
complaining of page allocation failures.  This is when I realized the nodes 
were heavy into swap.  These nodes were configured with 64GB of RAM as a cost 
saving going against the 1GB per 1TB recommendation.  We have since then 
doubled the RAM in each of the nodes giving each of them more than the 1GB per 
1TB ratio.
>
> The issue I am running into is that these nodes are still swapping; a 
lot, and over time becoming unresponsive, or throwing page allocation failures. 
 As an example, “free” will show 15GB of RAM usage (out of 128GB) and 32GB of 
swap.  I have configured swappiness to 0 and and also turned up the 
vm.min_free_kbytes to 4GB to try to keep the kernel happy, and yet I am still 
filling up swap.  It only occurs when the OSDs have mounted partitions and 
ceph-osd daemons active.
>
> Anyone have an idea where this swap usage might be coming from?
> Thanks for any insight,
>
> Sam Liston (sam.lis...@utah.edu)
> 
> Center for High Performance Computing
> 155 S. 1452 E. Rm 405
> Salt Lake City, Utah 84112 (801)232-6932
> 
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
~Blairo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD servers swapping despite having free memory capacity

2018-01-23 Thread Warren Wang

Check /proc/buddyinfo for memory fragmentation. We have some pretty severe 
memory frag issues with Ceph to the point where we keep excessive 
min_free_kbytes configured (8GB), and are starting to order more memory than we 
actually need. If you have a lot of objects, you may find that you need to 
increase vfs_cache_pressure as well, to something like the default of 100.

In your buddyinfo, the columns represent the quantity of each page size 
available. So if you only see numbers in the first 2 columns, you only have 4K 
and 8K pages available, and will fail any allocations larger than that. The 
problem is so severe for us that we have stopped using jumbo frames due to 
dropped packets as a result of not being able to DMA map pages that will fit 9K 
frames.

In short, you might have enough memory, but not contiguous. It's even worse on 
RGW nodes.

Warren Wang

On 1/23/18, 2:56 PM, "ceph-users on behalf of Samuel Taylor Liston" 
<ceph-users-boun...@lists.ceph.com on behalf of sam.lis...@utah.edu> wrote:

We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 7.4.  
The OSDs are configured with encryption.  The cluster is accessed via two - 
RGWs  and there are 3 - mon servers.  The data pool is using 6+3 erasure coding.

About 2 weeks ago I found two of the nine servers wedged and had to hard 
power cycle them to get them back.  In this hard reboot 22 - OSDs came back 
with either a corrupted encryption or data partitions.  These OSDs were removed 
and recreated, and the resultant rebalance moved along just fine for about a 
week.  At the end of that week two different nodes were unresponsive 
complaining of page allocation failures.  This is when I realized the nodes 
were heavy into swap.  These nodes were configured with 64GB of RAM as a cost 
saving going against the 1GB per 1TB recommendation.  We have since then 
doubled the RAM in each of the nodes giving each of them more than the 1GB per 
1TB ratio.  

The issue I am running into is that these nodes are still swapping; a lot, 
and over time becoming unresponsive, or throwing page allocation failures.  As 
an example, “free” will show 15GB of RAM usage (out of 128GB) and 32GB of swap. 
 I have configured swappiness to 0 and and also turned up the 
vm.min_free_kbytes to 4GB to try to keep the kernel happy, and yet I am still 
filling up swap.  It only occurs when the OSDs have mounted partitions and 
ceph-osd daemons active. 

Anyone have an idea where this swap usage might be coming from? 
Thanks for any insight,

Sam Liston (sam.lis...@utah.edu)

Center for High Performance Computing
155 S. 1452 E. Rm 405
Salt Lake City, Utah 84112 (801)232-6932




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What is the should be the expected latency of 10Gbit network connections

2018-01-22 Thread Warren Wang

25Gbe network. Servers have ConnectX-4 Pro, across a router, since L2 is 
terminated as the ToR:

10 packets transmitted, 10 received, 0% packet loss, time 1926ms
rtt min/avg/max/mdev = 0.013/0.013/0.205/0.004 ms, ipg/ewma 0.019/0.014 ms

Warren Wang
 
On 1/22/18, 4:06 PM, "ceph-users on behalf of Marc Roos" 
<ceph-users-boun...@lists.ceph.com on behalf of m.r...@f1-outsourcing.eu> wrote:


ping -c 10 -f 
ping -M do -s 8972 
 
10Gb ConnectX-3 Pro, DAC + Vlan
rtt min/avg/max/mdev = 0.010/0.013/0.200/0.003 ms, ipg/ewma 0.025/0.014 
ms

8980 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=0.144 ms
8980 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=0.205 ms
8980 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=0.248 ms
8980 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=0.281 ms
8980 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=0.187 ms
8980 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=0.121 ms

I350 Gigabit + bond
rtt min/avg/max/mdev = 0.027/0.038/0.211/0.006 ms, ipg/ewma 0.050/0.041 
ms

8980 bytes from 192.168.0.11: icmp_seq=1 ttl=64 time=0.555 ms
8980 bytes from 192.168.0.11: icmp_seq=2 ttl=64 time=0.508 ms
8980 bytes from 192.168.0.11: icmp_seq=3 ttl=64 time=0.514 ms
8980 bytes from 192.168.0.11: icmp_seq=4 ttl=64 time=0.555 ms



-Original Message-
From: Nick Fisk [mailto:n...@fisk.me.uk] 
Sent: maandag 22 januari 2018 12:38
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What is the should be the expected latency of 
10Gbit network connections

Anyone with 25G ethernet willing to do the test? Would love to see what 
the latency figures are for that.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Maged Mokhtar
Sent: 22 January 2018 11:28
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What is the should be the expected latency of 
10Gbit network connections

 

On 2018-01-22 08:39, Wido den Hollander wrote:



On 01/20/2018 02:02 PM, Marc Roos wrote: 

  If I test my connections with sockperf via a 1Gbit switch I 
get around
25usec, when I test the 10Gbit connection via the switch I 
have around
12usec is that normal? Or should there be a differnce of 10x.


No, that's normal.

Tests with 8k ping packets over different links I did:

1GbE:  0.800ms
10GbE: 0.200ms
40GbE: 0.150ms

Wido




sockperf ping-pong

sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=10.100 sec; SentMessages=432875;
ReceivedMessages=432874
sockperf: = Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=10.000 sec; 
SentMessages=428640;
ReceivedMessages=428640
sockperf: > avg-lat= 11.609 (std-dev=1.684)
sockperf: # dropped messages = 0; # duplicated messages = 0; #
out-of-order messages = 0
sockperf: Summary: Latency is 11.609 usec
sockperf: Total 428640 observations; each percentile contains 
4286.40
observations
sockperf: --->  observation =  856.944
sockperf: ---> percentile  99.99 =   39.789
sockperf: ---> percentile  99.90 =   20.550
sockperf: ---> percentile  99.50 =   17.094
sockperf: ---> percentile  99.00 =   15.578
sockperf: ---> percentile  95.00 =   12.838
sockperf: ---> percentile  90.00 =   12.299
sockperf: ---> percentile  75.00 =   11.844
sockperf: ---> percentile  50.00 =   11.409
sockperf: ---> percentile  25.00 =   11.124
sockperf: --->  observation =8.888

sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.100 sec; SentMessages=22065;
ReceivedMessages=22064
sockperf: = Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=1.000 sec; 
SentMessages=20056;
ReceivedMessages=20056
sockperf: > avg-lat= 24.861 (std-dev=1.774)
sockperf: # dropped messages = 0; # duplicate

Re: [ceph-users] EXT: ceph-lvm - a tool to deploy OSDs from LVM volumes

2017-06-16 Thread Warren Wang - ISD

I would prefer that this is something more generic, to possibly support other 
backends one day, like ceph-volume. Creating one tool per backend seems silly.

Also, ceph-lvm seems to imply that ceph itself has something to do with lvm, 
which it really doesn’t. This is simply to deal with the underlying disk. If 
there’s resistance to something more generic like ceph-volume, then it should 
at least be called something like ceph-disk-lvm.

2 cents from one of the LVM for Ceph users,
Warren Wang
Walmart ✻

On 6/16/17, 10:25 AM, "ceph-users on behalf of Alfredo Deza" 
<ceph-users-boun...@lists.ceph.com on behalf of ad...@redhat.com> wrote:

Hello,

At the last CDM [0] we talked about `ceph-lvm` and the ability to
deploy OSDs from logical volumes. We have now an initial draft for the
documentation [1] and would like some feedback.

The important features for this new tool are:

* parting ways with udev (new approach will rely on LVM functionality
for discovery)
* compatibility/migration for existing LVM volumes deployed as directories
* dmcache support

By documenting the API and workflows first we are making sure that
those look fine before starting on actual development.

It would be great to get some feedback, specially if you are currently
using LVM with ceph (or planning to!).

Please note that the documentation is not complete and is missing
content on some parts.

[0] http://tracker.ceph.com/projects/ceph/wiki/CDM_06-JUN-2017
[1] http://docs.ceph.com/ceph-lvm/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] EXT: Re: Intel power tuning - 30% throughput performance increase

2017-05-08 Thread Warren Wang - ISD

We also noticed a tremendous gain in latency performance by setting cstates to 
processor.max_cstate=1 intel_idle.max_cstate=0. We went from being over 1ms 
latency for 4KB writes to well under (.7ms? going off mem). I will note that we 
did not have as much of a problem on Intel v3 procs, but on v4 procs, our low 
QD, single threaded write perf dropped tremendously. I don’t recall now, but it 
was much worse than just a 30% loss in perf compared to a v3 proc that had 
default C states set. We only saw a small bump in power usage as well.

Bumping the CPU frequency up also offered a small performance change as well.

Warren Wang
Walmart ✻

On 5/3/17, 3:43 AM, "ceph-users on behalf of Dan van der Ster" 
<ceph-users-boun...@lists.ceph.com on behalf of d...@vanderster.com> wrote:

Hi Blair,

We use cpu_dma_latency=1, because it was in the latency-performance profile.
And indeed by setting cpu_dma_latency=0 on one of our OSD servers,
powertop now shows the package as 100% in turbo mode.

So I suppose we'll pay for this performance boost in energy.
But more importantly, can the CPU survive being in turbo 100% of the time?

-- Dan



On Wed, May 3, 2017 at 9:13 AM, Blair Bethwaite
<blair.bethwa...@gmail.com> wrote:
> Hi all,
>
> We recently noticed that despite having BIOS power profiles set to
> performance on our RHEL7 Dell R720 Ceph OSD nodes, that CPU frequencies
> never seemed to be getting into the top of the range, and in fact spent a
> lot of time in low C-states despite that BIOS option supposedly disabling
> C-states.
>
> After some investigation this C-state issue seems to be relatively common,
> apparently the BIOS setting is more of a config option that the OS can
> choose to ignore. You can check this by examining
> /sys/module/intel_idle/parameters/max_cstate - if this is >1 and you 
*think*
> C-states are disabled then your system is messing with you.
>
> Because the contemporary Intel power management driver
> (https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt) now
> limits the proliferation of OS level CPU power profiles/governors, the 
only
> way to force top frequencies is to either set kernel boot command line
> options or use the /dev/cpu_dma_latency, aka pmqos, interface.
>
> We did the latter using the pmqos_static.py, which was previously part of
> the RHEL6 tuned latency-performance profile, but seems to have been 
dropped
> in RHEL7 (don't yet know why), and in any case the default tuned profile 
is
> throughput-performance (which does not change cpu_dma_latency). You can 
find
> the pmqos-static.py script here
> 
https://github.com/NetSys/NetBricks/blob/master/scripts/tuning/pmqos-static.py.
>
> After setting `./pmqos-static.py cpu_dma_latency=0` across our OSD nodes 
we
> saw a conservative 30% increase in backfill and recovery throughput - now
> when our main RBD pool of 900+ OSDs is backfilling we expect to see 
~22GB/s,
> previously that was ~15GB/s.
>
> We have just got around to opening a case with Red Hat regarding this as 
at
> minimum Ceph should probably be actively using the pmqos interface and 
tuned
> should be setting this with recommendations for the latency-performance
> profile in the RHCS install guide. We have done no characterisation of it 
on
> Ubuntu yet, however anecdotally it looks like it has similar issues on the
> same hardware.
>
> Merry xmas.
>
> Cheers,
> Blair
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate OSD Journal to SSD

2016-12-02 Thread Warren Wang - ISD

I’ve actually had to migrate every single journal in many clusters from one 
(horrible) SSD model to a better SSD. It went smoothly. You’ll also need to 
update your /var/lib/ceph/osd/ceph-*/journal_uuid file. 

Honestly, the only challenging part was mapping and automating the back and 
forth conversion from /dev/sd* to the uuid for the corresponding osd.  I would 
share the script, but it was at my previous employer.

Warren Wang
Walmart ✻

On 12/1/16, 7:26 PM, "ceph-users on behalf of Christian Balzer" 
<ceph-users-boun...@lists.ceph.com on behalf of ch...@gol.com> wrote:

On Thu, 1 Dec 2016 18:06:38 -0600 Reed Dier wrote:

> Apologies if this has been asked dozens of times before, but most answers 
are from pre-Jewel days, and want to double check that the methodology still 
holds.
> 
It does.

> Currently have 16 OSD’s across 8 machines with on-disk journals, created 
using ceph-deploy.
> 
> These machines have NVMe storage (Intel P3600 series) for the system 
volume, and am thinking about carving out a partition for SSD journals for the 
OSD’s. The drives don’t make tons of use of the local storage, so should have 
plenty of io overhead to support the OSD journaling, as well as the P3600 
should have the endurance to handle the added write wear.
>
Slight disconnect there, money for a NVMe (which size?) and on disk
journals? ^_-
 
> From what I’ve read, you need a partition per OSD journal, so with the 
probability of a third (and final) OSD being added to each node, I should 
create 3 partitions, each ~8GB in size (is this a good value? 8TB OSD’s, is the 
journal size based on size of data or number of objects, or something else?).
> 
Journal size is unrelated to the OSD per se, with default parameters and
HDDs for OSDs a size of 10GB would be more than adequate, the default of
5GB would do as well.

> So:
> {create partitions}
> set noout
> service ceph stop osd.$i
> ceph-osd -i osd.$i —flush-journal
> rm -f rm -f /var/lib/ceph/osd//journal
Typo and there should be no need for -f. ^_^

> ln -s  /var/lib/ceph/osd//journal /dev/
Even though in your case with a single(?) NVMe there is little chance for
confusion, ALWAYS reference to devices by their UUID or similar, I prefer
the ID:
---
lrwxrwxrwx   1 root root44 May 21  2015 journal -> 
/dev/disk/by-id/wwn-0x55cd2e404b73d570-part4
---


> ceph-osd -i osd.$i -mkjournal
> service ceph start osd.$i
> ceph osd unset noout
> 
> Does this logic appear to hold up?
> 
Yup.

Christian

> Appreciate the help.
> 
> Thanks,
> 
> Reed

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding second interface to storage network - issue

2016-12-01 Thread Warren Wang - ISD

Jumbo frames for the cluster network has been done by quite a few operators 
without any problems. Admittedly, I’ve not run it that way in a year now, but 
we plan on switching back to jumbo for the cluster.

I do agree that jumbo on the public could result in poor behavior from clients, 
if you’re not careful.

Warren Wang
Walmart ✻

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of John Petrini 
<jpetr...@coredial.com>
Date: Wednesday, November 30, 2016 at 1:09 PM
To: Mike Jacobacci <mi...@flowjo.com>
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Adding second interface to storage network - issue

Yes that should work. Though I'd be weary of increasing the MTU to 9000 as this 
could introduce other issues. Jumbo Frames don't provide a very significant 
performance increase so I wouldn't recommend it unless you have a very good 
reason to make the change. If you do want to go down that path I'd suggest 
getting LACP configured on all of the nodes before upping the MTU and even then 
make sure you understand the requirement of a larger MTU size before 
introducing it on your network.


___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   
coredial.com<http://coredial.com/>   //   [witter] 
<https://twitter.com/coredial>[inkedIn] 
<http://www.linkedin.com/company/99631>[oogle Plus] 
<https://plus.google.com/104062177220750809525/posts>[log] 
<http://success.coredial.com/blog>
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: 
jpetr...@coredial.com<mailto:jpetr...@coredial.com>

[xceptional people. Proven Processes. Innovative Technology. 
Discover]<http://cta-redirect.hubspot.com/cta/redirect/210539/4c492538-6e4b-445e-9480-bef676787085>

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission,  dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.

On Wed, Nov 30, 2016 at 1:01 PM, Mike Jacobacci 
<mi...@flowjo.com<mailto:mi...@flowjo.com>> wrote:
Hi John,

Thanks that makes sense... So I take it if I use the same IP for the bond, I 
shouldn't run into the issues I ran into last night?

Cheers,
Mike

On Wed, Nov 30, 2016 at 9:55 AM, John Petrini 
<jpetr...@coredial.com<mailto:jpetr...@coredial.com>> wrote:
For redundancy I would suggest bonding the interfaces using LACP that way both 
ports are combined under the same interface with the same IP. They will both 
send and receive traffic and if one link goes down the other continues to work. 
The ports will need to be configured for LACP on the switch as well.


___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   
coredial.com<http://coredial.com/>   //   [witter] 
<https://twitter.com/coredial>[inkedIn] 
<http://www.linkedin.com/company/99631>[oogle Plus] 
<https://plus.google.com/104062177220750809525/posts>[log] 
<http://success.coredial.com/blog>
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: 
jpetr...@coredial.com<mailto:jpetr...@coredial.com>

[xceptional people. Proven Processes. Innovative Technology. 
Discover]<http://cta-redirect.hubspot.com/cta/redirect/210539/4c492538-6e4b-445e-9480-bef676787085>

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission,  dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.

On Wed, Nov 30, 2016 at 12:15 PM, Mike Jacobacci 
<mi...@flowjo.com<mailto:mi...@flowjo.com>> wrote:
I ran into an interesting issue last night when I tried to add a second storage 
interface.  The original 10gb storage interface on the OSD node was only set at 
1500 MTU, so the plan was to bump it to 9000 and configure the second interface 
the same way with a diff IP and reboot. Once I did that, for some reason the 
original interface showed active but would not respond to ping from the other 
OSD nodes, the second interface I added came up and was reachable.  So even 
though the node could still communicate to the others on the second interface, 
PG's would start remapping and would get stuck at about 300 (of 1024).  I 
resolved the issue by changing the config back on the original interface and 
disabling the

Re: [ceph-users] osd crash - disk hangs

2016-12-01 Thread Warren Wang - ISD

You’ll need to upgrade your kernel. It’s a terrible div by zero bug that occurs 
while trying to calculate load. You can still use “top –b –n1” instead of ps, 
but ultimately the kernel update fixed it for us. You can’t kill procs that are 
in uninterruptible wait.

Here’s the Ubuntu version: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1568729

Warren Wang
Walmart ✻

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of VELARTIS 
Philipp Dürhammer <p.duerham...@velartis.at>
Date: Thursday, December 1, 2016 at 7:19 AM
To: "'ceph-users@lists.ceph.com'" <ceph-users@lists.ceph.com>
Subject: [ceph-users] osd crash - disk hangs

Hello!

Tonight i had a osd crash. See the dump below. Also this osd is still mounted. 
Whats the cause? A bug? What to do next? I cant do a lsof or ps ax because it 
hangs.

Thank You!

Dec  1 00:31:30 ceph2 kernel: [17314369.493029] divide error:  [#1] SMP
Dec  1 00:31:30 ceph2 kernel: [17314369.493062] Modules linked in: act_police 
cls_basic sch_ingress sch_htb vhost_net vhost macvtap macvlan 8021q garp mrp 
veth nfsv3 softdog ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6_tables xt_mac ipt_REJECT nf_reject_ipv4 xt_NFLOG 
nfnetlink_log xt_physdev nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_tcpudp 
xt_addrtype xt_multiport xt_conntrack xt_set xt_mark ip_set_hash_net ip_set 
nfnetlink iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd 
grace fscache sunrpc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding xfs libcrc32c 
ipmi_ssif mxm_wmi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
irqbypass crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul 
glue_helper ablk_helper cryptd snd_pcm snd_timer snd soundcore pcspkr 
input_leds sb_edac shpchp edac_core mei_me ioatdma mei lpc_ich i2c_i801 ipmi_si 
8250_fintek wmi ipmi_msghandler mac_hid nf_conntrack_ftp nf_conntrack autofs4 
ses enclosure hid_generic usbmouse usbkbd usbhid hid ixgbe(O) vxlan 
ip6_udp_tunnel megaraid_sas udp_tunnel isci ahci libahci libsas igb(O) 
scsi_transport_sas dca ptp pps_core fjes
Dec  1 00:31:30 ceph2 kernel: [17314369.493708] CPU: 1 PID: 17291 Comm: 
ceph-osd Tainted: G   O4.4.8-1-pve #1
Dec  1 00:31:30 ceph2 kernel: [17314369.493754] Hardware name: Thomas-Krenn.AG 
X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
Dec  1 00:31:30 ceph2 kernel: [17314369.493799] task: 881f6ff05280 ti: 
880037c4c000 task.ti: 880037c4c000
Dec  1 00:31:30 ceph2 kernel: [17314369.493843] RIP: 0010:[]  
[] task_numa_find_cpu+0x23d/0x710
Dec  1 00:31:30 ceph2 kernel: [17314369.493893] RSP: :880037c4fbd8  
EFLAGS: 00010257
Dec  1 00:31:30 ceph2 kernel: [17314369.493919] RAX:  RBX: 
880037c4fc80 RCX: 
Dec  1 00:31:30 ceph2 kernel: [17314369.493962] RDX:  RSI: 
88103fa4 RDI: 881033f50c00
Dec  1 00:31:30 ceph2 kernel: [17314369.494006] RBP: 880037c4fc48 R08: 
000202046ea8 R09: 036b
Dec  1 00:31:30 ceph2 kernel: [17314369.494049] R10: 007c R11: 
0540 R12: 88064fbd
Dec  1 00:31:30 ceph2 kernel: [17314369.494093] R13: 0250 R14: 
0540 R15: 0009
Dec  1 00:31:30 ceph2 kernel: [17314369.494136] FS:  7ff17dd6c700() 
GS:88103fa4() knlGS:
Dec  1 00:31:30 ceph2 kernel: [17314369.494182] CS:  0010 DS:  ES:  
CR0: 80050033
Dec  1 00:31:30 ceph2 kernel: [17314369.494209] CR2: 7ff17dd6aff8 CR3: 
001025e4b000 CR4: 001426e0
Dec  1 00:31:30 ceph2 kernel: [17314369.494252] Stack:
Dec  1 00:31:30 ceph2 kernel: [17314369.494273]  880037c4fbe8 
81038219 003f 00017180
Dec  1 00:31:30 ceph2 kernel: [17314369.494323]  881f6ff05280 
00017180 0251 ffe7
Dec  1 00:31:30 ceph2 kernel: [17314369.494374]  0251 
881f6ff05280 880037c4fc80 00cb
Dec  1 00:31:30 ceph2 kernel: [17314369.494424] Call Trace:
Dec  1 00:31:30 ceph2 kernel: [17314369.494449]  [] ? 
sched_clock+0x9/0x10
Dec  1 00:31:30 ceph2 kernel: [17314369.494476]  [] 
task_numa_migrate+0x4e6/0xa00
Dec  1 00:31:30 ceph2 kernel: [17314369.494506]  [] ? 
copy_to_iter+0x7c/0x260
Dec  1 00:31:30 ceph2 kernel: [17314369.494534]  [] 
numa_migrate_preferred+0x79/0x80
Dec  1 00:31:30 ceph2 kernel: [17314369.494563]  [] 
task_numa_fault+0x848/0xd10
Dec  1 00:31:30 ceph2 kernel: [17314369.494591]  [] ? 
should_numa_migrate_memory+0x59/0x130
Dec  1 00:31:30 ceph2 kernel: [17314369.494623]  [] 
handle_mm_fault+0xc64/0x1a20
Dec  1 00:31:30 ceph2 kernel: [17314369.494654]  [] ? 
SYSC_recvfrom+0x144/0x160
Dec  1 00:31:30 ceph2 kernel: [17314369.494684]  [] 
__do_page_fault+0x19d/0x410
Dec  1 00:31:30 ceph2 kernel: [17314369.494713]  [] ? 
exit_to_usermode_loop+0xb0/0xd0
Dec  1 00:31:30 ce

Re: [ceph-users] osd down detection broken in jewel?

2016-11-30 Thread Warren Wang - ISD

FYI - Setting min down reports to 10 is somewhat risky. Unless you have a 
really large cluster, I would advise turning that down to 5 or lower. In a past 
life, we used to run that number higher on super dense nodes, but we found that 
it would result in some instances where legitimately down OSDs did not have 
enough peers to exceed the min down reporters.

Warren Wang
Walmart ✻


From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of John Petrini 
<jpetr...@coredial.com>
Date: Wednesday, November 30, 2016 at 9:24 AM
To: Manuel Lausch <manuel.lau...@1und1.de>
Cc: Ceph Users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] osd down detection broken in jewel?

It's right there in your config.

mon osd report timeout = 900

See: http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/


___

John Petrini

NOC Systems Administrator   //   CoreDial, LLC   //   
coredial.com<http://coredial.com/>   //   [witter] 
<https://twitter.com/coredial>[inkedIn] 
<http://www.linkedin.com/company/99631>[oogle Plus] 
<https://plus.google.com/104062177220750809525/posts>[log] 
<http://success.coredial.com/blog>
Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
P: 215.297.4400 x232   //   F: 215.297.4401   //   E: 
jpetr...@coredial.com<mailto:jpetr...@coredial.com>

[xceptional people. Proven Processes. Innovative Technology. 
Discover]<http://cta-redirect.hubspot.com/cta/redirect/210539/4c492538-6e4b-445e-9480-bef676787085>

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission,  dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.

On Wed, Nov 30, 2016 at 6:39 AM, Manuel Lausch 
<manuel.lau...@1und1.de<mailto:manuel.lau...@1und1.de>> wrote:
Hi,

In a test with ceph jewel we tested how long the cluster needs to detect and 
mark down OSDs after they are killed (with kill -9). The result -> 900 seconds.

In Hammer this took about 20 - 30 seconds.

In the Logfile from the leader monitor are a lot of messeages like
2016-11-30 11:32:20.966567 7f158f5ab700  0 log_channel(cluster) log [DBG] : 
osd.7 10.78.43.141:8120/106673<http://10.78.43.141:8120/106673> reported failed 
by osd.272 10.78.43.145:8106/117053<http://10.78.43.145:8106/117053>
A deeper look at this. A lot of OSDs reported this exactly one time. In Hammer 
The OSDs reported a down OSD a few more times.

Finaly there is the following and the osd is marked down.
2016-11-30 11:36:22.633253 7f158fdac700  0 log_channel(cluster) log [INF] : 
osd.7 marked down after no pg stats for 900.982893seconds

In my ceph.conf I have the following lines in the global section
mon osd min down reporters = 10
mon osd min down reports = 3
mon osd report timeout = 900

It seems the parameter "mon osd min down reports" is removed in jewel but the 
documentation is not updated -> 
http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/


Can someone tell me how ceph jewel detects down OSDs and mark them down in a 
appropriated time?


The Cluster:
ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
24 hosts á 60 OSDs -> 1440 OSDs
2 pool with replication factor 4
65536 PGs
5 Mons

--
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 
Karlsruhe | Germany
Phone: +49 721 91374-1847<tel:%2B49%20721%2091374-1847>
E-Mail: manuel.lau...@1und1.de<mailto:manuel.lau...@1und1.de> | Web: 
www.1und1.de<http://www.1und1.de>

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Frank Einhellinger, Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 
diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise 
auch immer zu verwenden.

This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient of this e-mail, you are hereby notified that saving, 
distribution or use of the content of this e-mail in any way is prohibited. If 
you have received this e-mail in error, please notify the sender and delete the 
e-mail.


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


This e

Re: [ceph-users] OSDs going down during radosbench benchmark

2016-09-12 Thread Warren Wang - ISD

Hi Tom, a few things you can check into. Some of these depend on how many
OSDs you¹re trying to run on a single chassis.

# up PIDs, otherwise you may run out of the ability to spawn new threads
kernel.pid_max=4194303

# up available mem for sudden bursts, like during benchmarking
Vm.min_free_kbytes = 

In ceph.conf:

max_open_files = <32K or more>

# make sure you have enough ephemeral port range for the number of OSDs
Ms bind port min = 6800
Ms bind port max = 9000

You may need to up your network tuning as well, but it¹s less likely to
cause these sorts of problems. Watch your netstat -s for clues.

Warren Wang



On 9/12/16, 12:44 PM, "ceph-users on behalf of Deneau, Tom"
<ceph-users-boun...@lists.ceph.com on behalf of tom.den...@amd.com> wrote:

>Trying to understand why some OSDs (6 out of 21) went down in my cluster
>while running a CBT radosbench benchmark.  From the logs below, is this a
>networking problem between systems, or is it some kind of FileStore
>problem.
>
>Looking at one crashed OSD log, I see the following crash error:
>
>2016-09-09 21:30:29.757792 7efc6f5f1700 -1 FileStore: sync_entry timed
>out after 600 seconds.
> ceph version 10.2.1-13.el7cp (f15ca93643fee5f7d32e62c3e8a7016c1fc1e6f4)
>
>just before that I see things like:
>
>2016-09-09 21:18:07.391760 7efc755fd700 -1 osd.12 165 heartbeat_check: no
>reply from osd.6 since back 2016-09-09 21:17:47.261601 front 2016-09-09
>21:17:47.261601 (cutoff 2016-09-09 21:17:47.391758)
>
>and also
>
>2016-09-09 19:03:45.788327 7efc53905700  0 -- 10.0.1.2:6826/58682 >>
>10.0.1.1:6832/19713 pipe(0x7efc8bfbc800 sd=65 :52000 s=1 pgs=12 cs=1 l=0\
> c=0x7efc8bef5b00).connect got RESETSESSION
>
>and many warnings for slow requests.
>
>
>All the other osds that died seem to have died with:
>
>2016-09-09 19:11:01.663262 7f2157e65700 -1 common/HeartbeatMap.cc: In
>function 'bool ceph::HeartbeatMap::_check(const
>ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f2157e65700 time
>2016-09-09 19:11:01.660671
>common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>
>
>-- Tom Deneau, AMD
>
>
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rgw meta pool

2016-09-09 Thread Warren Wang - ISD

A little extra context here. Currently the metadata pool looks like it is
on track to exceed the number of objects in the data pool, over time. In a
brand new cluster, we¹re already up to almost 2 million in each pool.

NAME  ID USED  %USED MAX AVAIL
OBJECTS
default.rgw.buckets.data  17 3092G  0.86  345T
2013585
default.rgw.meta  25  743M 0  172T
1975937

We¹re concerned this will be unmanageable over time.

Warren Wang


On 9/9/16, 10:54 AM, "ceph-users on behalf of Pavan Rallabhandi"
<ceph-users-boun...@lists.ceph.com on behalf of
prallabha...@walmartlabs.com> wrote:

>Any help on this is much appreciated, am considering to fix this, given
>it¹s confirmed an issue unless am missing something obvious.
>
>Thanks,
>-Pavan.
>
>On 9/8/16, 5:04 PM, "ceph-users on behalf of Pavan Rallabhandi"
><ceph-users-boun...@lists.ceph.com on behalf of
>prallabha...@walmartlabs.com> wrote:
>
>Trying it one more time on the users list.
>
>In our clusters running Jewel 10.2.2, I see default.rgw.meta pool
>running into large number of objects, potentially to the same range of
>objects contained in the data pool.
>
>I understand that the immutable metadata entries are now stored in
>this heap pool, but I couldn¹t reason out why the metadata objects are
>left in this pool even after the actual bucket/object/user deletions.
>
>The put_entry() promptly seems to be storing the same in the heap
>pool 
>https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880,
>but I do not see them to be reaped ever. Are they left there for some
>reason?
>
>Thanks,
>-Pavan.
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw ignores rgw_frontends? (10.2.2)

2016-08-05 Thread Warren Wang - ISD

It works for us. Here¹s what ours looks like:

rgw frontends = civetweb port=80 num_threads=50

>From netstat:
tcp0  0 0.0.0.0:80  0.0.0.0:*   LISTEN
 4010203/radosgw

Warren Wang



On 7/28/16, 7:20 AM, "ceph-users on behalf of Zoltan Arnold Nagy"
<ceph-users-boun...@lists.ceph.com on behalf of zol...@linux.vnet.ibm.com>
wrote:

>Hi,
>
>I just did a test deployment using ceph-deploy rgw create 
>after which I've added
>
>[client.rgw.c11n1]
>rgw_frontends = ³civetweb port=80²
>
>to the config.
>
>Using show-config I can see that it¹s there:
>
>root@c11n1:~# ceph --id rgw.c11n1 --show-config | grep civet
>debug_civetweb = 1/10
>rgw_frontends = civetweb port=80
>root@c11n1:~#
>
>However, radosgw ignores it:
>
>root@c11n1:~# netstat -anlp | grep radosgw
>tcp0  0 IP:48514   IP:6800ESTABLISHED
>29879/radosgw
>tcp0  0 IP:47484   IP:6789 ESTABLISHED
>29879/radosgw
>unix  2  [ ACC ] STREAM LISTENING 720517   29879/radosgw
> /var/run/ceph/ceph-client.rgw.c11n1.asok
>root@c11n1:~#
>
>I¹ve removed the key under /var/lib/ceph and copied it under /etc/ceph
>then added the keyring configuration after, which is read and is used by
>radosgw.
>
>Any ideas how I could debug this further?
>Is there a debug option that shows me that configuration settings is it
>reading from the configuration file?
>
>I¹ve been launching it for debugging purposes like this:
>usr/bin/radosgw --cluster=ceph -c /etc/ceph/ceph.conf --id rgw.c11n1 -d
>--setuser ceph --setgroup ceph --debug_rgw='20/20' --debug_client='20/20'
>--debug_civetweb='20/20' --debug_asok='20/20' --debug_auth='20/20'
>--debug-rgw=20/20
>
>Thanks,
>Zoltan
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bad performance when two fio write to the same image

2016-08-04 Thread Warren Wang - ISD

Wow, thanks. I think that¹s the tidbit of info I needed to explain why
increasing numjobs doesn¹t (anymore) scale performance as expected.

Warren Wang



On 8/4/16, 7:49 AM, "ceph-users on behalf of Jason Dillaman"
<ceph-users-boun...@lists.ceph.com on behalf of jdill...@redhat.com> wrote:

>With exclusive-lock, only a single client can have write access to the
>image at a time. Therefore, if you are using multiple fio processes
>against the same image, they will be passing the lock back and forth
>between each other and you can expect bad performance.
>
>If you have a use-case where you really need to share the same image
>between multiple concurrent clients, you will need to disable the
>exclusive-lock feature (this can be done with the RBD cli on existing
>images or by passing "--image-shared" when creating new images).
>
>On Thu, Aug 4, 2016 at 5:52 AM, Alexandre DERUMIER <aderum...@odiso.com>
>wrote:
>> Hi,
>>
>> I think this is because of exclusive-lock feature enabled by default
>>since jessie on rbd image
>>
>>
>> - Mail original -
>> De: "Zhiyuan Wang" <zhiyuan.w...@istuary.com>
>> À: "ceph-users" <ceph-users@lists.ceph.com>
>> Envoyé: Jeudi 4 Août 2016 11:37:04
>> Objet: [ceph-users] Bad performance when two fio write to the same image
>>
>>
>>
>> Hi Guys
>>
>> I am testing the performance of Jewel (10.2.2) with FIO, but found the
>>performance would drop dramatically when two process write to the same
>>image.
>>
>> My environment:
>>
>> 1. Server:
>>
>> One mon and four OSDs running on the same server.
>>
>> Intel P3700 400GB SSD which have 4 partitions, and each for one osd
>>journal (journal size is 10GB)
>>
>> Inter P3700 400GB SSD which have 4 partitions, and each format to XFS
>>for one osd data (each data is 90GB)
>>
>> 10GB network
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2660 (it is not the bottleneck)
>>
>> Memory: 256GB (it is not the bottleneck)
>>
>> 2. Client
>>
>> 10GB network
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2660 (it is not the bottleneck)
>>
>> Memory: 256GB (it is not the bottleneck)
>>
>> 3. Ceph
>>
>> Default configuration expect use async messager (have tried simple
>>messager, got nearly the same result)
>>
>> 10GB image with 256 pg num
>>
>> Test Case
>>
>> 1. One Fio process: bs 4KB; iodepth 256; direct 1; ioengine rbd;
>>randwrite
>>
>> The performance is nearly 60MB/s and IOPS is nearly 15K
>>
>> Four osd are nearly the same busy
>>
>> 2. Two Fio process: bs 4KB; iodepth 256; direct 1; ioengine rbd;
>>randwrite (write to the same image)
>>
>> The performance is nearly 4MB/s each, and IOPS is nearly 1.5K each
>>Terrible
>>
>> And I found that only one osd is busy, the other three are much more
>>idle on CPU
>>
>> And I also run FIO on two clients, the same result
>>
>> 3. Two Fio process: bs 4KB; iodepth 256; direct 1; ioengine rbd
>>randwrite (one to image1, one to image2)
>>
>> The performance is nearly 35MB/s each and IOPS is nearly 8.5K each
>>Reasonable
>>
>> Four osd are nearly the same busy
>>
>>
>>
>>
>>
>> Could someone help to explain the reason of TEST 2
>>
>>
>>
>> Thanks
>>
>>
>> Email Disclaimer & Confidentiality Notice
>>
>> This message is confidential and intended solely for the use of the
>>recipient to whom they are addressed. If you are not the intended
>>recipient you should not deliver, distribute or copy this e-mail. Please
>>notify the sender immediately by e-mail and delete this e-mail from your
>>system. Copyright © 2016 by Istuary Innovation Labs, Inc. All rights
>>reserved.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>-- 
>Jason
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I use fio with randwrite io to ceph image , it's run 2000 IOPS in the first time , and run 6000 IOPS in second time

2016-08-03 Thread Warren Wang - ISD

It¹s probably rbd cache taking effect. If you know all your clients are
well behaved, you could set "rbd cache writethrough until flush" to false,
instead of the default true, but understand the ramification. You could
also just do it during benchmarking.

Warren Wang



From:  ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of
"m13913886...@yahoo.com" <m13913886...@yahoo.com>
Reply-To:  "m13913886...@yahoo.com" <m13913886...@yahoo.com>
Date:  Monday, August 1, 2016 at 11:30 PM
To:  Ceph-users <ceph-users@lists.ceph.com>
Subject:  [ceph-users] I use fio with randwrite io to ceph image , it's
run 2000 IOPS in the first time , and run 6000 IOPS in second time



In version 10.2.2, fio firstly run 2000 IOPS, then I break fio,
and continue run fio, it run 6000 IOPS.

But in version 0.94, fio always run 6000 IOPS. With or without
repeated fio.


what is the different between this two versions about this.


my config is that :

I have three nodes, and two osds per node. A total of six osds.
All osds are ssd disk.


Here is my ceph.conf of osd:

[osd]

osd mkfs type=xfs
osd data = /data/$name
osd_journal_size = 8
filestore xattr use omap = true
filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 25000
filestore queue max bytes = 10485760
filestore queue committing max ops = 5000
filestore queue committing max bytes = 1048576

journal max write bytes = 1073714824
journal max write entries = 1
journal queue max ops = 5
journal queue max bytes = 1048576

osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 8
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
osd recovery op priority = 4
osd recovery max active = 10
osd max backfills = 4


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Warren Wang - ISD

I¹ve run the Mellanox 40 gig card. Connectx 3-Pro, but that¹s old now.
Back when I ran it, the  drivers were kind of a pain to deal with in
Ubuntu, primarily during PXE. It should be better now though.

If you have the network to support it, 25Gbe is quite a bit cheaper per
port, and won¹t be so hard to drive. 40Gbe is very hard to fill. I
personally probably would not do 40 again.

Warren Wang

On 7/13/16, 9:10 AM, "ceph-users on behalf of Götz Reinicke - IT
Koordinator" <ceph-users-boun...@lists.ceph.com on behalf of
goetz.reini...@filmakademie.de> wrote:

>Am 13.07.16 um 14:59 schrieb Joe Landman:
>>
>>
>> On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:
>>> 40Gbps can be used as 4*10Gbps
>>>
>>> I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
>>> ports", but extented to "usage of more than a single 10Gbps port, eg
>>> 20Gbps etc too"
>>>
>>> Is there people here that are using more than 10G on an ceph server ?
>>
>> We have built, and are building Ceph units for some of our customers
>> with dual 100Gb links.  The storage box was one of our all flash
>> Unison units for OSDs.  Similarly, we have several customers actively
>> using multiple 40GbE links on our 60 bay Unison spinning rust disk
>> (SRD) box.
>>
>Now we get closer. Can you tell me which 40G Nic you use?
>
>/götz
>

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Quick short survey which SSDs

2016-07-12 Thread Warren Wang - ISD

Our testing so far shows that it¹s a pretty good drive. We use it for the
actual backing OSD, but the journal is on NVMe. The raw results indicate
that it¹s a reasonable journal too, if you need to colocate, but you¹ll
exhaust write performance pretty quickly depending on your workload. We
also have them in large number. So far, so good.

Warren Wang





On 7/8/16, 1:37 PM, "ceph-users on behalf of Carlos M. Perez"
<ceph-users-boun...@lists.ceph.com on behalf of cpe...@cmpcs.com> wrote:

>I posted a bunch of the more recent numbers in the specs.  Had some down
>time and had a bunch of SSD's lying around and just curious if any were
>hidden gems... Interestingly, the Intel drives seem to not require the
>write cache off, while other drives had to be "forced" off using the
>hdparm -W0 /dev/sdx to make sure it was off.
>
>The machine we tested on is a Dell C2100 Dual x5560, 96GB ram, LSI2008 IT
>mode controller
>
>intel Dc S3700 200GB
>Model Number:   INTEL SSDSC2BA200G3L
>Firmware Revision:  5DV10265
>
>1 - io=4131.2MB, bw=70504KB/s, iops=17626, runt= 60001msec
>5 - io=9529.1MB, bw=162627KB/s, iops=40656, runt= 60001msec
>10 - io=7130.5MB, bw=121684KB/s, iops=30421, runt= 60004msec
>
>Samsung SM863
>Model Number:   SAMSUNG MZ7KM240HAGR-0E005
>Firmware Revision:  GXM1003Q
>
>1 - io=2753.1MB, bw=47001KB/s, iops=11750, runt= 6msec
>5 - io=6248.8MB, bw=106643KB/s, iops=26660, runt= 60001msec
>10 - io=8084.1MB, bw=137981KB/s, iops=34495, runt= 60001msec
>
>We decided to go with Intel model.  The Samsung was impressive on the
>higher end with multiple threads, but figured for most of our nodes with
>4-6 OSD's the intel were a bit more proven and had better "light-medium"
>load numbers.  
>
>Carlos M. Perez
>CMP Consulting Services
>305-669-1515
>
>-Original Message-
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>Dan van der Ster
>Sent: Tuesday, July 5, 2016 4:23 AM
>To: Christian Balzer <ch...@gol.com>
>Cc: ceph-users <ceph-users@lists.ceph.com>
>Subject: Re: [ceph-users] Quick short survey which SSDs
>
>On Tue, Jul 5, 2016 at 10:04 AM, Dan van der Ster <d...@vanderster.com>
>wrote:
>> On Tue, Jul 5, 2016 at 9:53 AM, Christian Balzer <ch...@gol.com> wrote:
>>>> Unfamiliar: Samsung SM863
>>>>
>>> You might want to read the thread here:
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007
>>> 871.html
>>>
>>> And google "ceph SM863".
>>>
>>> However I'm still waiting for somebody to confirm that these perform
>>> (as one would expect from DC level SSDs) at full speed with sync
>>> writes, which is the only important factor for journals.
>>
>> Tell me the fio options you're interested in and I'll run it right now.
>
>Using the options from Sebastien's blog I get:
>
>1 job: write: io=5863.3MB, bw=100065KB/s, iops=25016, runt= 60001msec
>5 jobs: write: io=11967MB, bw=204230KB/s, iops=51057, runt= 60001msec
>10 jobs: write: io=13760MB, bw=234829KB/s, iops=58707, runt= 60001msec
>
>Drive is model MZ7KM240 with firmware GXM1003Q.
>
>--
>Dan
>
>
>[1] fio --filename=/dev/sdc --direct=1 --sync=1 --rw=write --bs=4k
>--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
>--name=journal-test ___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-24 Thread Warren Wang - ISD

Oops, that reminds me, do you have min_free_kbytes set to something
reasonable like at least 2-4GB?

Warren Wang



On 6/24/16, 10:23 AM, "Wade Holler" <wade.hol...@gmail.com> wrote:

>On the vm.vfs_cace_pressure = 1 :   We had this initially and I still
>think it is the best choice for most configs.  However with our large
>memory footprint, vfs_cache_pressure=1 increased the likelihood of
>hitting an issue where our write response time would double; then a
>drop of caches would return response time to normal.  I don't claim to
>totally understand this and I only have speculation at the moment.
>Again thanks for this suggestion, I do think it is best for boxes that
>don't have very large memory.
>
>@ Christian - reformatting to btrfs or ext4 is an option in my test
>cluster.  I thought about that but needed to sort xfs first. (thats
>what production will run right now) You all have helped me do that and
>thank you again.  I will circle back and test btrfs under the same
>conditions.  I suspect that it will behave similarly but it's only a
>day and half's work or so to test.
>
>Best Regards,
>Wade
>
>
>On Thu, Jun 23, 2016 at 8:09 PM, Somnath Roy <somnath@sandisk.com>
>wrote:
>> Oops , typo , 128 GB :-)...
>>
>> -Original Message-
>> From: Christian Balzer [mailto:ch...@gol.com]
>> Sent: Thursday, June 23, 2016 5:08 PM
>> To: ceph-users@lists.ceph.com
>> Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph
>>Development
>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>of objects in pool
>>
>>
>> Hello,
>>
>> On Thu, 23 Jun 2016 22:24:59 + Somnath Roy wrote:
>>
>>> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
>>> *pin* inode/dentries in memory. We are using that for long now (with
>>> 128 TB node memory) and it seems helping specially for the random
>>> write workload and saving xattrs read in between.
>>>
>> 128TB node memory, really?
>> Can I have some of those, too? ^o^
>> And here I was thinking that Wade's 660GB machines were on the
>>excessive side.
>>
>> There's something to be said (and optimized) when your storage nodes
>>have the same or more RAM as your compute nodes...
>>
>> As for Warren, well spotted.
>> I personally use vm.vfs_cache_pressure = 1, this avoids the potential
>>fireworks if your memory is really needed elsewhere, while keeping
>>things in memory normally.
>>
>> Christian
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
>>> To: Wade Holler; Blair Bethwaite
>>> Cc: Ceph Development; ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Dramatic performance drop at certain number
>>> of objects in pool
>>>
>>> vm.vfs_cache_pressure = 100
>>>
>>> Go the other direction on that. You易ll want to keep it low to help
>>> keep inode/dentry info in memory. We use 10, and haven易t had a problem.
>>>
>>>
>>> Warren Wang
>>>
>>>
>>>
>>>
>>> On 6/22/16, 9:41 PM, "Wade Holler" <wade.hol...@gmail.com> wrote:
>>>
>>> >Blairo,
>>> >
>>> >We'll speak in pre-replication numbers, replication for this pool is
>>>3.
>>> >
>>> >23.3 Million Objects / OSD
>>> >pg_num 2048
>>> >16 OSDs / Server
>>> >3 Servers
>>> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>>> >vm.vfs_cache_pressure = 100
>>> >
>>> >Workload is native librados with python.  ALL 4k objects.
>>> >
>>> >Best Regards,
>>> >Wade
>>> >
>>> >
>>> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
>>> ><blair.bethwa...@gmail.com> wrote:
>>> >> Wade, good to know.
>>> >>
>>> >> For the record, what does this work out to roughly per OSD? And how
>>> >> much RAM and how many PGs per OSD do you have?
>>> >>
>>> >> What's your workload? I wonder whether for certain workloads (e.g.
>>> >> RBD) it's better to increase default object size somewhat before
>>> >> pushing the split/merge up a lot...
>>> >>
>>> >> Cheers,
>>> >>
>>> >>

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-23 Thread Warren Wang - ISD

vm.vfs_cache_pressure = 100

Go the other direction on that. You¹ll want to keep it low to help keep
inode/dentry info in memory. We use 10, and haven¹t had a problem.


Warren Wang




On 6/22/16, 9:41 PM, "Wade Holler" <wade.hol...@gmail.com> wrote:

>Blairo,
>
>We'll speak in pre-replication numbers, replication for this pool is 3.
>
>23.3 Million Objects / OSD
>pg_num 2048
>16 OSDs / Server
>3 Servers
>660 GB RAM Total, 179 GB Used (free -t) / Server
>vm.swappiness = 1
>vm.vfs_cache_pressure = 100
>
>Workload is native librados with python.  ALL 4k objects.
>
>Best Regards,
>Wade
>
>
>On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
><blair.bethwa...@gmail.com> wrote:
>> Wade, good to know.
>>
>> For the record, what does this work out to roughly per OSD? And how
>> much RAM and how many PGs per OSD do you have?
>>
>> What's your workload? I wonder whether for certain workloads (e.g.
>> RBD) it's better to increase default object size somewhat before
>> pushing the split/merge up a lot...
>>
>> Cheers,
>>
>> On 23 June 2016 at 11:26, Wade Holler <wade.hol...@gmail.com> wrote:
>>> Based on everyones suggestions; The first modification to 50 / 16
>>> enabled our config to get to ~645Mill objects before the behavior in
>>> question was observed (~330 was the previous ceiling).  Subsequent
>>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>>
>>> Thank you all very much for your support and assistance.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <ch...@gol.com>
>>>wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:
>>>>
>>>>> Sorry, late to the party here. I agree, up the merge and split
>>>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>>>> One of those things you just have to find out as an operator since
>>>>>it's
>>>>> not well documented :(
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>>>
>>>>> We have over 200 million objects in this cluster, and it's still
>>>>>doing
>>>>> over 15000 write IOPS all day long with 302 spinning drives + SATA
>>>>>SSD
>>>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>>>> should also help.
>>>>>
>>>> Indeed.
>>>>
>>>> Since it was asked in that bug report and also my first suspicion, it
>>>> would probably be good time to clarify that it isn't the splits that
>>>>cause
>>>> the performance degradation, but the resulting inflation of dir
>>>>entries
>>>> and exhaustion of SLAB and thus having to go to disk for things that
>>>> normally would be in memory.
>>>>
>>>> Looking at Blair's graph from yesterday pretty much makes that clear,
>>>>a
>>>> purely split caused degradation should have relented much quicker.
>>>>
>>>>
>>>>> Keep in mind that if you change the values, it won't take effect
>>>>> immediately. It only merges them back if the directory is under the
>>>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>>>
>>>> If it's a read a plain scrub might do the trick.
>>>>
>>>> Christian
>>>>> Warren
>>>>>
>>>>>
>>>>> From: ceph-users
>>>>> 
>>>>><ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.cep
>>>>>h.com>>
>>>>> on behalf of Wade Holler
>>>>> <wade.hol...@gmail.com<mailto:wade.hol...@gmail.com>> Date: Monday,
>>>>>June
>>>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>>>> <blair.bethwa...@gmail.com<mailto:blair.bethwa...@gmail.com>>, Wido
>>>>>den
>>>>> Hollander <w...@42on.com<mailto:w...@42on.com>> Cc: Ceph Development
>>>>> <ceph-de...@vger.kernel.org<mailto:ceph-de...@vger.kernel.org>>,
>>>>> "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
>>>>> <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
>>>>>Subject:
>>>>&

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-20 Thread Warren Wang - ISD

Sorry, late to the party here. I agree, up the merge and split thresholds. 
We're as high as 50/12. I chimed in on an RH ticket here. One of those things 
you just have to find out as an operator since it's not well documented :(

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

We have over 200 million objects in this cluster, and it's still doing over 
15000 write IOPS all day long with 302 spinning drives + SATA SSD journals. 
Having enough memory and dropping your vfs_cache_pressure should also help.

Keep in mind that if you change the values, it won't take effect immediately. 
It only merges them back if the directory is under the calculated threshold and 
a write occurs (maybe a read, I forget).

Warren

From: ceph-users 
> 
on behalf of Wade Holler >
Date: Monday, June 20, 2016 at 2:48 PM
To: Blair Bethwaite 
>, Wido den 
Hollander >
Cc: Ceph Development 
>, 
"ceph-users@lists.ceph.com" 
>
Subject: Re: [ceph-users] Dramatic performance drop at certain number of 
objects in pool

Thanks everyone for your replies.  I sincerely appreciate it. We are testing 
with different pg_num and filestore_split_multiple settings.  Early indications 
are  well not great. Regardless it is nice to understand the symptoms 
better so we try to design around it.

Best Regards,
Wade

On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite 
> wrote:
On 20 June 2016 at 09:21, Blair Bethwaite 
> wrote:
> slow request issues). If you watch your xfs stats you'll likely get
> further confirmation. In my experience xs_dir_lookups balloons (which
> means directory lookups are missing cache and going to disk).

Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
preparation for Jewel/RHCS2. Turns out when we last hit this very
problem we had only ephemerally set the new filestore merge/split
values - oops. Here's what started happening when we upgraded and
restarted a bunch of OSDs:
https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png

Seemed to cause lots of slow requests :-/. We corrected it about
12:30, then still took a while to settle.

--
Cheers,
~Blairo

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CFQ changes affect Ceph priority?

2016-02-05 Thread Warren Wang - ISD

Not sure how many folks use the CFQ scheduler to use Ceph IO priority, but 
there’s a CFQ change that probably needs to be evaluated for Ceph purposes.

http://lkml.iu.edu/hypermail/linux/kernel/1602.0/00820.html

This might be a better question for the dev list.

Warren Wang

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD

Which SSD are you using? Dsync flag will dramatically slow down most SSDs.
You¹ve got to be very careful about the SSD you pick.

Warren Wang




On 12/14/15, 5:49 AM, "Nikola Ciprich" <nikola.cipr...@linuxbox.cz> wrote:

>Hello,
>
>i'm doing some measuring on test (3 nodes) cluster and see strange
>performance
>drop for sync writes..
>
>I'm using SSD for both journalling and OSD. It should be suitable for
>journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>
>(measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>--group_reporting --name=journal-test)
>
>On top of this cluster, I have running KVM guest (using qemu librbd
>backend).
>Overall performance seems to be quite good, but the problem is when I try
>to measure sync IO performance inside the guest.. I'm getting only about
>600IOPS,
>which I think is quite poor.
>
>The problem is, I don't see any bottlenect, OSD daemons don't seem to be
>hanging on
>IO, neither hogging CPU, qemu process is also not somehow too much
>loaded..
>
>I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>disabled,
>
>my question is, what results I can expect for synchronous writes? I
>understand
>there will always be some performance drop, but 600IOPS on top of storage
>which
>can give as much as 16K IOPS seems to little..
>
>Has anyone done similar measuring?
>
>thanks a lot in advance!
>
>BR
>
>nik
>
>
>-- 
>-
>Ing. Nikola CIPRICH
>LinuxBox.cz, s.r.o.
>28.rijna 168, 709 00 Ostrava
>
>tel.:   +420 591 166 214
>fax:+420 596 621 273
>mobil:  +420 777 093 799
>www.linuxbox.cz
>
>mobil servis: +420 737 238 656
>email servis: ser...@linuxbox.cz
>-

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD

Whoops, I misread Nikola¹s original email, sorry!

If all your SSDs are all performing at that level for sync IO, then I
agree that it¹s down to other things, like network latency and PG locking.
Sequential 4K writes with 1 thread and 1 qd is probably the worst
performance you¹ll see. Is there a router between your VM and the Ceph
cluster, or one between Ceph nodes for the cluster network?

Are you using dsync at the VM level to simulate what a database or other
app would do? If you can switch to directIO, you¹ll likely get far better
performance. 

Warren Wang




On 12/14/15, 12:03 PM, "Mark Nelson" <mnel...@redhat.com> wrote:

>
>
>On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>> Hello,
>>
>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>performance
>> drop for sync writes..
>>
>> I'm using SSD for both journalling and OSD. It should be suitable for
>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>
>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>--bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>--group_reporting --name=journal-test)
>>
>> On top of this cluster, I have running KVM guest (using qemu librbd
>>backend).
>> Overall performance seems to be quite good, but the problem is when I
>>try
>> to measure sync IO performance inside the guest.. I'm getting only
>>about 600IOPS,
>> which I think is quite poor.
>>
>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>be hanging on
>> IO, neither hogging CPU, qemu process is also not somehow too much
>>loaded..
>>
>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>disabled,
>>
>> my question is, what results I can expect for synchronous writes? I
>>understand
>> there will always be some performance drop, but 600IOPS on top of
>>storage which
>> can give as much as 16K IOPS seems to little..
>
>So basically what this comes down to is latency.  Since you get 16K IOPS
>for O_DSYNC writes on the SSD, there's a good chance that it has a
>super-capacitor on board and can basically acknowledge a write as
>complete as soon as it hits the on-board cache rather than when it's
>written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>is completing in around 0.06ms on average.  That's very fast!  At 600
>IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per
>IO on average.
>
>So how do we account for the difference?  Let's start out by looking at
>a quick example of network latency (This is between two random machines
>in one of our labs at Red Hat):
>
>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>> 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
>> 64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
>> 64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
>> 64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
>
>now consider that when you do a write in ceph, you write to the primary
>OSD which then writes out to the replica OSDs.  Every replica IO has to
>complete before the primary will send the acknowledgment to the client
>(ie you have to add the latency of the worst of the replica writes!).
>In your case, the network latency alone is likely dramatically
>increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to
>process crush mappings, look up directory and inode metadata on the
>filesystem where objects are stored (assuming it's not cached), and
>other processing time, and the 1.6ms latency for the guest writes starts
>to make sense.
>
>Can we improve things?  Likely yes.  There's various areas in the code
>where we can trim latency away, implement alternate OSD backends, and
>potentially use alternate network technology like RDMA to reduce network
>latency.  The thing to remember is that when you are talking about
>O_DSYNC writes, even very small increases in latency can have dramatic
>effects on performance.  Every fraction of a millisecond has huge
>ramifications.
>
>>
>> Has anyone done similar measuring?
>>
>> thanks a lot in advance!
>>
>> BR
>>
>> nik
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Warren Wang - ISD

I get where you are coming from, Jan, but for a test this small, I still
think checking network latency first for a single op is a good idea.

Given that the cluster is not being stressed, CPUs may be running slow. It
may also benefit the test to turn CPU governors to performance for all
cores.

Warren Wang




On 12/14/15, 5:07 PM, "Jan Schermer" <j...@schermer.cz> wrote:

>Even with 10G ethernet, the bottleneck is not the network, nor the drives
>(assuming they are datacenter-class). The bottleneck is the software.
>The only way to improve that is to either increase CPU speed (more GHz
>per core) or to simplify the datapath IO has to take before it is
>considered durable.
>Stuff like RDMA will help only if there so zero-copy between the (RBD)
>client and the drive, or if the write is acknowledged when in the remote
>buffers of replicas (but it still has to come from client directly or
>RDMA becomes a bit pointless, IMHO).
>
>Databases do sync writes for a reason, O_DIRECT doesn't actually make
>strong guarantees on ordering or buffering, though in practice the race
>condition is negligible.
>
>Your 600 IOPS are pretty good actually.
>
>Jan
>
>
>> On 14 Dec 2015, at 22:58, Warren Wang - ISD <warren.w...@walmart.com>
>>wrote:
>> 
>> Whoops, I misread Nikola易s original email, sorry!
>> 
>> If all your SSDs are all performing at that level for sync IO, then I
>> agree that it易s down to other things, like network latency and PG
>>locking.
>> Sequential 4K writes with 1 thread and 1 qd is probably the worst
>> performance you易ll see. Is there a router between your VM and the Ceph
>> cluster, or one between Ceph nodes for the cluster network?
>> 
>> Are you using dsync at the VM level to simulate what a database or other
>> app would do? If you can switch to directIO, you易ll likely get far
>>better
>> performance. 
>> 
>> Warren Wang
>> 
>> 
>> 
>> 
>> On 12/14/15, 12:03 PM, "Mark Nelson" <mnel...@redhat.com> wrote:
>> 
>>> 
>>> 
>>> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>>>> Hello,
>>>> 
>>>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>>> performance
>>>> drop for sync writes..
>>>> 
>>>> I'm using SSD for both journalling and OSD. It should be suitable for
>>>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>>> 
>>>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>>> --group_reporting --name=journal-test)
>>>> 
>>>> On top of this cluster, I have running KVM guest (using qemu librbd
>>>> backend).
>>>> Overall performance seems to be quite good, but the problem is when I
>>>> try
>>>> to measure sync IO performance inside the guest.. I'm getting only
>>>> about 600IOPS,
>>>> which I think is quite poor.
>>>> 
>>>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>>> be hanging on
>>>> IO, neither hogging CPU, qemu process is also not somehow too much
>>>> loaded..
>>>> 
>>>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>>> disabled,
>>>> 
>>>> my question is, what results I can expect for synchronous writes? I
>>>> understand
>>>> there will always be some performance drop, but 600IOPS on top of
>>>> storage which
>>>> can give as much as 16K IOPS seems to little..
>>> 
>>> So basically what this comes down to is latency.  Since you get 16K
>>>IOPS
>>> for O_DSYNC writes on the SSD, there's a good chance that it has a
>>> super-capacitor on board and can basically acknowledge a write as
>>> complete as soon as it hits the on-board cache rather than when it's
>>> written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>>> is completing in around 0.06ms on average.  That's very fast!  At 600
>>> IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms
>>>per
>>> IO on average.
>>> 
>>> So how do we account for the difference?  Let's start out by looking at
>>> a quick example of network latency (This is between two random machines
>>> in one of our labs at Red Hat):
>>> 
>>>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>>>> 64 bytes from gqas008: icmp_seq=2 ttl=

Re: [ceph-users] Ceph Sizing

2015-12-03 Thread Warren Wang - ISD

I would be a lot more conservative in terms of what a spinning drive can
do. The Mirantis presentation has pretty high expectations out of a
spinning drive, as they¹re ignoring somewhat latency (til the last few
slides). Look at the max latencies for anything above 1 QD on a spinning
drive.

If you factor in a latency requirement, the capability of the drives fall
dramatically. You might be able to offset this by using NVMe or something
as a cache layer between the journal and the OSD, using bcache, LVM cache,
etc. In much of the performance testing that we¹ve done, the average isn¹t
too bad, but 90th percentile numbers tend to be quite bad. Part of it is
probably from locking PGs during a flush, and the other part is just the
nature of spinning drives.

I¹d try to get a handle on expected workloads before picking the gear, but
if you have to pick before that, SSD if you have the budget :) You can
offset it a little by using erasure coding for the RGW portion, or using
spinning drives for that.

I think picking gear for Ceph is tougher than running an actual cluster :)
Best of luck. I think you¹re still starting with better, and more info
than some of us did years ago.

Warren Wang




From:  Sam Huracan <nowitzki.sa...@gmail.com>
Date:  Thursday, December 3, 2015 at 4:01 AM
To:  Srinivasula Maram <srinivasula.ma...@sandisk.com>
Cc:  Nick Fisk <n...@fisk.me.uk>, "ceph-us...@ceph.com"
<ceph-us...@ceph.com>
Subject:  Re: [ceph-users] Ceph Sizing


I'm following this presentation of Mirantis team:
http://www.slideshare.net/mirantis/ceph-talk-vancouver-20

They calculate CEPH IOPS = Disk IOPS * HDD Quantity * 0.88 (4-8k random
read proportion)


And  VM IOPS = CEPH IOPS / VM Quantity

But if I use replication of 3, Would VM IOPS be divided by 3?


2015-12-03 7:09 GMT+07:00 Sam Huracan <nowitzki.sa...@gmail.com>:

IO size is 4 KB, and I need a Minimum sizing, cost optimized
I intend use SuperMicro Devices
http://www.supermicro.com/solutions/storage_Ceph.cfm


What do you think?


2015-12-02 23:17 GMT+07:00 Srinivasula Maram
<srinivasula.ma...@sandisk.com>:

One more factor we need to consider here is IO size(block size) to get
required IOPS, based on this we can calculate the bandwidth and design the
solution.

Thanks
Srinivas

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Nick Fisk
Sent: Wednesday, December 02, 2015 9:28 PM
To: 'Sam Huracan'; ceph-us...@ceph.com
Subject: Re: [ceph-users] Ceph Sizing

You've left out an important factorcost. Otherwise I would just say
buy enough SSD to cover the capacity.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Sam Huracan
> Sent: 02 December 2015 15:46
> To: ceph-us...@ceph.com
> Subject: [ceph-users] Ceph Sizing
>
> Hi,
> I'm building a storage structure for OpenStack cloud System, input:
> - 700 VM
> - 150 IOPS per VM
> - 20 Storage per VM (boot volume)
> - Some VM run database (SQL or MySQL)
>
> I want to ask a sizing plan for Ceph to satisfy the IOPS requirement,
> I list some factors considered:
> - Amount of OSD (SAS Disk)
> - Amount of Journal (SSD)
> - Amount of OSD Servers
> - Amount of MON Server
> - Network
> - Replica ( default is 3)
>
> I will divide to 3 pool with 3 Disk types: SSD, SAS 15k and SAS 10k
> Should I use all 3 disk types in one server or build dedicated servers
> for every pool? Example: 3 15k servers for Pool-1, 3 10k Servers for
>Pool-2.
>
> Could you help me a formula to calculate the minimum devices needed
> for above input.
>
> Thanks and regards.








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com












This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Upgrade to hammer, crush tuneables issue

2015-11-24 Thread Warren Wang - ISD

You upgraded (and restarted as appropriate) all the clients first, right?

Warren Wang





On 11/24/15, 10:52 AM, "Joe Ryner" <jry...@cait.org> wrote:

>Hi,
>
>Last night I upgraded my cluster from Centos 6.5 -> Centos 7.1 and in the
>process upgraded from Emperor -> Firefly -> Hammer
>
>When I finished I changed the crush tunables from
>ceph osd crush tunables legacy -> ceph osd crush tunables optimal
>
>I knew this would cause data movement.  But the IO for my clients is
>unacceptable.  Can any please tell what the best settings are for my
>configuration.  I have 2 Dell R720 Servers and 2 Dell R730 servers.  I
>have 36 1TB SATA SSD Drives in my cluster.  The servers have 128 GB of
>RAM.
>
>Below is some detail the might help.  According to my calculations the
>rebalance will take over a day.
>
>I would greatly appreciate some help on this.
>
>Thank you,
>
>Joe
>
>-
>BEGIN --
>NODE: gold.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128726   713162077  20   55332
>56767
>Swap: 0   0   0
>END   --
>BEGIN --
>NODE: gallo.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128726   79489 462  36   48774
>48547
>Swap:  8191   08191
>END   --
>BEGIN --
>NODE: hamms.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128536   69412 659  19   58464
>58342
>Swap: 16383   0   16383
>END   --
>BEGIN --
>NODE: helm.sys.cu.cait.org
>CMD : free -m 
>  totalusedfree  shared  buff/cache
>available
>Mem: 128536   662168799  26   53520
>61603
>Swap: 163831739   14644
>END   --
>
>ceph osd tree
>ID  WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 36.0 root default
> -3 36.0 rack curack
>-10  9.0 host helm
>  0  1.0 osd.0   up  1.0  1.0
>  1  1.0 osd.1   up  1.0  1.0
>  2  1.0 osd.2   up  1.0  1.0
>  3  1.0 osd.3   up  1.0  1.0
>  4  1.0 osd.4   up  1.0  1.0
>  5  1.0 osd.5   up  1.0  1.0
>  6  1.0 osd.6   up  1.0  1.0
>  7  1.0 osd.7   up  1.0  1.0
>  8  1.0 osd.8   up  1.0  1.0
> -7  9.0 host gold
> 16  1.0 osd.16  up  1.0  1.0
> 17  1.0 osd.17  up  1.0  1.0
> 18  1.0 osd.18  up  1.0  1.0
> 19  1.0 osd.19  up  1.0  1.0
> 20  1.0 osd.20  up  1.0  1.0
> 21  1.0 osd.21  up  1.0  1.0
>  9  1.0 osd.9   up  1.0  1.0
> 10  1.0 osd.10  up  1.0  1.0
> 34  1.0 osd.34  up  1.0  1.0
> -8  9.0 host gallo
> 22  1.0 osd.22  up  1.0  1.0
> 23  1.0 osd.23  up  1.0  1.0
> 24  1.0 osd.24  up  1.0  1.0
> 25  1.0 osd.25  up  1.0  1.0
> 26  1.0 osd.26  up  1.0  1.0
> 27  1.0 osd.27  up  1.0  1.0
> 11  1.0 osd.11  up  1.0  1.0
> 12  1.0 osd.12  up  1.0  1.0
> 35  1.0 osd.35  up  1.0  1.0
> -9  9.0 host hamms
> 13  1.0 osd.13  up  1.0  1.0
> 14  1.0 osd.14  up  1.0  1.0
> 15  1.0

Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-18 Thread Warren Wang - ISD

What were you using for iodepth and numjobs? If you’re getting an average of 
2ms per operation, and you’re single threaded, I’d expect about 500 IOPS / 
thread, until you hit the limit of your QEMU setup, which may be a single IO 
thread. That’s also what I think Mike is alluding to.

Warren

From: Sean Redmond >
Date: Wednesday, November 18, 2015 at 6:39 AM
To: "ceph-us...@ceph.com" 
>
Subject: [ceph-users] All SSD Pool - Odd Performance

Hi,

I have a performance question for anyone running an SSD only pool. Let me 
detail the setup first.

12 X Dell PowerEdge R630 ( 2 X 2620v3 64Gb RAM)
8 X intel DC 3710 800GB
Dual port Solarflare 10GB/s NIC (one front and one back)
Ceph 0.94.5
Ubuntu 14.04 (3.13.0-68-generic)

The above is in one pool that is used for QEMU guests, A 4k FIO test on the SSD 
directly yields around 55k Iops, the same test inside a QEMU guest seems to hit 
a limit around 4k Iops. If I deploy multiple guests they can all reach 4K Iops 
simultaneously.

I don't see any evidence of a bottle neck on the OSD hosts,Is this limit inside 
the guest expected or I am just not looking deep enough yet?

Thanks

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Advised Ceph release

2015-11-18 Thread Warren Wang - ISD

If it’s your first prod cluster, and you have no hard requirements for 
Infernalis features, I would say stick with Hammer.

Warren

From: Bogdan SOLGA >
Date: Wednesday, November 18, 2015 at 1:58 PM
To: ceph-users >
Cc: Calin Fatu >
Subject: [ceph-users] Advised Ceph release

Hello, everyone!

We have recently setup a Ceph cluster running on the Hammer release (v0.94.5), 
and we would like to know what is the advised release for preparing a 
production-ready cluster - the LTS version (Hammer) or the latest stable 
version (Infernalis)?

The cluster works properly (so far), and we're still not sure whether we should 
upgrade to Infernalis or not.

Thank you!

Regards,
Bogdan

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph and upgrading OS version

2015-10-21 Thread Warren Wang - ISD

Depending on how busy your cluster is, I’d nuke and pave node by node. You can 
slow the data movement off the old box, and also slow it on the way back in 
with weighting. My own personal preference, if you have performance overhead to 
spare.

Warren

From: Andrei Mikhailovsky >
Date: Tuesday, October 20, 2015 at 3:05 PM
To: "ceph-us...@ceph.com" 
>
Subject: [ceph-users] ceph and upgrading OS version

Hello everyone

I am planning to upgrade my ceph servers from Ubuntu 12.04 to 14.04 and I am 
wondering if you have a recommended process of upgrading the OS version without 
causing any issues to the ceph cluster?

Many thanks

Andrei

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph, SSD, and NVMe

2015-10-02 Thread Warren Wang - ISD

Since you didn¹t hear much from the successful crowd, I¹ll chime in. At my
previous employer, we ran some pretty large clusters (over 1PB)
successfully on Hammer. Some were upgraded from Firefly, and by no means
do I consider myself to be a developer. We totaled over 15 production
clusters. I¹m not saying there weren¹t some rocky times, but they were
generally not directly due to Ceph code, but things ancillary to it, like
kernel bugs, customers driving traffic, hardware selection/failures, or
minor config issues. We never lost a cluster, though we did lose access to
them on occasion.

It does require you to stay up to date on what¹s going on with the
community, but I don¹t think that it¹s too different from OpenStack in
that regard. If support is a concern, there¹s always the Red Hat option,
or purchase a Ceph appliance like the Sandisk Infiniflash, which comes
with solid support from folks like Somnath.

FWIW, Hammer¹s write performance isn¹t awful. My coworker borrowed some
compute nodes, and ran a pretty large scale test with 400 SSDs across 50
nodes, and the results were pretty encouraging.

Warren

On 10/1/15, 10:01 PM, "J David"  wrote:

>This is all very helpful feedback, thanks so much.
>
>Also it sounds like you guys have done a lot of work on this, so
>thanks for that as well!
>
>Is Hammer generally considered stable enough for production in an
>RBD-only environment?  The perception around here is that the number
>of people who report lost data or inoperable clusters due to bugs in
>Hammer on this list is troubling enough to cause hesitation.  There's
>a specific term for overweighting the probability of catastrophic
>negative outcomes, and maybe that's what's happening.  People tend not
>to post to the list "Hey we have a cluster, it's running great!"
>instead waiting until things are not great, so the list paints an
>artificially depressing picture of stability.  But when we ask around
>quietly to other places we know running Ceph in production, which is
>admittedly a very small sample, they're all also still running
>Firefly.
>
>Admittedly, it doesn't help that "On my recommendation, we performed a
>non-reversible upgrade on the production cluster which, despite our
>best testing efforts, wrecked things causing us to lose 4 hours of
>data and requiring 2 days of downtime while we rebuilt the cluster and
>restored the backups" is pretty much guaranteed to be followed by,
>"You're fired."
>
>So, do medium-sized IT organizations (i.e. those without the resources
>to have a Ceph developer on staff) run Hammer-based deployments in
>production successfully?
>
>Please understand this is not meant to be sarcastic or critical of the
>project in any way.  Ceph is amazing, and we love it.  Some features
>of Ceph, like CephFS, have been considered not-production-quality for
>a long time, and that is to be expected.  These things are incredibly
>complex and take time to get right.  So organizations in our position
>just don't use that stuff.  As a relative outsider for whom the Ceph
>source code is effectively a foreign language, it's just *really* hard
>to tell if Hammer in general is in that same "still baking" category.
>
>Thanks!
>
>
>On Wed, Sep 30, 2015 at 3:33 PM, Somnath Roy 
>wrote:
>> David,
>> You should move to Hammer to get all the benefits of performance. It's
>>all added to Giant and migrated to the present hammer LTS release.
>> FYI, focus was so far with read performance improvement and what we saw
>>in our environment with 6Gb SAS SSDs so far that we are able to saturate
>>drives BW wise with 64K onwards. But, with smaller block like 4K we are
>>not able to saturate the SAS SSD drives yet.
>> But, considering Ceph's scale out nature you can get some very good
>>numbers out of a cluster. For example, with 8 SAS SSD drives (in a JBOF)
>>and having 2 heads in front (So, a 2 node Ceph cluster) we are able to
>>hit ~300K Random read iops while 8 SSD aggregated performance would be
>>~400K. Not too bad. At this point we are saturating host cpus.
>> We have seen almost linear scaling if you add similar setups i.e adding
>>say ~3 of the above setup, you could hit ~900K RR iops. So, I would say
>>it is definitely there in terms read iops and more improvement are
>>coming.
>> But, write path is very awful compare to read and that's where the
>>problem is. Because, in the mainstream, no workload is 100% RR (IMO).
>>So,  even if you have say 90-10 read/write the performance numbers would
>>be  ~6/7 X slower.
>> So, it is very much dependent on your workload/application access
>>pattern and obviously the cost you are willing to spend.
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>Of Mark Nelson
>> Sent: Wednesday, September 30, 2015 12:04 PM
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph, SSD, and NVMe
>>
>> On 09/30/2015 09:34 AM, J

Re: [ceph-users] pgs stuck unclean on a new pool despite the pool size reconfiguration

2015-10-02 Thread Warren Wang - ISD

You probably don’t want hashpspool automatically set, since your clients may 
still not understand that crush map feature. You can try to unset it for that 
pool and see what happens, or create a new pool without hashpspool enabled from 
the start.  Just a guess.

Warren

From: Giuseppe Civitella 
>
Date: Friday, October 2, 2015 at 10:05 AM
To: ceph-users >
Subject: [ceph-users] pgs stuck unclean on a new pool despite the pool size 
reconfiguration

Hi all,
I have a Firefly cluster which has been upgraded from Emperor.
It has 2 OSD hosts and 3 monitors.
The cluster has default values for what concerns size and min_size of the pools.
Once upgraded to Firefly, I created a new pool called bench2:
ceph osd pool create bench2 128 128
and set its sizes:
ceph osd pool set bench2 size 2
ceph osd pool set bench2 min_size 1

this is the state of the pools:
pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 stripe_width 0
pool 3 'volumes' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 384 pgp_num 384 last_change 2568 stripe_width 0
removed_snaps [1~75]
pool 4 'images' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 384 pgp_num 384 last_change 1895 stripe_width 0
pool 8 'bench2' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 2580 flags hashpspool stripe_width 0

despite this I still get a warning about 128 pgs stuck unclean.
The "ceph health detail" shows me the stuck PGs. So i take one to get the 
involved OSDs:

pg 8.38 is stuck unclean since forever, current state active, last acting [22,7]

if I restart the OSD with id 22, the PG 8.38 gets an active+clean state.

This is an incorrect behavior, AFAIK. The cluster should get noticed of the new 
size and min_size values without any manual intervention. So my question is: 
any idea about why this happens and how to restore the default behavior? Do I 
need to restart all of the OSDs to restore an healthy state?

thanks a lot
Giuseppe

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] About Ceph SSD and HDD strategy

2013-10-09 Thread Warren Wang

While in theory this should be true, I'm not finding it to be the case for a 
typical enterprise LSI card with 24 drives attached. We tried a variety of 
ratios and went back to collocated journals on the spinning drives. 

Eagerly awaiting the tiered performance changes to implement a faster tier via 
SSD. 

--
Warren

On Oct 9, 2013, at 5:52 PM, Kyle Bader kyle.ba...@gmail.com wrote:

 Journal on SSD should effectively double your throughput because data will 
 not be written to the same device twice to ensure transactional integrity. 
 Additionally, by placing the OSD journal on an SSD you should see less 
 latency, the disk head no longer has to seek back and forth between the 
 journal and data partitions. For large writes it's not as critical to have a 
 device that supports high IOPs or throughput because large writes are striped 
 across many 4MB rados objects, relatively evenly distributed across the 
 cluster. Small write operations will benefit the most from an OSD data 
 partition with a writeback cache like btier/flashcache because it can absorbs 
 an order of magnitude more IOPs and allow a slower spinning device catch up 
 when there is less activity.
 
 
 On Tue, Oct 8, 2013 at 12:09 AM, Robert van Leeuwen 
 robert.vanleeu...@spilgames.com wrote:
  I tried putting Flashcache on my spindle OSDs using an Intel SSL and it 
  works great.  
  This is getting me read and write SSD caching instead of just write 
  performance on the journal.  
  It should also allow me to protect the OSD journal on the same drive as 
  the OSD data and still get benefits of SSD caching for writes.
 
 Small note that on Red Hat based distro's + Flashcache + XFS:
 There is a major issue (kernel panics) running xfs + flashcache on a 6.4 
 kernel. (anything higher then 2.6.32-279) 
 It should be fixed in kernel 2.6.32-387.el6 which, I assume, will be 6.5 
 which only just entered Beta.
 
 Fore more info, take a look here:
 https://github.com/facebook/flashcache/issues/113
 
 Since I've hit this issue (thankfully in our dev environment) we are 
 slightly less enthusiastic about running flashcache :(
 It also adds a layer of complexity so I would rather just run the journals 
 on SSD, at least on Redhat.
 I'm not sure about the performance difference of just journals v.s. 
 Flashcache but I'd be happy to read any such comparison :)
 
 Also, if you want to make use of the SSD trim func
 
 P.S. My experience with Flashcache is on Openstack Swift  Nova not Ceph.
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 -- 
 
 Kyle
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Rados gw upload problems

2013-10-03 Thread Warren Wang

Hi all, I'm having a problem uploading through the Rados GW.  I'm getting
the following error, and searches haven't lead me to a solution.

[Fri Oct 04 04:05:11 2013] [error] [client xxx.xxx.xxx.xxx] chunked
Transfer-Encoding forbidden: /swift/v1/wwang-container/test

FastCGI version:
ii  libapache2-mod-fastcgi
2.4.7~0910052141-1 amd64Apache 2 FastCGI module
for long-running CGI scripts

Auth works properly through keystone.  Getting hung up on this final part.

Thanks for any help,
Warren
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance with three nodes

2013-10-02 Thread Warren Wang

I agree with Greg that this isn't a great test.  You'll need multiple
clients to push the Ceph cluster, and you have to use oflag=direct if
you're using dd.

The OSDs should be individual drives, not part of a RAID set, otherwise
you're just creating extra work, unless you've reduced the number of copies
to 1 in your ceph config.

What I've seen is that a single threaded Ceph client maxes out around 50
MB/s for us, but the overall capacity is much, much higher.

Warren

Warren


On Wed, Oct 2, 2013 at 5:24 PM, Gregory Farnum g...@inktank.com wrote:

 On Wed, Oct 2, 2013 at 1:59 PM, Eric Lee Green eric.lee.gr...@gmail.com
 wrote:
  I have three storage servers that provide NFS and iSCSI services to my
  network, which serve data to four virtual machine compute hosts (two
 ESXi,
  two libvirt/kvm) with several dozen virtual machines . I decided to test
 out
  a Ceph deployment to see whether it could replace iSCSI as the primary
 way
  to provide block stores to my virtual machines, since this would allow
  better redundancy and better distribution of the load across the storage
  servers.
 
  I used ceph version 0.67.3 from RPM's. Because these are live servers
  providing NFS and iSCSI data they aren't a clean slate, so the Ceph
  datastores were created on XFS partitions. Each partition is on a single
  diskgroup (12-disk RAID6), of which there are two on each server, each
  connected to its own 3Gbit/sec SAS channel. The servers are all connected
  together with 10 gigabit Ethernet. The redundancy factor was set to 3
 (three
  copies of each chunk of data) so that a chunk would be guaranteed to
 reside
  on at least two servers (since each server has two chunkstores).
 
  My experience with doing streaming writes via NFS or iSCSI to these
 servers
  is that the limiting factor is the performance of the SAS bus. That is,
 on
  the client side I top out at 240 megabytes per second on writes to a
 single
  disk group, a bit higher on reads, due to the 3 gigabit/sec SAS bus.
 When I
  am exercising both disk groups at once I am maxing out both SAS buses for
  double the performance. The 10 gigabit Ethernet w/9000 MTU apparently has
  plenty of bandwidth to saturate two 3 gigabit SAS buses.
 
  My first test of ceph was to create a 'test1' volume that was around 8
  gigabytes in size (or roughly the size of the root partition of one of my
  virtual machines), then test streaming reads and writes. The test for
  streaming reads and writes was simple:
 
  [root@stack1 ~]# dd if=/dev/zero of=/dev/rbd/data/test1 bs=524288
  dd: error writing ‘/dev/rbd/data/test1’: No space left on device
  16193+0 records in
  16192+0 records out
  8489271296 bytes (8.5 GB) copied, 172.71 s, 49.2 MB/s
 
  [root@stack1 ~]# dd if=/dev/rbd/data/test1 of=/dev/null bs=524288
  16192+0 records in
  16192+0 records out
  8489271296 bytes (8.5 GB) copied, 25.2494 s, 336 MB/s
 
  So:
 
  1) Writes are truly appalling. They are not going at the speed of even a
  single disk drive (my disk drives are capable of streaming approximately
 120
  megabytes per second).
 
  2) Reads are more acceptable. I am getting better throughput than with a
  single SAS channel, as you would expect with reads striped across three
 SAS
  channels. Still, reads are slower than I expected given the speed of my
  infrastructure.
 
  Compared to Amazon EBS, reads appear roughly the same as EBS on an
  IO-enhanced instance, and writes are *much* slower.
 
  What this seems to indicate is either a) inherent Ceph performance issues
  for writes, or b) I have something misconfigured. There's simply too
 much of
  a mismatch between what the underlying hardware does with NFS and iSCSI,
 and
  what it does with Ceph, to consider this to be appropriate performance.
 My
  guess is (b), that I have something misconfigured. Any ideas what I
 should
  look for?

 There's a couple things here:
 1) You aren't accounting for Ceph's journaling. Unlike a system such
 as NFS, Ceph provides *very* strong data integrity guarantees under
 failure conditions, and in order to do so it does full data
 journaling. So, yes, cut your total disk bandwidth in half. (There's
 also a lot of syncing which it manages carefully to reduce the cost,
 but if you had other writes happening via your NFS/iSCSI setups that
 might have been hit by the OSD running a sync on its disk, that could
 be dramatically impacting the perceived throughput.)
 2) Placing an OSD (with its journal) on a RAID-6 is about the worst
 thing you can do for Ceph's performance; it does a lot of small
 flushed-to-disk IOs in the journal in between the full data writes.
 Try some other configuration?
 3) Did you explicitly set your PG counts at any point? They default to
 8, which is entirely too low; given your setup you should have
 400-1000 per pool.
 4) There could have been something wrong/going on with the system;
 though I doubt it. But if you can provide the output of ceph -s
 that'll let us check the basics.

Re: [ceph-users] PG distribution scattered

2013-09-19 Thread Warren Wang

Is this safe to enable on a running cluster?

--
Warren

On Sep 19, 2013, at 9:43 AM, Mark Nelson mark.nel...@inktank.com wrote:

 On 09/19/2013 08:36 AM, Niklas Goerke wrote:
 Hi there
 
 I'm currently evaluating ceph and started filling my cluster for the
 first time. After filling it up to about 75%, it reported some OSDs
 being near-full.
 After some evaluation I found that the PGs are not distributed evenly
 over all the osds.
 
 My Setup:
 * Two Hosts with 45 Disks each -- 90 OSDs
 * Only one newly created pool with 4500 PGs and a Replica Size of 2 --
 should be about 100 PGs per OSD
 
 What I found was that one OSD only had 72 PGs, while another had 123 PGs
 [1]. That means that - if I did the math correctly - I can only fill the
 cluster to about 81%, because thats when the first OSD is completely
 full[2].
 
 Does distribution improve if you make a pool with significantly more PGs?
 
 
 I did some experimenting and found, that if I add another pool with 4500
 PGs, each OSD will have exacly doubled the amount of PGs as with one
 pool. So this is not an accident (tried it multiple times). On another
 test-cluster with 4 Hosts and 15 Disks each, the Distribution was
 similarly worse.
 
 This is a bug that causes each pool to more or less be distributed the same 
 way on the same hosts.  We have a fix, but it impacts backwards compatibility 
 so it's off by default.  If you set:
 
 osd pool default flag hashpspool = true
 
 Theoretically that will cause different pools to be distributed more randomly.
 
 
 To me it looks like the rjenkins algorithm is not working as it - in my
 opinion - should be.
 
 Am I doing anything wrong?
 Is this behaviour to be expected?
 Can I don something about it?
 
 
 Thank you very much in advance
 Niklas
 
 
 [1] I built a small script that will parse pgdump and output the amount
 of pgs on each osd: http://pastebin.com/5ZVqhy5M
 [2] I know I should not fill my cluster completely but I'm talking about
 theory and adding a margin only makes it worse.
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG distribution scattered

2013-09-19 Thread Warren Wang

Good timing then. I just fired up the cluster 2 days ago. Thanks. 

--
Warren

On Sep 19, 2013, at 12:34 PM, Gregory Farnum g...@inktank.com wrote:

 It will not lose any of your data. But it will try and move pretty much all 
 of it, which will probably send performance down the toilet.
 -Greg
 
 On Thursday, September 19, 2013, Mark Nelson wrote:
 Honestly I don't remember, but I would be wary if it's not a test system. :)
 
 Mark
 
 On 09/19/2013 11:28 AM, Warren Wang wrote:
 Is this safe to enable on a running cluster?
 
 --
 Warren
 
 On Sep 19, 2013, at 9:43 AM, Mark Nelson mark.nel...@inktank.com wrote:
 
 On 09/19/2013 08:36 AM, Niklas Goerke wrote:
 Hi there
 
 I'm currently evaluating ceph and started filling my cluster for the
 first time. After filling it up to about 75%, it reported some OSDs
 being near-full.
 After some evaluation I found that the PGs are not distributed evenly
 over all the osds.
 
 My Setup:
 * Two Hosts with 45 Disks each -- 90 OSDs
 * Only one newly created pool with 4500 PGs and a Replica Size of 2 --
 should be about 100 PGs per OSD
 
 What I found was that one OSD only had 72 PGs, while another had 123 PGs
 [1]. That means that - if I did the math correctly - I can only fill the
 cluster to about 81%, because thats when the first OSD is completely
 full[2].
 
 Does distribution improve if you make a pool with significantly more PGs?
 
 
 I did some experimenting and found, that if I add another pool with 4500
 PGs, each OSD will have exacly doubled the amount of PGs as with one
 pool. So this is not an accident (tried it multiple times). On another
 test-cluster with 4 Hosts and 15 Disks each, the Distribution was
 similarly worse.
 
 This is a bug that causes each pool to more or less be distributed the 
 same way on the same hosts.  We have a fix, but it impacts backwards 
 compatibility so it's off by default.  If you set:
 
 osd pool default flag hashpspool = true
 
 Theoretically that will cause different pools to be distributed more 
 randomly.
 
 
 To me it looks like the rjenkins algorithm is not working as it - in my
 opinion - should be.
 
 Am I doing anything wrong?
 Is this behaviour to be expected?
 Can I don something about it?
 
 
 Thank you very much in advance
 Niklas
 
 
 [1] I built a small script that will parse pgdump and output the amount
 of pgs on each osd: http://pastebin.com/5ZVqhy5M
 [2] I know I should not fill my cluster completely but I'm talking about
 theory and adding a margin only makes it worse.
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 -- 
 Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph-deploy host as admin host

2013-09-18 Thread Warren Wang

Just got done deploying the largest ceph install I've had yet (9 boxes,
179TB), , and I used ceph-deploy, but not without much consternation.  I
have a question before I file a bug report.

Is the expectation that the deploy host will never be used as the admin
host?  I ran into various issues related to this.  For instance, if you run
ceph-deploy to (re)create osds, it will pull the boostrap keyring from
/etc/ceph, if available, instead of your deployment dir.  However, if you
create a mon host, it assumes that it must be generated, so when you
gather, you have a mismatch of keys between the mon and osd boxes.

Should we not be using admin and deploy on the same box?  I don't see a
good reason why we shouldn't.  I went through the changelog for today's
release, but I don't see anything that addresses this.

Thanks,
Warren
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Warren Wang

FWIW, we run into this same issue, and cannot get a good enough SSD:
spinning ratio, and decided on simply running the journals on each
(spinning) drive, for hosts that have 24 slots. The problem gets even
worse when we're talking about some of the newer boxes.

Warren

On Wed, Sep 18, 2013 at 1:56 PM, Mike Dawson mike.daw...@cloudapt.comwrote:

Joseph,

With properly architected failure domains and replication in a Ceph
cluster, RAID1 has diminishing returns.

A well-designed CRUSH map should allow for failures at any level of your
hierarchy (OSDs, hosts, racks, rows, etc) while protecting the data with a
configurable number of copies.

That being said, losing a series of six OSDs is certainly a hassle and
journals on a RAID1 set could help prevent that senerio.

But where do you stop? 3 monitors, 5, 7? RAID1 for OSDs, too? 3x
replication, 4x, 10x? I suppose each operator gets to decide how far to
chase the diminishing returns.

Thanks,

Mike Dawson
Co-Founder Director of Cloud Architecture
Cloudapt LLC

On 9/18/2013 1:27 PM, Gruher, Joseph R wrote:

-Original Message-
From:
ceph-users-boun...@lists.ceph.**comceph-users-boun...@lists.ceph.com[mailto:
ceph-users-
boun...@lists.ceph.com] On Behalf Of Mike Dawson

you need to understand losing an SSD will cause
the loss of ALL of the OSDs which had their journal on the failed SSD.

First, you probably don't want RAID1 for the journal SSDs. It isn't
particularly
needed for resiliency and certainly isn't beneficial from a throughput
perspective.

Sorry, can you clarify this further for me? If losing the SSD would
cause losing all the OSDs journaling on it why would you not want to RAID
it?

__**_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

39 matches

Mail list logo