[ceph-users] `ceph health` == HEALTH_GOOD_ENOUGH?

2017-02-19 Thread Tim Serong
Hi All,

Pretend I'm about to upgrade from one Ceph release to another.  I want
to know that the cluster is healthy enough to sanely upgrade (MONs
quorate, no OSDs actually on fire), but don't care about HEALTH_WARN
issues like "too many PGs per OSD" or "crush map has legacy tunables".

In this case, if I run `ceph health` and is says HEALTH_OK, I'm good,
and if it says HEALTH_ERR, I know things are bad.  But if it says
HEALTH_WARN, I might be OK, or I might not.

Is there any way to get a status like HEALTH_GOOD_ENOUGH? ;-)  Think:
some `ceph health` invocation a machine can parse to know whether or not
to allow upgrading the cluster.

Thanks,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-19 Thread Jay Linux
Hello Shinobu,

We already raised ticket for this issue. FYI -
http://tracker.ceph.com/issues/18924

Thanks
Jayaram


On Mon, Feb 20, 2017 at 12:36 AM, Shinobu Kinjo  wrote:

> Please open ticket at http://tracker.ceph.com, if you haven't yet.
>
> On Thu, Feb 16, 2017 at 6:07 PM, Muthusamy Muthiah
>  wrote:
> > Hi Wido,
> >
> > Thanks for the information and let us know if this is a bug.
> > As workaround we will go with small bluestore_cache_size to 100MB.
> >
> > Thanks,
> > Muthu
> >
> > On 16 February 2017 at 14:04, Wido den Hollander  wrote:
> >>
> >>
> >> > Op 16 februari 2017 om 7:19 schreef Muthusamy Muthiah
> >> > :
> >> >
> >> >
> >> > Thanks IIya Letkowski for the information we will change this value
> >> > accordingly.
> >> >
> >>
> >> What I understand from yesterday's performance meeting is that this
> seems
> >> like a bug. Lowering this buffer reduces memory, but the root-cause
> seems to
> >> be memory not being freed. A few bytes of a larger allocation still
> >> allocated causing this buffer not to be freed.
> >>
> >> Tried:
> >>
> >> debug_mempools = true
> >>
> >> $ ceph daemon osd.X dump_mempools
> >>
> >> Might want to view the YouTube video of yesterday when it's online:
> >> https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw/videos
> >>
> >> Wido
> >>
> >> > Thanks,
> >> > Muthu
> >> >
> >> > On 15 February 2017 at 17:03, Ilya Letkowski  >
> >> > wrote:
> >> >
> >> > > Hi, Muthusamy Muthiah
> >> > >
> >> > > I'm not totally sure that this is a memory leak.
> >> > > We had same problems with bluestore on ceph v11.2.0.
> >> > > Reduce bluestore cache helped us to solve it and stabilize OSD
> memory
> >> > > consumption on the 3GB level.
> >> > >
> >> > > Perhaps this will help you:
> >> > >
> >> > > bluestore_cache_size = 104857600
> >> > >
> >> > >
> >> > >
> >> > > On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
> >> > > muthiah.muthus...@gmail.com> wrote:
> >> > >
> >> > >> Hi All,
> >> > >>
> >> > >> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
> >> > >> issues.
> >> > >>
> >> > >> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL
> >> > >> 7.2
> >> > >>
> >> > >> Some traces using sar are below and attached the memory utilisation
> >> > >> graph
> >> > >> .
> >> > >>
> >> > >> (16:54:42)[cn2.c1 sa] # sar -r
> >> > >> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
> >> > >> %commit
> >> > >> kbactive kbinact kbdirty
> >> > >> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18
> >> > >> 51991692
> >> > >> 2676468 260
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22
> >> > >> 51851512
> >> > >> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316
> >> > >> 47.22
> >> > >> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
> >> > >> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45
> >> > >> 16176
> >> > >> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548
> 137673084
> >> > >> 83.52
> >> > >> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646
> >> > >> 138376076
> >> > >> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01
> >> > >> 26002252
> >> > >> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036
> >> > >> 1611:40:01
> >> > >> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980
> >> > >> 2716740
> >> > >> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84
> >> > >> 57869628
> >> > >> 2715400 16*
> >> > >>
> >> > >> ...
> >> > >> ...
> >> > >>
> >> > >> In the attached graph, there is increase in memory utilisation by
> >> > >> ceph-osd during soak test. And when it reaches the system limit of
> >> > >> 128GB
> >> > >> RAM , we could able to see the below dmesg logs related to memory
> out
> >> > >> when
> >> > >> the system reaches close to 128GB RAM. OSD.3 killed due to Out of
> >> > >> memory
> >> > >> and started again.
> >> > >>
> >> > >> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
> >> > >> gfp_mask=0x280da, order=0, oom_score_adj=0*
> >> > >> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
> >> > >> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not
> >> > >> tainted
> >> > >> 3.10.0-327.13.1.el7.x86_64 #1
> >> > >> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420
> >> > >> Gen9/ProLiant
> >> > >> XL420 Gen9, BIOS U19 09/12/2016
> >> > >> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
> >> > >> 881fa58f7528 816356f4
> >> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
> >> > >> 881fa3478360 881fa3478378
> >> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
> >> > >> 0001 0001f65f
> >> > >> [Tue Feb 14 

Re: [ceph-users] Jewel + kernel 4.4 Massive performance regression (-50%)

2017-02-19 Thread Christian Balzer

Hello,

On Thu, 16 Feb 2017 17:51:18 +0200 Kostis Fardelas wrote:

> Hello,
> we are on Debian Jessie and Hammer 0.94.9 and recently we decided to
> upgrade our kernel from 3.16 to 4.9 (jessie-backports). We experience
> the same regression but with some shiny points

Same OS, kernels and Ceph version here, but I can't reproduce this for
the most part, probably because of other differences.

4 nodes, 
2 with 4 HDD based and SSD journal OSDs,
2 with 4 SSD based OSDs (cache-tier),
replication 2.
Half of the nodes/OSDs are using XFS, the other half EXT4.

> -- ceph tell osd average across the cluster --
> 3.16.39-1: 204MB/s
> 4.9.0-0: 158MB/s
> 
The "ceph osd tell bench" is really way too imprecise and all over the
place for me, but the average of the HDD based OSDs doesn't differ
noticeably.

> -- 1 rados bench client 4K 2048 threads avg IOPS --
> 3.16.39-1: 1604
> 4.9.0-0: 451
> 
I'd think 32-64 threads will do nicely.
As discussed on the ML before, this test is also not particular realistic
when it comes to actual client performance, but still, a data point is a
data point. 

And incidentally this is the only test where I can clearly see something
similar, with 64 threads and 4K:

3400 IOPS 3.16
2600 IOPS 4.9

So where you are seeing a 70% reduction, I'm seeing "only" 25% less.

Which is of course a perfect match for my XFS vs. EXT4 OSD ratio.

Thus I turned off the XFS node and ran the test again with just the EXT4
node active. And this time 4.9 came out (slightly) ahead:

3645 IOPS 3.16
3970 IOPS 4.9

So this looks like a regression when it comes to CEPH interacting with XFS.
Probably aggravated by how the "bench" tests work (lots of object
creation), as opposed to normal usage with existing objects as tested
below.

> -- 1 rados bench client 64K 512 threads avg BW MB/s--
> 3.16.39-1: 78
> 4.9.0-0: 31
>
With the default 4MB block size, no relevant difference here again.
But then again, this creates only a few objects compared to 4KB.
 
I've run fio (4M write, 4K write, 4k randwrite) from within a VM against
the cluster with both kernel versions, no noticeable difference there
either.

Just to compare this to the rados bench tests above:
---
root@tvm-01:~# fio --size=18G --ioengine=libaio --invalidate=1 --direct=1 
--numjobs=1 --rw=write --name=fiojob --blocksize=4M --iodepth=64

fiojob: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=64
fio-2.1.11
  write: io=18432MB, bw=359772KB/s, iops=87, runt= 52462msec
---
OSD processes are at about 35% CPU usage (100% = 1 core), SSDs are at about
85% utilization. 

---
root@tvm-01:~# fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 
--numjobs=1 --rw=write --name=fiojob --blocksize=4K --iodepth=64

fiojob: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.1.11
  write: io=4096.0MB, bw=241984KB/s, iops=60495, runt= 17333msec
---
OSD processes are at about 20% CPU  usage, SSDs are at 50%
utilization.

---
root@tvm-01:~# fio --size=2G --ioengine=libaio --invalidate=1 --direct=1 
--numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=64

fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.1.11
  write: io=2048.0MB, bw=36086KB/s, iops=9021, runt= 58115msec
---
OSD processes are at 300% (and likely wanting more) CPU usage, SSDs at
about 25% utilization

Christian

> The shiny points are on the following tests:
> 1 rados bench client 4K 512 threads avg IOPS
> 1 rados bench client 64K 2048 threads avg BW MB/s
> 
> where machines with kernel 4.9 seem to perform slightly better. The
> overall impression though is that there is a serious regression or
> something that should be tuned to get the same performance out of the
> cluster.
> 
> Our demo cluster is 4 nodes X 12 OSDs, separate journal on SSD,
> firefly tunables and everything else default considering our Ceph
> installation and Debian OS. Each rados bench run 5 times to get an
> average and caches were dropped before each test.
> 
> I wonder if anyone has discovered the culprit so far? Any hints from
> others to focus our investigation on?
> 
> Best regards,
> Kostis
> 
> On 19 December 2016 at 17:17, Yoann Moulin  wrote:
> > Hello,
> >
> > Finally, I found time to do some new benchmarks with the latest jewel 
> > release (10.2.5) on 4 nodes. Each node has 10 OSDs.
> >
> > I ran 2 times "ceph tell osd.* bench" over 40 OSDs, here the average speed :
> >
> > 4.2.0-42-generic  97.45 MB/s
> > 4.4.0-53-generic  55.73 MB/s
> > 4.8.15-040815-generic 62.41 MB/s
> > 4.9.0-040900-generic  60.88 MB/s
> >
> > I have the same behaviour with at least 35 to 40% performance drop between 
> > kernel 4.2 and kernel > 4.4
> >
> > I can do further benches if needed.
> >
> > Yoann
> >
> > Le 26/07/2016 à 09:09, Lomayani S. Laizer a écrit :  
> >> Hello,
> >> do you have journal on disk too ?
> >>
> >> Yes am having journal on same hard disk.
> >>
> >> ok and could you do bench with kernel 4.2 ? just to see 

Re: [ceph-users] Rbd export-diff bug? rbd export-diff generates different incremental files

2017-02-19 Thread Zhongyan Gu
BTW, we used hammer  version with the following fix. the issue is also
reported by us during the former backup testing.
https://github.com/ceph/ceph/pull/12218/files
librbd: diffs to clone's first snapshot should include parent diffs


Zhongyan

On Mon, Feb 20, 2017 at 11:13 AM, Zhongyan Gu  wrote:

>
> Hi Sage and Jason,
>
> My company is building backup system based on rbd export-diff and
> import-diff cmds.
>
> However, in recent test we found some strange behaviors of cmd
> export-diff. long words in short: sometimes repeatedly executing rbd
> export-diff –from-snap snap1 image@snap2 -|md5sum, and md5sum returns
> different values.
>
> The details are:
>
> We used two ceph rbd clusters: A for online vms usage and B for backup
> usage.
>
> For a specific vm image, this image is cloned from a parent image. And
> initially our backup system will do a full backup with rbd export/import
> cmds. Then every day we will do incremental backup with rbd
> export-diff/import-diff cmds.
>
> The make sure the data consistency, we also do the md5 comparison of
> online vm images@snapN and backup vm images@snapN.
>
> Our test found some times for some vm images the md5 check is failed:
> online vm images@snapN doesn’t match backup vm images@snapN.
>
> To narrow this issue, we manually generated the incremental file generated
> by rbd export-diff between the specific snaps and found its md5 didn’t
> match the file generated by backup scripits.
>
> Compared those two binary files we found only a little difference: some
> bytes are not the same.
>
> I doubt could this be an export-diff bug? As far as I know, if we create
> two snaps, then the diffs between two snaps should always be the same. But
> why export-diff doesn’t work as expected and return different md5 check?
> Some corner case not well considered or anyone else has the same
> experience? BTW, we did some fio io workload 24 hours in vms during the
> backup test.
>
>
>
> Thanks,
>
> Zhongyan
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rbd export-diff bug? rbd export-diff generates different incremental files

2017-02-19 Thread Zhongyan Gu
Hi Sage and Jason,

My company is building backup system based on rbd export-diff and
import-diff cmds.

However, in recent test we found some strange behaviors of cmd export-diff.
long words in short: sometimes repeatedly executing rbd export-diff
–from-snap snap1 image@snap2 -|md5sum, and md5sum returns different values.

The details are:

We used two ceph rbd clusters: A for online vms usage and B for backup
usage.

For a specific vm image, this image is cloned from a parent image. And
initially our backup system will do a full backup with rbd export/import
cmds. Then every day we will do incremental backup with rbd
export-diff/import-diff cmds.

The make sure the data consistency, we also do the md5 comparison of online
vm images@snapN and backup vm images@snapN.

Our test found some times for some vm images the md5 check is failed:
online vm images@snapN doesn’t match backup vm images@snapN.

To narrow this issue, we manually generated the incremental file generated
by rbd export-diff between the specific snaps and found its md5 didn’t
match the file generated by backup scripits.

Compared those two binary files we found only a little difference: some
bytes are not the same.

I doubt could this be an export-diff bug? As far as I know, if we create
two snaps, then the diffs between two snaps should always be the same. But
why export-diff doesn’t work as expected and return different md5 check?
Some corner case not well considered or anyone else has the same
experience? BTW, we did some fio io workload 24 hours in vms during the
backup test.



Thanks,

Zhongyan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Passing LUA script via python rados execute

2017-02-19 Thread Patrick Donnelly
On Sat, Feb 18, 2017 at 2:55 PM, Noah Watkins  wrote:
> The least intrusive solution is to simply change the sandbox to allow
> the standard file system module loading function as expected. Then any
> user would need to make sure that every OSD had consistent versions of
> dependencies installed using something like LuaRocks. This is simple,
> but could make debugging and deployment a major headache.

A locked down require which doesn't load C bindings (i.e. only load
.lua files) would probably be alright.

> A more ambitious version would be to create an interface for users to
> upload scripts and dependencies into objects, and support referencing
> those objects as standard dependencies in Lua scripts as if they were
> standard modules on the file system. Each OSD could then cache scripts
> and dependencies, allowing applications to use references to scripts
> instead of sending a script with every request.

This is very doable. I imagine we'd just put all of the Lua modules in
a flattened hierarchy under a RADOS namespace? The potentially
annoying nit in this is writing some kind of mechanism for installing
a Lua module tree into RADOS. Users would install locally and then
upload the tree through some tool.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kraken-bluestore 11.2.0 memory leak issue

2017-02-19 Thread Shinobu Kinjo
Please open ticket at http://tracker.ceph.com, if you haven't yet.

On Thu, Feb 16, 2017 at 6:07 PM, Muthusamy Muthiah
 wrote:
> Hi Wido,
>
> Thanks for the information and let us know if this is a bug.
> As workaround we will go with small bluestore_cache_size to 100MB.
>
> Thanks,
> Muthu
>
> On 16 February 2017 at 14:04, Wido den Hollander  wrote:
>>
>>
>> > Op 16 februari 2017 om 7:19 schreef Muthusamy Muthiah
>> > :
>> >
>> >
>> > Thanks IIya Letkowski for the information we will change this value
>> > accordingly.
>> >
>>
>> What I understand from yesterday's performance meeting is that this seems
>> like a bug. Lowering this buffer reduces memory, but the root-cause seems to
>> be memory not being freed. A few bytes of a larger allocation still
>> allocated causing this buffer not to be freed.
>>
>> Tried:
>>
>> debug_mempools = true
>>
>> $ ceph daemon osd.X dump_mempools
>>
>> Might want to view the YouTube video of yesterday when it's online:
>> https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw/videos
>>
>> Wido
>>
>> > Thanks,
>> > Muthu
>> >
>> > On 15 February 2017 at 17:03, Ilya Letkowski 
>> > wrote:
>> >
>> > > Hi, Muthusamy Muthiah
>> > >
>> > > I'm not totally sure that this is a memory leak.
>> > > We had same problems with bluestore on ceph v11.2.0.
>> > > Reduce bluestore cache helped us to solve it and stabilize OSD memory
>> > > consumption on the 3GB level.
>> > >
>> > > Perhaps this will help you:
>> > >
>> > > bluestore_cache_size = 104857600
>> > >
>> > >
>> > >
>> > > On Tue, Feb 14, 2017 at 11:52 AM, Muthusamy Muthiah <
>> > > muthiah.muthus...@gmail.com> wrote:
>> > >
>> > >> Hi All,
>> > >>
>> > >> On all our 5 node cluster with ceph 11.2.0 we encounter memory leak
>> > >> issues.
>> > >>
>> > >> Cluster details : 5 node with 24/68 disk per node , EC : 4+1 , RHEL
>> > >> 7.2
>> > >>
>> > >> Some traces using sar are below and attached the memory utilisation
>> > >> graph
>> > >> .
>> > >>
>> > >> (16:54:42)[cn2.c1 sa] # sar -r
>> > >> 07:50:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
>> > >> %commit
>> > >> kbactive kbinact kbdirty
>> > >> 10:20:01 32077264 132754368 80.54 16176 3040244 77767024 47.18
>> > >> 51991692
>> > >> 2676468 260
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> *10:30:01 32208384 132623248 80.46 16176 3048536 77832312 47.22
>> > >> 51851512
>> > >> 2684552 1210:40:01 32067244 132764388 80.55 16176 3059076 77832316
>> > >> 47.22
>> > >> 51983332 2694708 26410:50:01 30626144 134205488 81.42 16176 3064340
>> > >> 78177232 47.43 53414144 2693712 411:00:01 28927656 135903976 82.45
>> > >> 16176
>> > >> 3074064 78958568 47.90 55114284 2702892 1211:10:01 27158548 137673084
>> > >> 83.52
>> > >> 16176 3080600 80553936 48.87 56873664 2708904 1211:20:01 2646
>> > >> 138376076
>> > >> 83.95 16176 3080436 81991036 49.74 57570280 2708500 811:30:01
>> > >> 26002252
>> > >> 138829380 84.22 16176 3090556 82223840 49.88 58015048 2718036
>> > >> 1611:40:01
>> > >> 25965924 138865708 84.25 16176 3089708 83734584 50.80 58049980
>> > >> 2716740
>> > >> 1211:50:01 26142888 138688744 84.14 16176 3089544 83800100 50.84
>> > >> 57869628
>> > >> 2715400 16*
>> > >>
>> > >> ...
>> > >> ...
>> > >>
>> > >> In the attached graph, there is increase in memory utilisation by
>> > >> ceph-osd during soak test. And when it reaches the system limit of
>> > >> 128GB
>> > >> RAM , we could able to see the below dmesg logs related to memory out
>> > >> when
>> > >> the system reaches close to 128GB RAM. OSD.3 killed due to Out of
>> > >> memory
>> > >> and started again.
>> > >>
>> > >> [Tue Feb 14 03:51:02 2017] *tp_osd_tp invoked oom-killer:
>> > >> gfp_mask=0x280da, order=0, oom_score_adj=0*
>> > >> [Tue Feb 14 03:51:02 2017] tp_osd_tp cpuset=/ mems_allowed=0-1
>> > >> [Tue Feb 14 03:51:02 2017] CPU: 20 PID: 11864 Comm: tp_osd_tp Not
>> > >> tainted
>> > >> 3.10.0-327.13.1.el7.x86_64 #1
>> > >> [Tue Feb 14 03:51:02 2017] Hardware name: HP ProLiant XL420
>> > >> Gen9/ProLiant
>> > >> XL420 Gen9, BIOS U19 09/12/2016
>> > >> [Tue Feb 14 03:51:02 2017]  8819ccd7a280 30e84036
>> > >> 881fa58f7528 816356f4
>> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75b8 8163068f
>> > >> 881fa3478360 881fa3478378
>> > >> [Tue Feb 14 03:51:02 2017]  881fa58f75e8 8819ccd7a280
>> > >> 0001 0001f65f
>> > >> [Tue Feb 14 03:51:02 2017] Call Trace:
>> > >> [Tue Feb 14 03:51:02 2017]  [] dump_stack+0x19/0x1b
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> dump_header+0x8e/0x214
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> oom_kill_process+0x24e/0x3b0
>> > >> [Tue Feb 14 03:51:02 2017]  [] ?
>> > >> find_lock_task_mm+0x56/0xc0
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> *out_of_memory+0x4b6/0x4f0*
>> > >> [Tue Feb 14 03:51:02 2017]  []
>> > >> __alloc_pages_nodemask+0xa95/0xb90
>> > >> [Tue Feb 

Re: [ceph-users] Experience with 5k RPM/archive HDDs

2017-02-19 Thread Wido den Hollander

> Op 18 februari 2017 om 17:03 schreef rick stehno :
> 
> 
> I work for Seagate and have done over a hundred of tests using SMR 8TB disks 
> in a cluster. It all depends on what your access is if SMR hdd would be the 
> best choice. Remember SMR hdd don't perform well doing random writes, but are 
> excellent for reads and sequential writes. 
> I have many tests where I added a SSD or PCIe flash card to place the 
> journals on these devices and SMR performed better than a typical CMR disk 
> and overall cheaper than using all CMR hdd. You can also use some type of 
> caching like Ceph Cache Tier or other caching with very good results.
> By placing the journals on flash or adopt some type of caching you are 
> eliminating the double writes to the SMR hdd and performance should be fine. 
> I have test results if you would like to see them.

I am really keen on seeing those numbers. The blogpost ( 
https://blog.widodh.nl/2017/02/do-not-use-smr-disks-with-ceph/ ) I wrote is 
based on two occasions where people bought 6TB and 8TB Seagate SMR disks and 
used them in Ceph.

One use-case was with a application which would write natively to RADOS and the 
other with CephFS.

In both occasions the Journals were on SSD, but the backing disk would just be 
saturated very easily. Ceph still does Random Writes on the disk for updating 
things like PGLogs and such, writing new OSDMaps, etc.

A sequential large write into Ceph might be splitted up by either CephFS or RBD 
into smaller writes to various RADOS objects.

I haven't seen a use-case where SMR disks perform 'OK' at all with Ceph. That's 
why my advise is still to stay away from those disks for Ceph.

In both cases my customers had to spent a lot of money on buying new disks to 
make it work. The first case was actually somebody who bought 1000 SMR disks 
and then found out they didn't work with Ceph.

Wido 

> 
> Rick 
> Sent from my iPhone, please excuse any typing errors.
> 
> > On Feb 17, 2017, at 8:49 PM, Mike Miller  wrote:
> > 
> > Hi,
> > 
> > don't go there, we tried this with SMR drives, which will slow down to 
> > somewhere around 2-3 IOPS during backfilling/recovery and that renders the 
> > cluster useless for client IO. Things might change in the future, but for 
> > now, I would strongly recommend against SMR.
> > 
> > Go for normal SATA drives with only slightly higher price/capacity ratios.
> > 
> > - mike
> > 
> >> On 2/3/17 2:46 PM, Stillwell, Bryan J wrote:
> >> On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"
> >>  wrote:
> >>> 
>  Op 3 februari 2017 om 11:03 schreef Maxime Guyot
>  :
>  
>  
>  Hi,
>  
>  Interesting feedback!
>  
>   > In my opinion the SMR can be used exclusively for the RGW.
>   > Unless it's something like a backup/archive cluster or pool with
>  little to none concurrent R/W access, you're likely to run out of IOPS
>  (again) long before filling these monsters up.
>  
>  That¹s exactly the use case I am considering those archive HDDs for:
>  something like AWS Glacier, a form of offsite backup probably via
>  radosgw. The classic Seagate enterprise class HDD provide ³too much²
>  performance for this use case, I could live with 1Ž4 of the performance
>  for that price point.
>  
> >>> 
> >>> If you go down that route I suggest that you make a mixed cluster for RGW.
> >>> 
> >>> A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or
> >>> PM863 or a Intel DC series.
> >>> 
> >>> All pools by default should go to those OSDs.
> >>> 
> >>> Only the RGW buckets data pool should go to the big SMR drives. However,
> >>> again, expect very, very low performance of those disks.
> >> One of the other concerns you should think about is recovery time when one
> >> of these drives fail.  The more OSDs you have, the less this becomes an
> >> issue, but on a small cluster is might take over a day to fully recover
> >> from an OSD failure.  Which is a decent amount of time to have degraded
> >> PGs.
> >> Bryan
> >> E-MAIL CONFIDENTIALITY NOTICE:
> >> The contents of this e-mail message and any attachments are intended 
> >> solely for the addressee(s) and may contain confidential and/or legally 
> >> privileged information. If you are not the intended recipient of this 
> >> message or if this message has been addressed to you in error, please 
> >> immediately alert the sender by reply e-mail and then delete this message 
> >> and any attachments. If you are not the intended recipient, you are 
> >> notified that any use, dissemination, distribution, copying, or storage of 
> >> this message or any attachment is strictly prohibited.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >>