Re: [ceph-users] Experience with 5k RPM/archive HDDs

2017-02-20 Thread Mike Miller

Hi,

again, as I said, in normal operation everything is fine with SMR. They 
perform well in particular for large sequential writes because of the on 
platter cache (20 GB I think). All tests we have done were with good 
SSDs for OSD cache.


Things blow up during backfill / recovery because the SMR disks saturate 
and then slow down to some 2-3 iops.


Cache tiering will not work either because if one or more of the SMR in 
the backend fail, backfilling / recovery will again slow things down to 
almost no client io.


And yes, we tried all sorts of throttling mechanisms during backfilling 
/ recovery.


In all cases we tested the cluster is useless from the client side 
during backfilling / recovery.


- mike

On 2/19/17 9:54 AM, Wido den Hollander wrote:



Op 18 februari 2017 om 17:03 schreef rick stehno <rs3...@me.com>:


I work for Seagate and have done over a hundred of tests using SMR 8TB disks in 
a cluster. It all depends on what your access is if SMR hdd would be the best 
choice. Remember SMR hdd don't perform well doing random writes, but are 
excellent for reads and sequential writes.
I have many tests where I added a SSD or PCIe flash card to place the journals 
on these devices and SMR performed better than a typical CMR disk and overall 
cheaper than using all CMR hdd. You can also use some type of caching like Ceph 
Cache Tier or other caching with very good results.
By placing the journals on flash or adopt some type of caching you are 
eliminating the double writes to the SMR hdd and performance should be fine. I 
have test results if you would like to see them.


I am really keen on seeing those numbers. The blogpost ( 
https://blog.widodh.nl/2017/02/do-not-use-smr-disks-with-ceph/ ) I wrote is 
based on two occasions where people bought 6TB and 8TB Seagate SMR disks and 
used them in Ceph.

One use-case was with a application which would write natively to RADOS and the 
other with CephFS.

In both occasions the Journals were on SSD, but the backing disk would just be 
saturated very easily. Ceph still does Random Writes on the disk for updating 
things like PGLogs and such, writing new OSDMaps, etc.

A sequential large write into Ceph might be splitted up by either CephFS or RBD 
into smaller writes to various RADOS objects.

I haven't seen a use-case where SMR disks perform 'OK' at all with Ceph. That's 
why my advise is still to stay away from those disks for Ceph.

In both cases my customers had to spent a lot of money on buying new disks to 
make it work. The first case was actually somebody who bought 1000 SMR disks 
and then found out they didn't work with Ceph.

Wido



Rick
Sent from my iPhone, please excuse any typing errors.


On Feb 17, 2017, at 8:49 PM, Mike Miller <millermike...@gmail.com> wrote:

Hi,

don't go there, we tried this with SMR drives, which will slow down to 
somewhere around 2-3 IOPS during backfilling/recovery and that renders the 
cluster useless for client IO. Things might change in the future, but for now, 
I would strongly recommend against SMR.

Go for normal SATA drives with only slightly higher price/capacity ratios.

- mike


On 2/3/17 2:46 PM, Stillwell, Bryan J wrote:
On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"
<ceph-users-boun...@lists.ceph.com on behalf of w...@42on.com> wrote:



Op 3 februari 2017 om 11:03 schreef Maxime Guyot
<maxime.gu...@elits.com>:


Hi,

Interesting feedback!

  > In my opinion the SMR can be used exclusively for the RGW.
  > Unless it's something like a backup/archive cluster or pool with
little to none concurrent R/W access, you're likely to run out of IOPS
(again) long before filling these monsters up.

That¹s exactly the use case I am considering those archive HDDs for:
something like AWS Glacier, a form of offsite backup probably via
radosgw. The classic Seagate enterprise class HDD provide ³too much²
performance for this use case, I could live with 1Ž4 of the performance
for that price point.



If you go down that route I suggest that you make a mixed cluster for RGW.

A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or
PM863 or a Intel DC series.

All pools by default should go to those OSDs.

Only the RGW buckets data pool should go to the big SMR drives. However,
again, expect very, very low performance of those disks.

One of the other concerns you should think about is recovery time when one
of these drives fail.  The more OSDs you have, the less this becomes an
issue, but on a small cluster is might take over a day to fully recover
from an OSD failure.  Which is a decent amount of time to have degraded
PGs.
Bryan
E-MAIL CONFIDENTIALITY NOTICE:
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please 

Re: [ceph-users] Experience with 5k RPM/archive HDDs

2017-02-17 Thread Mike Miller

Hi,

don't go there, we tried this with SMR drives, which will slow down to 
somewhere around 2-3 IOPS during backfilling/recovery and that renders 
the cluster useless for client IO. Things might change in the future, 
but for now, I would strongly recommend against SMR.


Go for normal SATA drives with only slightly higher price/capacity ratios.

- mike

On 2/3/17 2:46 PM, Stillwell, Bryan J wrote:

On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"
 wrote:




Op 3 februari 2017 om 11:03 schreef Maxime Guyot
:


Hi,

Interesting feedback!

  > In my opinion the SMR can be used exclusively for the RGW.
  > Unless it's something like a backup/archive cluster or pool with
little to none concurrent R/W access, you're likely to run out of IOPS
(again) long before filling these monsters up.

That¹s exactly the use case I am considering those archive HDDs for:
something like AWS Glacier, a form of offsite backup probably via
radosgw. The classic Seagate enterprise class HDD provide ³too much²
performance for this use case, I could live with 1Ž4 of the performance
for that price point.



If you go down that route I suggest that you make a mixed cluster for RGW.

A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or
PM863 or a Intel DC series.

All pools by default should go to those OSDs.

Only the RGW buckets data pool should go to the big SMR drives. However,
again, expect very, very low performance of those disks.


One of the other concerns you should think about is recovery time when one
of these drives fail.  The more OSDs you have, the less this becomes an
issue, but on a small cluster is might take over a day to fully recover
from an OSD failure.  Which is a decent amount of time to have degraded
PGs.

Bryan

E-MAIL CONFIDENTIALITY NOTICE:
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate cephfs metadata to SSD in running cluster

2017-02-16 Thread Mike Miller

Hi,

thanks all, still I would appreciate hints on a concrete procedure how 
to migrate cephfs metadata to a SSD pool, the SSDs being on the same 
hosts like to spinning disks.


This reference I read:
https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
Are there other alternatives to this suggested configuration?

I am kind of a little paranoid to start playing around with crush rules 
in the running system.


Regards,

Mike

On 1/5/17 11:40 PM, jiajia zhong wrote:



2017-01-04 23:52 GMT+08:00 Mike Miller <millermike...@gmail.com 
<mailto:millermike...@gmail.com>>:


Wido, all,

can you point me to the "recent benchmarks" so I can have a look?
How do you define "performance"? I would not expect cephFS
throughput to change, but it is surprising to me that metadata on
SSD will have no measurable effect on latency.

- mike


operations like "ls", "stat", "find" would become faster, the bottleneck 
is the slow osds which store file data.



On 1/3/17 10:49 AM, Wido den Hollander wrote:


Op 3 januari 2017 om 2:49 schreef Mike Miller
<millermike...@gmail.com <mailto:millermike...@gmail.com>>:


will metadata on SSD improve latency significantly?


No, as I said in my previous e-mail, recent benchmarks showed
that storing CephFS metadata on SSD does not improve performance.

It still might be good to do since it's not that much data thus
recovery will go quickly, but don't expect a CephFS performance
improvement.

Wido

Mike

On 1/2/17 11:50 AM, Wido den Hollander wrote:


Op 2 januari 2017 om 10:33 schreef Shinobu Kinjo
<ski...@redhat.com <mailto:ski...@redhat.com>>:


I've never done migration of cephfs_metadata from
spindle disks to
ssds. But logically you could achieve this through 2
phases.

  #1 Configure CRUSH rule including spindle disks
and ssds
  #2 Configure CRUSH rule for just pointing to ssds
   * This would cause massive data shuffling.


Not really, usually the CephFS metadata isn't that much
data.

Recent benchmarks (can't find them now) show that
storing CephFS metadata on SSD doesn't really improve
performance though.

    Wido



On Mon, Jan 2, 2017 at 2:36 PM, Mike Miller
<millermike...@gmail.com
<mailto:millermike...@gmail.com>> wrote:

Hi,

Happy New Year!

Can anyone point me to specific walkthrough /
howto instructions how to move
cephfs metadata to SSD in a running cluster?

How is crush to be modified step by step such
that the metadata migrate to
SSD?

Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate cephfs metadata to SSD in running cluster

2017-01-04 Thread Mike Miller

Wido, all,

can you point me to the "recent benchmarks" so I can have a look?
How do you define "performance"? I would not expect cephFS throughput to 
change, but it is surprising to me that metadata on SSD will have no 
measurable effect on latency.


- mike

On 1/3/17 10:49 AM, Wido den Hollander wrote:



Op 3 januari 2017 om 2:49 schreef Mike Miller <millermike...@gmail.com>:


will metadata on SSD improve latency significantly?



No, as I said in my previous e-mail, recent benchmarks showed that storing 
CephFS metadata on SSD does not improve performance.

It still might be good to do since it's not that much data thus recovery will 
go quickly, but don't expect a CephFS performance improvement.

Wido


Mike

On 1/2/17 11:50 AM, Wido den Hollander wrote:



Op 2 januari 2017 om 10:33 schreef Shinobu Kinjo <ski...@redhat.com>:


I've never done migration of cephfs_metadata from spindle disks to
ssds. But logically you could achieve this through 2 phases.

 #1 Configure CRUSH rule including spindle disks and ssds
 #2 Configure CRUSH rule for just pointing to ssds
  * This would cause massive data shuffling.


Not really, usually the CephFS metadata isn't that much data.

Recent benchmarks (can't find them now) show that storing CephFS metadata on 
SSD doesn't really improve performance though.

Wido




On Mon, Jan 2, 2017 at 2:36 PM, Mike Miller <millermike...@gmail.com> wrote:

Hi,

Happy New Year!

Can anyone point me to specific walkthrough / howto instructions how to move
cephfs metadata to SSD in a running cluster?

How is crush to be modified step by step such that the metadata migrate to
SSD?

Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate cephfs metadata to SSD in running cluster

2017-01-02 Thread Mike Miller

will metadata on SSD improve latency significantly?

Mike

On 1/2/17 11:50 AM, Wido den Hollander wrote:



Op 2 januari 2017 om 10:33 schreef Shinobu Kinjo <ski...@redhat.com>:


I've never done migration of cephfs_metadata from spindle disks to
ssds. But logically you could achieve this through 2 phases.

 #1 Configure CRUSH rule including spindle disks and ssds
 #2 Configure CRUSH rule for just pointing to ssds
  * This would cause massive data shuffling.


Not really, usually the CephFS metadata isn't that much data.

Recent benchmarks (can't find them now) show that storing CephFS metadata on 
SSD doesn't really improve performance though.

Wido




On Mon, Jan 2, 2017 at 2:36 PM, Mike Miller <millermike...@gmail.com> wrote:

Hi,

Happy New Year!

Can anyone point me to specific walkthrough / howto instructions how to move
cephfs metadata to SSD in a running cluster?

How is crush to be modified step by step such that the metadata migrate to
SSD?

Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrate cephfs metadata to SSD in running cluster

2017-01-01 Thread Mike Miller

Hi,

Happy New Year!

Can anyone point me to specific walkthrough / howto instructions how to 
move cephfs metadata to SSD in a running cluster?


How is crush to be modified step by step such that the metadata migrate 
to SSD?


Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [EXTERNAL] Ceph performance is too good (impossible..)...

2016-12-16 Thread Mike Miller

Hi,

you need to flush all caches before starting read tests. With fio you 
can probably do this if you keep the files that it creates.


as root on all clients and all osd nodes run:

echo 3 > /proc/sys/vm/drop_caches

But fio is a little problematic for ceph because of the caches in the 
clients and the osd nodes. If you really need to know the read rates, go 
for large files, write them with dd, flush all caches, and the read the 
files with dd. Single threaded read dd shows less throughput compared to 
multiple threads dd read. readahead size does also matter.


Good luck testing.

- mike


On 12/13/16 2:37 AM, V Plus wrote:

The same..
see:
A: (g=0): rw=read, bs=5M-5M/5M-5M/5M-5M, ioengine=*libaio*, iodepth=1
...
fio-2.2.10
Starting 16 processes

A: (groupid=0, jobs=16): err= 0: pid=27579: Mon Dec 12 20:36:10 2016
  mixed: io=122515MB, bw=6120.3MB/s, iops=1224, runt= 20018msec

I think at the end, the only one way to solve this issue is to write the
image before read testas suggested

I have no clue why rbd engine does not work...

On Mon, Dec 12, 2016 at 4:23 PM, Will.Boege > wrote:

Try adding --ioengine=libaio

__ __

*From: *V Plus >
*Date: *Monday, December 12, 2016 at 2:40 PM
*To: *"Will.Boege" >
*Subject: *Re: [EXTERNAL] [ceph-users] Ceph performance is too good
(impossible..)...

__ __

Hi Will, 

thanks very much..

However, I tried with your suggestions.

Both are *not *working...

1. with FIO rbd engine:
*[RBD_TEST]
ioengine=rbd
clientname=admin
pool=rbd
rbdname=fio_test
invalidate=1
direct=1
group_reporting=1
unified_rw_reporting=1
time_based=1
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20*

then I run "sudo fio rbd.job" and  got:


RBD_TEST: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=rbd, iodepth=1
...
fio-2.2.10
Starting 16 processes
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
Jobs: 12 (f=7): [R(5),_(1),R(4),_(1),R(2),_(2),R(1)] [100.0% done]
[11253MB/0KB/0KB /s] [2813/0/0 iops] [eta 00m:00s]
RBD_TEST: (groupid=0, jobs=16): err= 0: pid=17504: Mon Dec 12
15:32:52 2016
  mixed: io=212312MB, *bw=10613MB/s*, iops=2653, runt= 20005msec

2. with blockalign
[A]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd0
rw=read
bs=5MB
numjobs=16
ramp_time=5
runtime=20
blockalign=512b

[B]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd1
rw=read
bs=5MB
numjobs=16
ramp_time=5
runtime=20
blockalign=512b

sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt
& wait

Then I got:
A: (groupid=0, jobs=16): err= 0: pid=19320: Mon Dec 12 15:35:32 2016
  mixed: io=88590MB, bw=4424.7MB/s, iops=884, runt= 20022msec

B: (groupid=0, jobs=16): err= 0: pid=19324: Mon Dec 12 15:35:32 2016
  mixed: io=88020MB, *bw=4395.6MB/s*, iops=879, runt= 20025msec

..

__ __

On Mon, Dec 12, 2016 at 10:45 AM, Will.Boege > wrote:

My understanding is that when using direct=1 on a raw block
device FIO (aka-you) will have to handle all the sector
alignment or the request will get buffered to perform the
alignment.  

 

Try adding the –blockalign=512b option to your jobs, or better
yet just use the native FIO RBD engine.

 

Something like this (untested) - 

 

[A]

ioengine=rbd

clientname=admin

pool=rbd

rbdname=fio_test

direct=1

group_reporting=1

unified_rw_reporting=1

time_based=1

rw=read

bs=4MB

numjobs=16

ramp_time=10

runtime=20

 

*From: *ceph-users > on behalf of V Plus
>
  

Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device

2016-12-12 Thread Mike Miller

John,

thanks for emphasizing this, before this workaround we tried many 
different kernel versions including 4.5.x, all the same. The problem 
might be particular to our environment as most of the client machines 
(compute servers) have large RAM, so plenty of cache space for 
inodes/dentries.


Cheers,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device

2016-12-11 Thread Mike Miller

Hi,

you have given up too early. rsync is not a nice workload for cephfs, in 
particular, most linux kernel clients cephfs will end up caching all 
inodes/dentries. The result is that mds servers crash due to memory 
limitations. And rsync basically screens all inodes/dentries so it is 
the perfect application to gobble up all inode caps.


We run a cronjob script flush_cache every few (2-5) minutes:

#!/bin/bash
echo 2 > /proc/sys/vm/drop_caches

on all machines that mount cephfs. There is no performance drop in the 
client machines, but happily, the mds congestion is solved by this.


We also went the rbd way before this, but for large rbd images we much 
prefer cephfs instead.


Regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sandisk SSDs

2016-12-11 Thread Mike Miller

Hi,

some time ago when starting a ceph evaluation cluster I used SSDs with 
similar specs. I would strongly recommend against it, during normal 
operation things might be fine, but wait until the first disk fails and 
things have to be backfilled.


If you still try, please let me know how things turn out for you.

Regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs (rbd) read performance low - where is thebottleneck?

2016-11-23 Thread Mike Miller

JiaJia, all,

thanks, yes, I have the mount opts in mtab, and correct, if I leave out 
the "-v" option, no complaints.


mtab:
mounted ... type ceph (name=cephfs,rasize=134217728,key=client.cephfs)

It has to be rasize (rsize will not work).
One can check here:

cat /sys/class/bdi/ceph-*/read_ahead_kb
-> 131072

And YES! I am so happy, dd 40GB file does a lot more single thread now, 
much better.


rasize= 67108864  222 MB/s
rasize=134217728  360 MB/s
rasize=268435456  474 MB/s

Thank you all very much for bringing me on the right track, highly 
appreciated.


Regards,

Mike

On 11/23/16 5:55 PM, JiaJia Zhong wrote:

Mike,
if you run mount.ceph with "-v" options, you may get "ceph: Unknown
mount option rsize",
actually, you could ignore this, the rsize and rasize will both be
passed to mount syscall.

I belive that you have had the cephfs mounted successfully,
run "mount" in terminal to check the actual mount opts in mtab.

-- Original --
*From: * "Mike Miller"<millermike...@gmail.com>;
*Date: * Wed, Nov 23, 2016 02:38 PM
*To: * "Eric Eastman"<eric.east...@keepertech.com>;
*Cc: * "Ceph Users"<ceph-users@lists.ceph.com>;
*Subject: * Re: [ceph-users] cephfs (rbd) read performance low - where
is thebottleneck?

Hi,

did some testing multithreaded access and dd, performance scales as it
should.

Any ideas to improve single threaded read performance further would be
highly appreciated. Some of our use cases requires that we need to read
large files by a single thread.

I have tried changing the readahead on the kernel client cephfs mount
too, rsize and rasize.

mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864

Doing this on kernel 4.5.2 gives the error message:
"ceph: Unknown mount option rsize"
or unknown rasize.

Can someone explain to me how I can experiment with readahead on cephfs?

Mike

On 11/21/16 12:33 PM, Eric Eastman wrote:

Have you looked at your file layout?

On a test cluster running 10.2.3 I created a 5GB file and then looked
at the layout:

# ls -l test.dat
  -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat
# getfattr -n ceph.file.layout test.dat
  # file: test.dat
  ceph.file.layout="stripe_unit=4194304 stripe_count=1
object_size=4194304 pool=cephfs_data"

From what I understand with this layout you are reading 4MB of data
from 1 OSD at a time so I think you are seeing the overall speed of a
single SATA drive.  I do not think increasing your MON/MDS links to
10Gb will help, nor for a single file read will it help by going to
SSD for the metadata.

To test this, you may want to try creating 10 x 50GB files, and then
read them in parallel and see if your overall throughput increases.
If so, take a look at the layout parameters and see if you can change
the file layout to get more parallelization.

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst

Regards,
Eric

On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike...@gmail.com>

wrote:

Hi,

reading a big file 50 GB (tried more too)

dd if=bigfile of=/dev/zero bs=4M

in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3)

gives

me only about *122 MB/s* read speed in single thread. Scrubbing

turned off

during measurement.

I have been searching for possible bottlenecks. The network is not the
problem, the machine running dd is connected to the cluster public

network

with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T,

private

10 GBASE-T.

The osd SATA disks are utilized only up until about 10% or 20%, not more
than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1
core is used on this 6-core machine). mon and mds connected with only

1 GbE

(I would expect some latency from that, but no bandwidth issues; in fact
network bandwidth is about 20 Mbit max).

If I read a file with 50 GB, then clear the cache on the reading machine
(but not the osd caches), I get much better reading performance of about
*620 MB/s*. That seems logical to me as much (most) of the data is

still in

the osd cache buffers. But still the read performance is not super
considered that the reading machine is connected to the cluster with a 20
Gbit/s bond.

How can I improve? I am not really sure, but from my understanding 2
possible bottlenecks come to mind:

1) 1 GbE connection to mon / mds

Is this the reason why reads are slow and osd disks are not hammered

by read

requests and therewith fully utilized?

2) Move metadata to SSD

Currently, cephfs_metadata is on the same pool as the data on the

spinning

SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
solution?

Or is it both?

Your experience and insight are highly appreciated.

Thanks,

Mike
___

Re: [ceph-users] cephfs (rbd) read performance low - where is the bottleneck?

2016-11-22 Thread Mike Miller

Hi,

did some testing multithreaded access and dd, performance scales as it 
should.


Any ideas to improve single threaded read performance further would be 
highly appreciated. Some of our use cases requires that we need to read 
large files by a single thread.


I have tried changing the readahead on the kernel client cephfs mount 
too, rsize and rasize.


mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864

Doing this on kernel 4.5.2 gives the error message:
"ceph: Unknown mount option rsize"
or unknown rasize.

Can someone explain to me how I can experiment with readahead on cephfs?

Mike

On 11/21/16 12:33 PM, Eric Eastman wrote:

Have you looked at your file layout?

On a test cluster running 10.2.3 I created a 5GB file and then looked
at the layout:

# ls -l test.dat
  -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat
# getfattr -n ceph.file.layout test.dat
  # file: test.dat
  ceph.file.layout="stripe_unit=4194304 stripe_count=1
object_size=4194304 pool=cephfs_data"

From what I understand with this layout you are reading 4MB of data
from 1 OSD at a time so I think you are seeing the overall speed of a
single SATA drive.  I do not think increasing your MON/MDS links to
10Gb will help, nor for a single file read will it help by going to
SSD for the metadata.

To test this, you may want to try creating 10 x 50GB files, and then
read them in parallel and see if your overall throughput increases.
If so, take a look at the layout parameters and see if you can change
the file layout to get more parallelization.

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst

Regards,
Eric

On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike...@gmail.com> wrote:

Hi,

reading a big file 50 GB (tried more too)

dd if=bigfile of=/dev/zero bs=4M

in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3) gives
me only about *122 MB/s* read speed in single thread. Scrubbing turned off
during measurement.

I have been searching for possible bottlenecks. The network is not the
problem, the machine running dd is connected to the cluster public network
with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T, private
10 GBASE-T.

The osd SATA disks are utilized only up until about 10% or 20%, not more
than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1
core is used on this 6-core machine). mon and mds connected with only 1 GbE
(I would expect some latency from that, but no bandwidth issues; in fact
network bandwidth is about 20 Mbit max).

If I read a file with 50 GB, then clear the cache on the reading machine
(but not the osd caches), I get much better reading performance of about
*620 MB/s*. That seems logical to me as much (most) of the data is still in
the osd cache buffers. But still the read performance is not super
considered that the reading machine is connected to the cluster with a 20
Gbit/s bond.

How can I improve? I am not really sure, but from my understanding 2
possible bottlenecks come to mind:

1) 1 GbE connection to mon / mds

Is this the reason why reads are slow and osd disks are not hammered by read
requests and therewith fully utilized?

2) Move metadata to SSD

Currently, cephfs_metadata is on the same pool as the data on the spinning
SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
solution?

Or is it both?

Your experience and insight are highly appreciated.

Thanks,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs (rbd) read performance low - where is the bottleneck?

2016-11-22 Thread Mike Miller

thank you very much for this info.

On 11/21/16 12:33 PM, Eric Eastman wrote:

Have you looked at your file layout?

On a test cluster running 10.2.3 I created a 5GB file and then looked
at the layout:

# ls -l test.dat
  -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat
# getfattr -n ceph.file.layout test.dat
  # file: test.dat
  ceph.file.layout="stripe_unit=4194304 stripe_count=1
object_size=4194304 pool=cephfs_data"


The file layout looks the same in my case.


From what I understand with this layout you are reading 4MB of data
from 1 OSD at a time so I think you are seeing the overall speed of a
single SATA drive.  I do not think increasing your MON/MDS links to
10Gb will help, nor for a single file read will it help by going to
SSD for the metadata.


Really? Does ceph really wait until each of the stripe_unit reads has 
finished reading before the next one?



To test this, you may want to try creating 10 x 50GB files, and then
read them in parallel and see if your overall throughput increases.


Scaling through parallelism works as expected, no problem there.


If so, take a look at the layout parameters and see if you can change
the file layout to get more parallelization.

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst
https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst


Interesting. But how would I change this to improve single threaded read 
speed?


And how would I do the changes to already existing files?

Regards,

Mike


Regards,
Eric

On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike...@gmail.com> wrote:

Hi,

reading a big file 50 GB (tried more too)

dd if=bigfile of=/dev/zero bs=4M

in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3) gives
me only about *122 MB/s* read speed in single thread. Scrubbing turned off
during measurement.

I have been searching for possible bottlenecks. The network is not the
problem, the machine running dd is connected to the cluster public network
with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T, private
10 GBASE-T.

The osd SATA disks are utilized only up until about 10% or 20%, not more
than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1
core is used on this 6-core machine). mon and mds connected with only 1 GbE
(I would expect some latency from that, but no bandwidth issues; in fact
network bandwidth is about 20 Mbit max).

If I read a file with 50 GB, then clear the cache on the reading machine
(but not the osd caches), I get much better reading performance of about
*620 MB/s*. That seems logical to me as much (most) of the data is still in
the osd cache buffers. But still the read performance is not super
considered that the reading machine is connected to the cluster with a 20
Gbit/s bond.

How can I improve? I am not really sure, but from my understanding 2
possible bottlenecks come to mind:

1) 1 GbE connection to mon / mds

Is this the reason why reads are slow and osd disks are not hammered by read
requests and therewith fully utilized?

2) Move metadata to SSD

Currently, cephfs_metadata is on the same pool as the data on the spinning
SATA disks. Is this the bottleneck? Is the move of metadata to SSD a
solution?

Or is it both?

Your experience and insight are highly appreciated.

Thanks,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs - mds hardware recommendation for 40 million files and 500 users

2016-07-26 Thread Mike Miller

Hi,

we have started to migrate user homes to cephfs with the mds server 32GB 
RAM. With multiple rsync threads copying this seems to be undersized; 
the mds process consumes all memory 32GB fitting about 4 million caps.


Any hardware recommendation for about 40 million files and about 500 users?

Currently, we are on hammer 0.94.5 and linux ubuntu, kernel 3.13.

Thanks and regards,

Mike

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd current.remove.me.somenumber

2016-06-30 Thread Mike Miller

Hi Greg,

thanks, highly appreciated. And yes, that was on an osd with btrfs. We 
switched back to xfs because of btrfs instabilities.


Regards,

-Mike

On 6/27/16 10:13 PM, Gregory Farnum wrote:

On Sat, Jun 25, 2016 at 11:22 AM, Mike Miller <millermike...@gmail.com> wrote:

Hi,

what is the meaning of the directory "current.remove.me.846930886" is
/var/lib/ceph/osd/ceph-14?


If you're using btrfs, I believe that's a no-longer-required snapshot
of the current state of the system. If you're not, I've no idea what
creates directories named like that.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd current.remove.me.somenumber

2016-06-25 Thread Mike Miller

Hi,

what is the meaning of the directory "current.remove.me.846930886" is 
/var/lib/ceph/osd/ceph-14?


Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-27 Thread Mike Miller

Nick, all,

fantastic, that did it!

I installed kernel 4.5.2 on the client, now the single threaded read 
performance using krbd mount is up to about 370 MB/s with standard 256 
readahead size, and touching 400 MB/s with larger readahead sizes.

All single threaded.

Multi-threaded krbd read on the same mount also improved, a 10 GBit/s 
network connection is easily saturated.


Thank you all so much for the discussion and the hints.

Regards,

Mike

On 4/23/16 9:51 AM, n...@fisk.me.uk wrote:

I've just looked through github for the Linux kernel and it looks like
that read ahead fix was introduced in 4.4, so I'm not sure if it's worth
trying a slightly newer kernel?

Sent from Nine
<http://xo4t.mj.am/link/xo4t/x07j8u1663zy/1/QWFy6wfcMj4vvr1tAIpuQA/aHR0cDovL3d3dy45Zm9sZGVycy5jb20v>

*From:* Mike Miller <millermike...@gmail.com>
*Sent:* 21 Apr 2016 2:20 pm
*To:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

Hi Udo,

thanks, just to make sure, further increased the readahead:

$ sudo blockdev --getra /dev/rbd0
1048576

$ cat /sys/block/rbd0/queue/read_ahead_kb
524288

No difference here. First one is sectors (512 bytes), second one KB.

The second read (after drop cache) is somewhat faster (10%-20%) but not
much.

I also found this info
http://tracker.ceph.com/issues/9192

Maybe Ilya can help us, he knows probably best how this can be improved.

Thanks and cheers,

Mike


On 4/21/16 4:32 PM, Udo Lembke wrote:

Hi Mike,

Am 21.04.2016 um 09:07 schrieb Mike Miller:

Hi Nick and Udo,

thanks, very helpful, I tweaked some of the config parameters along
the line Udo suggests, but still only some 80 MB/s or so.

this mean you have reached factor 3 (this are round about the value I
see with single thread on RBD too). Better than nothing.



Kernel 4.3.4 running on the client machine and comfortable readahead
configured

$ sudo blockdev --getra /dev/rbd0
262144

Still not more than about 80-90 MB/s.

they are two possibilities for read-ahead.
Take a look here (and change with echo)
cat /sys/block/rbd0/queue/read_ahead_kb

Perhaps there are slightly differences?



For writing the parallelization is amazing and I see very impressive
speeds, but why is reading performance so much behind? Why is it not
parallelized the same way writing is? Is this something coming up in
the jewel release? Or is it planned further down the road?

If you read an big file and clear your cache ("echo 3 >
/proc/sys/vm/drop_caches") on the client, is the second read very fast?
I assume yes.
In this case the readed data is in the cache on the osd-nodes... so
tuning must be there (and I'm very interesting in improvements).

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-21 Thread Mike Miller

Hi Udo,

thanks, just to make sure, further increased the readahead:

$ sudo blockdev --getra /dev/rbd0
1048576

$ cat /sys/block/rbd0/queue/read_ahead_kb
524288

No difference here. First one is sectors (512 bytes), second one KB.

The second read (after drop cache) is somewhat faster (10%-20%) but not 
much.


I also found this info
http://tracker.ceph.com/issues/9192

Maybe Ilya can help us, he knows probably best how this can be improved.

Thanks and cheers,

Mike


On 4/21/16 4:32 PM, Udo Lembke wrote:

Hi Mike,

Am 21.04.2016 um 09:07 schrieb Mike Miller:

Hi Nick and Udo,

thanks, very helpful, I tweaked some of the config parameters along
the line Udo suggests, but still only some 80 MB/s or so.

this mean you have reached factor 3 (this are round about the value I
see with single thread on RBD too). Better than nothing.



Kernel 4.3.4 running on the client machine and comfortable readahead
configured

$ sudo blockdev --getra /dev/rbd0
262144

Still not more than about 80-90 MB/s.

they are two possibilities for read-ahead.
Take a look here (and change with echo)
cat /sys/block/rbd0/queue/read_ahead_kb

Perhaps there are slightly differences?



For writing the parallelization is amazing and I see very impressive
speeds, but why is reading performance so much behind? Why is it not
parallelized the same way writing is? Is this something coming up in
the jewel release? Or is it planned further down the road?

If you read an big file and clear your cache ("echo 3 >
/proc/sys/vm/drop_caches") on the client, is the second read very fast?
I assume yes.
In this case the readed data is in the cache on the osd-nodes... so
tuning must be there (and I'm very interesting in improvements).

Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

2016-04-21 Thread Mike Miller

Hi Nick and Udo,

thanks, very helpful, I tweaked some of the config parameters along the 
line Udo suggests, but still only some 80 MB/s or so.


Kernel 4.3.4 running on the client machine and comfortable readahead 
configured


$ sudo blockdev --getra /dev/rbd0
262144

Still not more than about 80-90 MB/s.

For writing the parallelization is amazing and I see very impressive 
speeds, but why is reading performance so much behind? Why is it not 
parallelized the same way writing is? Is this something coming up in the 
jewel release? Or is it planned further down the road?


Please let me know if there is a way to enable clients better single 
threaded read performance for large files.


Thanks and regards,

Mike

On 4/20/16 10:43 PM, Nick Fisk wrote:




-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Udo Lembke
Sent: 20 April 2016 07:21
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5

Hi Mike,
I don't have experiences with RBD mounts, but see the same effect with
RBD.

You can do some tuning to get better results (disable debug and so on).

As hint some values from a ceph.conf:
[osd]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0
 filestore_op_threads = 4
 osd max backfills = 1
 osd mount options xfs =
"rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M"
 osd mkfs options xfs = "-f -i size=2048"
 osd recovery max active = 1
 osd_disk_thread_ioprio_class = idle
 osd_disk_thread_ioprio_priority = 7
 osd_disk_threads = 1
 osd_enable_op_tracker = false
 osd_op_num_shards = 10
 osd_op_num_threads_per_shard = 1
 osd_op_threads = 4

Udo

On 19.04.2016 11:21, Mike Miller wrote:

Hi,

RBD mount
ceph v0.94.5
6 OSD with 9 HDD each
10 GBit/s public and private networks
3 MON nodes 1Gbit/s network

A rbd mounted with btrfs filesystem format performs really badly when
reading. Tried readahead in all combinations but that does not help in
any way.

Write rates are very good in excess of 600 MB/s up to 1200 MB/s,
average 800 MB/s Read rates on the same mounted rbd are about 10-30
MB/s !?


What kernel are you running, older kernels had an issue where readahead was
capped at 2MB. In order to get good read speeds you need readahead set to
about 32MB+.




Of course, both writes and reads are from a single client machine with
a single write/read command. So I am looking at single threaded
performance.
Actually, I was hoping to see at least 200-300 MB/s when reading, but
I am seeing 10% of that at best.

Thanks for your help.

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about cache tier and backfill/recover

2016-03-25 Thread Mike Miller

Hi,

in case of a failure in the storage tier, say single OSD disk failure or 
complete system failure with several OSD disks, will the remaining cache 
tier (on other nodes) be used for rapid backfilling/recovering first 
until it is full? Or is backfill/recovery done directly to the storage tier?


Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS memory sizing

2016-03-05 Thread Mike Miller

Hi Dietmar,

it all depends how many inodes have caps on the mds. I have run a very 
similar configuration with 0.5 TB raw and about 200 million files, mds 
collocated with mon and 32 GB RAM.
When rsyncing files from other servers onto cephfs I have observed that 
the mds sometimes runs out of memory and hangs.

All on hammer 0.94.

Regards,

Mike

On 3/1/16 8:13 AM, Yan, Zheng wrote:

On Tue, Mar 1, 2016 at 7:28 PM, Dietmar Rieder
 wrote:

Dear ceph users,


I'm in the very initial phase of planning a ceph cluster an have a
question regarding the RAM recommendation for an MDS.

According to the ceph docs the minimum amount of RAM should be "1 GB
minimum per daemon". Is this per OSD in the cluster or per MDS in the
cluster?

I plan to run 3 ceph-mon on 3 dedicated machines and would like to run 3
ceph-msd on these  machines as well. The raw capacity of the cluster
should be ~1.9PB. Would 64GB of RAM then be enough for the
ceph-mon/ceph-msd nodes?



Each file inode in MDS uses about 2k memory (It's not relevant to file
size).  MDS memory usage depends on how large active file set are.

Regards
Yan, Zheng


Thanks
  Dietmar

--
_
D i e t m a r  R i e d e r


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HBA - PMC Adaptec HBA 1000

2016-03-02 Thread Mike Miller

Hi,

can someone report their experiences with the PMC Adaptec HBA 1000 
series of controllers?

https://www.adaptec.com/en-us/smartstorage/hba/

Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mount.ceph not accepting options, please help

2015-12-16 Thread Mike Miller

Hi,

sorry, the question might seem very easy, probably my bad, but can you 
please help me why I am unable to change read ahead size and other 
options when mounting cephfs?


mount.ceph m2:6789:/ /foo2 -v -o name=cephfs,secret=,rsize=1024000

the result is:

ceph: Unknown mount option rsize

I am using hammer 0.94.5 and ubuntu trusty.

Thanks for your help!

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Debug / monitor osd journal usage

2015-12-14 Thread Mike Miller

Hi,

is there a way to debug / monitor the osd journal usage?

Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mix of SATA and SSD

2015-12-11 Thread Mike Miller

Hi,

can you please help me with the question I am currently thinking about.

I am entertaining a osd node design of a mixture of SATA spinner based 
osd daemons and SSD based osd daemons.


Is it possible to have incoming write traffic go to the SSD first and 
then when write traffic is becoming less intense redistribute from SSD 
to the spinners?


Is this possible using a suitable crushmap?

Is this thought equivalent to having large SSD journals?

Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy osd prepare for journal size 0

2015-12-03 Thread Mike Miller

Hi,

for testing I would like to create some OSD in the hammer release with 
journal size 0


I included this in ceph.conf:
[osd]
osd journal size = 0

Then I zapped the disk in question and tried:
'ceph-deploy disk zap o1:sda'

Thank you for your advice how to prepare an osd without journal / 
journal size 0.


Thanks and regards,

Mike

---
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.22): /usr/bin/ceph-deploy disk 
prepare o1:sda

[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks o1:/dev/sda:
[o1][DEBUG ] connection detected need for sudo
[o1][DEBUG ] connected to host: o1
[o1][DEBUG ] detect platform information from remote host
[o1][DEBUG ] detect machine type
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] Deploying osd to o1
[o1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[o1][INFO  ] Running command: sudo udevadm trigger 
--subsystem-match=block --action=add
[ceph_deploy.osd][DEBUG ] Preparing host o1 disk /dev/sda journal None 
activate False
[o1][INFO  ] Running command: sudo ceph-disk -v prepare --fs-type xfs 
--cluster ceph -- /dev/sda
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=fsid
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=osd_journal_size
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
[o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_dmcrypt_type

[o1][WARNIN] INFO:ceph-disk:Will colocate journal with data on /dev/sda
[o1][WARNIN] DEBUG:ceph-disk:Creating journal partition num 2 size 0 on 
/dev/sda
[o1][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk --new=2:0:0M 
--change-name=2:ceph journal 
--partition-guid=2:ded83283-2023-4c8e-93ae-b33341710bde 
--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sda

[o1][DEBUG ] The operation has completed successfully.
[o1][WARNIN] DEBUG:ceph-disk:Calling partprobe on prepared device /dev/sda
[o1][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sda
[o1][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
[o1][WARNIN] DEBUG:ceph-disk:Journal is GPT partition 
/dev/disk/by-partuuid/ded83283-2023-4c8e-93ae-b33341710bde
[o1][WARNIN] DEBUG:ceph-disk:Journal is GPT partition 
/dev/disk/by-partuuid/ded83283-2023-4c8e-93ae-b33341710bde

[o1][WARNIN] DEBUG:ceph-disk:Creating osd partition on /dev/sda
[o1][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk 
--largest-new=1 --change-name=1:ceph data 
--partition-guid=1:87ab533b-e530-4fa3-bfad-8a157a88cc88 
--typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sda

[o1][DEBUG ] The operation has completed successfully.
[o1][WARNIN] DEBUG:ceph-disk:Calling partprobe on created device /dev/sda
[o1][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sda
[o1][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle
[o1][WARNIN] DEBUG:ceph-disk:Creating xfs fs on /dev/sda1
[o1][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t xfs -f -i 
size=2048 -- /dev/sda1

[o1][WARNIN] warning: device is not properly aligned /dev/sda1
[o1][WARNIN] agsize (251 blocks) too small, need at least 4096 blocks
[o1][WARNIN] Usage: mkfs.xfs
[o1][WARNIN] /* blocksize */[-b log=n|size=num]
[o1][WARNIN] /* data subvol */  [-d 
agcount=n,agsize=n,file,name=xxx,size=num,
[o1][WARNIN] 
(sunit=value,swidth=value|su=num,sw=num),

[o1][WARNIN]sectlog=n|sectsize=num
[o1][WARNIN] /* inode size */   [-i 
log=n|perblock=n|size=num,maxpct=n,attr=0|1|2,

[o1][WARNIN]projid32bit=0|1]
[o1][WARNIN] /* log subvol */   [-l 
agnum=n,internal,size=num,logdev=xxx,version=n
[o1][WARNIN] 
sunit=value|su=num,sectlog=n|sectsize=num,

[o1][WARNIN]lazy-count=0|1]
[o1][WARNIN] /* label */[-L label (maximum 12 characters)]
[o1][WARNIN] /* naming */   [-n log=n|size=num,version=2|ci]
[o1][WARNIN] /* prototype file */   [-p fname]
[o1][WARNIN] /* quiet */[-q]
[o1][WARNIN] /* realtime subvol */  [-r 

[ceph-users] Infernalis Error EPERM: problem getting command descriptions from mds.0

2015-11-25 Thread Mike Miller

Hi,

can some please help me with this error?

$ ceph tell mds.0 version
Error EPERM: problem getting command descriptions from mds.0

Tell is not working for me on mds.

Version: infernalis - trusty

Thanks and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS memory usage

2015-11-24 Thread Mike Miller

Hi,

in my cluster with 16 OSD daemons and more than 20 million files on 
cephfs, the memory usage on MDS is around 16 GB. It seems that 'mds 
cache size' has no real influence on the memory usage of the MDS.


Is there a formula that relates 'mds cache size' directly to memory 
consumption on the MDS?


In the documentation (and other posts on the mailing list) it is said 
that the MDS needs 1 GB per daemon. I am observing that the MDS uses 
almost exactly 1 GB per OSD daemon (I have 16 OSD and 16 GB memory usage 
on the MDS). Is this the correct formula?


Or is it 1 GB per MDS daemon?

In my case, the standard 'mds cache size 10' makes MDS crash and/or 
the cephfs is unresponsive. Larger values for 'mds cache size' seem to 
work really well.


Version trusty 14.04 and hammer.

Thanks and kind regards,

Mike


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS memory usage

2015-11-24 Thread Mike Miller

Hi Greg,

thanks very much. This is clear to me now.

As for 'MDS cluster', I thought that this was not recommended at this 
stage? I would very much like to have a number >1 of MDS in my cluster 
as this would probably help very much to balance the load. But I am 
afraid what everybody says about stability issues.


Is more than one MDS considered stable enough with hammer?

Thanks and regards,

Mike

On 11/25/15 12:51 PM, Gregory Farnum wrote:

On Tue, Nov 24, 2015 at 10:26 PM, Mike Miller <millermike...@gmail.com> wrote:

Hi,

in my cluster with 16 OSD daemons and more than 20 million files on cephfs,
the memory usage on MDS is around 16 GB. It seems that 'mds cache size' has
no real influence on the memory usage of the MDS.

Is there a formula that relates 'mds cache size' directly to memory
consumption on the MDS?


The dominant factor should be the number of inodes in cache, although
there are other things too. Depending on version I think it was ~2KB
of memory for each inode+dentry at last count.


In the documentation (and other posts on the mailing list) it is said that
the MDS needs 1 GB per daemon. I am observing that the MDS uses almost
exactly 1 GB per OSD daemon (I have 16 OSD and 16 GB memory usage on the
MDS). Is this the correct formula?

Or is it 1 GB per MDS daemon?


It's got nothing to do with the number of OSDs. I'm not sure where 1GB
per MDS came from, although you can certainly run a reasonable
low-intensity cluster on that.



In my case, the standard 'mds cache size 10' makes MDS crash and/or the
cephfs is unresponsive. Larger values for 'mds cache size' seem to work
really well.


Right. You need the total cache size of your MDS "cluster" (which is
really just 1) to be larger than your working set size or you'll have
trouble. Similarly if you have any individual directories which are a
significant portion of your total cache it might cause issues.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about hardware and CPU selection

2015-10-24 Thread Mike Miller

Hi,

as I am planning to set up a ceph cluster with 6 OSD nodes with 10 
harddisks in each node, could you please give me some advice about 
hardware selection? CPU? RAM?

I am planning a 10 GBit/s public and a separate 10 GBit/s private network.

For a smaller test cluster with 5 OSD nodes and 4 harddisks each, 2 
GBit/s public and 4 GBit/s private network, I already tested this using 
core i5 boxes 16GB RAM installed. In most of my test scenarios including 
load, node failure, backfilling, etc. the CPU usage was not at all the 
bottleneck with a maximum of about 25% load per core. The private 
network was also far from being fully loaded.


It would be really great to get some advice about hardware choices for 
my newly planned setup.


Thanks very much and regards,

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com