Re: [ceph-users] Performance counters oddities, cache tier and otherwise

2016-04-06 Thread Shinobu Kinjo
It seems that temperature / recency estimation haven't worked properly at some 
point.

Cheers,
Shinobu

- Original Message -
From: "Christian Balzer" 
To: ceph-users@lists.ceph.com
Sent: Thursday, April 7, 2016 11:51:38 AM
Subject: [ceph-users] Performance counters oddities, cache tier and otherwise


Hello,

Ceph 0.94.5 for the record.

As some may remember, I phased in a 2TB cache tier 5 weeks ago.

About now it has reached about 60% usage, which is what I have the
cache_target_dirty_ratio set to.

And for the last 3 days I could see some writes (op_in_bytes) to the
backing storage (aka HDD pool), which hadn't seen any write action for the
aforementioned 5 weeks.
Alas my graphite dashboard showed no flushes (tier_flush), whereas
tier_promote on the cache pool could always be matched more or less to
op_out_bytes on the HDD pool.

The documentation (RH site) just parrots the names of the various perf
counters, so no help there. OK, lets look a what we got:
---
"tier_promote": 49776,
"tier_flush": 0,
"tier_flush_fail": 0,
"tier_try_flush": 558,
"tier_try_flush_fail": 0,
"agent_flush": 558,
"tier_evict": 0,
"agent_evict": 0,
---
Lots of promotions, that's fine. 
Not a single tier_flush, er. wot? So what does this denote then?
OK, clearly tier_try_flush and agent_flush are where the flushing is
actually recorded (in my test cluster they differ, as I have run that
against the wall several times).
No evictions yet, that will happen at 90% usage.

So now I changed the graph data source for flushes to tier_try_flush,
however that does not match most of the op_in_bytes (or any other counter I
tried!) on the HDDs. 
As, in there are flushes but no activity on the HDD OSDs as far as Ceph
seems to be concerned.
I can however match the flushes to actual disk activity on the HDDs
(gathered by collectd), which are otherwise totally dormant. 

Can somebody shed some light on this, is it a known problem, in need of
a bug report?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] adding cache tier in productive hammer environment

2016-04-06 Thread Christian Balzer

Hello,

On Wed, 6 Apr 2016 20:35:20 +0200 Oliver Dzombic wrote:

> Hi,
> 
> i have some IO issues, and after Christian's great article/hint about
> caches i plan to add caches too.
> 
Thanks, version 2 is still a work in progress, as I keep running into
unknowns. 

IO issues in what sense, like in too many write IOPS for the current HW to
sustain? 
Also, what are you using Ceph for, RBD hosting VM images?

It will help you a lot if you can identify and quantify the usage patterns
(including a rough idea on how many hot objects you have) and where you
run into limits.

> So now comes the troublesome question:
> 
> How much dangerous is it to add cache tiers in an existing cluster with
> around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes.
> 
You're reducing nodes? Why? 
More nodes/OSDs equates to more IOPS in general.

40TB is a sizable amount of data, how many objects does you cluster hold?
Also is that raw data or after replication (size 3?)?
In short, "ceph -s" output please. ^.^

> I mean will just everything explode and i just die, or how is the road
> map to introduce this, after you have an already running cluster ?
> 
That's pretty much straightforward from the Ceph docs at:
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
(replace master with hammer if you're running that)

Nothing happens until the "set-overlay" bit and you will want to configure
all the pertinent bits before that.

A basic question is if you will have dedicated SSD cache tier hosts or
have the SSDs holding the cache pool in your current hosts.
Dedicated hosts have the advantage matched HW, CPU power the SSDs and
simpler configuration, shared hosts can have the advantage of spreading
the network load further out instead of having everything going through
the cache tier nodes.

The size and length of the explosion will entirely depend on:
1) how capable your current cluster is, how (over)loaded it is.
2) the actual load/usage at the time you phase the cache tier in
3) the amount of "truly hot" objects you have.

As I wrote here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007933.html

In my case with a BADLY overload base pool and a constant stream of
log/status writes (4-5MB/s, 1000 IOPS) from 200VMs it all stabilized after
10 minutes.

Truly hot objects as mentioned above will be those (in the case of VM
images) holding active directory inodes and files.


> Anything that needs to be considered ? Dangerous no-no's ?
> 
> Also it will happen, that i have to add the cache tiers server by
> server, and not all at the same time.
> 
You want at least 2 cache tier servers from the start and well known,
well tested (LSI timeouts!) SSDs in them.

Christian

> I am happy for any kind of advice.
> 
> Thank you !
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance counters oddities, cache tier and otherwise

2016-04-06 Thread Christian Balzer

Hello,

Ceph 0.94.5 for the record.

As some may remember, I phased in a 2TB cache tier 5 weeks ago.

About now it has reached about 60% usage, which is what I have the
cache_target_dirty_ratio set to.

And for the last 3 days I could see some writes (op_in_bytes) to the
backing storage (aka HDD pool), which hadn't seen any write action for the
aforementioned 5 weeks.
Alas my graphite dashboard showed no flushes (tier_flush), whereas
tier_promote on the cache pool could always be matched more or less to
op_out_bytes on the HDD pool.

The documentation (RH site) just parrots the names of the various perf
counters, so no help there. OK, lets look a what we got:
---
"tier_promote": 49776,
"tier_flush": 0,
"tier_flush_fail": 0,
"tier_try_flush": 558,
"tier_try_flush_fail": 0,
"agent_flush": 558,
"tier_evict": 0,
"agent_evict": 0,
---
Lots of promotions, that's fine. 
Not a single tier_flush, er. wot? So what does this denote then?
OK, clearly tier_try_flush and agent_flush are where the flushing is
actually recorded (in my test cluster they differ, as I have run that
against the wall several times).
No evictions yet, that will happen at 90% usage.

So now I changed the graph data source for flushes to tier_try_flush,
however that does not match most of the op_in_bytes (or any other counter I
tried!) on the HDDs. 
As, in there are flushes but no activity on the HDD OSDs as far as Ceph
seems to be concerned.
I can however match the flushes to actual disk activity on the HDDs
(gathered by collectd), which are otherwise totally dormant. 

Can somebody shed some light on this, is it a known problem, in need of
a bug report?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maximizing OSD to PG quantity

2016-04-06 Thread Christian Balzer

Hello,

On Wed, 6 Apr 2016 18:15:57 + David Turner wrote:

> You can mitigate how much it affects the IO but for the cost of how long
> it will take to complete.
> 
> ceph tell osd.* injectargs '--osd-max-backfills #'
> 
Also have a read of:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg27970.html
for more knobs to twiddle.

> Where # is the most pgs any osd can participate backfill data for at any
> given time.  This is the same setting that is used when you add, remove,
> lose, or reweight osds in your cluster.  The lower the number, the less
> impact to cluster IO but the longer it will take to finish the task.
> Max-backfills of 5 seems to work out well enough to get through things
> in a timely manner while not critically impacting IO.  I do up that to
> 20 if I need speed more than IO.  These numbers are very dependent on
> your individual hardware and configuration.
>
Very very, true words.

Which brings me to the OP, you haven't told us your cluster details.
12 OSDs sounds like 2 hosts with 6 OSDs each to me.
If that's the case, you'll need/want a 3rd host. 

If you already have 3 or more storage nodes, you can go ahead with the
replica increase, but note that this will not only reduce your storage
capacity accordingly but also have an impact on performance, one more OSD
will have to ACK each write. This will be particular noticeable with
non-SSD journals, but the additional network latency will be there in any
case.

Christian

>  From: ceph-users
> [ceph-users-boun...@lists.ceph.com] on behalf of Oliver Dzombic
> [i...@ip-interactive.de] Sent: Wednesday, April 06, 2016 11:45 AM To:
> ceph-users@lists.ceph.com Subject: Re: [ceph-users] Maximizing OSD to PG
> quantity
> 
> Hi,
> 
> huge, deadly, IO :-)
> 
> Imagine, everything has to multiplied 1 time. Thats nothing what will go
> smooth :-)
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 06.04.2016 um 16:41 schrieb d...@integrityhost.com:
> > Will changing the replication size from 2 to 3 cause huge I/O resources
> > to be used, or does this happen quietly in the background?
> >
> >
> > On 2016-04-06 00:40, Christian Balzer wrote:
> >> Hello,
> >>
> >> Brian already mentioned a number very pertinent things, I've got a few
> >> more:
> >>
> >> On Tue, 05 Apr 2016 10:48:49 -0400 d...@integrityhost.com wrote:
> >>
> >>> In a 12 OSD setup, the following config is there:
> >>>
> >>> (OSDs * 100)
> >>> Total PGs = --
> >>>   pool size
> >>>
> >>
> >> The PGcalc page at http://ceph.com/pgcalc/ is quite helpful and
> >> contains a
> >> lot of background info as well.
> >>
> >> As Brian said, you can never decrease PG count, but growing it is
> >> also a very I/O intensive operation and you want to avoid that as
> >> much as possible.
> >>
> >>>
> >>> So with 12 OSD's and a pool size of 2 replicas, this would equal
> >>> Total PGs of 600 as per this url:
> >> PGcalc with a target of 200 PGs per OSD (doubling of cluster size
> >> expected) gives us 1024, which is also what I would go for myself.
> >>
> >> However if this a production cluster and your OSDs are NOT RAID1 or
> >> very very reliable, fast and well monitored SSDs you're basically
> >> asking Murphy
> >> to come visit, destroying your data while eating babies and washing
> >> them down with bath water.
> >>
> >> The default replication size was changed to 3 for a very good reason,
> >> there are plenty of threads in this ML about failure scenarios and
> >> probabilities.
> >>
> >> Christian
> >>
> >>>
> >>> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#preselection
> >>>
> >>>
> >>> Yet in the same page, at the top it says:
> >>>
> >>> Between 10 and 50 OSDs set pg_num to 4096
> >>>
> >>> Our use is for shared hosting so there are lots of small writes and
> >>> reads.  Which of these would be correct?
> >>>
> >>> Also is it a simple process to update PGs on a live system without
> >>> affecting service?
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 

Re: [ceph-users] cephfs rm -rf on directory of 160TB /40M files

2016-04-06 Thread John Spray
On Wed, Apr 6, 2016 at 10:42 PM, Scottix  wrote:
> I have been running some speed tests in POSIX file operations and I noticed
> even just listing files can take a while compared to an attached HDD. I am
> wondering is there a reason it takes so long to even just list files.

If you're running comparisons, it would really be more instructive to
compare ceph with something like an NFS server, rather than a local
filesystem.

> Here is the test I ran
>
> time for i in {1..10}; do touch $i; done
>
> Internal HDD:
> real 4m37.492s
> user 0m18.125s
> sys 1m5.040s
>
> Ceph Dir
> real 12m30.059s
> user 0m16.749s
> sys 0m53.451s
>
> ~300% faster on HDD
>
> *I am actually ok with this but nice to be quicker.
>
> When I am listing the directory it is taking a lot longer compared to an
> attached HDD
>
> time ls -1
>
> Internal HDD
> real 0m2.112s
> user 0m0.560s
> sys 0m0.440s
>
> Ceph Dir
> real 3m35.982s
> user 0m2.788s
> sys 0m4.580s
>
> ~1000% faster on HDD
>
> *I understand there is some time in the display so what is really making it
> odd is the following test.
>
> time ls -1 > /dev/null
>
> Internal HDD
> real 0m0.367s
> user 0m0.324s
> sys 0m0.040s
>
> Ceph Dir
> real 0m2.807s
> user 0m0.128s
> sys 0m0.052s

If the difference when sending to /dev/null is reproducible (not just
a cache artifact), I would suspect that your `ls` is noticing that
it's not talking to a tty, so it's not bothering to color things, so
it's not individually statting each file to decide what color to make
it.  On network filesystems, "ls -l" (or colored ls) is often much
slower than a straight directory listing.

Cheers,
John

> ~700% faster on HDD
>
> My guess the performance issue is with the batch requests as you stated. So
> I am wondering if the file deletion of the 40M files is not just deleting
> the files but even just traversing that many files takes a while.

It's an unhappy feedback combination of listing them, sending N
individual unlink operations, and then the MDS getting bogged down in
the resulting purges while it's still trying to handle incoming unlink
requests.

> I am running this on 0.94.6 with Ceph Fuse Client
> And config
> fuse multithreaded = false
>
> Since multithreaded crashes in hammer.
>
> It would be interesting to see the performance on newer versions.
>
> Any thoughts or comments would be good.
>
> On Tue, Apr 5, 2016 at 9:22 AM Gregory Farnum  wrote:
>>
>> On Mon, Apr 4, 2016 at 9:55 AM, Gregory Farnum  wrote:
>> > Deletes are just slow right now. You can look at the ops in flight on
>> > you
>> > client or MDS admin socket to see how far along it is and watch them to
>> > see
>> > how long stuff is taking -- I think it's a sync disk commit for each
>> > unlink
>> > though so at 40M it's going to be a good looong while. :/
>> > -Greg
>>
>> Oh good, I misremembered — it's a synchronous request to the MDS, but
>> it's not a synchronous disk commit. They get batched up normally in
>> the metadata log. :)
>> Still, a sync MDS request can take a little bit of time. Someday we
>> will make the client able to respond to these more quickly locally and
>> batch up MDS requests or something, but it'll be tricky. Faster file
>> creates will probably come first. (If we're lucky they can use some of
>> the same client-side machinery.)
>> -Greg
>>
>> >
>> >
>> > On Monday, April 4, 2016, Kenneth Waegeman 
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I want to remove a large directory containing +- 40M files /160TB of
>> >> data
>> >> in CephFS by running rm -rf on the directory via the ceph kernel
>> >> client.
>> >> After 7h , the rm command is still running. I checked the rados df
>> >> output,
>> >> and saw that only about  2TB and 2M files are gone.
>> >> I know this output of rados df can be confusing because ceph should
>> >> delete
>> >> objects asyncroniously, but then I don't know why the rm command still
>> >> hangs.
>> >> Is there some way to speed this up? And is there a way to check how far
>> >> the marked for deletion has progressed ?
>> >>
>> >> Thank you very much!
>> >>
>> >> Kenneth
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs rm -rf on directory of 160TB /40M files

2016-04-06 Thread Gregory Farnum
On Wed, Apr 6, 2016 at 2:42 PM, Scottix  wrote:
> I have been running some speed tests in POSIX file operations and I noticed
> even just listing files can take a while compared to an attached HDD. I am
> wondering is there a reason it takes so long to even just list files.
>
> Here is the test I ran
>
> time for i in {1..10}; do touch $i; done
>
> Internal HDD:
> real 4m37.492s
> user 0m18.125s
> sys 1m5.040s
>
> Ceph Dir
> real 12m30.059s
> user 0m16.749s
> sys 0m53.451s
>
> ~300% faster on HDD
>
> *I am actually ok with this but nice to be quicker.
>
> When I am listing the directory it is taking a lot longer compared to an
> attached HDD
>
> time ls -1
>
> Internal HDD
> real 0m2.112s
> user 0m0.560s
> sys 0m0.440s
>
> Ceph Dir
> real 3m35.982s
> user 0m2.788s
> sys 0m4.580s
>
> ~1000% faster on HDD

This might be a bad interaction between your MDS cache size and the
size of the directory. The subsequent run is a lot faster because
after running an "ls" once you've got most of the information you need
for it cached locally on the client (but perhaps not all of it,
depending on various things).

>
> *I understand there is some time in the display so what is really making it
> odd is the following test.
>
> time ls -1 > /dev/null
>
> Internal HDD
> real 0m0.367s
> user 0m0.324s
> sys 0m0.040s
>
> Ceph Dir
> real 0m2.807s
> user 0m0.128s
> sys 0m0.052s
>
> ~700% faster on HDD
>
> My guess the performance issue is with the batch requests as you stated. So
> I am wondering if the file deletion of the 40M files is not just deleting
> the files but even just traversing that many files takes a while.
>
> I am running this on 0.94.6 with Ceph Fuse Client
> And config
> fuse multithreaded = false
>
> Since multithreaded crashes in hammer.

Oh, that's probably hurting things in various ways. The fix for
http://tracker.ceph.com/issues/13729 ended up getting into the hammer
branch after all and should go out whenever there's another stable
release, FYI.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs rm -rf on directory of 160TB /40M files

2016-04-06 Thread Scottix
I have been running some speed tests in POSIX file operations and I noticed
even just listing files can take a while compared to an attached HDD. I am
wondering is there a reason it takes so long to even just list files.

Here is the test I ran

time for i in {1..10}; do touch $i; done

Internal HDD:
real 4m37.492s
user 0m18.125s
sys 1m5.040s

Ceph Dir
real 12m30.059s
user 0m16.749s
sys 0m53.451s

~300% faster on HDD

*I am actually ok with this but nice to be quicker.

When I am listing the directory it is taking a lot longer compared to an
attached HDD

time ls -1

Internal HDD
real 0m2.112s
user 0m0.560s
sys 0m0.440s

Ceph Dir
real 3m35.982s
user 0m2.788s
sys 0m4.580s

~1000% faster on HDD

*I understand there is some time in the display so what is really making it
odd is the following test.

time ls -1 > /dev/null

Internal HDD
real 0m0.367s
user 0m0.324s
sys 0m0.040s

Ceph Dir
real 0m2.807s
user 0m0.128s
sys 0m0.052s

~700% faster on HDD

My guess the performance issue is with the batch requests as you stated. So
I am wondering if the file deletion of the 40M files is not just deleting
the files but even just traversing that many files takes a while.

I am running this on 0.94.6 with Ceph Fuse Client
And config
fuse multithreaded = false

Since multithreaded crashes in hammer.

It would be interesting to see the performance on newer versions.

Any thoughts or comments would be good.

On Tue, Apr 5, 2016 at 9:22 AM Gregory Farnum  wrote:

> On Mon, Apr 4, 2016 at 9:55 AM, Gregory Farnum  wrote:
> > Deletes are just slow right now. You can look at the ops in flight on you
> > client or MDS admin socket to see how far along it is and watch them to
> see
> > how long stuff is taking -- I think it's a sync disk commit for each
> unlink
> > though so at 40M it's going to be a good looong while. :/
> > -Greg
>
> Oh good, I misremembered — it's a synchronous request to the MDS, but
> it's not a synchronous disk commit. They get batched up normally in
> the metadata log. :)
> Still, a sync MDS request can take a little bit of time. Someday we
> will make the client able to respond to these more quickly locally and
> batch up MDS requests or something, but it'll be tricky. Faster file
> creates will probably come first. (If we're lucky they can use some of
> the same client-side machinery.)
> -Greg
>
> >
> >
> > On Monday, April 4, 2016, Kenneth Waegeman 
> > wrote:
> >>
> >> Hi all,
> >>
> >> I want to remove a large directory containing +- 40M files /160TB of
> data
> >> in CephFS by running rm -rf on the directory via the ceph kernel client.
> >> After 7h , the rm command is still running. I checked the rados df
> output,
> >> and saw that only about  2TB and 2M files are gone.
> >> I know this output of rados df can be confusing because ceph should
> delete
> >> objects asyncroniously, but then I don't know why the rm command still
> >> hangs.
> >> Is there some way to speed this up? And is there a way to check how far
> >> the marked for deletion has progressed ?
> >>
> >> Thank you very much!
> >>
> >> Kenneth
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] adding cache tier in productive hammer environment

2016-04-06 Thread Oliver Dzombic
Hi,

i have some IO issues, and after Christian's great article/hint about
caches i plan to add caches too.

So now comes the troublesome question:

How much dangerous is it to add cache tiers in an existing cluster with
around 30 OSD's and 40 TB of Data on 3-6 ( currently reducing ) nodes.

I mean will just everything explode and i just die, or how is the road
map to introduce this, after you have an already running cluster ?

Anything that needs to be considered ? Dangerous no-no's ?

Also it will happen, that i have to add the cache tiers server by
server, and not all at the same time.

I am happy for any kind of advice.

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maximizing OSD to PG quantity

2016-04-06 Thread David Turner
You can mitigate how much it affects the IO but for the cost of how long it 
will take to complete.

ceph tell osd.* injectargs '--osd-max-backfills #'

Where # is the most pgs any osd can participate backfill data for at any given 
time.  This is the same setting that is used when you add, remove, lose, or 
reweight osds in your cluster.  The lower the number, the less impact to 
cluster IO but the longer it will take to finish the task.  Max-backfills of 5 
seems to work out well enough to get through things in a timely manner while 
not critically impacting IO.  I do up that to 20 if I need speed more than IO.  
These numbers are very dependent on your individual hardware and configuration.

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Oliver 
Dzombic [i...@ip-interactive.de]
Sent: Wednesday, April 06, 2016 11:45 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Maximizing OSD to PG quantity

Hi,

huge, deadly, IO :-)

Imagine, everything has to multiplied 1 time. Thats nothing what will go
smooth :-)

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 06.04.2016 um 16:41 schrieb d...@integrityhost.com:
> Will changing the replication size from 2 to 3 cause huge I/O resources
> to be used, or does this happen quietly in the background?
>
>
> On 2016-04-06 00:40, Christian Balzer wrote:
>> Hello,
>>
>> Brian already mentioned a number very pertinent things, I've got a few
>> more:
>>
>> On Tue, 05 Apr 2016 10:48:49 -0400 d...@integrityhost.com wrote:
>>
>>> In a 12 OSD setup, the following config is there:
>>>
>>> (OSDs * 100)
>>> Total PGs = --
>>>   pool size
>>>
>>
>> The PGcalc page at http://ceph.com/pgcalc/ is quite helpful and
>> contains a
>> lot of background info as well.
>>
>> As Brian said, you can never decrease PG count, but growing it is also a
>> very I/O intensive operation and you want to avoid that as much as
>> possible.
>>
>>>
>>> So with 12 OSD's and a pool size of 2 replicas, this would equal Total
>>> PGs of 600 as per this url:
>> PGcalc with a target of 200 PGs per OSD (doubling of cluster size
>> expected) gives us 1024, which is also what I would go for myself.
>>
>> However if this a production cluster and your OSDs are NOT RAID1 or very
>> very reliable, fast and well monitored SSDs you're basically asking
>> Murphy
>> to come visit, destroying your data while eating babies and washing them
>> down with bath water.
>>
>> The default replication size was changed to 3 for a very good reason,
>> there are plenty of threads in this ML about failure scenarios and
>> probabilities.
>>
>> Christian
>>
>>>
>>> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#preselection
>>>
>>>
>>> Yet in the same page, at the top it says:
>>>
>>> Between 10 and 50 OSDs set pg_num to 4096
>>>
>>> Our use is for shared hosting so there are lots of small writes and
>>> reads.  Which of these would be correct?
>>>
>>> Also is it a simple process to update PGs on a live system without
>>> affecting service?
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Day Sunnyvale Presentations

2016-04-06 Thread Patrick McGarry
Hey cephers,

I have all but one of the presentations from Ceph Day Sunnyvale, so
rather than wait for a full hand I went ahead and posted the link to
the slides on the event page:

http://ceph.com/cephdays/ceph-day-sunnyvale/

The videos probably wont be processed until after next week, but I’ll
add those once we get them. Thanks to all of the presenters and
attendees that made this another great event.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maximizing OSD to PG quantity

2016-04-06 Thread dan
Will changing the replication size from 2 to 3 cause huge I/O resources 
to be used, or does this happen quietly in the background?



On 2016-04-06 00:40, Christian Balzer wrote:

Hello,

Brian already mentioned a number very pertinent things, I've got a few
more:

On Tue, 05 Apr 2016 10:48:49 -0400 d...@integrityhost.com wrote:


In a 12 OSD setup, the following config is there:

(OSDs * 100)
Total PGs = --
  pool size



The PGcalc page at http://ceph.com/pgcalc/ is quite helpful and 
contains a

lot of background info as well.

As Brian said, you can never decrease PG count, but growing it is also 
a

very I/O intensive operation and you want to avoid that as much as
possible.



So with 12 OSD's and a pool size of 2 replicas, this would equal Total
PGs of 600 as per this url:

PGcalc with a target of 200 PGs per OSD (doubling of cluster size
expected) gives us 1024, which is also what I would go for myself.

However if this a production cluster and your OSDs are NOT RAID1 or 
very
very reliable, fast and well monitored SSDs you're basically asking 
Murphy
to come visit, destroying your data while eating babies and washing 
them

down with bath water.

The default replication size was changed to 3 for a very good reason,
there are plenty of threads in this ML about failure scenarios and
probabilities.

Christian



http://docs.ceph.com/docs/master/rados/operations/placement-groups/#preselection

Yet in the same page, at the top it says:

Between 10 and 50 OSDs set pg_num to 4096

Our use is for shared hosting so there are lots of small writes and
reads.  Which of these would be correct?

Also is it a simple process to update PGs on a live system without
affecting service?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph rbd object write is atomic?

2016-04-06 Thread Jason Dillaman
If you can guarantee that your write will be wholly contained within an object 
(and within a stripe), you should be able to consider the writes to be atomic 
between two clients since the OSD will process the two writes in sequence (all 
ops are execute in order for a given placement group).

-- 

Jason Dillaman 


- Original Message - 

> From: "min fang" 
> To: "Jason Dillaman" 
> Cc: "ceph-users" 
> Sent: Wednesday, April 6, 2016 9:37:06 AM
> Subject: Re: [ceph-users] ceph rbd object write is atomic?

> Thanks Jason, yes, I also do not think they can guarantee atomic in extent
> level. But for a stripe unit in a object, can the atomic write be
> guaranteed? thanks.

> 2016-04-06 19:53 GMT+08:00 Jason Dillaman < dilla...@redhat.com > :

> > It's possible for a write to span one or more blocks -- it just depends on
> > the write address/size and the RBD image layout (object size, "fancy"
> > striping, etc). Regardless, however, RBD cannot provide any ordering
> > guarantees when two clients are writing to the same image at the same
> > extent. To safely use two or more clients concurrently on the same image
> > you
> > need a clustering filesystem on top of RBD (e.g. GFS2) or the application
> > needs to provide its own coordination to avoid concurrent writes to the
> > same
> > image extents.
> 

> > --
> 

> > Jason Dillaman
> 

> > - Original Message -
> 

> > > From: "min fang" < louisfang2...@gmail.com >
> 
> > > To: "ceph-users" < ceph-users@lists.ceph.com >
> 
> > > Sent: Tuesday, April 5, 2016 10:11:10 PM
> 
> > > Subject: [ceph-users] ceph rbd object write is atomic?
> 

> > > Hi, as my understanding, ceph rbd image will be divided into multiple
> > > objects
> 
> > > based on LBA address.
> 

> > > My question here is:
> 

> > > if two clients write to the same LBA address, such as client A write
> > > ""
> 
> > > to LBA 0x123456, client B write "" to the same LBA.
> 

> > > LBA address and data will only be in an object, not cross two objects.
> 

> > > Will ceph guarantee object data must be "" or ""? "aabb", "bbaa"
> > > will
> 
> > > not happen even in a stripe data layout model?
> 

> > > thanks.
> 

> > > ___
> 
> > > ceph-users mailing list
> 
> > > ceph-users@lists.ceph.com
> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph rbd object write is atomic?

2016-04-06 Thread min fang
Thanks Jason, yes, I also do not think they can guarantee atomic in extent
level. But for a stripe unit in a object, can the atomic write be
guaranteed? thanks.

2016-04-06 19:53 GMT+08:00 Jason Dillaman :

> It's possible for a write to span one or more blocks -- it just depends on
> the write address/size and the RBD image layout (object size, "fancy"
> striping, etc).  Regardless, however, RBD cannot provide any ordering
> guarantees when two clients are writing to the same image at the same
> extent.  To safely use two or more clients concurrently on the same image
> you need a clustering filesystem on top of RBD (e.g. GFS2) or the
> application needs to provide its own coordination to avoid concurrent
> writes to the same image extents.
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>
> > From: "min fang" 
> > To: "ceph-users" 
> > Sent: Tuesday, April 5, 2016 10:11:10 PM
> > Subject: [ceph-users] ceph rbd object write is atomic?
>
> > Hi, as my understanding, ceph rbd image will be divided into multiple
> objects
> > based on LBA address.
>
> > My question here is:
>
> > if two clients write to the same LBA address, such as client A write
> ""
> > to LBA 0x123456, client B write "" to the same LBA.
>
> > LBA address and data will only be in an object, not cross two objects.
>
> > Will ceph guarantee object data must be "" or ""? "aabb", "bbaa"
> will
> > not happen even in a stripe data layout model?
>
> > thanks.
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com