Re: [ceph-users] Replication vs Erasure Coding with only 2 elementsinthe failure-domain.

2017-03-07 Thread Burkhard Linke

Hi,


On 03/07/2017 05:53 PM, Francois Blondel wrote:


Hi all,


We have (only) 2 separate "rooms" (crush bucket) and would like to 
build a cluster being able to handle the complete loss of one room.




*snipsnap*


Second idea would be to use Erasure Coding, as it fits our performance 
requirements and would use less raw space.



Creating an EC profile like:

   “ceph osd erasure-code-profile set eck2m2room k=2 m=2 
ruleset-failure-domain=room”


and a pool using that EC profile, with “ceph osd pool create ecpool 
128 128 erasure eck2m2room” of course leads to having “128 
creating+incomplete” PGs, as we only have 2 rooms.



Is there somehow a way to store the “parity chuncks” (m) on both 
rooms, so that the loss of a room would be possible ?



If I understood correctly, an Erasure Coding of for example k=2, m=2, 
would use the same space as a replication with a size of 2, but be 
more reliable, as we could afford the loss of more OSDs at the same time.


Would it be possible to instruct the crush rule to store the first k 
and m chuncks in room 1, and the second k and m chuncks in room 2 ?




As far as I understand erasure coding there's no special handling for 
parity or data chunks. To assemble an EC object you just need k chunks, 
regardless whether they are data or parity chunks.


You should be able to distribute the chunks among two rooms by creating 
a new crush rule:


- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

I'm not 100% sure about whether chooseleaf is correct or another choose 
step is necessary to ensure that two osd from differents hosts are 
chosen (if necessary). The important point is using two choose-emit 
cycles and using the correct start points. Just insert the crush labels 
for the rooms.


This approach should work, but it has two drawbacks:

- crash handling
In case of host failing in a room, the PG from that host will be 
replicated to another host in the same room. You have to ensure that 
there's enough capacity in each rooms (vs. having enough capacity in the 
complete cluster), which might be tricky.


- bandwidth / host utilization
Almost all ceph based applications/libraries use the 'primary' osd for 
accessing data in a PG. The primary OSD is the first one generated by 
the crush rule. In the upper example, the primary OSDs will all be 
located in the first room. All client traffic will be heading to hosts 
in that room. Depending on your setup this might not be a desired solution.


Unfortunately I'm not aware of a solution. It would require to replace 
'step take ' with 'step take ' and 'step take 
' with 'step take '. Iteration is not 
part of crush as far as I know. Maybe someone else can give some more 
insight into this.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG active+remapped even I have three hosts

2017-03-07 Thread Stefan Lissmats
Hello!

To me it looks like you have one osd on host Ceph-Stress-02 and therefore only 
a weight of 1 on that host and 7 on the other. If you want of three replicas on 
only three hosts you need about the same storage space on all hosts.




On Wed, Mar 8, 2017 at 4:50 AM +0100, "TYLin" 
mailto:wooer...@gmail.com>> wrote:


Hi all,

We got 4 PG active+remapped in our cluster. We set the pool’s  ruleset to 
ruleset 0 and got HEALTH_OK. After we set the ruleset to ruleset 1, 4 pg is 
active+remapped. The testing result from crushtool also shows some bad mapping 
exists. Anyone happened to know the reason?



pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 421 flags hashpspool stripe_width 0

[root@Ceph-Stress-01 ~]# ceph pg ls remapped
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG 
DISK_LOG STATE   STATE_STAMPVERSION REPORTED UP 
UP_PRIMARY ACTINGACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
0.33  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.453447 0'0  419:110 [3,22]  3 
 [3,22,4]  30'0 2017-03-07 13:29:32.110280 0'0 
2017-03-07 13:29:32.110280
0.3b  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.619526 0'0  419:110 [2,20]  2 
[2,20,17]  20'0 2017-03-07 13:29:32.110287 0'0 
2017-03-07 13:29:32.110287
0.49  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.453239 0'0  419:104 [4,20]  4 
[4,20,19]  40'0 2017-03-07 13:29:32.110257 0'0 
2017-03-07 13:29:32.110257
0.54  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.101725 0'0  419:207 [19,3] 19 
[19,3,20] 190'0 2017-03-07 13:29:32.110262 0'0 
2017-03-07 13:29:32.110262


rule replicated_ruleset_one_host {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 0 type osd
step emit
}
rule replicated_ruleset_multi_host {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 15.0 root default
-2  7.0 host Ceph-Stress-01
 0  1.0 osd.0up  1.0  1.0
 1  1.0 osd.1up  1.0  1.0
 2  1.0 osd.2up  1.0  1.0
 3  1.0 osd.3up  1.0  1.0
 4  1.0 osd.4up  1.0  1.0
 5  1.0 osd.5up  1.0  1.0
 6  1.0 osd.6up  1.0  1.0
-4  7.0 host Ceph-Stress-03
16  1.0 osd.16   up  1.0  1.0
17  1.0 osd.17   up  1.0  1.0
18  1.0 osd.18   up  1.0  1.0
19  1.0 osd.19   up  1.0  1.0
20  1.0 osd.20   up  1.0  1.0
21  1.0 osd.21   up  1.0  1.0
22  1.0 osd.22   up  1.0  1.0
-3  1.0 host Ceph-Stress-02
 7  1.0 osd.7up  1.0  1.0


[root@Ceph-Stress-02 ~]# crushtool --test -i crushmap --rule 1 --min-x 1 
--max-x 5 --num-rep 3 --show-utilization --show-mappings --show-bad-mappings
rule 1 (replicated_ruleset_multi_host), x = 1..5, numrep = 3..3
CRUSH rule 1 x 1 [5,7,21]
CRUSH rule 1 x 2 [17,6]
bad mapping rule 1 x 2 num_rep 3 result [17,6]
CRUSH rule 1 x 3 [19,7,1]
CRUSH rule 1 x 4 [5,22,7]
CRUSH rule 1 x 5 [21,0]
bad mapping rule 1 x 5 num_rep 3 result [21,0]
rule 1 (replicated_ruleset_multi_host) num_rep 3 result size == 2:  2/5
rule 1 (replicated_ruleset_multi_host) num_rep 3 result size == 3:  3/5
  device 0:  stored : 1  expected : 1
  device 1:  stored : 1  expected : 1
  device 5:  stored : 2  expected : 1
  device 6:  stored : 1  expected : 1
  device 7:  stored : 3  expected : 1
  device 17: stored : 1  expected : 1
  device 19: stored : 1  expected : 1
  device 21: stored : 2  expected : 1
  device 22: stored : 1  expected : 1

Thanks,
Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://li

[ceph-users] PG active+remapped even I have three hosts

2017-03-07 Thread TYLin
Hi all,

We got 4 PG active+remapped in our cluster. We set the pool’s  ruleset to 
ruleset 0 and got HEALTH_OK. After we set the ruleset to ruleset 1, 4 pg is 
active+remapped. The testing result from crushtool also shows some bad mapping 
exists. Anyone happened to know the reason?



pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 421 flags hashpspool stripe_width 0

[root@Ceph-Stress-01 ~]# ceph pg ls remapped
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG 
DISK_LOG STATE   STATE_STAMPVERSION REPORTED UP 
UP_PRIMARY ACTINGACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
LAST_DEEP_SCRUB DEEP_SCRUB_STAMP   
0.33  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.453447 0'0  419:110 [3,22]  3 
 [3,22,4]  30'0 2017-03-07 13:29:32.110280 0'0 
2017-03-07 13:29:32.110280 
0.3b  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.619526 0'0  419:110 [2,20]  2 
[2,20,17]  20'0 2017-03-07 13:29:32.110287 0'0 
2017-03-07 13:29:32.110287 
0.49  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.453239 0'0  419:104 [4,20]  4 
[4,20,19]  40'0 2017-03-07 13:29:32.110257 0'0 
2017-03-07 13:29:32.110257 
0.54  0  00 0   0 0   0
0 active+remapped 2017-03-07 22:53:02.101725 0'0  419:207 [19,3] 19 
[19,3,20] 190'0 2017-03-07 13:29:32.110262 0'0 
2017-03-07 13:29:32.110262


rule replicated_ruleset_one_host {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 0 type osd
step emit
}
rule replicated_ruleset_multi_host {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 15.0 root default  
-2  7.0 host Ceph-Stress-01   
 0  1.0 osd.0up  1.0  1.0 
 1  1.0 osd.1up  1.0  1.0 
 2  1.0 osd.2up  1.0  1.0 
 3  1.0 osd.3up  1.0  1.0 
 4  1.0 osd.4up  1.0  1.0 
 5  1.0 osd.5up  1.0  1.0 
 6  1.0 osd.6up  1.0  1.0 
-4  7.0 host Ceph-Stress-03   
16  1.0 osd.16   up  1.0  1.0 
17  1.0 osd.17   up  1.0  1.0 
18  1.0 osd.18   up  1.0  1.0 
19  1.0 osd.19   up  1.0  1.0 
20  1.0 osd.20   up  1.0  1.0 
21  1.0 osd.21   up  1.0  1.0 
22  1.0 osd.22   up  1.0  1.0 
-3  1.0 host Ceph-Stress-02   
 7  1.0 osd.7up  1.0  1.0


[root@Ceph-Stress-02 ~]# crushtool --test -i crushmap --rule 1 --min-x 1 
--max-x 5 --num-rep 3 --show-utilization --show-mappings --show-bad-mappings
rule 1 (replicated_ruleset_multi_host), x = 1..5, numrep = 3..3
CRUSH rule 1 x 1 [5,7,21]
CRUSH rule 1 x 2 [17,6]
bad mapping rule 1 x 2 num_rep 3 result [17,6]
CRUSH rule 1 x 3 [19,7,1]
CRUSH rule 1 x 4 [5,22,7]
CRUSH rule 1 x 5 [21,0]
bad mapping rule 1 x 5 num_rep 3 result [21,0]
rule 1 (replicated_ruleset_multi_host) num_rep 3 result size == 2:  2/5
rule 1 (replicated_ruleset_multi_host) num_rep 3 result size == 3:  3/5
  device 0:  stored : 1  expected : 1
  device 1:  stored : 1  expected : 1
  device 5:  stored : 2  expected : 1
  device 6:  stored : 1  expected : 1
  device 7:  stored : 3  expected : 1
  device 17: stored : 1  expected : 1
  device 19: stored : 1  expected : 1
  device 21: stored : 2  expected : 1
  device 22: stored : 1  expected : 1

Thanks,
Tim
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS assert failed when shutting down

2017-03-07 Thread Xusangdi
Hi Cephers,

We occasionally meet an assertion failure when trying to shutdown an MDS, as 
followings:

   -14> 2017-01-22 14:13:46.833804 7fd210c58700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.48:6800/42546 pipe(0x558ff3803400 
sd=17 :52412 s=4 pgs=227 cs=1 l=1 c=0x558ff3758900).fault (0) Success
   -13> 2017-01-22 14:13:46.833802 7fd2092e6700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.12:6813/4037017 pipe(0x558ff3802000 
sd=72 :32894 s=4 pgs=24 cs=1 l=1 c=0x558ffc199200).reader couldn't read tag, 
(0) Success
   -12> 2017-01-22 14:13:46.833831 7fd2092e6700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.12:6813/4037017 pipe(0x558ff3802000 
sd=72 :32894 s=4 pgs=24 cs=1 l=1 c=0x558ffc199200).fault (0) Success
   -11> 2017-01-22 14:13:46.833884 7fd213861700  5 asok(0x558ff373a000) 
unregister_command objecter_requests
   -10> 2017-01-22 14:13:46.833896 7fd213861700 10 monclient: shutdown
-9> 2017-01-22 14:13:46.833901 7fd213861700  1 -- 
192.168.36.11:6801/2188363 mark_down 0x558ffc198600 -- 0x558ffa52e000
-8> 2017-01-22 14:13:46.833922 7fd2080d4700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.12:6819/4037213 pipe(0x558ff500e800 
sd=82 :32834 s=4 pgs=25 cs=1 l=1 c=0x558ffc19bc00).reader couldn't read tag, 
(0) Success
-7> 2017-01-22 14:13:46.833943 7fd2080d4700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.12:6819/4037213 pipe(0x558ff500e800 
sd=82 :32834 s=4 pgs=25 cs=1 l=1 c=0x558ffc19bc00).fault (0) Success
-6> 2017-01-22 14:13:46.833937 7fd214964700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.11:6789/0 pipe(0x558ffa52e000 sd=8 
:52298 s=4 pgs=31815 cs=1 l=1 c=0x558ffc198600).reader couldn't read tag, (0) 
Success
-5> 2017-01-22 14:13:46.833954 7fd214964700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.11:6789/0 pipe(0x558ffa52e000 sd=8 
:52298 s=4 pgs=31815 cs=1 l=1 c=0x558ffc198600).fault (0) Success
-4> 2017-01-22 14:13:46.833959 7fd210b57700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.11:6800/678824 pipe(0x558ff3804800 
sd=18 :45286 s=4 pgs=198 cs=1 l=1 c=0x558ff3758c00).reader couldn't read tag, 
(0) Success
-3> 2017-01-22 14:13:46.833972 7fd210b57700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.11:6800/678824 pipe(0x558ff3804800 
sd=18 :45286 s=4 pgs=198 cs=1 l=1 c=0x558ff3758c00).fault (0) Success
-2> 2017-01-22 14:13:46.834029 7fd20e437700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.48:6804/42771 pipe(0x558ff5034000 
sd=33 :35778 s=4 pgs=300 cs=1 l=1 c=0x558ff375ba80).reader couldn't read tag, 
(0) Success
-1> 2017-01-22 14:13:46.834062 7fd20e437700  2 -- 
192.168.36.11:6801/2188363 >> 192.168.36.48:6804/42771 pipe(0x558ff5034000 
sd=33 :35778 s=4 pgs=300 cs=1 l=1 c=0x558ff375ba80).fault (0) Success
 0> 2017-01-22 14:13:46.836775 7fd21285f700 -1 osdc/Objecter.cc: In 
function 'void Objecter::_op_submit_with_budget(Objecter::Op*, 
Objecter::shunique_lock&, ceph_tid_t*, int*)' thread 7fd21285f700 time 
2017-01-22 14:13:46.834106
osdc/Objecter.cc: 2145: FAILED assert(initialized.read())

 ceph version 10.2.5 (53ded15a3fab78780028baa5681f578254e2b9df)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x88) 
[0x558fe7a1ca18]
 2: (Objecter::_op_submit_with_budget(Objecter::Op*, 
ceph::shunique_lock&, unsigned long*, int*)+0x3ad) 
[0x558fe78b068d]
 3: (Objecter::op_submit(Objecter::Op*, unsigned long*, int*)+0x6e) 
[0x558fe78b07ae]
 4: (Filer::_probe(Filer::Probe*, std::unique_lock&)+0xbea) 
[0x558fe788524a]
 5: (Filer::_probed(Filer::Probe*, object_t const&, unsigned long, 
std::chrono::time_point > >, 
std::unique_lock&)+0x9bb) [0x558fe788671b]
 6: (Filer::C_Probe::finish(int)+0x6c) [0x558fe7888dac]
 7: (Context::complete(int)+0x9) [0x558fe7606be9]
 8: (Finisher::finisher_thread_entry()+0x4c5) [0x558fe793e305]
 9: (()+0x8182) [0x7fd21d371182]
 10: (clone()+0x6d) [0x7fd21b8ba47d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

I have also opened a ticket here: tracker.ceph.com/issues/19204 . Any advice on 
how to fix this?

Regards,
--- Sandy
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hammer to jewel upgrade experiences? cache tier experience?

2017-03-07 Thread Christian Balzer

[re-adding ML, so others may benefit]

On Tue, 7 Mar 2017 13:14:14 -0700 Mike Lovell wrote:

> On Mon, Mar 6, 2017 at 8:18 PM, Christian Balzer  wrote:
> 
> > On Mon, 6 Mar 2017 19:57:11 -0700 Mike Lovell wrote:
> >  
> > > has anyone on the list done an upgrade from hammer (something later than
> > > 0.94.6) to jewel with a cache tier configured? i tried doing one last  
> > week  
> > > and had a hiccup with it. i'm curious if others have been able to
> > > successfully do the upgrade and, if so, did they take any extra steps
> > > related to the cache tier?
> > >  
> > It would be extremely helpful for everybody involved if you could be bit
> > more specific than "hiccup".
> >  
> 
> the problem we had was osds in the cache tier were crashing and it made the
> cluster unusable for a while. http://tracker.ceph.com/issues/19185 is a
> tracker issue i made for it. i'm guessing not many others have seen the
> same issue. i'm just wondering if others have successfully done an upgrade
> with an active cache tier and how things went.
>
Yeah, I saw that a bit later, looks like you found/hit a genuine bug.
 
> I've upgraded one crappy test cluster from hammer to jewel w/o issues and
> > am about to do that on a more realistic, busier test cluster as well.
> >
I did upgrade the other test cluster, that had actual traffic (to/through
the cache) going on during the upgrade without any issues.

Maybe Kefu Chai can comment on why this is not something seen by everyone,
one thing I can think of is that I didn't change any defaults, in
particular "hit_set_period".

> > OTOH, I have no plans to upgrade my production Hammer cluster with a cache
> > tier at this point.
> >  
> 
> interesting. do you not have plans just because you are still testing? or
> is there just no desire or need to upgrade?
> 
All of the above. 
That (small) cluster is serving 9 compute nodes and that whole
installation has reached its max build-out, it will NOT grow any further.
Hammer is working fine, nobody involved is interesting in upgrading things
willy-nilly (which would involve the compute nodes at some point as well)
for a service that needs to be as close to 24/7 as possible.

While I would like to eventually replace old HW consecutively about 3-4
years down the line and thus require "current" SW, migrating everything
off that installation and starting fresh is also an option.

If you do an upgrade of a compute node, you can live migrate things away
from it first and if it doesn't pan out, no harm done.

If you run into a "hiccup" with a Ceph upgrade (especially one that
doesn't manifest itself immediately on the first MON/OSD being upgraded),
your whole installation with (in my case) hundreds of VMs is dead in the
water, given the exact circumstances for a prolonged period.
 
Not a particular sunny or career enhancing prospect.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Christian Balzer

Hello,

as Adrian pointed out, this is not really Ceph specific.

That being said, there are literally dozen of threads in this ML about
this issue and speeding up things in general, use your google-foo.
In particular Nick Fisk's articles are a good source for understanding
what is happening and how to minimize this within the limits by the laws of
physics.

Christian 

On Tue, 7 Mar 2017 23:37:46 + Adrian Saul wrote:

> The problem is not so much ceph, but the fact that sync workloads tend to 
> mean you have an effective queue depth of 1 because it serialises the IO from 
> the application, as it waits for the last write to complete before issuing 
> the next one.
> 
> 
> From: Matteo Dacrema [mailto:mdacr...@enter.eu]
> Sent: Wednesday, 8 March 2017 10:36 AM
> To: Adrian Saul
> Cc: ceph-users
> Subject: Re: [ceph-users] MySQL and ceph volumes
> 
> Thank you Adrian!
> 
> I’ve forgot this option and I can reproduce the problem.
> 
> Now, what could be the problem on ceph side with O_DSYNC writes?
> 
> Regards
> Matteo
> 
> 
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and delete 
> this e-mail from your system. If you are not the intended recipient you are 
> notified that disclosing, copying, distributing or taking any action in 
> reliance on the contents of this information is strictly prohibited.
> 
> Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
> mailto:adrian.s...@tpgtelecom.com.au>> ha 
> scritto:
> 
> 
> Possibly MySQL is doing sync writes, where as your FIO could be doing 
> buffered writes.
> 
> Try enabling the sync option on fio and compare results.
> 
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Matteo Dacrema
> Sent: Wednesday, 8 March 2017 7:52 AM
> To: ceph-users
> Subject: [ceph-users] MySQL and ceph volumes
> 
> Hi All,
> 
> I have a galera cluster running on openstack with data on ceph volumes
> capped at 1500 iops for read and write ( 3000 total ).
> I can’t understand why with fio I can reach 1500 iops without IOwait and
> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
> 
> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
> can’t reproduce the problem.
> 
> Anyone can tell me where I’m wrong?
> 
> Thank you
> Regards
> Matteo
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam:
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534
> 
> 
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Adrian Saul

The problem is not so much ceph, but the fact that sync workloads tend to mean 
you have an effective queue depth of 1 because it serialises the IO from the 
application, as it waits for the last write to complete before issuing the next 
one.


From: Matteo Dacrema [mailto:mdacr...@enter.eu]
Sent: Wednesday, 8 March 2017 10:36 AM
To: Adrian Saul
Cc: ceph-users
Subject: Re: [ceph-users] MySQL and ceph volumes

Thank you Adrian!

I’ve forgot this option and I can reproduce the problem.

Now, what could be the problem on ceph side with O_DSYNC writes?

Regards
Matteo



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
mailto:adrian.s...@tpgtelecom.com.au>> ha 
scritto:


Possibly MySQL is doing sync writes, where as your FIO could be doing buffered 
writes.

Try enabling the sync option on fio and compare results.



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Matteo Dacrema
Sent: Wednesday, 8 March 2017 7:52 AM
To: ceph-users
Subject: [ceph-users] MySQL and ceph volumes

Hi All,

I have a galera cluster running on openstack with data on ceph volumes
capped at 1500 iops for read and write ( 3000 total ).
I can’t understand why with fio I can reach 1500 iops without IOwait and
MySQL can reach only 150 iops both read or writes showing 30% of IOwait.

I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
can’t reproduce the problem.

Anyone can tell me where I’m wrong?

Thank you
Regards
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Seguire il link qui sotto per segnalarlo come spam:
http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Matteo Dacrema
Thank you Adrian!

I’ve forgot this option and I can reproduce the problem.

Now, what could be the problem on ceph side with O_DSYNC writes?

Regards
Matteo



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
>  ha scritto:
> 
> 
> Possibly MySQL is doing sync writes, where as your FIO could be doing 
> buffered writes.
> 
> Try enabling the sync option on fio and compare results.
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Matteo Dacrema
>> Sent: Wednesday, 8 March 2017 7:52 AM
>> To: ceph-users
>> Subject: [ceph-users] MySQL and ceph volumes
>> 
>> Hi All,
>> 
>> I have a galera cluster running on openstack with data on ceph volumes
>> capped at 1500 iops for read and write ( 3000 total ).
>> I can’t understand why with fio I can reach 1500 iops without IOwait and
>> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>> 
>> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
>> can’t reproduce the problem.
>> 
>> Anyone can tell me where I’m wrong?
>> 
>> Thank you
>> Regards
>> Matteo
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Adrian Saul

Possibly MySQL is doing sync writes, where as your FIO could be doing buffered 
writes.

Try enabling the sync option on fio and compare results.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Matteo Dacrema
> Sent: Wednesday, 8 March 2017 7:52 AM
> To: ceph-users
> Subject: [ceph-users] MySQL and ceph volumes
>
> Hi All,
>
> I have a galera cluster running on openstack with data on ceph volumes
> capped at 1500 iops for read and write ( 3000 total ).
> I can’t understand why with fio I can reach 1500 iops without IOwait and
> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>
> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
> can’t reproduce the problem.
>
> Anyone can tell me where I’m wrong?
>
> Thank you
> Regards
> Matteo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Deepak Naidu
I hope you did 1 minute interval of iostat. Based on your iostat & disk info.


· avgrq-sz  is showing 750.49  & avgqu-sz is showing 17.39

· 375.245 KB is your average block size.

· That said, your disk is showing a Quee of 17.39 length. Typically 
higher Q length will increase your disk IO wait whether its read or write.

Hope you have the picture of your IO now & hope this info helps.

>> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and 
>> I can’t reproduce the problem.
Try approx. 375.245 KB & 32 Q-Depth & see what’s your iostat looking, if it’s 
same then that’s what your disk can do.

Now if you want to compare ceph RDB perf. Do the same on a normal block device.

--
Deepak



From: Matteo Dacrema [mailto:mdacr...@enter.eu]
Sent: Tuesday, March 07, 2017 1:17 PM
To: Deepak Naidu
Cc: ceph-users
Subject: Re: [ceph-users] MySQL and ceph volumes

Hi Deepak,

thank you.

Here an example of iostat

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   5.160.002.64   15.740.00   76.45

Device: rrqm/s wrqm/sr/s
w/s   rkB/swkB/s  avgrq-sz  
 avgqu-sz  await   r_awaitw_await  
svctm   %util
vda   0.000.00 0.00 0.00 0.00   
  0.00 0.00 
0.00 0.00 0.00 0.00 0.00 0.00
vdb   0.001.00 96.00   292.00 4944.00   
14065 2.00  750.49 17.39
   43.89   17.79   52.47   2.58 100.00

vdb is the ceph volumes with xfs fs.


Disk /dev/vdb: 2199.0 GB, 219902322 bytes
255 heads, 63 sectors/track, 267349 cylinders, total 4294967296 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x

   Device Boot  Start End  Blocks   Id  System
/dev/vdb1   1  4294967295  2147483647+  ee  GPT

Regards
Matteo

Il giorno 07 mar 2017, alle ore 22:08, Deepak Naidu 
mailto:dna...@nvidia.com>> ha scritto:

My response is without any context to ceph or any SDS, purely how to check the 
IO bottleneck. You can then determine if its Ceph or any other process or disk.

>> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
Lower IOPS is not issue with itself as your block size might be higher. But 
MySQL doing higher block not sure.  You can check below iostat metrics to see 
why is the IO wait higher.

*  avgqu-sz(Avg queue length)-->  Higher the Q length 
more the IO wait
* avgrq-sz[The average size (in sectors)] -->  Shows IOblock size( check 
this when using mysql). [ you need to calculate this based on your FS block 
size in KB & don’t just you the avgrq-sz # ]


--
Deepak



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Matteo 
Dacrema
Sent: Tuesday, March 07, 2017 12:52 PM
To: ceph-users
Subject: [ceph-users] MySQL and ceph volumes

Hi All,

I have a galera cluster running on openstack with data on ceph volumes capped 
at 1500 iops for read and write ( 3000 total ).
I can’t understand why with fio I can reach 1500 iops without IOwait and MySQL 
can reach only 150 iops both read or writes showing 30% of IOwait.

I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I 
can’t reproduce the problem.

Anyone can tell me where I’m wrong?

Thank you
Regards
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.


--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Clicca qui per segnalarlo come 
spam.
Clicca qui per metterlo in 
blacklist

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot Costs (Was: Re: Pool Sizes)

2017-03-07 Thread Kent Borg

On 03/07/2017 04:35 PM, Gregory Farnum wrote:

Creating a snapshot generally involves a round-trip to the monitor,
which requires a new OSDMap epoch (although it can coalesce) — ie, the
monitor paxos commit and processing the new map on all the OSDs/PGs.
Destroying a snapshot involves adding the snapshot ID to an
interval_set in a new OSDMap epoch; and then going through the snap
trimming process (which can be fairly expensive).
If you send a write to a snapshotted object, it is (for
FileStore-on-xfs) copied on write. (FileStore-on-xfs does
filesystem-level copy-on-write, which is one reason we kept hoping it
would be our stable future...) I think BlueStore also does block-level
copy-on-write. It's a one-time penalty.


Concise answer. Makes sense. I'm slowly getting this stuff.

My take away? Snapshots are *way* more affordable than creating or 
deleting pools. But significantly more expensive than just reading and 
writing objects. So use snapshots for human scale stuff. Big-ish things 
humans care about, things that happen at time scales humans can 
participate in. Don't be too cute and clever here. Keep my clever to 
what I read and write, from and to, zillions of objects. (Playing on the 
zillions-of-scale is still a kick!)


Thanks,

-kb
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Snapshot Costs (Was: Re: Pool Sizes)

2017-03-07 Thread Gregory Farnum
On Tue, Mar 7, 2017 at 12:43 PM, Kent Borg  wrote:
> On 01/04/2017 03:41 PM, Brian Andrus wrote:
>>
>> Think "many objects, few pools". The number of pools do not scale well
>> because of PG limitations. Keep a small number of pools with the proper
>> number of PGs.
>
>
> I finally got it through my head, seems the larger answer is: Not only it is
> okay to have a (properly configured) pool grow to insane numbers of objects,
> the inverse is also true; keep the number of pools not just small, but to a
> very bare minimum. For example, Cephfs, which aspires to scale to crazy
> sizes, only uses two pools. And when Cephfs picks up the ability to offer
> multiple Cephfs file systems in of a single cluster...it will probably still
> only be using two pools.
>
>
> Continuing along with my theme of trying to understand Ceph (specifically
> RADOS, if that matters): Snapshots!
>
> What does a snapshot cost? In time? In other resources? When do those costs
> hit? What does it cost to destroy a snapshot? What does it cost to
> accumulate multiple snapshots? What does it cost to alter a snapshotted
> object? (Does that alteration cost hit only once or does it linger?)
> Whatever their costs, what makes them greater and what makes them smaller?
> It is sensible to make snapshots programmatically? If so, how rapidly?

Creating a snapshot generally involves a round-trip to the monitor,
which requires a new OSDMap epoch (although it can coalesce) — ie, the
monitor paxos commit and processing the new map on all the OSDs/PGs.
Destroying a snapshot involves adding the snapshot ID to an
interval_set in a new OSDMap epoch; and then going through the snap
trimming process (which can be fairly expensive).
If you send a write to a snapshotted object, it is (for
FileStore-on-xfs) copied on write. (FileStore-on-xfs does
filesystem-level copy-on-write, which is one reason we kept hoping it
would be our stable future...) I think BlueStore also does block-level
copy-on-write. It's a one-time penalty.

> For example, one idea in the back of my mind is whether there would be a way
> to use snapshots as a way to kinda fake transactions. I have no idea whether
> that might be clever or an abuse of the feature...

I don't really think so — they're read-only so it's a linear structure.

>
> I would love it if someone could toss out some examples of the sorts of
> things snapshots are good for and the sorts of things they are terrible for.
> (And some hints as to why, please.)

They're good for CephFS snapshots. They're good at RBD snapshots as
long as you don't take them too frequently. In general if you're using
self-managed snapshots and *can* reuse snapids across objects, that
better mimics their original design goal (CephFS subtree snapshots)
and minimizes the associated costs.

I'll be giving a developer-focused talk on this at Vault (and it looks
like an admin-focused one at the OpenStack Boston Ceph day) which will
involve gathering up the data in one place and presenting it more
accessibly, so keep an eye out for those if you're interested. :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replica questions

2017-03-07 Thread Matteo Dacrema
Hi,

thank you all.

I’m using Mellanox switches with connectX-3 40 gbit pro NIC.
Bond balance-xor with policy layer3+4 

It’s a bit expensive but it’s very hard to saturate.
I’m using one single nic for both replica and access network. 


> Il giorno 03 mar 2017, alle ore 14:52, Vy Nguyen Tan 
>  ha scritto:
> 
> Hi,
> 
> You should read email from Wido den Hollander:
> "Hi,
> 
> As a Ceph consultant I get numerous calls throughout the year to help people 
> with getting their broken Ceph clusters back online.
> 
> The causes of downtime vary vastly, but one of the biggest causes is that 
> people use replication 2x. size = 2, min_size = 1.
> 
> In 2016 the amount of cases I have where data was lost due to these settings 
> grew exponentially.
> 
> Usually a disk failed, recovery kicks in and while recovery is happening a 
> second disk fails. Causing PGs to become incomplete.
> 
> There have been to many times where I had to use xfs_repair on broken disks 
> and use ceph-objectstore-tool to export/import PGs.
> 
> I really don't like these cases, mainly because they can be prevented easily 
> by using size = 3 and min_size = 2 for all pools.
> 
> With size = 2 you go into the danger zone as soon as a single disk/daemon 
> fails. With size = 3 you always have two additional copies left thus keeping 
> your data safe(r).
> 
> If you are running CephFS, at least consider running the 'metadata' pool with 
> size = 3 to keep the MDS happy.
> 
> Please, let this be a big warning to everybody who is running with size = 2. 
> The downtime and problems caused by missing objects/replicas are usually big 
> and it takes days to recover from those. But very often data is lost and/or 
> corrupted which causes even more problems.
> 
> I can't stress this enough. Running with size = 2 in production is a SERIOUS 
> hazard and should not be done imho.
> 
> To anyone out there running with size = 2, please reconsider this!
> 
> Thanks,
> 
> Wido"
> 
> Btw, could you please share your experience about HA network for Ceph ? What 
> type of bonding do you have? are you using stackable switches?
> 
> 
> 
> On Fri, Mar 3, 2017 at 6:24 PM, Maxime Guyot  > wrote:
> Hi Henrik and Matteo,
> 
>  
> 
> While I agree with Henrik: increasing your replication factor won’t improve 
> recovery or read performance on its own. If you are changing from replica 2 
> to replica 3, you might need to scale-out your cluster to have enough space 
> for the additional replica, and that would improve the recovery and read 
> performance.
> 
>  
> 
> Cheers,
> 
> Maxime
> 
>  
> 
> From: ceph-users  > on behalf of Henrik Korkuc 
> mailto:li...@kirneh.eu>>
> Date: Friday 3 March 2017 11:35
> To: "ceph-users@lists.ceph.com " 
> mailto:ceph-users@lists.ceph.com>>
> Subject: Re: [ceph-users] replica questions
> 
>  
> 
> On 17-03-03 12:30, Matteo Dacrema wrote:
> 
> Hi All,
> 
>  
> 
> I’ve a production cluster made of 8 nodes, 166 OSDs and 4 Journal SSD every 5 
> OSDs with replica 2 for a total RAW space of 150 TB.
> 
> I’ve few question about it:
> 
>  
> 
> It’s critical to have replica 2? Why?
> Replica size 3 is highly recommended. I do not know exact numbers but it 
> decreases chance of data loss as 2 disk failures appear to be quite frequent 
> thing, especially in larger clusters.
> 
> 
> Does replica 3 makes recovery faster?
> no
> 
> 
> Does replica 3 makes rebalancing and recovery less heavy for customers? If I 
> lose 1 node does replica 3 reduce the IO impact respect a replica 2?
> no
> 
> 
> Does read performance increase with replica 3?
> no
> 
> 
>  
> 
> Thank you
> 
> Regards
> 
> Matteo
> 
>  
> 
> 
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and delete 
> this e-mail from your system. If you are not the intended recipient you are 
> notified that disclosing, copying, distributing or taking any action in 
> reliance on the contents of this information is strictly prohibited.
> 
>  
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
>  
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.c

Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Matteo Dacrema
Hi Deepak,

thank you.

Here an example of iostat

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   5.160.002.64   15.740.00   76.45

Device: rrqm/s  wrqm/s  r/s w/s rkB/s   
wkB/s   avgrq-szavgqu-szawait   r_await 
w_await svctm   %util
vda   0.00  0.000.000.00
0.000.000.000.00
0.000.000.000.000.00
vdb   0.00  1.0096.00   292.00  4944.00 
14065 2.00  750.49  17.39   43.89   
17.79   52.47   2.58100.00

vdb is the ceph volumes with xfs fs.


Disk /dev/vdb: 2199.0 GB, 219902322 bytes
255 heads, 63 sectors/track, 267349 cylinders, total 4294967296 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x

   Device Boot  Start End  Blocks   Id  System
/dev/vdb1   1  4294967295  2147483647+  ee  GPT

Regards
Matteo

> Il giorno 07 mar 2017, alle ore 22:08, Deepak Naidu  ha 
> scritto:
> 
> My response is without any context to ceph or any SDS, purely how to check 
> the IO bottleneck. You can then determine if its Ceph or any other process or 
> disk.
>  
> >> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
> Lower IOPS is not issue with itself as your block size might be higher. But 
> MySQL doing higher block not sure.  You can check below iostat metrics to see 
> why is the IO wait higher.
>  
> *  avgqu-sz(Avg queue length)à  Higher the Q length 
> more the IO wait
> * avgrq-sz[The average size (in sectors)] à  Shows IOblock size( check 
> this when using mysql). [ you need to calculate this based on your FS block 
> size in KB & don’t just you the avgrq-sz # ]
>  
>  
> --
> Deepak
>  
>  
>  
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Matteo Dacrema
> Sent: Tuesday, March 07, 2017 12:52 PM
> To: ceph-users
> Subject: [ceph-users] MySQL and ceph volumes
>  
> Hi All,
>  
> I have a galera cluster running on openstack with data on ceph volumes capped 
> at 1500 iops for read and write ( 3000 total ).
> I can’t understand why with fio I can reach 1500 iops without IOwait and 
> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>  
> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I 
> can’t reproduce the problem.
>  
> Anyone can tell me where I’m wrong?
>  
> Thank you
> Regards
> Matteo
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is prohibited.  If you are not the intended recipient, please 
> contact the sender by reply email and destroy all copies of the original 
> message. 
> 
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto. 
> Clicca qui per segnalarlo come spam. 
>  
> Clicca qui per metterlo in blacklist 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Deepak Naidu
My response is without any context to ceph or any SDS, purely how to check the 
IO bottleneck. You can then determine if its Ceph or any other process or disk.



>> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.

Lower IOPS is not issue with itself as your block size might be higher. But 
MySQL doing higher block not sure.  You can check below iostat metrics to see 
why is the IO wait higher.



*  avgqu-sz(Avg queue length)-->  Higher the Q length 
more the IO wait

* avgrq-sz[The average size (in sectors)] -->  Shows IOblock size( check 
this when using mysql). [ you need to calculate this based on your FS block 
size in KB & don’t just you the avgrq-sz # ]





--

Deepak







-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Matteo 
Dacrema
Sent: Tuesday, March 07, 2017 12:52 PM
To: ceph-users
Subject: [ceph-users] MySQL and ceph volumes



Hi All,



I have a galera cluster running on openstack with data on ceph volumes capped 
at 1500 iops for read and write ( 3000 total ).

I can’t understand why with fio I can reach 1500 iops without IOwait and MySQL 
can reach only 150 iops both read or writes showing 30% of IOwait.



I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I 
can’t reproduce the problem.



Anyone can tell me where I’m wrong?



Thank you

Regards

Matteo



___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MySQL and ceph volumes

2017-03-07 Thread Matteo Dacrema
Hi All,

I have a galera cluster running on openstack with data on ceph volumes capped 
at 1500 iops for read and write ( 3000 total ).
I can’t understand why with fio I can reach 1500 iops without IOwait and MySQL 
can reach only 150 iops both read or writes showing 30% of IOwait.

I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I 
can’t reproduce the problem.

Anyone can tell me where I’m wrong?

Thank you
Regards
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Strange read results using FIO inside RBD QEMU VM ...

2017-03-07 Thread Xavier Trilla
Hi,

We have a pure SSD based Ceph cluster (+100 OSDs with Enterprise SSDs and IT 
mode cards) Hammer 0.94.9 over 10G. It's really stable and we are really happy 
with the performance we are getting. But after a customer ran some tests, we 
realized about something quite strange. Our user did some tests using FIO, and 
the strange thing is that Write tests did work as expected, but some Read tests 
did not.

The VM he used was artificially limited via QEMU to 3200  read and 3200  write 
IOPS. In the write department everything works more or less as expected. The 
results get close to 3200 IOPS but the read tests are the ones we don't really 
understand.

We ran tests using different IO Engines: Sync, libaio and POSIX AIO, during the 
write tests the 3 of them expect quite similar -which is something I did not 
really expect- but on the read department there is a huge difference:

Read Results (Random Read - Buffered: No - Direct: Yes - Block Size: 4KB):

LibAIO - Average: 3196 IOPS
POSIX AIO - Average: 878 IOPS
Sync -   Average: 929 IOPS

Write Results (Random Read - Buffered: No - Direct: Yes - Block Size: 4KB):

LibAIO -Average: 2741 IOPS
POSIX AIO -Average: 2673 IOPS
Sync -  Average: 2795 IOPS

I would expect a difference when using LibAIO or POSIX AIO, but I would expect 
it in both read and write results,  not only during reads.

So, I'm quite disoriented with this one... Does anyone have an idea about what 
might be going on?

Thanks!

Saludos Cordiales,
Xavier Trilla P.
Clouding.io

¿Un Servidor Cloud con SSDs, redundado
y disponible en menos de 30 segundos?

¡Pruébalo ahora en Clouding.io!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snapshot Costs (Was: Re: Pool Sizes)

2017-03-07 Thread Kent Borg

On 01/04/2017 03:41 PM, Brian Andrus wrote:
Think "many objects, few pools". The number of pools do not scale well 
because of PG limitations. Keep a small number of pools with the 
proper number of PGs.


I finally got it through my head, seems the larger answer is: Not only 
it is okay to have a (properly configured) pool grow to insane numbers 
of objects, the inverse is also true; keep the number of pools not just 
small, but to a very bare minimum. For example, Cephfs, which aspires to 
scale to crazy sizes, only uses two pools. And when Cephfs picks up the 
ability to offer multiple Cephfs file systems in of a single 
cluster...it will probably still only be using two pools.



Continuing along with my theme of trying to understand Ceph 
(specifically RADOS, if that matters): Snapshots!


What does a snapshot cost? In time? In other resources? When do those 
costs hit? What does it cost to destroy a snapshot? What does it cost to 
accumulate multiple snapshots? What does it cost to alter a snapshotted 
object? (Does that alteration cost hit only once or does it linger?) 
Whatever their costs, what makes them greater and what makes them 
smaller? It is sensible to make snapshots programmatically? If so, how 
rapidly?


For example, one idea in the back of my mind is whether there would be a 
way to use snapshots as a way to kinda fake transactions. I have no idea 
whether that might be clever or an abuse of the feature...


I would love it if someone could toss out some examples of the sorts of 
things snapshots are good for and the sorts of things they are terrible 
for. (And some hints as to why, please.)


Thanks,

-kb


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-07 Thread Alejandro Comisario
Gregory, thanks for the response, what you've said is by far, the most
enlightneen thing i know about ceph in a long time.

What brings even greater doubt, which is, this "non-functional" pool, was
only 1.5GB large, vs 50-150GB on the other effected pools, the tiny pool
was still being used, and just because that pool was blovking requests, the
whole cluster was unresponsive.

So , what do you mean by "non-functional" pool ? how a pool can become
non-functional ? and what asures me that tomorrow (just becaue i deleted
the 1.5GB pool to fix the whole problem) another pool doesnt becomes
non-functional ?

Ceph Bug ?
Another Bug ?
Something than can be avoided ?


On Tue, Mar 7, 2017 at 2:11 PM, Gregory Farnum  wrote:

> Some facts:
> The OSDs use a lot of gossip protocols to distribute information.
> The OSDs limit how many client messages they let in to the system at a
> time.
> The OSDs do not distinguish between client ops for different pools (the
> blocking happens before they have any idea what the target is).
>
> So, yes: if you have a non-functional pool and clients keep trying to
> access it, those requests can fill up the OSD memory queues and block
> access to other pools as it cascades across the system.
>
> On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario 
> wrote:
>
>> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
>> This weekend we'be experienced a huge outage from our customers vms
>> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
>> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
>> basically all PG's blocked where just one OSD in the acting set, but
>> all customers on the other pool got their vms almost freezed.
>>
>> while trying to do basic troubleshooting like doing noout and then
>> bringing down the OSD that slowed/blocked the most, inmediatelly
>> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
>> we rolled back that change and started to move data around with the
>> same logic (reweighting down those OSD) with exactly the same result.
>>
>> So, me made a decition, we decided to delete the pool where all PGS
>> where slowed/locked allways despite the osd.
>>
>> Not even 10 secconds passes after the pool deletion, where not only
>> there were no more degraded PGs, bit also ALL slow iops dissapeared
>> for ever, and performance from hundreds of vms came to normal
>> immediately.
>>
>> I must say that i was kinda scared to see that happen, bascally
>> because there was only ONE POOL's PGS always slowed, but performance
>> hit the another pool, so ... did not the PGS that exists on one pool
>> are not shared by the other ?
>> If my assertion is true, why OSD's locking iops from one pool's pg
>> slowed down all other pgs from other pools ?
>>
>> again, i just deleted a pool that has almost no traffic, because its
>> pgs were locked and affected pgs on another pool, and as soon as that
>> happened, the whole cluster came back to normal (and of course,
>> HEALTH_OK and no slow transaction whatsoever)
>>
>> please, someone help me understand the gap where i miss something,
>> since this , as long as my ceph knowledge is concerned, makes no
>> sense.
>>
>> PS: i have found someone that , looks like went through the same here:
>> https://forum.proxmox.com/threads/ceph-osd-failure-
>> causing-proxmox-node-to-crash.20781/
>> but i still dont understand what happened.
>>
>> hoping to get the help from the community.
>>
>> --
>> Alejandrito.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] purging strays faster

2017-03-07 Thread Patrick Donnelly
Hi Dan,

On Tue, Mar 7, 2017 at 11:10 AM, Daniel Davidson
 wrote:
> When I try this command, I still get errors:
>
> ceph daemon mds.0 config show
> admin_socket: exception getting command descriptions: [Errno 2] No such file
> or directory
> admin_socket: exception getting command descriptions: [Errno 2] No such file
> or directory
>
> I am guessing there is a path set up incorrectly somewhere, but I do not
> know where to look.

You need to run the command on the machine where the daemon is running.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] can a OSD affect performance from pool X when blocking/slow requests PGs from pool Y ?

2017-03-07 Thread Gregory Farnum
Some facts:
The OSDs use a lot of gossip protocols to distribute information.
The OSDs limit how many client messages they let in to the system at a time.
The OSDs do not distinguish between client ops for different pools (the
blocking happens before they have any idea what the target is).

So, yes: if you have a non-functional pool and clients keep trying to
access it, those requests can fill up the OSD memory queues and block
access to other pools as it cascades across the system.

On Sun, Mar 5, 2017 at 6:22 PM Alejandro Comisario 
wrote:

> Hi, we have a 7 nodes ubuntu ceph hammer pool (78 OSD to be exact).
> This weekend we'be experienced a huge outage from our customers vms
> (located on pool CUSTOMERS, replica size 3 ) when lots of OSD's
> started to slow request/block PG's on pool PRIVATE ( replica size 1 )
> basically all PG's blocked where just one OSD in the acting set, but
> all customers on the other pool got their vms almost freezed.
>
> while trying to do basic troubleshooting like doing noout and then
> bringing down the OSD that slowed/blocked the most, inmediatelly
> another OSD slowed/locked iops on pgs from the same PRIVATE pool, so
> we rolled back that change and started to move data around with the
> same logic (reweighting down those OSD) with exactly the same result.
>
> So, me made a decition, we decided to delete the pool where all PGS
> where slowed/locked allways despite the osd.
>
> Not even 10 secconds passes after the pool deletion, where not only
> there were no more degraded PGs, bit also ALL slow iops dissapeared
> for ever, and performance from hundreds of vms came to normal
> immediately.
>
> I must say that i was kinda scared to see that happen, bascally
> because there was only ONE POOL's PGS always slowed, but performance
> hit the another pool, so ... did not the PGS that exists on one pool
> are not shared by the other ?
> If my assertion is true, why OSD's locking iops from one pool's pg
> slowed down all other pgs from other pools ?
>
> again, i just deleted a pool that has almost no traffic, because its
> pgs were locked and affected pgs on another pool, and as soon as that
> happened, the whole cluster came back to normal (and of course,
> HEALTH_OK and no slow transaction whatsoever)
>
> please, someone help me understand the gap where i miss something,
> since this , as long as my ceph knowledge is concerned, makes no
> sense.
>
> PS: i have found someone that , looks like went through the same here:
>
> https://forum.proxmox.com/threads/ceph-osd-failure-causing-proxmox-node-to-crash.20781/
> but i still dont understand what happened.
>
> hoping to get the help from the community.
>
> --
> Alejandrito.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replication vs Erasure Coding with only 2 elements in the failure-domain.

2017-03-07 Thread Francois Blondel
Hi all,


We have (only) 2 separate "rooms" (crush bucket) and would like to build a 
cluster being able to handle the complete loss of one room.


First idea would be to use replication:

-> As we read the mail thread "2x replication: A BIG warning", we would chose a 
replication size of 3.

-> We need to change the default ruleset {bucket-type} to room. ( as described 
here http://docs.ceph.com/docs/master/rados/operations/crush-map/#crushmaprules 
)

We created for that a new crush rule:


rule replicated_ruleset_new {

ruleset 3

type replicated

min_size 1

max_size 10

step take default

step choose firstn 2 type room

step chooseleaf firstn 2 type host

step emit

}



Second idea would be to use Erasure Coding, as it fits our performance 
requirements and would use less raw space.


Creating an EC profile like:

   “ceph osd erasure-code-profile set eck2m2room k=2 m=2 
ruleset-failure-domain=room”

and a pool using that EC profile, with “ceph osd pool create ecpool 128 128 
erasure eck2m2room” of course leads to having “128 creating+incomplete” PGs, as 
we only have 2 rooms.


Is there somehow a way to store the “parity chuncks” (m) on both rooms, so that 
the loss of a room would be possible ?


If I understood correctly, an Erasure Coding of for example k=2, m=2, would use 
the same space as a replication with a size of 2, but be more reliable, as we 
could afford the loss of more OSDs at the same time.

Would it be possible to instruct the crush rule to store the first k and m 
chuncks in room 1, and the second k and m chuncks in room 2 ?



Many thanks for your feedbacks !


Thanks !

François
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] purging strays faster

2017-03-07 Thread Daniel Davidson

When I try this command, I still get errors:

ceph daemon mds.0 config show
admin_socket: exception getting command descriptions: [Errno 2] No such 
file or directory
admin_socket: exception getting command descriptions: [Errno 2] No such 
file or directory


I am guessing there is a path set up incorrectly somewhere, but I do not 
know where to look.


Dan


On 03/06/2017 09:05 AM, John Spray wrote:

On Mon, Mar 6, 2017 at 3:03 PM, Daniel Davidson
 wrote:

Thanks for the suggestion, however I think my more immediate problem is the
ms_handle_reset messages. I do not think the mds are getting the updates
when I send them.

I wouldn't assume that.  You can check the current config state to see
that your values got through by using "ceph daemon mds. config
show".

John


Dan


On 03/04/2017 09:08 AM, John Spray wrote:

On Fri, Mar 3, 2017 at 9:48 PM, Daniel Davidson
 wrote:

ceph daemonperf mds.ceph-0
-mds-- --mds_server-- ---objecter--- -mds_cache-
---mds_log
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs
evts
subm|
0  336k  97k|  000 |  00   20 |  00 246k   0 | 31
27k
0
0  336k  97k|  000 |1120   20 |  00 246k  55 | 31
26k
55
0  336k  97k|  010 | 900   20 |  00 246k  45 | 31
26k
45
0  336k  97k|  000 |  20   20 |  00 246k   1 | 31
26k
1
0  336k  97k|  000 |1660   21 |  00 246k  83 | 31
26k
83

I have too many strays that seem to be causing disk full errors when
deleting man files (hundreds of thousands)  the number here is down from
over 400k.  I have been trying to up the number of processes to do this,
but
it is not happening:

ceph tell mds.ceph-0 injectargs --mds-max-purge-ops-per-pg 2
2017-03-03 15:44:00.606548 7fd96400a700  0 client.225772 ms_handle_reset
on
172.16.31.1:6800/55710
2017-03-03 15:44:00.618556 7fd96400a700  0 client.225776 ms_handle_reset
on
172.16.31.1:6800/55710
mds_max_purge_ops_per_pg = '2'

ceph tell mds.ceph-0 injectargs --mds-max-purge-ops 16384
2017-03-03 15:45:27.256132 7ff6d900c700  0 client.225808 ms_handle_reset
on
172.16.31.1:6800/55710
2017-03-03 15:45:27.268302 7ff6d900c700  0 client.225812 ms_handle_reset
on
172.16.31.1:6800/55710
mds_max_purge_ops = '16384'

I do have a backfill running as I also have a new node that is almost
done.
Any ideas as to what is going on here?

Try also increasing mds_max_purge_files.If your files are small
then that is likely to be the bottleneck.

John


Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Much more dentries than inodes, is that normal?

2017-03-07 Thread Xiaoxi Chen
Thanks John.

Very likely, note that mds_mem::ino + mds_cache::strays_created ~=
mds::inodes, plus the MDS was the active-standby one, and become
active days ago due to failover.

mds": {
"inodes": 1291393,
}
"mds_cache": {
"num_strays": 3559,
"strays_created": 706120,
"strays_purged": 702561
}
"mds_mem": {
"ino": 584974,
}

I do have a cache dump from the mds via admin socket,  is there
anything I can get from it  to make 100% percent sure?


Xiaoxi

2017-03-07 22:20 GMT+08:00 John Spray :
> On Tue, Mar 7, 2017 at 9:17 AM, Xiaoxi Chen  wrote:
>> Hi,
>>
>>   From the admin socket of mds, I got following data on our
>> production cephfs env, roughly we have 585K inodes and almost same
>> amount of caps, but we have>2x dentries than inodes.
>>
>>   I am pretty sure we dont use hard link intensively (if any).
>> And the #ino match with "rados ls --pool $my_data_pool}.
>>
>>   Thanks for any explanations, appreciate it.
>>
>>
>> "mds_mem": {
>> "ino": 584974,
>> "ino+": 1290944,
>> "ino-": 705970,
>> "dir": 25750,
>> "dir+": 25750,
>> "dir-": 0,
>> "dn": 1291393,
>> "dn+": 1997517,
>> "dn-": 706124,
>> "cap": 584560,
>> "cap+": 2657008,
>> "cap-": 2072448,
>> "rss": 24599976,
>> "heap": 166284,
>> "malloc": 18446744073708721289,
>> "buf": 0
>> },
>>
>
> One possibility is that you have many "null" dentries, which are
> created when we do a lookup and a file is not found -- we create a
> special dentry to remember that that filename does not exist, so that
> we can return ENOENT quickly next time.  On pre-Kraken versions, null
> dentries can also be left behind after file deletions when the
> deletion is replayed on a standbyreplay MDS
> (http://tracker.ceph.com/issues/16919)
>
> John
>
>
>
>>
>> Xiaoxi
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Much more dentries than inodes, is that normal?

2017-03-07 Thread John Spray
On Tue, Mar 7, 2017 at 9:17 AM, Xiaoxi Chen  wrote:
> Hi,
>
>   From the admin socket of mds, I got following data on our
> production cephfs env, roughly we have 585K inodes and almost same
> amount of caps, but we have>2x dentries than inodes.
>
>   I am pretty sure we dont use hard link intensively (if any).
> And the #ino match with "rados ls --pool $my_data_pool}.
>
>   Thanks for any explanations, appreciate it.
>
>
> "mds_mem": {
> "ino": 584974,
> "ino+": 1290944,
> "ino-": 705970,
> "dir": 25750,
> "dir+": 25750,
> "dir-": 0,
> "dn": 1291393,
> "dn+": 1997517,
> "dn-": 706124,
> "cap": 584560,
> "cap+": 2657008,
> "cap-": 2072448,
> "rss": 24599976,
> "heap": 166284,
> "malloc": 18446744073708721289,
> "buf": 0
> },
>

One possibility is that you have many "null" dentries, which are
created when we do a lookup and a file is not found -- we create a
special dentry to remember that that filename does not exist, so that
we can return ENOENT quickly next time.  On pre-Kraken versions, null
dentries can also be left behind after file deletions when the
deletion is replayed on a standbyreplay MDS
(http://tracker.ceph.com/issues/16919)

John



>
> Xiaoxi
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds crashing during hit_set_trim and hit_set_remove_all

2017-03-07 Thread kefu chai
On Tue, Mar 7, 2017 at 3:30 PM, kefu chai  wrote:
> On Fri, Mar 3, 2017 at 11:40 PM, Sage Weil  wrote:
>> On Fri, 3 Mar 2017, Mike Lovell wrote:
>>> i started an upgrade process to go from 0.94.7 to 10.2.5 on a production
>>> cluster that is using cache tiering. this cluster has 3 monitors, 28 storage
>>> nodes, around 370 osds. the upgrade of the monitors completed without issue.
>>> i then upgraded 2 of the storage nodes, and after the restarts, the osds
>>> started crashing during hit_set_trim. here is some of the output from the
>>> log.
>>> 2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function
>>> 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>>> thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020
>>> osd/ReplicatedPG.cc: 10514: FAILED assert(obc)
>>>
>>>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x85) [0xbddac5]
>>>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
>>> int)+0x75f) [0x87e48f]
>>>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>>>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a) [0x8a0d1a]
>>>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
>>> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>>>  6: (OSD::dequeue_op(boost::intrusive_ptr,
>>> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405) [0x69a5c5]
>>>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>>> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>>>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
>>> [0xbcd1cf]
>>>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>>>  10: (()+0x7dc5) [0x7f8c1c209dc5]
>>>  11: (clone()+0x6d) [0x7f8c1aceaced]
>>>
>>> it started on just one osd and then spread to others until most of the osds
>>> that are part of the cache tier were crashing. that was happening on both
>>> the osds that were running jewel and on the ones running hammer. in the
>>> process of trying to sort this out, the use_gmt_hitset option was set to
>>> true and all of the osds were upgraded to hammer. we still have not been
>>> able to determine a cause or a fix.
>>>
>>> it looks like when hit_set_trim and hit_set_remove_all are being called,
>>> they are calling hit_set_archive_object() to generate a name based on a
>>> timestamp and then calling get_object_context() which then returns nothing
>>> and triggers an assert.
>>>
>>> i raised the debug_osd to 10/10 and then analyzed the logs after the crash.
>>> i found the following in the ceph osd log afterwards.
>>>
>>> 2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043
>>> pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923]
>>> local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0
>>> 266042/266042/266042) [146,116,179] r=0 lpr=266042
>>>  pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0 active+degraded
>>> NIBBLEWISE] get_object_context: no obc for soid
>>> 19:2ba0:.ceph-internal::hit_set_19.5d4_archive_2017-03-03
>>> 05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101
>>> 6Z:head and !can_create
>>> 2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters --
>>> start
>>> 2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters --
>>> finish
>>> 2017-03-03 03:10:31.921076 7f2194051700  7 osd.146 266043 handle_pg_notify
>>> from osd.255
>>> 2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters --
>>> start
>>> 2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters --
>>> finish
>>> 2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function
>>> 'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time
>>> 2017-03-03 03:10:31.918201
>>> osd/ReplicatedPG.cc: 11494: FAILED assert(obc)
>>>
>>>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x85) [0x7f21acee9425]
>>>  2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92]
>>>  3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd]
>>>  4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated const&)+0xac)
>>> [0x7f21ac916adc]
>>>  5: (boost::statechart::simple_state>> PG::RecoveryState::Primary, 
>>> PG::RecoveryState::Activating,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba
>>> se const&, void const*)+0x179) [0x7f21a
>>> c974909]
>>>  6: (boost::statechart::simple_state>> PG::RecoveryState::Active, boost::mpl::list>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>>> mpl_::na, mpl_::na, mpl_:
>>> :na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
>>> mpl_::na>,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba
>>> se const&, void const*)+0xcd) [0x7f21ac977ccd]
>>>  7: (boost::statechart::state_machine>> PG::RecoveryState::Initial, 
>>> std::allocator,boost::statechart::null_exception_transla

Re: [ceph-users] RBD device on Erasure Coded Pool with kraken and Ubuntu Xenial.

2017-03-07 Thread Francois Blondel
Am Dienstag, den 07.03.2017, 11:14 +0100 schrieb Ilya Dryomov:

On Tue, Mar 7, 2017 at 10:27 AM, Francois Blondel 
mailto:fblon...@intelliad.de>> wrote:


Hi all,

I have been triyng to use RBD devices on a Erasure Coded data-pool on Ubuntu
Xenial.

I created my block device "blockec2" with :
rbd create blockec2 --size 300G --data-pool ecpool --image-feature
layering,data-pool
(same issue with "rbd create blockec2 --size 300G --data-pool ecpool" )

I maped this block device to a "4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23
17:49:58 UTC 2017 x86_64" Kernel using "rbd-nbd map rbd/blockec2"

Wenn using the block device (mkfs.ext4 for example), i get following errors
in dmesg:

[Fri Mar  3 10:05:25 2017] nbd: registered device at major 43
[Fri Mar  3 10:05:53 2017] block nbd0: Other side returned error (95)
[Fri Mar  3 10:05:53 2017] blk_update_request: I/O error, dev nbd0, sector 0
[Fri Mar  3 10:05:56 2017] block nbd0: Other side returned error (95)
[Fri Mar  3 10:05:56 2017] blk_update_request: I/O error, dev nbd0, sector
8388607
[Fri Mar  3 10:06:06 2017] block nbd0: Other side returned error (95)
[Fri Mar  3 10:06:06 2017] blk_update_request: I/O error, dev nbd0, sector
16777214



Be aware that this functionality is experimental and may eat your data.
If you still want to try it, you need to whitelist it in ceph.conf:

enable experimental unrecoverable data corrupting features =
debug_white_box_testing_ec_overwrites

and do

$ ceph osd pool set ecpool debug_white_box_testing_ec_overwrites true


Thanks a lot, it works [:-)]

Had to force it with   --yes-i-really-mean-it :


ceph osd pool set ecpool debug_white_box_testing_ec_overwrites true 
--yes-i-really-mean-it


Thanks,


François



Thanks,

Ilya


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD device on Erasure Coded Pool with kraken and Ubuntu Xenial.

2017-03-07 Thread Ilya Dryomov
On Tue, Mar 7, 2017 at 10:27 AM, Francois Blondel  wrote:
> Hi all,
>
> I have been triyng to use RBD devices on a Erasure Coded data-pool on Ubuntu
> Xenial.
>
> I created my block device "blockec2" with :
> rbd create blockec2 --size 300G --data-pool ecpool --image-feature
> layering,data-pool
> (same issue with "rbd create blockec2 --size 300G --data-pool ecpool" )
>
> I maped this block device to a "4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23
> 17:49:58 UTC 2017 x86_64" Kernel using "rbd-nbd map rbd/blockec2"
>
> Wenn using the block device (mkfs.ext4 for example), i get following errors
> in dmesg:
>
> [Fri Mar  3 10:05:25 2017] nbd: registered device at major 43
> [Fri Mar  3 10:05:53 2017] block nbd0: Other side returned error (95)
> [Fri Mar  3 10:05:53 2017] blk_update_request: I/O error, dev nbd0, sector 0
> [Fri Mar  3 10:05:56 2017] block nbd0: Other side returned error (95)
> [Fri Mar  3 10:05:56 2017] blk_update_request: I/O error, dev nbd0, sector
> 8388607
> [Fri Mar  3 10:06:06 2017] block nbd0: Other side returned error (95)
> [Fri Mar  3 10:06:06 2017] blk_update_request: I/O error, dev nbd0, sector
> 16777214

Be aware that this functionality is experimental and may eat your data.
If you still want to try it, you need to whitelist it in ceph.conf:

enable experimental unrecoverable data corrupting features =
debug_white_box_testing_ec_overwrites

and do

$ ceph osd pool set ecpool debug_white_box_testing_ec_overwrites true

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph/hammer - debian7/wheezy repository doesnt work correctly

2017-03-07 Thread linux-ml

they are there, as workaround, download it manually from:

http://download.ceph.com/debian-hammer/pool/main/c/ceph/

*0.94.9-1~bpo70+1

http://download.ceph.com/debian-hammer/pool/main/c/curl/

*7.29.0-1~bpo70+1

and install with dpkg -i *.deb what u need


best regards rainer


On 06/03/17 22:50, Smart Weblications GmbH - Florian Wiessner wrote:

Am 28.02.2017 um 09:48 schrieb linux...@boku.ac.at:

/Hi,/

/
actually i can´t install hammer on wheezy:/

/~# cat /etc/apt/sources.list.d/ceph.list
deb http://download.ceph.com/debian-hammer/ wheezy main

~# cat /etc/issue
Debian GNU/Linux 7 \n \l
/

/~# apt-cache search ceph
ceph-deploy - Ceph-deploy is an easy to use configuration tool

~# apt-cache policy ceph-deploy
ceph-deploy:
   Installed: 1.5.37
   Candidate: 1.5.37
   Version table:
  *** 1.5.37 0
 999 http://download.ceph.com/debian-hammer/ wheezy/main amd64 Packages
 100 /var/lib/dpkg/status/
//

/~# ceph-deploy --version
1.5.37

~# ceph-ceploy install --realease=hammer MYHOST

[WARNIN] E: Unable to locate package ceph-osd
[WARNIN] E: Unable to locate package ceph-mds
[WARNIN] E: Unable to locate package ceph-mon
[WARNIN] E: Unable to locate package radosgw
[ERROR ] RuntimeError: command returned non-zero exit status: 100
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes -q
--no-install-recommends install -o Dpkg::Options::=--force-confnew ceph-osd
ceph-mds ceph-mon radosgw


any ideas ?

thx & greez Rainer


Cannot install hammer on wheezy, too. Seems that the Packages are removed?





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A Jewel in the rough? (cache tier bugs and documentation omissions)

2017-03-07 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John 
> Spray
> Sent: 07 March 2017 01:45
> To: Christian Balzer 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] A Jewel in the rough? (cache tier bugs and 
> documentation omissions)
> 
> On Tue, Mar 7, 2017 at 12:28 AM, Christian Balzer  wrote:
> >
> >
> > Hello,
> >
> > It's now 10 months after this thread:
> >
> > http://www.spinics.net/lists/ceph-users/msg27497.html (plus next
> > message)
> >
> > and we're at the fifth iteration of Jewel and still
> >
> > osd_tier_promote_max_objects_sec
> > and
> > osd_tier_promote_max_bytes_sec
> >
> > are neither documented (master or jewel), nor mentioned in the
> > changelogs and most importantly STILL default to the broken reverse 
> > settings above.
> 
> Is there a pull request?

Mark fixed it in this commit, but looks like it was never marked for backport 
to Jewel.

https://github.com/ceph/ceph/commit/793ceac2f3d5a2c404ac50569c44a21de6001b62

I will look into getting the documentation updated for these settings.

> 
> John
> 
> > Anybody coming from Hammer or even starting with Jewel and using cache
> > tiering will be having a VERY bad experience.
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD device on Erasure Coded Pool with kraken and Ubuntu Xenial.

2017-03-07 Thread Francois Blondel
Hi all,

I have been triyng to use RBD devices on a Erasure Coded data-pool on Ubuntu 
Xenial.

I created my block device "blockec2" with :
rbd create blockec2 --size 300G --data-pool ecpool --image-feature 
layering,data-pool
(same issue with "rbd create blockec2 --size 300G --data-pool ecpool" )

I maped this block device to a "4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 
17:49:58 UTC 2017 x86_64" Kernel using "rbd-nbd map rbd/blockec2"

Wenn using the block device (mkfs.ext4 for example), i get following errors in 
dmesg:

[Fri Mar  3 10:05:25 2017] nbd: registered device at major 43
[Fri Mar  3 10:05:53 2017] block nbd0: Other side returned error (95)
[Fri Mar  3 10:05:53 2017] blk_update_request: I/O error, dev nbd0, sector 0
[Fri Mar  3 10:05:56 2017] block nbd0: Other side returned error (95)
[Fri Mar  3 10:05:56 2017] blk_update_request: I/O error, dev nbd0, sector 
8388607
[Fri Mar  3 10:06:06 2017] block nbd0: Other side returned error (95)
[Fri Mar  3 10:06:06 2017] blk_update_request: I/O error, dev nbd0, sector 
16777214

I tested with 2 versions of ceph, same issue : rbd-nbd -v :
ceph version 12.0.0 (b7d9d6eb542e2b946ac778bd3a381ce466f60f6a)
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)

The ceph cluster is running ceph version 11.2.0 
(f223e27eeb35991352ebc1f67423d4ebc252adb7) with bluestore OSDs.

Am I doing something wrong ?
Is the kernel of my RBD client "too old" ?
Should I open a ticket/bug report ?

Thanks !

François Blondel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Much more dentries than inodes, is that normal?

2017-03-07 Thread Xiaoxi Chen
Hi,

  From the admin socket of mds, I got following data on our
production cephfs env, roughly we have 585K inodes and almost same
amount of caps, but we have>2x dentries than inodes.

  I am pretty sure we dont use hard link intensively (if any).
And the #ino match with "rados ls --pool $my_data_pool}.

  Thanks for any explanations, appreciate it.


"mds_mem": {
"ino": 584974,
"ino+": 1290944,
"ino-": 705970,
"dir": 25750,
"dir+": 25750,
"dir-": 0,
"dn": 1291393,
"dn+": 1997517,
"dn-": 706124,
"cap": 584560,
"cap+": 2657008,
"cap-": 2072448,
"rss": 24599976,
"heap": 166284,
"malloc": 18446744073708721289,
"buf": 0
},



Xiaoxi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com