[ceph-users] Wrong PG information after increase pg_num

2015-07-14 Thread Luke Kao
Hello all,

I am testing cluster with mixed type OSD on same data node (yes, it's the idea 
from:
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/),
 and run into a strange status:

ceph -s or ceph pg dump shows incorrect PG information after set pg_num to pool 
which is using different ruleset to select faster OSD.



Please advise what's wrong and if I can fix the issue without recreate new pool 
with final pg_num directly:





Soe more detail:

1) update crushmap to have different root  ruleset to select different OSDs 
like this:

rule replicated_ruleset_ssd {
ruleset 50
type replicated
min_size 1
max_size 10
step take sdd
step chooseleaf firstn 0 type host
step emit
}

2) create new pool and set crush_ruleset to use this new rule



$ ceph osd pool create ssd 64 64 replicated replicated_ruleset_ssd

(however after this command it's still using default ruleset 0)

$ ceph osd pool set ssd crush_ruleset 50

3) it looks good now:
$ ceph osd dump | grep pool
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 
stripe_width 0
pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0
pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0
pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1574 flags hashpspool stripe_width 0

$ ceph -s
cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569
 health HEALTH_OK
 monmap e2: 3 mons at 
{DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0},
 election epoch 84, quorum 0,1,2 
DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3
 osdmap e1578: 21 osds: 15 up, 15 in
  pgmap v560681: 1472 pgs, 5 pools, 285 GB data, 73352 objects
80151 MB used, 695 GB / 779 GB avail
1472 active+clean
4) increase pg_num  pgp_num but total PG number is still 1472 in ceph -s:
$ ceph osd pool set ssd pg_num 128
set pool 9 pg_num to 128
$ ceph osd pool set ssd pgp_num 128
set pool 9 pgp_num to 128

$ ceph osd dump | grep pool
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 
stripe_width 0
pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0
pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0
pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins 
pg_num 128 pgp_num 128 last_change 1581 flags hashpspool stripe_width 0

$ ceph -s
cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569
 health HEALTH_OK
 monmap e2: 3 mons at 
{DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0},
 election epoch 84, quorum 0,1,2 
DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3
 osdmap e1582: 21 osds: 15 up, 15 in
  pgmap v560709: 1472 pgs, 5 pools, 285 GB data, 73352 objects
80158 MB used, 695 GB / 779 GB avail
1472 active+clean

5) same problem with pg dump:
$ ceph pg dump | grep '^9\.' | wc
dumped all in format plain
 641472   10288

6) looks pg are created under /var/lib/ceph/osd/ceph-osd/current folder:



$ ls -ld /var/lib/ceph/osd/ceph-15/current/9.* | wc
 74 6666133

]$ ls -ld /var/lib/ceph/osd/ceph-16/current/9.* | wc
 54 4864475



6 osd for this ruleset = 128 * 3 / 6 ~= 64





Thanks a lot





BR,

Luke Kao

MYCOM-OSI



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com

Re: [ceph-users] Ceph on RHEL7.0

2015-05-28 Thread Luke Kao
Hi Bruce,
RHEL7.0 kernel has many issues on filesystem sub modules and most of them fixed 
only in RHEL7.1.
So you should consider to go to RHEL7.1 directly and upgrade to at least kernel 
3.10.0-229.1.2


BR,
Luke


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Bruce 
McFarland [bruce.mcfarl...@taec.toshiba.com]
Sent: Friday, May 29, 2015 5:13 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph on RHEL7.0

We’re planning on moving from Centos6.5 to RHEL7.0 for Ceph storage and monitor 
nodes. Are there any known issues using RHEL7.0?
Thanks



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Linux block device tuning on Kernel RBD device

2015-04-02 Thread Luke Kao
Hello everyone,
Is there anyone having experience to try to tune Kernel RBD device by changing 
scheduler and other settings?

Currently we are trying it on RHEL 7.1 bundled rbd module, to change the 
following setting under /sys/block/rbdX/queue:
1) scheduler: noop vs deadline, deadline seems better
2) nr_requests: default 128, tried 64  256  1024, and no clear difference 
between different value
3) rotational: as a network based device, should set to 0 for rbd?  Tried and 
no clear difference between different value
4) read_ahead_kb: default 128, tried 4096 will be much better but also seen 
many extra network bandwidth used

Now trying to have a plan to measure performance change on iops, throughput and 
side-effect in a quantitative way, and would like to know if anyone can share 
experience if there is already some optimal setting and if any other parameter 
I should try, like the tunable parameters tuning able for deadline.

Thanks in advance,



Luke

MYCOM OSI
http://www.mycom-osi.com



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor stay in synchronizing state for over 24hour

2015-03-13 Thread Luke Kao
Hi all,
Does anyone have some idea?

Or maybe have some direction about which debug log I can enable to check some 
information about progress of synchronization.
currently I have set
debug_mon=20
mon_sync_debug=true

But not sure I can really know which log enty I should check


Thanks in advance

BR,
Luke
MYCOM-OSI


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Luke Kao 
[luke@mycom-osi.com]
Sent: Thursday, March 12, 2015 5:22 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Monitor stay in synchronizing state for over 24hour

Hello everyone,
I am currently trying to recover a ceph cluster from the disaster, now I have 
enough osd (171 up and in/195) and have 2 incomplete pgs at the end.

However the question now is not the incomplete pgs, is about one mon services 
fail to start due to a strange, wrong monmap is used.  After inject monmap 
exported from cluster, it's up and enter synchronizing and unable to be back 
after several hours.  I originally guess it's common for the fact the whole 
cluster is still busy in recovering and backfilling, however it's over 24hour 
now and no hint when sync can be done or if it's still in healthy status.

The log tells it is still doing synchronizing and I can see the file under 
store.db keep being updated.


a small piece of log for the reference:
2015-03-12 03:20:15.025048 7f3cb6c48700 10 
mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) service_tick
2015-03-12 03:20:15.025075 7f3cb6c48700  0 
mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) update_stats avail 71% 
total 103080888 used 24281956 avail 73539668
2015-03-12 03:20:30.460672 7f3cb4b43700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).aborted = 0
2015-03-12 03:20:30.460923 7f3cb4b43700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).reader got message 1466470577 0x45b3c80 mon_sync(chunk cookie 
37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2
2015-03-12 03:20:30.460963 7f3cbc783700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).writer: state = open policy.server=0
2015-03-12 03:20:30.460988 7f3cbc783700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).write_ack 1466470577
2015-03-12 03:20:30.461011 7f3cbc783700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).writer: state = open policy.server=0
2015-03-12 03:20:30.461030 7f3cb6447700  1 -- 10.137.36.30:6789/0 == mon.1 
10.137.36.31:6789/0 1466470577  mon_sync(chunk cookie 37950063980 lc 
12343379 bl 791970 bytes last_key logm,full_5120265) v2  792163+0+0 
(2147002791 0 0) 0x45b3c80 con 0x34b1760
2015-03-12 03:20:30.461048 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 handle_sync mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes 
last_key logm,full_5120265) v2
2015-03-12 03:20:30.461052 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 handle_sync_chunk mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 
bytes last_key logm,full_5120265) v2
2015-03-12 03:20:30.463832 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 sync_reset_timeout


I am also wondering some osd are fail to join cluster due to this.  Some osd 
processes are up without error, but after load pgs, it cannot keep moving to 
boot and status is still down and out.

Please advise, thanks



Luke Kao

MYCOM OSI
http://www.mycom-osi.com



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor stay in synchronizing state for over 24hour

2015-03-12 Thread Luke Kao
Hello everyone,
I am currently trying to recover a ceph cluster from the disaster, now I have 
enough osd (171 up and in/195) and have 2 incomplete pgs at the end.

However the question now is not the incomplete pgs, is about one mon services 
fail to start due to a strange, wrong monmap is used.  After inject monmap 
exported from cluster, it's up and enter synchronizing and unable to be back 
after several hours.  I originally guess it's common for the fact the whole 
cluster is still busy in recovering and backfilling, however it's over 24hour 
now and no hint when sync can be done or if it's still in healthy status.

The log tells it is still doing synchronizing and I can see the file under 
store.db keep being updated.


a small piece of log for the reference:
2015-03-12 03:20:15.025048 7f3cb6c48700 10 
mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) service_tick
2015-03-12 03:20:15.025075 7f3cb6c48700  0 
mon.NVMBD1CIF290D00@0(synchronizing).data_health(0) update_stats avail 71% 
total 103080888 used 24281956 avail 73539668
2015-03-12 03:20:30.460672 7f3cb4b43700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).aborted = 0
2015-03-12 03:20:30.460923 7f3cb4b43700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).reader got message 1466470577 0x45b3c80 mon_sync(chunk cookie 
37950063980 lc 12343379 bl 791970 bytes last_key logm,full_5120265) v2
2015-03-12 03:20:30.460963 7f3cbc783700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).writer: state = open policy.server=0
2015-03-12 03:20:30.460988 7f3cbc783700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).write_ack 1466470577
2015-03-12 03:20:30.461011 7f3cbc783700 10 -- 10.137.36.30:6789/0  
10.137.36.31:6789/0 pipe(0x3528280 sd=9 :57111 s=2 pgs=30630 cs=15 l=0 
c=0x34b1760).writer: state = open policy.server=0
2015-03-12 03:20:30.461030 7f3cb6447700  1 -- 10.137.36.30:6789/0 == mon.1 
10.137.36.31:6789/0 1466470577  mon_sync(chunk cookie 37950063980 lc 
12343379 bl 791970 bytes last_key logm,full_5120265) v2  792163+0+0 
(2147002791 0 0) 0x45b3c80 con 0x34b1760
2015-03-12 03:20:30.461048 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 handle_sync mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 bytes 
last_key logm,full_5120265) v2
2015-03-12 03:20:30.461052 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 handle_sync_chunk mon_sync(chunk cookie 37950063980 lc 12343379 bl 791970 
bytes last_key logm,full_5120265) v2
2015-03-12 03:20:30.463832 7f3cb6447700 10 mon.NVMBD1CIF290D00@0(synchronizing) 
e1 sync_reset_timeout


I am also wondering some osd are fail to join cluster due to this.  Some osd 
processes are up without error, but after load pgs, it cannot keep moving to 
boot and status is still down and out.

Please advise, thanks



Luke Kao

MYCOM OSI
http://www.mycom-osi.com



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSHMAP for chassis balance

2015-02-14 Thread Luke Kao
Hi Gregory,
Thanks for the direction that I  finish with 3 different rule in a ruleset for 
different rep size:
Tested no bad-mapping and host / osd are correctly balanced between 2 chassis.

Not sure if it can be optimized but I am happy with current result:
rule rule_rep2 {
ruleset 0
type replicated
min_size 2
max_size 2
step take chassis1
step chooseleaf firstn 1 type host
step emit
step take chassis2
step chooseleaf firstn 1 type host
step emit
}
rule rule_rep34 {
ruleset 0
type replicated
min_size 3
max_size 4
step take default
step choose firstn 2 type chassis
step chooseleaf firstn 2 type host
step emit
}
rule rule_rep56 {
ruleset 0
type replicated
min_size 5
max_size 6
step take default
step choose firstn 3 type chassis
step chooseleaf firstn 3 type host
step emit
}


Luke

From: Gregory Farnum [mailto:g...@gregs42.com]
Sent: Friday, February 13, 2015 11:01 PM
To: Luke Kao; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] CRUSHMAP for chassis balance

With sufficiently new CRUSH versions (all the latest point releases on LTS?) I 
think you can simply have the rule return extra IDs which are dropped if they 
exceed the number required. So you can choose two chassis, then have those both 
choose to lead OSDs, and return those 4 from the rule.
-Greg
On Fri, Feb 13, 2015 at 6:13 AM Luke Kao 
luke@mycom-osi.commailto:luke@mycom-osi.com wrote:
Dear cepher,
Currently I am working on crushmap to try to make sure the at least one copy 
are going to different chassis.
Say chassis1 has host1,host2,host3, and chassis2 has host4,host5,host6.

With replication =2, it’s not a problem, I can use the following step in rule
step take chasses1
step chooseleaf firstn 1 type host
step emit
step take chasses2
step chooseleaf firstn 1 type host
step emit

But for replication=3, I tried
step take chasses1
step chooseleaf firstn 1 type host
step emit
step take chasses2
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 1 type host
step emit

At the end, the 3rd osd returned in rule test is always duplicate with first 1 
or first 2.

Any idea or what’s the direction to move forward?
Thanks in advance

BR,
Luke
MYCOM-OSI




This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CRUSHMAP for chassis balance

2015-02-13 Thread Luke Kao
Dear cepher,
Currently I am working on crushmap to try to make sure the at least one copy 
are going to different chassis.
Say chassis1 has host1,host2,host3, and chassis2 has host4,host5,host6.

With replication =2, it's not a problem, I can use the following step in rule
step take chasses1
step chooseleaf firstn 1 type host
step emit
step take chasses2
step chooseleaf firstn 1 type host
step emit

But for replication=3, I tried
step take chasses1
step chooseleaf firstn 1 type host
step emit
step take chasses2
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 1 type host
step emit

At the end, the 3rd osd returned in rule test is always duplicate with first 1 
or first 2.

Any idea or what's the direction to move forward?
Thanks in advance

BR,
Luke
MYCOM-OSI




This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs backend with autodefrag mount option

2015-01-30 Thread Luke Kao
Thanks Lionel, we are using btrfs compression and it's also stable in our 
cluster.

Currently another minor problem of btrfs fragments is sometimes we see 
btrfs-transacti process can pause the whole OSD node I/O for seconds, impacting 
all OSDs on the server.   Especially when doing recovery / backfill.

However, I wonder restart a OSD takes 30minutes may become a problem for 
maintenance.

I will share if we have any result on testing different settings.


BR,
Luke



From: Lionel Bouton [lionel-subscript...@bouton.name]
Sent: Saturday, January 31, 2015 2:29 AM
To: Luke Kao; ceph-us...@ceph.com
Subject: Re: [ceph-users] btrfs backend with autodefrag mount option

On 01/30/15 14:24, Luke Kao wrote:

Dear ceph users,

Has anyone tried to add autodefrag and mount option when use btrfs as the osd 
storage?



In some previous discussion that btrfs osd startup becomes very slow after used 
for some time, just thinking about add autodefrag will help.



We will add on our test cluster first to see if there is any difference.

We used autodefrag but it didn't help: performance degrades over time. One 
possibility raised in previous discussions here is that BTRFS's autodefrag 
isn't smart enough when snapshots are heavily used as is the case with Ceph OSD 
by default.

There are some tunings available that we have yet to test :

filestore btrfs snap
filestore btrfs clone range
filestore journal parallel



All are enabled by default for BTRFS backends. snap is probably the first you 
might want to disable and check how autodefrag and defrag behave. It might be 
possible to use snap and defrag, BTRFS was quite stable for us (but all our 
OSDs are on systems with at least 72GB RAM which have enough CPU power so 
memory wasn't much of an issue).

Best regards,

Lionel Bouton



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] btrfs backend with autodefrag mount option

2015-01-30 Thread Luke Kao
Dear ceph users,

Has anyone tried to add autodefrag and mount option when use btrfs as the osd 
storage?



In some previous discussion that btrfs osd startup becomes very slow after used 
for some time, just thinking about add autodefrag will help.



We will add on our test cluster first to see if there is any difference.





Please kindly share experience if available, thanks





Luke Kao

MYCOM OSI



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to do maintenance without falling out of service?

2015-01-21 Thread Luke Kao
Hi David,
How about your pools size  min_size setting?
In your cluster, you may need to set all pools min_size=1 before shutdown server


BR,
Luke
MYCOM-OSI

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of J David 
[j.david.li...@gmail.com]
Sent: Tuesday, January 20, 2015 12:40 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] How to do maintenance without falling out of service?

A couple of weeks ago, we had some involuntary maintenance come up
that required us to briefly turn off one node of a three-node ceph
cluster.

To our surprise, this resulted in failure to write on the VM's on that
ceph cluster, even though we set noout before the maintenance.

This cluster is for bulk storage, it has copies=1 (2 total) and very
large SATA drives.  The OSD tree looks like this:

# id weight type name up/down reweight
-1 127.1 root default
-2 18.16 host f16
0 4.54 osd.0 up 1
1 4.54 osd.1 up 1
2 4.54 osd.2 up 1
3 4.54 osd.3 up 1
-3 54.48 host f17
4 4.54 osd.4 up 1
5 4.54 osd.5 up 1
6 4.54 osd.6 up 1
7 4.54 osd.7 up 1
8 4.54 osd.8 up 1
9 4.54 osd.9 up 1
10 4.54 osd.10 up 1
11 4.54 osd.11 up 1
12 4.54 osd.12 up 1
13 4.54 osd.13 up 1
14 4.54 osd.14 up 1
15 4.54 osd.15 up 1
-4 54.48 host f18
16 4.54 osd.16 up 1
17 4.54 osd.17 up 1
18 4.54 osd.18 up 1
19 4.54 osd.19 up 1
20 4.54 osd.20 up 1
21 4.54 osd.21 up 1
22 4.54 osd.22 up 1
23 4.54 osd.23 up 1
24 4.54 osd.24 up 1
25 4.54 osd.25 up 1
26 4.54 osd.26 up 1
27 4.54 osd.27 up 1

The host that was turned off was f18.  f16 does have a handful of
OSDs, but it is mostly there to provide an odd number of monitors.
The cluster is very lightly used, here is the current status:

cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
 health HEALTH_OK
 monmap e3: 3 mons at
{f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
election epoch 28, quorum 0,1,2 f16,f17,f18
 osdmap e1674: 28 osds: 28 up, 28 in
  pgmap v12965109: 1152 pgs, 3 pools, 11139 GB data, 2784 kobjects
22314 GB used, 105 TB / 127 TB avail
1152 active+clean
  client io 38162 B/s wr, 9 op/s

Where did we go wrong last time?  How can we do the same maintenance
to f17 (taking it offline for about 15-30 minutes) without repeating
our mistake?

As it stands, it seems like we have inadvertently created a cluster
with three single points of failure, rather than none.  That has not
been our experience with our other clusters, so we're really confused
at present.

Thanks for any advice!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any workaround for FAILED assert(p != snapset.clones.end())

2015-01-14 Thread Luke Kao
Hi Sam and Greg,
No, not using cache tier.
Just for your information, backend filestore is btrfs with zlib compression

Need I provide any more information?
Thanks.


BR,
Luke


From: Samuel Just [sam.j...@inktank.com]
Sent: Wednesday, January 14, 2015 1:22 AM
To: Luke Kao
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] any workaround for FAILED assert(p != 
snapset.clones.end())

Are you using a cache tier?
-Sam

On Mon, Jan 12, 2015 at 11:37 PM, Luke Kao luke@mycom-osi.com wrote:
 Hello community,
 We have a cluster using v0.80.5, and recently several OSDs goes down with
 error when removing a rbd snapshot:
 osd/ReplicatedPG.cc: 2352: FAILED assert(p != snapset.clones.end())

 and after restart those OSDs, it will go down again soon for the same error.
 It looks like link to BUG#8629, but before upgrade to the patched version,
 is there any workaround other than reformat disk and create OSDs?

 Also a side question: I don't find this bug fix in release note of v0.80.6
 or v0.80.7, so should I assume the patch is not yet released?

 Thanks

 BR,
 Luke Kao
 MYCOM-OSI



 

 This electronic message contains information from Mycom which may be
 privileged or confidential. The information is intended to be for the use of
 the individual(s) or entity named above. If you are not the intended
 recipient, be aware that any disclosure, copying, distribution or any other
 use of the contents of this information is prohibited. If you have received
 this electronic message in error, please notify us by post or telephone (to
 the numbers or correspondence address above) or by email (at the email
 address above) immediately.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] any workaround for FAILED assert(p != snapset.clones.end())

2015-01-13 Thread Luke Kao
Hello community,
We have a cluster using v0.80.5, and recently several OSDs goes down with error 
when removing a rbd snapshot:
osd/ReplicatedPG.cc: 2352: FAILED assert(p != snapset.clones.end())

and after restart those OSDs, it will go down again soon for the same error.
It looks like link to BUG#8629, but before upgrade to the patched version, is 
there any workaround other than reformat disk and create OSDs?

Also a side question: I don't find this bug fix in release note of v0.80.6 or 
v0.80.7, so should I assume the patch is not yet released?

Thanks

BR,
Luke Kao
MYCOM-OSI





This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD pool with unfound objects

2014-12-23 Thread Luke Kao
Hi all,
I have some questions about unfound objects in rbd pool, what's the real impact 
to the rbd image?

Currently our cluster (running on v0.80.5) has 25 unfound objects due to recent 
OSD crashes, and cannot mark as lost yet (Bug #10405 created for this).
So far it seems we can still mount the rbd image (filesystem is xfs) but I 
would like to know the real impact
1.My guess it should like bad sector of a real hard disk?
2.Is there any way to identify which file get impacted of the RBD disk?
3.What if we mark it as lost using ceph pg pg mark_unfound_lost revert revert 
/ delete?
4.Is it better to copy current rbd image to another new one and use the new one 
instead?

Any suggestion to current situation is also welcome that we need keep the data 
inside this RBD.

Thanks in advance,



BR,
Luke




This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com