date:20140826

Re: [ceph-users] ceph-deploy with --release (--stable) for dumpling?

2014-08-26 Thread Konrad Gutkowski

Ceph-deploy should set priority for ceph repository, which it doesn't,  
this usually installs the best available version from any repository. I  
don't know if this is intentional, but you can change this yourself -  
google "apt repository priority" and change it on all your nodes.


W dniu 26.08.2014 o 02:52 Nigel Williams   
pisze:



ceph-deploy --release dumpling or previously ceph-deploy --stable
dumpling now results in Firefly (0.80.1) being installed, is this
intentional?

I'm adding another host with more OSDs and guessing it is preferable
to deploy the same version.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Konrad Gutkowski
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] question about getting rbd.ko and ceph.ko

2014-08-26 Thread yuelongguang

hi,all
 
is there a way to get rbd,ko and ceph.ko for centos 6.X.
 
or  i have to build them from source code?  which is the least kernel version?
 
thanks___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question about getting rbd.ko and ceph.ko

2014-08-26 Thread Irek Fasikhov

Hi

No support module begins with a 2.6.37. And it is not recommended to use.

But you can use http://elrepo.org/tiki/kernel-ml


2014-08-26 11:56 GMT+04:00 yuelongguang :

> hi,all
>
> is there a way to get rbd,ko and ceph.ko for centos 6.X.
>
> or  i have to build them from source code?  which is the least kernel
> version?
>
> thanks
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread yuelongguang

hi,all
 
i am planning to do a test on ceph, include performance, throughput, 
scalability,availability.
in order to get a full test result, i  hope you all can give me some advice. 
meanwhile i can send the result to you,if you like.
as for each category test( performance, throughput, scalability,availability)  
,  do you have some some test idea and test tools?
basicly i have know some tools to test throughtput,iops .  but you can tell the 
tools you prefer and the result you expect.  
 
thanks very much
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Mateusz Skała



Hi thanks for reply.



From the top of my head, it is recommended to use 3 mons in
production. Also, for the 22 osds your number of PGs look a bug low,
you should look at that.
I get it from 
http://ceph.com/docs/master/rados/operations/placement-groups/


(22osd's * 100)/3 replicas = 733, ~1024 pgs
Please correct me if I'm wrong.

It will be 5 mons (on 6 hosts) but now we must migrate some data from 
used servers.





The performance of the cluster is poor - this is too vague. What is
your current performance, what benchmarks have you tried, what is your
data workload and most importantly, how is your cluster setup. what
disks, ssds, network, ram, etc.

Please provide more information so that people could help you.

Andrei


Hardware informations:
ceph15:
RAM: 4GB
Network: 4x 1GB NIC
OSD disk's:
2x SATA Seagate ST31000524NS
2x SATA WDC WD1003FBYX-18Y7B0

ceph25:
RAM: 16GB
Network: 4x 1GB NIC
OSD disk's:
2x SATA WDC WD7500BPKX-7
2x SATA WDC WD7500BPKX-2
2x SATA SSHD ST1000LM014-1EJ164

ceph30
RAM: 16GB
Network: 4x 1GB NIC
OSD disks:
6x SATA SSHD ST1000LM014-1EJ164

ceph35:
RAM: 16GB
Network: 4x 1GB NIC
OSD disks:
6x SATA SSHD ST1000LM014-1EJ164


All journals are on OSD's. 2 NIC are for backend network (10.20.4.0/22) 
and 2 NIC are for frontend (10.20.8.0/22).


This cluster we use as storage backend for <100VM's on KVM. I don't make 
benchmarks but all vm's are migrated from Xen+GlusterFS(NFS), before 
migration every VM are running fine, now each VM  from time to time 
hangs for few seconds, apps installed on VM's loading much more time. 
GlusterFS are running on 2 servers with 1x 1GB NIC and 2x8 disks WDC 
WD7500BPKX-7.


I make one test with recovery, if disk marks out, then recovery io is 
150-200MB/s but all vm's hangs until recovery ends.


Biggest load is on ceph35, IOps on each disk are near 150, cpu load 
~4-5.

On other hosts cpu load <2, 120~130iops

Our ceph.conf

===
[global]

fsid=a9d17295-62f2-46f6-8325-1cad7724e97f
mon initial members = ceph35, ceph30, ceph25, ceph15
mon host = 10.20.8.35, 10.20.8.30, 10.20.8.25, 10.20.8.15
public network = 10.20.8.0/22
cluster network = 10.20.4.0/22
osd journal size = 1024
filestore xattr use omap = true
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 1024
osd pool default pgp num = 1024
osd crush chooseleaf type = 1
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
rbd default format = 2

##ceph35 osds
[osd.0]
cluster addr = 10.20.4.35
[osd.1]
cluster addr = 10.20.4.35
[osd.2]
cluster addr = 10.20.4.35
[osd.3]
cluster addr = 10.20.4.36
[osd.4]
cluster addr = 10.20.4.36
[osd.5]
cluster addr = 10.20.4.36

##ceph25 osds
[osd.6]
cluster addr = 10.20.4.25
public addr = 10.20.8.25
[osd.7]
cluster addr = 10.20.4.25
public addr = 10.20.8.25
[osd.8]
cluster addr = 10.20.4.25
public addr = 10.20.8.25
[osd.9]
cluster addr = 10.20.4.26
public addr = 10.20.8.26
[osd.10]
cluster addr = 10.20.4.26
public addr = 10.20.8.26
[osd.11]
cluster addr = 10.20.4.26
public addr = 10.20.8.26

##ceph15 osds
[osd.12]
cluster addr = 10.20.4.15
public addr = 10.20.8.15
[osd.13]
cluster addr = 10.20.4.15
public addr = 10.20.8.15
[osd.14]
cluster addr = 10.20.4.15
public addr = 10.20.8.15
[osd.15]
cluster addr = 10.20.4.16
public addr = 10.20.8.16

##ceph30 osds
[osd.16]
cluster addr = 10.20.4.30
public addr = 10.20.8.30
[osd.17]
cluster addr = 10.20.4.30
public addr = 10.20.8.30
[osd.18]
cluster addr = 10.20.4.30
public addr = 10.20.8.30
[osd.19]
cluster addr = 10.20.4.31
public addr = 10.20.8.31
[osd.20]
cluster addr = 10.20.4.31
public addr = 10.20.8.31
[osd.21]
cluster addr = 10.20.4.31
public addr = 10.20.8.31

[mon.ceph35]
host = ceph35
mon addr = 10.20.8.35:6789
[mon.ceph30]
host = ceph30
mon addr = 10.20.8.30:6789
[mon.ceph25]
host = ceph25
mon addr = 10.20.8.25:6789
[mon.ceph15]
host = ceph15
mon addr = 10.20.8.15:6789


Regards,
Mateusz


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread Irek Fasikhov

Hi.
I and many people use fio.
For ceph rbd has a special engine:
https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html


2014-08-26 12:15 GMT+04:00 yuelongguang :

> hi,all
>
> i am planning to do a test on ceph, include performance, throughput,
> scalability,availability.
> in order to get a full test result, i  hope you all can give me some
> advice. meanwhile i can send the result to you,if you like.
> as for each category test( performance, throughput,
> scalability,availability)  ,  do you have some some test idea and test
> tools?
> basicly i have know some tools to test throughtput,iops .  but you can
> tell the tools you prefer and the result you expect.
>
> thanks very much
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster inconsistency?

2014-08-26 Thread Kenneth Waegeman



Hi,

In the meantime I already tried with upgrading the cluster to 0.84, to  
see if that made a difference, and it seems it does.

I can't reproduce the crashing osds by doing a 'rados -p ecdata ls' anymore.

But now the cluster detect it is inconsistent:

  cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
   health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too  
few pgs per osd (4 < min 20); mon.ceph002 low disk space
   monmap e3: 3 mons at  
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2  
ceph001,ceph002,ceph003

   mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
   osdmap e145384: 78 osds: 78 up, 78 in
pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
  1502 GB used, 129 TB / 131 TB avail
   279 active+clean
40 active+clean+inconsistent
 1 active+clean+scrubbing+deep


I tried to do ceph pg repair for all the inconsistent pgs:

  cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
   health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub  
errors; too few pgs per osd (4 < min 20); mon.ceph002 low disk space
   monmap e3: 3 mons at  
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 30, quorum 0,1,2  
ceph001,ceph002,ceph003

   mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
   osdmap e146452: 78 osds: 78 up, 78 in
pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
  1503 GB used, 129 TB / 131 TB avail
   279 active+clean
39 active+clean+inconsistent
 1 active+clean+scrubbing+deep
 1 active+clean+scrubbing+deep+inconsistent+repair

I let it recovering through the night, but this morning the mons were  
all gone, nothing to see in the log files.. The osds were all still up!


cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
 health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub  
errors; too few pgs per osd (4 < min 20)
 monmap e7: 3 mons at  
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 44, quorum 0,1,2  
ceph001,ceph002,ceph003

 mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
 osdmap e203410: 78 osds: 78 up, 78 in
  pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects
1547 GB used, 129 TB / 131 TB avail
   1 active+clean+scrubbing+deep+inconsistent+repair
 284 active+clean
  35 active+clean+inconsistent

I restarted the monitors now, I will let you know when I see something more..



- Message from Haomai Wang  -
 Date: Sun, 24 Aug 2014 12:51:41 +0800
 From: Haomai Wang 
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman ,  
ceph-users@lists.ceph.com




It's really strange! I write a test program according the key ordering
you provided and parse the corresponding value. It's true!

I have no idea now. If free, could you add this debug code to
"src/os/GenericObjectMap.cc" and insert *before* "assert(start <=
header.oid);":

  dout(0) << "start: " << start << "header.oid: " << header.oid << dendl;

Then you need to recompile ceph-osd and run it again. The output log
can help it!

On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang  wrote:

I feel a little embarrassed, 1024 rows still true for me.

I was wondering if you could give your all keys via
""ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
_GHOBJTOSEQ_ > keys.log“.

thanks!

On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman
 wrote:


- Message from Haomai Wang  -
 Date: Tue, 19 Aug 2014 12:28:27 +0800

 From: Haomai Wang 
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman 
   Cc: Sage Weil , ceph-users@lists.ceph.com



On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman
 wrote:



- Message from Haomai Wang  -
 Date: Mon, 18 Aug 2014 18:34:11 +0800

 From: Haomai Wang 
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman 
   Cc: Sage Weil , ceph-users@lists.ceph.com




On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman
 wrote:



Hi,

I tried this after restarting the osd, but I guess that was not the aim
(
# ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
_GHOBJTOSEQ_|
grep 6adb1100 -A 100
IO error: lock /var/lib/ceph/osd/ceph-67/current//LOCK: Resource
temporarily
unavailable
tools/ceph_kvstore_tool.cc: In function 'StoreTool::StoreTool(const
string&)' thread 7f8fecf7d780 time 2014-08-18 11:12:29.551780
tools/ceph_kvstore_tool.cc: 38: FAILED assert(!db_ptr->open(std::cerr))
..
)

When I run it after bringing the osd down, it takes a while, but it has
no
output.. (When running it without the grep, I'm getting a huge list )




Oh

Re: [ceph-users] v0.84 released

2014-08-26 Thread Stijn De Weirdt


hi all,


there are a zillion OSD bug fixes. Things are looking pretty good for the
Giant release that is coming up in the next month.
any chance of having a compilable cephfs kernel module for el7 for the 
next major release?




stijn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Irek Fasikhov

Move logs on the SSD and immediately increase performance. you have about
50% of the performance lost on logs. And just for the three replications
recommended more than 5 hosts


2014-08-26 12:17 GMT+04:00 Mateusz Skała :

>
> Hi thanks for reply.
>
>
>
>  From the top of my head, it is recommended to use 3 mons in
>> production. Also, for the 22 osds your number of PGs look a bug low,
>> you should look at that.
>>
> I get it from http://ceph.com/docs/master/rados/operations/placement-
> groups/
>
> (22osd's * 100)/3 replicas = 733, ~1024 pgs
> Please correct me if I'm wrong.
>
> It will be 5 mons (on 6 hosts) but now we must migrate some data from used
> servers.
>
>
>
>
>> The performance of the cluster is poor - this is too vague. What is
>> your current performance, what benchmarks have you tried, what is your
>> data workload and most importantly, how is your cluster setup. what
>> disks, ssds, network, ram, etc.
>>
>> Please provide more information so that people could help you.
>>
>> Andrei
>>
>
> Hardware informations:
> ceph15:
> RAM: 4GB
> Network: 4x 1GB NIC
> OSD disk's:
> 2x SATA Seagate ST31000524NS
> 2x SATA WDC WD1003FBYX-18Y7B0
>
> ceph25:
> RAM: 16GB
> Network: 4x 1GB NIC
> OSD disk's:
> 2x SATA WDC WD7500BPKX-7
> 2x SATA WDC WD7500BPKX-2
> 2x SATA SSHD ST1000LM014-1EJ164
>
> ceph30
> RAM: 16GB
> Network: 4x 1GB NIC
> OSD disks:
> 6x SATA SSHD ST1000LM014-1EJ164
>
> ceph35:
> RAM: 16GB
> Network: 4x 1GB NIC
> OSD disks:
> 6x SATA SSHD ST1000LM014-1EJ164
>
>
> All journals are on OSD's. 2 NIC are for backend network (10.20.4.0/22)
> and 2 NIC are for frontend (10.20.8.0/22).
>
> This cluster we use as storage backend for <100VM's on KVM. I don't make
> benchmarks but all vm's are migrated from Xen+GlusterFS(NFS), before
> migration every VM are running fine, now each VM  from time to time hangs
> for few seconds, apps installed on VM's loading much more time. GlusterFS
> are running on 2 servers with 1x 1GB NIC and 2x8 disks WDC WD7500BPKX-7.
>
> I make one test with recovery, if disk marks out, then recovery io is
> 150-200MB/s but all vm's hangs until recovery ends.
>
> Biggest load is on ceph35, IOps on each disk are near 150, cpu load ~4-5.
> On other hosts cpu load <2, 120~130iops
>
> Our ceph.conf
>
> ===
> [global]
>
> fsid=a9d17295-62f2-46f6-8325-1cad7724e97f
> mon initial members = ceph35, ceph30, ceph25, ceph15
> mon host = 10.20.8.35, 10.20.8.30, 10.20.8.25, 10.20.8.15
> public network = 10.20.8.0/22
> cluster network = 10.20.4.0/22
> osd journal size = 1024
> filestore xattr use omap = true
> osd pool default size = 3
> osd pool default min size = 1
> osd pool default pg num = 1024
> osd pool default pgp num = 1024
> osd crush chooseleaf type = 1
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> rbd default format = 2
>
> ##ceph35 osds
> [osd.0]
> cluster addr = 10.20.4.35
> [osd.1]
> cluster addr = 10.20.4.35
> [osd.2]
> cluster addr = 10.20.4.35
> [osd.3]
> cluster addr = 10.20.4.36
> [osd.4]
> cluster addr = 10.20.4.36
> [osd.5]
> cluster addr = 10.20.4.36
>
> ##ceph25 osds
> [osd.6]
> cluster addr = 10.20.4.25
> public addr = 10.20.8.25
> [osd.7]
> cluster addr = 10.20.4.25
> public addr = 10.20.8.25
> [osd.8]
> cluster addr = 10.20.4.25
> public addr = 10.20.8.25
> [osd.9]
> cluster addr = 10.20.4.26
> public addr = 10.20.8.26
> [osd.10]
> cluster addr = 10.20.4.26
> public addr = 10.20.8.26
> [osd.11]
> cluster addr = 10.20.4.26
> public addr = 10.20.8.26
>
> ##ceph15 osds
> [osd.12]
> cluster addr = 10.20.4.15
> public addr = 10.20.8.15
> [osd.13]
> cluster addr = 10.20.4.15
> public addr = 10.20.8.15
> [osd.14]
> cluster addr = 10.20.4.15
> public addr = 10.20.8.15
> [osd.15]
> cluster addr = 10.20.4.16
> public addr = 10.20.8.16
>
> ##ceph30 osds
> [osd.16]
> cluster addr = 10.20.4.30
> public addr = 10.20.8.30
> [osd.17]
> cluster addr = 10.20.4.30
> public addr = 10.20.8.30
> [osd.18]
> cluster addr = 10.20.4.30
> public addr = 10.20.8.30
> [osd.19]
> cluster addr = 10.20.4.31
> public addr = 10.20.8.31
> [osd.20]
> cluster addr = 10.20.4.31
> public addr = 10.20.8.31
> [osd.21]
> cluster addr = 10.20.4.31
> public addr = 10.20.8.31
>
> [mon.ceph35]
> host = ceph35
> mon addr = 10.20.8.35:6789
> [mon.ceph30]
> host = ceph30
> mon addr = 10.20.8.30:6789
> [mon.ceph25]
> host = ceph25
> mon addr = 10.20.8.25:6789
> [mon.ceph15]
> host = ceph15
> mon addr = 10.20.8.15:6789
> 
>
> Regards,
>
> Mateusz
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster inconsistency?

2014-08-26 Thread Haomai Wang

Hmm, it looks like you hit this bug(http://tracker.ceph.com/issues/9223).

Sorry for the late message, I forget that this fix is merged into 0.84.

Thanks for your patient :-)

On Tue, Aug 26, 2014 at 4:39 PM, Kenneth Waegeman
 wrote:
>
> Hi,
>
> In the meantime I already tried with upgrading the cluster to 0.84, to see
> if that made a difference, and it seems it does.
> I can't reproduce the crashing osds by doing a 'rados -p ecdata ls' anymore.
>
> But now the cluster detect it is inconsistent:
>
>   cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too few pgs
> per osd (4 < min 20); mon.ceph002 low disk space
>monmap e3: 3 mons at
> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
> election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
>mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
>osdmap e145384: 78 osds: 78 up, 78 in
> pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
>   1502 GB used, 129 TB / 131 TB avail
>279 active+clean
> 40 active+clean+inconsistent
>  1 active+clean+scrubbing+deep
>
>
> I tried to do ceph pg repair for all the inconsistent pgs:
>
>   cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub errors;
> too few pgs per osd (4 < min 20); mon.ceph002 low disk space
>monmap e3: 3 mons at
> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
> election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
>mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
>osdmap e146452: 78 osds: 78 up, 78 in
> pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
>   1503 GB used, 129 TB / 131 TB avail
>279 active+clean
> 39 active+clean+inconsistent
>  1 active+clean+scrubbing+deep
>  1 active+clean+scrubbing+deep+inconsistent+repair
>
> I let it recovering through the night, but this morning the mons were all
> gone, nothing to see in the log files.. The osds were all still up!
>
> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>  health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub errors;
> too few pgs per osd (4 < min 20)
>  monmap e7: 3 mons at
> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
> election epoch 44, quorum 0,1,2 ceph001,ceph002,ceph003
>  mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3 up:standby
>  osdmap e203410: 78 osds: 78 up, 78 in
>   pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects
> 1547 GB used, 129 TB / 131 TB avail
>1 active+clean+scrubbing+deep+inconsistent+repair
>  284 active+clean
>   35 active+clean+inconsistent
>
> I restarted the monitors now, I will let you know when I see something
> more..
>
>
>
>
> - Message from Haomai Wang  -
>  Date: Sun, 24 Aug 2014 12:51:41 +0800
>
>  From: Haomai Wang 
> Subject: Re: [ceph-users] ceph cluster inconsistency?
>To: Kenneth Waegeman ,
> ceph-users@lists.ceph.com
>
>
>> It's really strange! I write a test program according the key ordering
>> you provided and parse the corresponding value. It's true!
>>
>> I have no idea now. If free, could you add this debug code to
>> "src/os/GenericObjectMap.cc" and insert *before* "assert(start <=
>> header.oid);":
>>
>>   dout(0) << "start: " << start << "header.oid: " << header.oid << dendl;
>>
>> Then you need to recompile ceph-osd and run it again. The output log
>> can help it!
>>
>> On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang 
>> wrote:
>>>
>>> I feel a little embarrassed, 1024 rows still true for me.
>>>
>>> I was wondering if you could give your all keys via
>>> ""ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
>>> _GHOBJTOSEQ_ > keys.log“.
>>>
>>> thanks!
>>>
>>> On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman
>>>  wrote:


 - Message from Haomai Wang  -
  Date: Tue, 19 Aug 2014 12:28:27 +0800

  From: Haomai Wang 
 Subject: Re: [ceph-users] ceph cluster inconsistency?
To: Kenneth Waegeman 
Cc: Sage Weil , ceph-users@lists.ceph.com


> On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman
>  wrote:
>>
>>
>>
>> - Message from Haomai Wang  -
>>  Date: Mon, 18 Aug 2014 18:34:11 +0800
>>
>>  From: Haomai Wang 
>> Subject: Re: [ceph-users] ceph cluster inconsistency?
>>To: Kenneth Waegeman 
>>Cc: Sage Weil , ceph-users@lists.ceph.com
>>
>>
>>
>>> On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman
>>>  wrote:



 Hi,

Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Mateusz Skała


You mean to move /var/log/ceph/* to SSD disk?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer


Hello,

On Tue, 26 Aug 2014 10:23:43 +1000 Blair Bethwaite wrote:

> > Message: 25
> > Date: Fri, 15 Aug 2014 15:06:49 +0200
> > From: Loic Dachary 
> > To: Erik Logtenberg , ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
> > Message-ID: <53ee05e9.1040...@dachary.org>
> > Content-Type: text/plain; charset="iso-8859-1"
> > ...
> > Here is how I reason about it, roughly:
> >
> > If the probability of loosing a disk is 0.1%, the probability of
> > loosing two disks simultaneously (i.e. before the failure can be
> > recovered) would be 0.1*0.1 = 0.01% and three disks becomes
> > 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
> 
> I watched this conversation and an older similar one (Failure
> probability with largish deployments) with interest as we are in the
> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
> been trying to wrap my head around these issues.
>
As the OP of the "Failure probability with largish deployments" thread I
have to thank Blair for raising this issue again and doing the hard math
below. Which looks fine to me.

At the end of that slightly inconclusive thread I walked away with the
same impression as Blair, namely that the survival of PGs is the key
factor and that they will likely be spread out over most, if not all the
OSDs.

Which in turn did reinforce my decision to deploy our first production
Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind
a HW RAID controller with 4GB cache AND SDD journals. 
I can live with the reduced performance (which is caused by the OSD code
running out of steam long before the SSDs or the RAIDs do), because not
only do I save 1/3rd of the space and 1/4th of the cost compared to a
replication 3 cluster, the total of disks that need to fail within the
recovery window to cause data loss is now 4.

The next cluster I'm currently building is a classic Ceph design,
replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with
this cluster I won't have predictable I/O patterns and loads.
OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with
the odds here.

I think doing the exact maths for a cluster of the size you're planning
would be very interesting and also very much needed. 
3.5PB usable space would be close to 3000 disks with a replication of 3,
but even if you meant that as gross value it would probably mean that
you're looking at frequent, if not daily disk failures.


Regards,

Christian
> Loic's reasoning (above) seems sound as a naive approximation assuming
> independent probabilities for disk failures, which may not be quite
> true given potential for batch production issues, but should be okay
> for other sorts of correlations (assuming a sane crushmap that
> eliminates things like controllers and nodes as sources of
> correlation).
> 
> One of the things that came up in the "Failure probability with
> largish deployments" thread and has raised its head again here is the
> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
> be somehow more prone to data-loss than non-striped. I don't think
> anyone has so far provided an answer on this, so here's my thinking...
> 
> The level of atomicity that matters when looking at durability &
> availability in Ceph is the Placement Group. For any non-trivial RBD
> it is likely that many RBDs will span all/most PGs, e.g., even a
> relatively small 50GiB volume would (with default 4MiB object size)
> span 12800 PGs - more than there are in many production clusters
> obeying the 100-200 PGs per drive rule of thumb. Losing any
> one PG will cause data-loss. The failure-probability effects of
> striping across multiple PGs are immaterial considering that loss of
> any single PG is likely to damage all your RBDs. This
> might be why the reliability calculator doesn't consider total number
> of disks.
> 
> Related to all this is the durability of 2 versus 3 replicas (or e.g.
> M>=1 for Erasure Coding). It's easy to get caught up in the worrying
> fallacy that losing any M OSDs will cause data-loss, but this isn't
> true - they have to be members of the same PG for data-loss to occur.
> So then it's tempting to think the chances of that happening are so
> slim as to not matter and why would we ever even need 3 replicas. I
> mean, what are the odds of exactly those 2 drives, out of the
> 100,200... in my cluster, failing in ?! But therein
> lays the rub - you should be thinking about PGs. If a drive fails then
> the chance of a data-loss event resulting are dependent on the chances
> of losing further drives from the affected/degraded PGs.
> 
> I've got a real cluster at hand, so let's use that as an example. We
> have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
> failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
> dies. How many PGs are now at risk:
> $ grep "^10\." pg.dump | awk '{print $15}' | grep 15 | wc
> 109 109 861
> (

Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Irek Fasikhov

I'm sorry, of course it journals)


2014-08-26 13:16 GMT+04:00 Mateusz Skała :

> You mean to move /var/log/ceph/* to SSD disk?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph monitor load, low performance

2014-08-26 Thread pawel . orzechowski

 

Hello Gentelmen:-) 

Let me point one important aspect of this "low performance" problem:
from all 4 nodes of our ceph cluster only one node shows bad metrics,
that is very high latency on its osd's (from 200-600ms), while other
three nodes behave normaly, thats is latency of their osds is between
1-10ms. 

So, the idea of putting journals on SSD is something that we are looking
at, but we think that we have in general some problem with that
particular node, what affects whole cluster. 

So can the number (4) of hosts a reason for that? Any other hints? 

Thanks 

Pawel ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread yuelongguang


thanks Irek Fasikhov.
is it the only way to test ceph-rbd?  and an important aim of the test is to 
find where  the bottleneck is.   qemu/librbd/ceph.
could you share your test result with me?
 
 
 
thanks




 


在 2014-08-26 04:22:22，"Irek Fasikhov"  写道：

Hi.
I and many people use fio. 
For ceph rbd has a special engine: 
https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html



2014-08-26 12:15 GMT+04:00 yuelongguang :

hi,all
 
i am planning to do a test on ceph, include performance, throughput, 
scalability,availability.
in order to get a full test result, i  hope you all can give me some advice. 
meanwhile i can send the result to you,if you like.
as for each category test( performance, throughput, scalability,availability)  
,  do you have some some test idea and test tools?
basicly i have know some tools to test throughtput,iops .  but you can tell the 
tools you prefer and the result you expect.  
 
thanks very much
 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







--

С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread Irek Fasikhov

For me, the bottleneck is single-threaded operation. The recording will
have more or less solved with the inclusion of rbd cache, but there are
problems with reading. But I think that these problems can be solved cache
pool, but have not tested.

It follows that the more threads, the greater the speed of reading and
writing. But in reality it is different.

The speed and number of operations, depending on many factors, such as
network latency.

Examples testing, special attention to the charts:

https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-in-a-quantitative-way-part-i
and
https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii


2014-08-26 15:11 GMT+04:00 yuelongguang :

>
> thanks Irek Fasikhov.
> is it the only way to test ceph-rbd?  and an important aim of the test is
> to find where  the bottleneck is.   qemu/librbd/ceph.
> could you share your test result with me?
>
>
>
> thanks
>
>
>
>
>
>
> 在 2014-08-26 04:22:22，"Irek Fasikhov"  写道：
>
> Hi.
> I and many people use fio.
> For ceph rbd has a special engine:
> https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html
>
>
> 2014-08-26 12:15 GMT+04:00 yuelongguang :
>
>> hi,all
>>
>> i am planning to do a test on ceph, include performance, throughput,
>> scalability,availability.
>> in order to get a full test result, i  hope you all can give me some
>> advice. meanwhile i can send the result to you,if you like.
>> as for each category test( performance, throughput,
>> scalability,availability)  ,  do you have some some test idea and test
>> tools?
>> basicly i have know some tools to test throughtput,iops .  but you can
>> tell the tools you prefer and the result you expect.
>>
>> thanks very much
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
>
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] enrich ceph test methods, what is your concern about ceph. thanks

2014-08-26 Thread Irek Fasikhov

Sorry..Enter pressed :)

continued...
no, it's not the only way to check, but it depends what you want to use ceph


2014-08-26 15:22 GMT+04:00 Irek Fasikhov :

> For me, the bottleneck is single-threaded operation. The recording will
> have more or less solved with the inclusion of rbd cache, but there are
> problems with reading. But I think that these problems can be solved cache
> pool, but have not tested.
>
> It follows that the more threads, the greater the speed of reading and
> writing. But in reality it is different.
>
> The speed and number of operations, depending on many factors, such as
> network latency.
>
> Examples testing, special attention to the charts:
>
>
> https://software.intel.com/en-us/blogs/2013/10/25/measure-ceph-rbd-performance-in-a-quantitative-way-part-i
> and
>
> https://software.intel.com/en-us/blogs/2013/11/20/measure-ceph-rbd-performance-in-a-quantitative-way-part-ii
>
>
> 2014-08-26 15:11 GMT+04:00 yuelongguang :
>
>
>> thanks Irek Fasikhov.
>> is it the only way to test ceph-rbd?  and an important aim of the test is
>> to find where  the bottleneck is.   qemu/librbd/ceph.
>> could you share your test result with me?
>>
>>
>>
>> thanks
>>
>>
>>
>>
>>
>>
>> 在 2014-08-26 04:22:22，"Irek Fasikhov"  写道：
>>
>> Hi.
>> I and many people use fio.
>> For ceph rbd has a special engine:
>> https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html
>>
>>
>> 2014-08-26 12:15 GMT+04:00 yuelongguang :
>>
>>> hi,all
>>>
>>> i am planning to do a test on ceph, include performance, throughput,
>>> scalability,availability.
>>> in order to get a full test result, i  hope you all can give me some
>>> advice. meanwhile i can send the result to you,if you like.
>>> as for each category test( performance, throughput,
>>> scalability,availability)  ,  do you have some some test idea and test
>>> tools?
>>> basicly i have know some tools to test throughtput,iops .  but you can
>>> tell the tools you prefer and the result you expect.
>>>
>>> thanks very much
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>>
>> --
>> С уважением, Фасихов Ирек Нургаязович
>> Моб.: +79229045757
>>
>>
>>
>>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph can not repair itself after accidental power down, half of pgs are peering

2014-08-26 Thread yuelongguang

hi,all
 
i have 5 osds and 3 mons. its status is ok then.
 
to be mentioned , this cluster has no any data.  i just deploy it and to be 
familar with some command lines.
what is the probpem and how to fix?
 
thanks
 
 
---environment-
ceph-release-1-0.el6.noarch
ceph-deploy-1.5.11-0.noarch
ceph-0.81.0-5.el6.x86_64
ceph-libs-0.81.0-5.el6.x86_64
-ceph -s --
[root@cephosd1-mona ~]# ceph -s
cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651
 health HEALTH_WARN 183 pgs peering; 183 pgs stuck inactive; 183 pgs stuck 
unclean; clock skew detected on mon.cephosd2-monb, mon.cephosd3-monc
 monmap e13: 3 mons at 
{cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0},
 election epoch 74, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc
 osdmap e151: 5 osds: 5 up, 5 in
  pgmap v499: 384 pgs, 4 pools, 0 bytes data, 0 objects
201 MB used, 102143 MB / 102344 MB avail
 167 peering
 201 active+clean
  16 remapped+peering
 
 
--log--osd.0
2014-08-26 19:16:13.926345 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:13.926355 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x4dc2a80 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d5960).accept: got bad authorizer
2014-08-26 19:16:28.928023 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:28.928050 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x4dc2800 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d56a0).accept: got bad authorizer
2014-08-26 19:16:28.929139 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:16:28.929237 7f114c009700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38071 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:43.930846 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:43.930899 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x4dc2580 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d0b00).accept: got bad authorizer
2014-08-26 19:16:43.932204 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:16:43.932230 7f114c009700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38073 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:58.933526 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:58.935094 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x4dc2300 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d0840).accept: got bad authorizer
2014-08-26 19:16:58.936239 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:16:58.936261 7f114c009700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38074 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:13.937335 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:13.937368 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x4dc2080 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d1b80).accept: got bad authorizer
2014-08-26 19:17:13.937923 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:17:13.937933 7f114c009700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38075 s=1 pgs=0 cs=0 l=0 
c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:28.939439 7f114a8d2700  0 cephx: verify_authorizer could not 
decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:28.939455 7f114a8d2700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x4dc1e00 sd=25 :6800 s=0 pgs=0 cs=0 l=0 
c=0x45d5540).accept: got bad authorizer
2014-08-26 19:17:28.939716 7f114c009700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2014-08-26 19:17:28.939731 7f114c009700  0 -- 11.154.249.2:6800/1667 >> 
11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :3807

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary

Hi Blair,

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the 
failure of the first disk (assuming AFR 
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
other disks are lost before recovery. Since the disk that failed initialy 
participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
lost. Or the entire pool if it is used in a way that loosing a PG means loosing 
all data in the pool (as in your example, where it contains RBD volumes and 
each of the RBD volume uses all the available PG).

If the pool is using at least two datacenters operated by two different 
organizations, this calculation makes sense to me. However, if the cluster is 
in a single datacenter, isn't it possible that some event independent of Ceph 
has a greater probability of permanently destroying the data ? A month ago I 
lost three machines in a Ceph cluster and realized on that occasion that the 
crushmap was not configured properly and that PG were lost as a result. 
Fortunately I was able to recover the disks and plug them in another machine to 
recover the lost PGs. I'm not a system administrator and the probability of me 
failing to do the right thing is higher than normal: this is just an example of 
a high probability event leading to data loss. In other words, I wonder if this 
0.0001% chance of losing a PG within the hour following a disk failure matters 
or if it is dominated by other factors. What do you think ?

Cheers

On 26/08/2014 02:23, Blair Bethwaite wrote:
>> Message: 25
>> Date: Fri, 15 Aug 2014 15:06:49 +0200
>> From: Loic Dachary 
>> To: Erik Logtenberg , ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
>> Message-ID: <53ee05e9.1040...@dachary.org>
>> Content-Type: text/plain; charset="iso-8859-1"
>> ...
>> Here is how I reason about it, roughly:
>>
>> If the probability of loosing a disk is 0.1%, the probability of loosing two 
>> disks simultaneously (i.e. before the failure can be recovered) would be 
>> 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
>> becomes 0.0001%
> 
> I watched this conversation and an older similar one (Failure
> probability with largish deployments) with interest as we are in the
> process of planning a pretty large Ceph cluster (~3.5 PB), so I have
> been trying to wrap my head around these issues.
> 
> Loic's reasoning (above) seems sound as a naive approximation assuming
> independent probabilities for disk failures, which may not be quite
> true given potential for batch production issues, but should be okay
> for other sorts of correlations (assuming a sane crushmap that
> eliminates things like controllers and nodes as sources of
> correlation).
> 
> One of the things that came up in the "Failure probability with
> largish deployments" thread and has raised its head again here is the
> idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
> be somehow more prone to data-loss than non-striped. I don't think
> anyone has so far provided an answer on this, so here's my thinking...
> 
> The level of atomicity that matters when looking at durability &
> availability in Ceph is the Placement Group. For any non-trivial RBD
> it is likely that many RBDs will span all/most PGs, e.g., even a
> relatively small 50GiB volume would (with default 4MiB object size)
> span 12800 PGs - more than there are in many production clusters
> obeying the 100-200 PGs per drive rule of thumb. Losing any
> one PG will cause data-loss. The failure-probability effects of
> striping across multiple PGs are immaterial considering that loss of
> any single PG is likely to damage all your RBDs. This
> might be why the reliability calculator doesn't consider total number
> of disks.
> 
> Related to all this is the durability of 2 versus 3 replicas (or e.g.
> M>=1 for Erasure Coding). It's easy to get caught up in the worrying
> fallacy that losing any M OSDs will cause data-loss, but this isn't
> true - they have to be members of the same PG for data-loss to occur.
> So then it's tempting to think the chances of that happening are so
> slim as to not matter and why would we ever even need 3 replicas. I
> mean, what are the odds of exactly those 2 drives, out of the
> 100,200... in my cluster, failing in ?! But therein
> lays the rub - you should be thinking about PGs. If a drive fails then
> the chance of a data-loss event resulting are dependent on the chances
> of losing further drives from the affected/degraded PGs.
> 
> I've got a real cluster at hand, so let's use that as an example. We
> have 96 drives/OSDs - 8

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary

Using percentages instead of numbers lead me to calculations errors. Here it is 
again using 1/100 instead of % for clarity ;-)

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 1/100,000 chance to fail within the hour following the 
failure of the first disk (assuming AFR 
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, 
divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 1/100,000*1/100,000 = 1/10,000,000,000 
chance that two other disks are lost before recovery. Since the disk that 
failed initialy participates in 100 PG, that is 1/10,000,000,000 x 100 = 
1/100,000,000 chance that a PG is lost. Or the entire pool if it is used in a 
way that loosing a PG means loosing all data in the pool (as in your example, 
where it contains RBD volumes and each of the RBD volume uses all the available 
PG).

If the pool is using at least two datacenters operated by two different 
organizations, this calculation makes sense to me. However, if the cluster is 
in a single datacenter, isn't it possible that some event independent of Ceph 
has a greater probability of permanently destroying the data ? A month ago I 
lost three machines in a Ceph cluster and realized on that occasion that the 
crushmap was not configured properly and that PG were lost as a result. 
Fortunately I was able to recover the disks and plug them in another machine to 
recover the lost PGs. I'm not a system administrator and the probability of me 
failing to do the right thing is higher than normal: this is just an example of 
a high probability event leading to data loss. Another example would be if all 
disks in the same PG are part of the same batch and therefore likely to fail at 
the same time. In other words, I wonder if this 0.0001% chance of losing a PG 
within the hour following a disk failure matters or if it is dominate
d by other factors. What do you think ?

Cheers

> 
> Assuming that:
> 
> * The pool is configured for three replicas (size = 3 which is the default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 0.001% chance to fail within the hour following the 
> failure of the first disk (assuming AFR 
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
> divided by the number of hours during a year).
> * A given disk does not participate in more than 100 PG
> 
> Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
> other disks are lost before recovery. Since the disk that failed initialy 
> participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
> lost. Or the entire pool if it is used in a way that loosing a PG means 
> loosing all data in the pool (as in your example, where it contains RBD 
> volumes and each of the RBD volume uses all the available PG).
> 
> If the pool is using at least two datacenters operated by two different 
> organizations, this calculation makes sense to me. However, if the cluster is 
> in a single datacenter, isn't it possible that some event independent of Ceph 
> has a greater probability of permanently destroying the data ? A month ago I 
> lost three machines in a Ceph cluster and realized on that occasion that the 
> crushmap was not configured properly and that PG were lost as a result. 
> Fortunately I was able to recover the disks and plug them in another machine 
> to recover the lost PGs. I'm not a system administrator and the probability 
> of me failing to do the right thing is higher than normal: this is just an 
> example of a high probability event leading to data loss. In other words, I 
> wonder if this 0.0001% chance of losing a PG within the hour following a disk 
> failure matters or if it is dominated by other factors. What do you think ?
> 
> Cheers

On 26/08/2014 15:25, Loic Dachary wrote:> Hi Blair,
> 
> Assuming that:
> 
> * The pool is configured for three replicas (size = 3 which is the default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 0.001% chance to fail within the hour following the 
> failure of the first disk (assuming AFR 
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
> divided by the number of hours during a year).
> * A given disk does not participate in more than 100 PG
> 
> Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
> other disks are lost before recovery. Since the disk that failed initialy 
> participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
> lost. Or the entire pool if it is used in a way that loosing a PG means 
> loosing all data in the pool (as in your example, where it contains RBD 
> volumes and each of the

[ceph-users] MDS dying on Ceph 0.67.10

2014-08-26 Thread MinhTien MinhTien

Hi all,

I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate =  2)

When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.

I have 3 MDS in 3 nodes,the MDS process is dying after a while with a
stack trace:

---

 2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154
<== osd.10 10.20.0.21:6802/15917 1  osd_op_reply(230
10003f6. [tmapup 0~0] ondisk = 0) v4  119+0+0
(1770421071 0 0) 0x2aece00 con 0x2aa4200
   -54> 2014-08-26 17:08:34.362942 7f1c2c704700  1 --
10.20.0.21:6800/22154 <== osd.55 10.20.0.23:6800/2407 10 
osd_op_reply(263 100048a. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (3908997833 0 0) 0x1e63000 con
0x1e7aaa0
   -53> 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log
submit_entry 427629603~1541 : EUpdate purge_stray truncate [metablob
100, 2 dirs]
   -52> 2014-08-26 17:08:34.363022 7f1c2c704700  1 --
10.20.0.21:6800/22154 <== osd.37 10.20.0.22:6898/11994 6 
osd_op_reply(226 1. [tmapput 0~7664] ondisk = 0) v4 
109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0
   -51> 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
segment 293601899 2548 events
   -50> 2014-08-26 17:08:34.363117 7f1c2c704700  1 --
10.20.0.21:6800/22154 <== osd.17 10.20.0.21:6941/17572 9 
osd_op_reply(264 1000489. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (1979034473 0 0) 0x1e62200 con
0x1e7b180
   -49> 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log
submit_entry 427631148~1541 : EUpdate purge_stray truncate [metablob
100, 2 dirs]
   -48> 2014-08-26 17:08:34.363197 7f1c2c704700  1 --
10.20.0.21:6800/22154 <== osd.1 10.20.0.21:6872/13227 6 
osd_op_reply(265 1000491. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (1231782695 0 0) 0x1e63400 con
0x1e7ac00
   -47> 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log
submit_entry 427632693~1541 : EUpdate purge_stray truncate [metablob
100, 2 dirs]
   -46> 2014-08-26 17:08:34.363274 7f1c2c704700  1 --
10.20.0.21:6800/22154 <== osd.11 10.20.0.21:6884/7018 5 
osd_op_reply(266 100047d. [getxattr] ack = -2 (No such
file or directory)) v4  119+0+0 (2737916920 0 0) 0x1e61e00 con
0x1e7bc80

-
I try to restart MDSs, but after a few seconds in a state of "active", MDS
switch to state "laggy or crashed". I have a lot of important data on it.
I do not want to use the command:
ceph mds newfs   --yes-i-really-mean-it

:(

Tien Bui.



-- 
Bui Minh Tien
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph can not repair itself after accidental power down, half of pgs are peering

2014-08-26 Thread Michael

How far out are your clocks? It's showing a clock skew, if they're too 
far out it can cause issues with cephx.

Otherwise you're probably going to need to check your cephx auth keys.

-Michael

On 26/08/2014 12:26, yuelongguang wrote:

hi,all
i have 5 osds and 3 mons. its status is ok then.
to be mentioned , this cluster has no any data.  i just deploy it and 
to be familar with some command lines.

what is the probpem and how to fix?
thanks
---environment-
ceph-release-1-0.el6.noarch
ceph-deploy-1.5.11-0.noarch
ceph-0.81.0-5.el6.x86_64
ceph-libs-0.81.0-5.el6.x86_64
-ceph -s --
[root@cephosd1-mona ~]# ceph -s
cluster 508634f6-20c9-43bb-bc6f-b777f4bb1651
 health HEALTH_WARN 183 pgs peering; 183 pgs stuck inactive; 183 
pgs stuck unclean; clock skew detected on mon.cephosd2-monb, 
mon.cephosd3-monc
 monmap e13: 3 mons at 
{cephosd1-mona=10.154.249.3:6789/0,cephosd2-monb=10.154.249.4:6789/0,cephosd3-monc=10.154.249.5:6789/0}, 
election epoch 74, quorum 0,1,2 cephosd1-mona,cephosd2-monb,cephosd3-monc

 osdmap e151: 5 osds: 5 up, 5 in
  pgmap v499: 384 pgs, 4 pools, 0 bytes data, 0 objects
201 MB used, 102143 MB / 102344 MB avail
 167 peering
 201 active+clean
  16 remapped+peering
--log--osd.0
2014-08-26 19:16:13.926345 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:13.926355 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x4dc2a80 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d5960).accept: got bad authorizer
2014-08-26 19:16:28.928023 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:28.928050 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x4dc2800 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d56a0).accept: got bad authorizer
2014-08-26 19:16:28.929139 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:16:28.929237 7f114c009700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38071 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:43.930846 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:43.930899 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x4dc2580 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d0b00).accept: got bad authorizer
2014-08-26 19:16:43.932204 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:16:43.932230 7f114c009700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38073 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:16:58.933526 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:16:58.935094 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x4dc2300 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d0840).accept: got bad authorizer
2014-08-26 19:16:58.936239 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:16:58.936261 7f114c009700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38074 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:13.937335 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:13.937368 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x4dc2080 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d1b80).accept: got bad authorizer
2014-08-26 19:17:13.937923 7f114c009700  0 cephx: verify_reply 
couldn't decrypt with error: error decoding block for decryption
2014-08-26 19:17:13.937933 7f114c009700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x3edb700 sd=24 :38075 s=1 pgs=0 cs=0 
l=0 c=0x45d23c0).failed verifying authorize reply
2014-08-26 19:17:28.939439 7f114a8d2700  0 cephx: verify_authorizer 
could not decrypt ticket info: error: decryptor.MessageEnd::Exception: 
StreamTransformationFilter: invalid PKCS #7 block padding found
2014-08-26 19:17:28.939455 7f114a8d2700  0 -- 11.154.249.2:6800/1667 
>> 11.154.249.7:6800/1599 pipe(0x4dc1e00 sd=25 :6800 s=0 pgs=0 cs=0 
l=0 c=0x45d5540).accept: got bad authorizer
2014-08-26 19:17:28.93971

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Craig Lewis

My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
backfills = 1).   I believe that increases my risk of failure by 48^2 .
 Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk.  So more formally, rebuild time
to the power of (replicas -1).

So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
higher risk than 1 / 10^8.

A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure.  Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number.  Managing human error is
much harder.

On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  wrote:

> Using percentages instead of numbers lead me to calculations errors. Here
> it is again using 1/100 instead of % for clarity ;-)
>
> Assuming that:
>
> * The pool is configured for three replicas (size = 3 which is the default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 1/100,000 chance to fail within the hour following
> the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000
> * A given disk does not participate in more than 100 PG
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph monitor load, low performance

2014-08-26 Thread Craig Lewis

I had a similar problem once.  I traced my problem it to a failed battery
on my RAID card, which disabled write caching.  One of the many things I
need to add to monitoring.



On Tue, Aug 26, 2014 at 3:58 AM,  wrote:

>  Hello Gentelmen:-)
>
> Let me point one important aspect of this "low performance" problem: from
> all 4 nodes of our ceph cluster only one node shows bad metrics, that is
> very high latency on its osd's (from 200-600ms), while other three nodes
> behave normaly, thats is latency of their osds is between 1-10ms.
>
> So, the idea of putting journals on SSD is something that we are looking
> at, but we think that we have in general some problem with that particular
> node, what affects whole cluster.
>
> So can the number (4) of hosts a reason for that? Any other hints?
>
> Thanks
>
> Pawel
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary

Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost of the 
cluster low ? I wrote "1h recovery time" because it is roughly the time it 
would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to 
reduce the recovery time to less than two hours ? Or are there factors other 
than cost that prevent this ?

Cheers

On 26/08/2014 19:37, Craig Lewis wrote:
> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max 
> backfills = 1).   I believe that increases my risk of failure by 48^2 .  
> Since your numbers are failure rate per hour per disk, I need to consider the 
> risk for the whole time for each disk.  So more formally, rebuild time to the 
> power of (replicas -1).
> 
> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much higher 
> risk than 1 / 10^8.
> 
> 
> A risk of 1/43,000 means that I'm more likely to lose data due to human error 
> than disk failure.  Still, I can put a small bit of effort in to optimize 
> recovery speed, and lower this number.  Managing human error is much harder.
> 
> 
> 
> 
> 
> 
> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  > wrote:
> 
> Using percentages instead of numbers lead me to calculations errors. Here 
> it is again using 1/100 instead of % for clarity ;-)
> 
> Assuming that:
> 
> * The pool is configured for three replicas (size = 3 which is the 
> default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 1/100,000 chance to fail within the hour following 
> the failure of the first disk (assuming AFR 
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, 
> divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000
> * A given disk does not participate in more than 100 PG
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-08-26 Thread Steve Anthony

Ok, after some delays and the move to new network hardware I have an
update. I'm still seeing the same low bandwidth and high retransmissions
from iperf after moving to the Cisco 6001 (10Gb) and 2960 (1Gb). I've
narrowed it down to transmissions from a 10Gb connected host to a 1Gb
connected host. Taking a more targeted tcpdump, I discovered that there
are multiple duplicate ACKs, triggering fast retransmissions between the
two test hosts.

There are several websites/articles which suggest that mixing 10Gb and
1Gb hosts causes performance issues, but no concrete explanation of why
that's the case, and if it can be avoided without moving everything to
10Gb, eg.

http://blogs.technet.com/b/networking/archive/2011/05/16/tcp-dupacks-and-tcp-fast-retransmits.aspx
http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19856911/download.aspx
[PDF]
http://packetpushers.net/flow-control-storm-%E2%80%93-ip-storage-performance-effects/

I verified that it's not a flow control storm (the pause frame counters
along the network path are zero), so assuming it might be bandwidth
related I installed trickle and used it to limit the bandwidth of iperf
to 1Gb; no change. I further restricted it down to 100Kbps, and was
*still* seeing high retransmission. This seems to imply it's not purely
bandwidth related.

After further research, I noticed a difference of about 0.1ms in the RTT
between two 10Gb hosts (intra-switch) and the 10Gb and 1Gb host
(inter-switch). I theorized this may be affecting the retransmission
timeout counter calculations, per:

http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html

so I used ethtool to set the link plugged into the 10Gb 6001 to 1Gb;
this immediately fixed the issue. After this change the difference in
RTTs moved to about .025ms. Plugging another host into the old 10Gb FEX,
I have 10Gb to 10Gb RTTs withing .001ms of 6001 to 2960 RTTs, and don't
see the high retransmissions with iperf between those 10Gb hosts.

 tldr 

So, right now I don't see retransmissions between hosts on the same
switch (even if speeds are mixed), but I do across switches when the
hosts are mixed 10Gb/1Gb. Also, I wonder what the difference between
process bandwidth limiting and link 1Gb negotiation is which leads to
the differences observed. I checked the link per Mark's suggestion
below, but all the values they increase in that old post are already
lower than the defaults set on my hosts.

If anyone has any ideas or explanations, I'd appreciate it. Otherwise,
I'll keep the list posted if I uncover a solution or make more progress.
Thanks.

-Steve

On 07/28/2014 01:21 PM, Mark Nelson wrote:
> On 07/28/2014 11:28 AM, Steve Anthony wrote:
>> While searching for more information I happened across the following
>> post (http://dachary.org/?p=2961) which vaguely resembled the symptoms
>> I've been experiencing. I ran tcpdump and noticed what appeared to be a
>> high number of retransmissions on the host where the images are mounted
>> during a read from a Ceph rbd, so I ran iperf3 to get some concrete
>> numbers:
>
> Very interesting that you are seeing retransmissions.
>
>>
>> Server: nas4 (where rbd images are mapped)
>> Client: ceph2 (currently not in the cluster, but configured
>> identically to the other nodes)
>>
>> Start server on nas4:
>> iperf3 -s
>>
>> On ceph2, connect to server nas4, send 4096MB of data, report on 1
>> second intervals. Add -R to reverse the client/server roles.
>> iperf3 -c nas4 -n 4096M -i 1
>>
>> Summary of traffic going out the 1Gb interface to a switch
>>
>> [ ID] Interval   Transfer Bandwidth   Retr
>> [  5]   0.00-36.53  sec  4.00 GBytes   941 Mbits/sec   15
>> sender
>> [  5]   0.00-36.53  sec  4.00 GBytes   940 Mbits/sec
>> receiver
>>
>> Reversed, summary of traffic going over the fabric extender
>>
>> [ ID] Interval   Transfer Bandwidth   Retr
>> [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec  30756
>> sender
>> [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec
>> receiver
>
> Definitely looks suspect!
>
>>
>>
>> It appears that the issue is related to the network topology employed.
>> The private cluster network and nas4's public interface are both
>> connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a
>> Nexus 7000. This was meant as a temporary solution until our network
>> team could finalize their design and bring up the Nexus 6001 for the
>> cluster. From what our network guys have said, the FEX has been much
>> more limited than they anticipated and they haven't been pleased with it
>> as a solution in general. The 6001 is supposed be ready this week, so
>> once it's online I'll move the cluster to that switch and re-test to see
>> if this fixes the issues I've been experiencing.
>
> If it's not the hardware, one other thing you might want to test is to
> make sure it's not something similar to the autotuning issues we used
> to see.  I don't think this should be an issue at this po

Re: [ceph-users] Two osds are spaming dmesg every 900 seconds

2014-08-26 Thread Gregory Farnum

This is being output by one of the kernel clients, and it's just
saying that the connections to those two OSDs have died from
inactivity. Either the other OSD connections are used a lot more, or
aren't used at all.

In any case, it's not a problem; just a noisy notification. There's
not much you can do about it; sorry.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 12:01 PM, Andrei Mikhailovsky  wrote:
> Hello
>
> I am seeing this message every 900 seconds on the osd servers. My dmesg 
> output is all filled with:
>
> [256627.683702] libceph: osd3 192.168.168.200:6821 socket closed (con state 
> OPEN)
> [256627.687663] libceph: osd6 192.168.168.200:6841 socket closed (con state 
> OPEN)
>
>
> Looking at the ceph-osd logs I see the following at the same time:
>
> 2014-08-25 19:48:14.869145 7f0752125700  0 -- 192.168.168.200:6821/4097 >> 
> 192.168.168.200:0/2493848861 pipe(0x13b43c80 sd=92 :6821 s=0 pgs=0 cs=0 l=0 
> c=0x16a606e0).accept peer addr is really 192.168.168.200:0/2493848861 (socket 
> is 192.168.168.200:54457/0)
>
>
> This happens only on two osds and the rest of osds seem fine. Does anyone 
> know why am I seeing this and how to correct it?
>
> Thanks
>
> Andrei
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-fuse fails to mount

2014-08-26 Thread Gregory Farnum

In particular, we changed things post-Firefly so that the filesystem
isn't created automatically. You'll need to set it up (and its pools,
etc) explicitly to use it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby
 wrote:
> Hi James,
>
>
> On 26 August 2014 07:17, LaBarre, James (CTR) A6IT 
> wrote:
>>
>>
>>
>> [ceph@first_cluster ~]$ ceph -s
>>
>> cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d
>>
>>  health HEALTH_OK
>>
>>  monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0}, election
>> epoch 2, quorum 0 first_cluster
>>
>>  mdsmap e4: 1/1/1 up {0=first_cluster=up:active}
>>
>>  osdmap e13: 3 osds: 3 up, 3 in
>>
>>   pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects
>>
>> 19835 MB used, 56927 MB / 76762 MB avail
>>
>>  192 active+clean
>
>
> This cluster has an MDS. It should mount.
>
>>
>>
>>
>> [ceph@second_cluster ~]$ ceph -s
>>
>> cluster 06f655b7-e147-4790-ad52-c57dcbf160b7
>>
>>  health HEALTH_OK
>>
>>  monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0}, election
>> epoch 1, quorum 0 cilsdbxd1768
>>
>>  osdmap e16: 7 osds: 7 up, 7 in
>>
>>   pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects
>>
>> 252 MB used, 194 GB / 194 GB avail
>>
>>  192 active+clean
>
>
> No mdsmap line for this cluster, and therefore the filesystem won't mount.
> Have you added an MDS for this cluster, or has the mds daemon died? You'll
> have to get the mdsmap line to show before it will mount
>
> Sean
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS dying on Ceph 0.67.10

2014-08-26 Thread Gregory Farnum

I don't think the log messages you're showing are the actual cause of
the failure. The log file should have a proper stack trace (with
specific function references and probably a listed assert failure),
can you find that?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien
 wrote:
> Hi all,
>
> I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate =  2)
>
> When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.
>
> I have 3 MDS in 3 nodes,the MDS process is dying after a while with a stack
> trace:
>
> ---
>
>  2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154 <==
> osd.10 10.20.0.21:6802/15917 1  osd_op_reply(230 10003f6.
> [tmapup 0~0] ondisk = 0) v4  119+0+0 (1770421071 0 0) 0x2aece00 con
> 0x2aa4200
>-54> 2014-08-26 17:08:34.362942 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.55 10.20.0.23:6800/2407 10  osd_op_reply(263
> 100048a. [getxattr] ack = -2 (No such file or directory)) v4
>  119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0
>-53> 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log submit_entry
> 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>-52> 2014-08-26 17:08:34.363022 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.37 10.20.0.22:6898/11994 6  osd_op_reply(226 1. [tmapput
> 0~7664] ondisk = 0) v4  109+0+0 (1007110430 0 0) 0x1e64800 con 0x1e7a7e0
>-51> 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
> segment 293601899 2548 events
>-50> 2014-08-26 17:08:34.363117 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.17 10.20.0.21:6941/17572 9  osd_op_reply(264
> 1000489. [getxattr] ack = -2 (No such file or directory)) v4
>  119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180
>-49> 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log submit_entry
> 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>-48> 2014-08-26 17:08:34.363197 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.1 10.20.0.21:6872/13227 6  osd_op_reply(265 1000491.
> [getxattr] ack = -2 (No such file or directory)) v4  119+0+0 (1231782695
> 0 0) 0x1e63400 con 0x1e7ac00
>-47> 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log submit_entry
> 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>-46> 2014-08-26 17:08:34.363274 7f1c2c704700  1 -- 10.20.0.21:6800/22154
> <== osd.11 10.20.0.21:6884/7018 5  osd_op_reply(266 100047d.
> [getxattr] ack = -2 (No such file or directory)) v4  119+0+0 (2737916920
> 0 0) 0x1e61e00 con 0x1e7bc80
>
> -
> I try to restart MDSs, but after a few seconds in a state of "active", MDS
> switch to state "laggy or crashed". I have a lot of important data on it.
> I do not want to use the command:
> ceph mds newfs   --yes-i-really-mean-it
>
> :(
>
> Tien Bui.
>
>
>
> --
> Bui Minh Tien
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fresh Firefly install degraded without modified default tunables

2014-08-26 Thread Gregory Farnum

Hmm, that all looks basically fine. But why did you decide not to
segregate OSDs across hosts (according to your CRUSH rules)? I think
maybe it's the interaction of your map, setting choose_local_tries to
0, and trying to go straight to the OSDs instead of choosing hosts.
But I'm not super familiar with how the tunables would act under these
exact conditions.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji  wrote:
> Hi Greg,
>
> Thanks for helping to take a look. Please find your requested outputs below.
>
> ceph osd tree:
>
> # id weight type name up/down reweight
> -1 0 root default
> -2 0 host osd1
> 0 0 osd.0 up 1
> 4 0 osd.4 up 1
> 8 0 osd.8 up 1
> 11 0 osd.11 up 1
> -3 0 host osd0
> 1 0 osd.1 up 1
> 3 0 osd.3 up 1
> 6 0 osd.6 up 1
> 9 0 osd.9 up 1
> -4 0 host osd2
> 2 0 osd.2 up 1
> 5 0 osd.5 up 1
> 7 0 osd.7 up 1
> 10 0 osd.10 up 1
>
>
> ceph -s:
>
> cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
>  health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery
> 43/86 objects degraded (50.000%)
>  monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2,
> quorum 0 ceph-mon0
>  osdmap e34: 12 osds: 12 up, 12 in
>   pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects
> 403 MB used, 10343 MB / 10747 MB avail
> 43/86 objects degraded (50.000%)
>  832 active+degraded
>
>
> Thanks,
> Ripal
>
> On Aug 25, 2014, at 12:45 PM, Gregory Farnum  wrote:
>
> What's the output of "ceph osd tree"? And the full output of "ceph -s"?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji  wrote:
>
> Hi folks,
>
> I've come across an issue which I found a "fix" for, but I'm not sure
> whether it's correct or if there is some other misconfiguration on my end
> and this is merely a symptom. I'd appreciate any insights anyone could
> provide based on the information below, and happy to provide more details as
> necessary.
>
> Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as
> active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying
> number of OSD hosts (1, 2, 3), where each OSD host has four storage drives.
> The configuration file defines a default replica size of 2, and allows leafs
> of type 0. Specific snippet:
>
> [global]
>  ...
>  osd pool default size = 2
>  osd crush chooseleaf type = 0
>
>
> I verified the crush rules were as expected:
>
>  "rules": [
>{ "rule_id": 0,
>  "rule_name": "replicated_ruleset",
>  "ruleset": 0,
>  "type": 1,
>  "min_size": 1,
>  "max_size": 10,
>  "steps": [
>{ "op": "take",
>  "item": -1,
>  "item_name": "default"},
>{ "op": "choose_firstn",
>  "num": 0,
>  "type": "osd"},
>{ "op": "emit"}]}],
>
>
> Inspecting the pg dump I observed that all pgs had a single osd in the
> up/acting sets. That seemed to explain why the pgs were degraded, but it was
> unclear to me why a second OSD wasn't in the set. After trying a variety of
> things, I noticed that there was a difference between Emperor (which works
> fine in these configurations) and Firefly with the default tunables, where
> Firefly comes up with the bobtail profile. The setting
> choose_local_fallback_tries is 0 in this profile while it used to default to
> 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to
> a non-zero value, the cluster remaps and goes healthy with all pgs
> active+clean.
>
> The documentation states the optimal value of choose_local_fallback_tries is
> 0 for FF, so I'd like to get a better understanding of this parameter and
> why modifying the default value moves the pgs to a clean state in my
> scenarios.
>
> Thanks,
> Ripal
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] ceph replication and striping

2014-08-26 Thread Aaron Ten Clay

On Tue, Aug 26, 2014 at 5:07 AM,  wrote:

>  Hello all,
>
>
>
> I have configured a ceph storage cluster.
>
>
>
> 1. I created the volume .I would like to know that  replication of data
> will happen automatically in ceph ?
>
> 2. how to configure striped volume using ceph ?
>
>
>
>
>
> Regards,
>
> Malleshi CN
>
>
If I understand your position and questions correctly... the replication
level is configured per-pool, so whatever your "size" parameter is set to
for the pool you created the volume in will dictate how many copies are
stored. (Default is 3, IIRC.)

RADOS block device volumes are always striped across 4 MiB objects. I don't
believe that is configurable (at least not yet.)

FYI, this list is intended for discussion of Ceph community concerns. These
kinds of questions are better handled on the ceph-users list, and I've
forwarded your message accordingly.

-Aaron
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-fuse fails to mount

2014-08-26 Thread Gregory Farnum

[Re-added the list.]

I believe you'll find everything you need at
http://ceph.com/docs/master/cephfs/createfs/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Aug 26, 2014 at 1:25 PM, LaBarre, James  (CTR)  A6IT
 wrote:
> So is there a link for documentation on the newer versions?  (we're doing 
> evaluations at present, so I had wanted to work with newer versions, since it 
> would be closer to what we would end up using).
>
>
> -Original Message-
> From: Gregory Farnum [mailto:g...@inktank.com]
> Sent: Tuesday, August 26, 2014 4:05 PM
> To: Sean Crosby
> Cc: LaBarre, James (CTR) A6IT; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph-fuse fails to mount
>
> In particular, we changed things post-Firefly so that the filesystem isn't 
> created automatically. You'll need to set it up (and its pools,
> etc) explicitly to use it.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Mon, Aug 25, 2014 at 2:40 PM, Sean Crosby  
> wrote:
>> Hi James,
>>
>>
>> On 26 August 2014 07:17, LaBarre, James (CTR) A6IT
>> 
>> wrote:
>>>
>>>
>>>
>>> [ceph@first_cluster ~]$ ceph -s
>>>
>>> cluster e0433b49-d64c-4c3e-8ad9-59a47d84142d
>>>
>>>  health HEALTH_OK
>>>
>>>  monmap e1: 1 mons at {first_cluster=10.25.164.192:6789/0},
>>> election epoch 2, quorum 0 first_cluster
>>>
>>>  mdsmap e4: 1/1/1 up {0=first_cluster=up:active}
>>>
>>>  osdmap e13: 3 osds: 3 up, 3 in
>>>
>>>   pgmap v480: 192 pgs, 3 pools, 1417 MB data, 4851 objects
>>>
>>> 19835 MB used, 56927 MB / 76762 MB avail
>>>
>>>  192 active+clean
>>
>>
>> This cluster has an MDS. It should mount.
>>
>>>
>>>
>>>
>>> [ceph@second_cluster ~]$ ceph -s
>>>
>>> cluster 06f655b7-e147-4790-ad52-c57dcbf160b7
>>>
>>>  health HEALTH_OK
>>>
>>>  monmap e1: 1 mons at {second_cluster=10.25.165.91:6789/0},
>>> election epoch 1, quorum 0 cilsdbxd1768
>>>
>>>  osdmap e16: 7 osds: 7 up, 7 in
>>>
>>>   pgmap v539: 192 pgs, 3 pools, 0 bytes data, 0 objects
>>>
>>> 252 MB used, 194 GB / 194 GB avail
>>>
>>>  192 active+clean
>>
>>
>> No mdsmap line for this cluster, and therefore the filesystem won't mount.
>> Have you added an MDS for this cluster, or has the mds daemon died?
>> You'll have to get the mdsmap line to show before it will mount
>>
>> Sean
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> --
> CONFIDENTIALITY NOTICE: If you have received this email in error,
> please immediately notify the sender by e-mail at the address shown.
> This email transmission may contain confidential information.  This
> information is intended only for the use of the individual(s) or entity to
> whom it is intended even if addressed incorrectly.  Please delete it from
> your files if you are not the intended recipient.  Thank you for your
> compliance.  Copyright (c) 2014 Cigna
> ==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fresh Firefly install degraded without modified default tunables

2014-08-26 Thread Ripal Nathuji

Hi Greg,

Good question: I started with a single node test and had just left the setting 
in across larger configs as in earlier versions (e.g. Emperor) it didn't seem 
to matter. I also had the same thought that it could be causing an issue with 
the new default tunables in Firefly and did try removing for multi-host (all 
things the same except for omitting "osd crush chooseleaf type = 0" in 
ceph.conf). However, I observed the same behavior in both cases. 

Thanks,
Ripal

On Aug 26, 2014, at 3:04 PM, Gregory Farnum  wrote:

> Hmm, that all looks basically fine. But why did you decide not to
> segregate OSDs across hosts (according to your CRUSH rules)? I think
> maybe it's the interaction of your map, setting choose_local_tries to
> 0, and trying to go straight to the OSDs instead of choosing hosts.
> But I'm not super familiar with how the tunables would act under these
> exact conditions.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Mon, Aug 25, 2014 at 12:59 PM, Ripal Nathuji  wrote:
>> Hi Greg,
>> 
>> Thanks for helping to take a look. Please find your requested outputs below.
>> 
>> ceph osd tree:
>> 
>> # id weight type name up/down reweight
>> -1 0 root default
>> -2 0 host osd1
>> 0 0 osd.0 up 1
>> 4 0 osd.4 up 1
>> 8 0 osd.8 up 1
>> 11 0 osd.11 up 1
>> -3 0 host osd0
>> 1 0 osd.1 up 1
>> 3 0 osd.3 up 1
>> 6 0 osd.6 up 1
>> 9 0 osd.9 up 1
>> -4 0 host osd2
>> 2 0 osd.2 up 1
>> 5 0 osd.5 up 1
>> 7 0 osd.7 up 1
>> 10 0 osd.10 up 1
>> 
>> 
>> ceph -s:
>> 
>>cluster 4a158d27-f750-41d5-9e7f-26ce4c9d2d45
>> health HEALTH_WARN 832 pgs degraded; 832 pgs stuck unclean; recovery
>> 43/86 objects degraded (50.000%)
>> monmap e1: 1 mons at {ceph-mon0=192.168.2.10:6789/0}, election epoch 2,
>> quorum 0 ceph-mon0
>> osdmap e34: 12 osds: 12 up, 12 in
>>  pgmap v61: 832 pgs, 8 pools, 840 bytes data, 43 objects
>>403 MB used, 10343 MB / 10747 MB avail
>>43/86 objects degraded (50.000%)
>> 832 active+degraded
>> 
>> 
>> Thanks,
>> Ripal
>> 
>> On Aug 25, 2014, at 12:45 PM, Gregory Farnum  wrote:
>> 
>> What's the output of "ceph osd tree"? And the full output of "ceph -s"?
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> 
>> 
>> On Mon, Aug 18, 2014 at 8:07 PM, Ripal Nathuji  wrote:
>> 
>> Hi folks,
>> 
>> I've come across an issue which I found a "fix" for, but I'm not sure
>> whether it's correct or if there is some other misconfiguration on my end
>> and this is merely a symptom. I'd appreciate any insights anyone could
>> provide based on the information below, and happy to provide more details as
>> necessary.
>> 
>> Summary: A fresh install of Ceph 0.80.5 comes up with all pgs marked as
>> active+degraded. This reproduces on 12.04 as well as CentOS 7 with a varying
>> number of OSD hosts (1, 2, 3), where each OSD host has four storage drives.
>> The configuration file defines a default replica size of 2, and allows leafs
>> of type 0. Specific snippet:
>> 
>> [global]
>> ...
>> osd pool default size = 2
>> osd crush chooseleaf type = 0
>> 
>> 
>> I verified the crush rules were as expected:
>> 
>> "rules": [
>>   { "rule_id": 0,
>> "rule_name": "replicated_ruleset",
>> "ruleset": 0,
>> "type": 1,
>> "min_size": 1,
>> "max_size": 10,
>> "steps": [
>>   { "op": "take",
>> "item": -1,
>> "item_name": "default"},
>>   { "op": "choose_firstn",
>> "num": 0,
>> "type": "osd"},
>>   { "op": "emit"}]}],
>> 
>> 
>> Inspecting the pg dump I observed that all pgs had a single osd in the
>> up/acting sets. That seemed to explain why the pgs were degraded, but it was
>> unclear to me why a second OSD wasn't in the set. After trying a variety of
>> things, I noticed that there was a difference between Emperor (which works
>> fine in these configurations) and Firefly with the default tunables, where
>> Firefly comes up with the bobtail profile. The setting
>> choose_local_fallback_tries is 0 in this profile while it used to default to
>> 5 on Emperor. Sure enough, if I modify my crush map and set the parameter to
>> a non-zero value, the cluster remaps and goes healthy with all pgs
>> active+clean.
>> 
>> The documentation states the optimal value of choose_local_fallback_tries is
>> 0 for FF, so I'd like to get a better understanding of this parameter and
>> why modifying the default value moves the pgs to a clean state in my
>> scenarios.
>> 
>> Thanks,
>> Ripal
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] do RGW have billing feature? If have, how do we use it ?

2014-08-26 Thread baijia...@126.com






baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

> Hi Craig,
> 
> I assume the reason for the 48 hours recovery time is to keep the cost
> of the cluster low ? I wrote "1h recovery time" because it is roughly
> the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
> your hardware to reduce the recovery time to less than two hours ? Or
> are there factors other than cost that prevent this ?
> 

I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it only
10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are being
read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub, all
those VMs rebooting at the same time or a backfill caused by a failed OSD.
Now all of a sudden client ops compete with the backfill ops, page caches
are no longer hot, the spinners are seeking left and right. 
Pandemonium.

I doubt very much that even with a SSD backed cluster you would get away
with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new cluster
but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the OSD.
Both operations took about the same time, 4 minutes for evacuating the OSD
(having 7 write targets clearly helped) for measly 12GB or about 50MB/s
and 5 minutes or about 35MB/ for refilling the OSD. 
And that is on one node (thus no network latency) that has the default
parameters (so a max_backfill of 10) which was otherwise totally idle. 

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.

More in another reply.

> Cheers
> 
> On 26/08/2014 19:37, Craig Lewis wrote:
> > My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
> > max backfills = 1).   I believe that increases my risk of failure by
> > 48^2 .  Since your numbers are failure rate per hour per disk, I need
> > to consider the risk for the whole time for each disk.  So more
> > formally, rebuild time to the power of (replicas -1).
> > 
> > So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
> > higher risk than 1 / 10^8.
> > 
> > 
> > A risk of 1/43,000 means that I'm more likely to lose data due to
> > human error than disk failure.  Still, I can put a small bit of effort
> > in to optimize recovery speed, and lower this number.  Managing human
> > error is much harder.
> > 
> > 
> > 
> > 
> > 
> > 
> > On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary  > > wrote:
> > 
> > Using percentages instead of numbers lead me to calculations
> > errors. Here it is again using 1/100 instead of % for clarity ;-)
> > 
> > Assuming that:
> > 
> > * The pool is configured for three replicas (size = 3 which is the
> > default)
> > * It takes one hour for Ceph to recover from the loss of a single
> > OSD
> > * Any other disk has a 1/100,000 chance to fail within the hour
> > following the failure of the first disk (assuming AFR
> > https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> > 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> > 1/100,000
> > * A given disk does not participate in more than 100 PG
> > 
> 

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-deploy with --release (--stable) for dumpling?

2014-08-26 Thread Nigel Williams

On Tue, Aug 26, 2014 at 5:10 PM, Konrad Gutkowski
 wrote:
> Ceph-deploy should set priority for ceph repository, which it doesn't, this
> usually installs the best available version from any repository.

Thanks Konrad for the tip. It took several goes (notably ceph-deploy
purge did not, for me at least, seem to be removing librbd1 cleanly)
but I managed to get 0.67.10 to be preferred, basically I did this:

root@ceph12:~# ceph -v
ceph version 0.67.10
root@ceph12:~# cat /etc/apt/preferences
Package: *
Pin: origin ceph.com
Pin-priority: 900

Package: *
Pin: origin ceph.newdream.net
Pin-priority: 900
root@ceph12:~#
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer


Hello,

On Tue, 26 Aug 2014 16:12:11 +0200 Loic Dachary wrote:

> Using percentages instead of numbers lead me to calculations errors.
> Here it is again using 1/100 instead of % for clarity ;-)
> 
> Assuming that:
> 
> * The pool is configured for three replicas (size = 3 which is the
> default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
I think Craig and I have debunked that number.
It will be something like "that depends on many things starting with the
amount of data, the disk speeds, the contention (client and other ops),
the network speed/utilization, the actual OSD process and queue handling
speed, etc.".
If you want to make an assumption that's not an order of magnitude wrong,
start with 24 hours.

It would be nice to hear from people with really huge clusters like Dan at
CERN how their recovery speeds are.

> * Any other disk has a 1/100,000 chance to fail within the hour
> following the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000 
> * A given disk does not participate in more than 100 PG
> 
You will find that the smaller the cluster, the more likely it is to be
higher than 100, due to rounding up or just upping things because the
distribution is too uneven otherwise.


> Each time an OSD is lost, there is a 1/100,000*1/100,000 =
> 1/10,000,000,000 chance that two other disks are lost before recovery.
> Since the disk that failed initialy participates in 100 PG, that is
> 1/10,000,000,000 x 100 = 1/100,000,000 chance that a PG is lost. Or the
> entire pool if it is used in a way that loosing a PG means loosing all
> data in the pool (as in your example, where it contains RBD volumes and
> each of the RBD volume uses all the available PG).
> 
> If the pool is using at least two datacenters operated by two different
> organizations, this calculation makes sense to me. However, if the
> cluster is in a single datacenter, isn't it possible that some event
> independent of Ceph has a greater probability of permanently destroying
> the data ? A month ago I lost three machines in a Ceph cluster and
> realized on that occasion that the crushmap was not configured properly
> and that PG were lost as a result. Fortunately I was able to recover the
> disks and plug them in another machine to recover the lost PGs. I'm not
> a system administrator and the probability of me failing to do the right
> thing is higher than normal: this is just an example of a high
> probability event leading to data loss. Another example would be if all
> disks in the same PG are part of the same batch and therefore likely to
> fail at the same time. In other words, I wonder if this 0.0001% chance
> of losing a PG within the hour following a disk failure matters or if it
> is dominate d by other factors. What do you think ?
>

Batch failures are real, I'm seeing that all the time. 
However they tend to be still spaced out widely enough most of the time.
Still something to consider in a complete calculation.

As for failures other than disks, these tend to be recoverable, as you
experienced yourself. A node, rack, whatever failure might make your
cluster temporarily inaccessible (and thus should be avoided by proper
CRUSH maps and other precautions), but it will not lead to actual data
loss.
  
Regards,

Christian
 
> Cheers
> 
> > 
> > Assuming that:
> > 
> > * The pool is configured for three replicas (size = 3 which is the
> > default)
> > * It takes one hour for Ceph to recover from the loss of a single OSD
> > * Any other disk has a 0.001% chance to fail within the hour following
> > the failure of the first disk (assuming AFR
> > https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> > 10%, divided by the number of hours during a year).
> > * A given disk does not participate in more than 100 PG
> > 
> > Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance
> > that two other disks are lost before recovery. Since the disk that
> > failed initialy participates in 100 PG, that is 0.01% x 100 =
> > 0.0001% chance that a PG is lost. Or the entire pool if it is used in
> > a way that loosing a PG means loosing all data in the pool (as in your
> > example, where it contains RBD volumes and each of the RBD volume uses
> > all the available PG).
> > 
> > If the pool is using at least two datacenters operated by two
> > different organizations, this calculation makes sense to me. However,
> > if the cluster is in a single datacenter, isn't it possible that some
> > event independent of Ceph has a greater probability of permanently
> > destroying the data ? A month ago I lost three machines in a Ceph
> > cluster and realized on that occasion that the crushmap was not
> > configured properly and that PG were lost as a result. Fortunately I
> > was able to recover the disks and plug them in another machine to
> > recover th

Re: [ceph-users] MDS dying on Ceph 0.67.10

2014-08-26 Thread MinhTien MinhTien

Hi Gregory Farmum,

Thank you for your reply!
This is the log:

2014-08-26 16:22:39.103461 7f083752f700 -1 mds/CDir.cc: In function 'void
CDir::_committed(version_t)' thread 7f083752f700 time 2014-08-26
16:22:39.075809
mds/CDir.cc: 2071: FAILED assert(in->is_dirty() || in->last < ((__u64)(-2)))

 ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f)
 1: (CDir::_committed(unsigned long)+0xc4e) [0x74d9ee]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe8d) [0x7d09bd]
 3: (MDS::handle_core_message(Message*)+0x987) [0x57c457]
 4: (MDS::_dispatch(Message*)+0x2f) [0x57c50f]
 5: (MDS::ms_dispatch(Message*)+0x19b) [0x57dfbb]
 6: (DispatchQueue::entry()+0x5a2) [0x904732]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8afdbd]
 8: (()+0x79d1) [0x7f083c2979d1]
 9: (clone()+0x6d) [0x7f083afb6b5d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mds.Ceph01-dc5k3u0104.log
--- end dump of recent events ---
2014-08-26 16:22:39.134173 7f083752f700 -1 *** Caught signal (Aborted) **
 in thread 7f083752f700




On Wed, Aug 27, 2014 at 3:09 AM, Gregory Farnum  wrote:

> I don't think the log messages you're showing are the actual cause of
> the failure. The log file should have a proper stack trace (with
> specific function references and probably a listed assert failure),
> can you find that?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien
>  wrote:
> > Hi all,
> >
> > I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate
> =  2)
> >
> > When I add the 3rd node in the Ceph Cluster, CEPH perform load balancing.
> >
> > I have 3 MDS in 3 nodes,the MDS process is dying after a while with a
> stack
> > trace:
> >
> >
> ---
> >
> >  2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154 <==
> > osd.10 10.20.0.21:6802/15917 1  osd_op_reply(230
> 10003f6.
> > [tmapup 0~0] ondisk = 0) v4  119+0+0 (1770421071 0 0) 0x2aece00 con
> > 0x2aa4200
> >-54> 2014-08-26 17:08:34.362942 7f1c2c704700  1 --
> 10.20.0.21:6800/22154
> > <== osd.55 10.20.0.23:6800/2407 10  osd_op_reply(263
> > 100048a. [getxattr] ack = -2 (No such file or directory)) v4
> >  119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0
> >-53> 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log submit_entry
> > 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
> >-52> 2014-08-26 17:08:34.363022 7f1c2c704700  1 --
> 10.20.0.21:6800/22154
> > <== osd.37 10.20.0.22:6898/11994 6  osd_op_reply(226 1.
> [tmapput
> > 0~7664] ondisk = 0) v4  109+0+0 (1007110430 0 0) 0x1e64800 con
> 0x1e7a7e0
> >-51> 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
> > segment 293601899 2548 events
> >-50> 2014-08-26 17:08:34.363117 7f1c2c704700  1 --
> 10.20.0.21:6800/22154
> > <== osd.17 10.20.0.21:6941/17572 9  osd_op_reply(264
> > 1000489. [getxattr] ack = -2 (No such file or directory)) v4
> >  119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180
> >-49> 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log submit_entry
> > 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
> >-48> 2014-08-26 17:08:34.363197 7f1c2c704700  1 --
> 10.20.0.21:6800/22154
> > <== osd.1 10.20.0.21:6872/13227 6  osd_op_reply(265
> 1000491.
> > [getxattr] ack = -2 (No such file or directory)) v4  119+0+0
> (1231782695
> > 0 0) 0x1e63400 con 0x1e7ac00
> >-47> 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log submit_entry
> > 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
> >-46> 2014-08-26 17:08:34.363274 7f1c2c704700  1 --
> 10.20.0.21:6800/22154
> > <== osd.11 10.20.0.21:6884/7018 5  osd_op_reply(266
> 100047d.
> > [getxattr] ack = -2 (No such file or directory)) v4  119+0+0
> (2737916920
> > 0 0) 0x1e61e00 con 0x1e7bc80
> >
> >
>

[ceph-users] error ioctl(BTRFS_IOC_SNAP_CREATE) failed: (17) File exists

2014-08-26 Thread John Morris

During reorganization of the Ceph system, including an updated CRUSH
map and moving to btrfs, some PGs became stuck incomplete+remapped.
Before that was resolved, a restart of osd.1 failed while creating a
btrfs snapshot.  A 'ceph-osd -i 1 --flush-journal' fails with the same
error.  See the below pasted log.

This is a Bad Thing, because two PGs are now stuck down+peering.  A
'ceph pg 2.74 query' shows they had been stuck on osd.1 before the
btrfs problem, despite what the 'last acting' field shows in the below
'ceph health detail' output.

Is there any way to recover from this?  Judging from Google searches
on the list archives, nobody has run into this problem before, so I'm
quite worried that this spells backup recovery exercises for the next
few days.

Related question:  Are outright OSD crashes the reason btrfs is
discouraged for production use?

Thanks-

John



pg 2.74 is stuck inactive since forever, current state down+peering, last 
acting [3,7,0,6]
pg 3.73 is stuck inactive since forever, current state down+peering, last 
acting [3,7,0,6]
pg 2.74 is stuck unclean since forever, current state down+peering, last acting 
[3,7,0,6]
pg 3.73 is stuck unclean since forever, current state down+peering, last acting 
[3,7,0,6]
pg 2.74 is down+peering, acting [3,7,0,6]
pg 3.73 is down+peering, acting [3,7,0,6]


2014-08-26 22:36:12.641585 7f5b38e507a0  0 ceph version 0.67.10 
(9d446bd416c52cd785ccf048ca67737ceafcdd7f), process ceph-osd, pid 10281
2014-08-26 22:36:12.717100 7f5b38e507a0  0 filestore(/ceph/osd.1) mount FIEMAP 
ioctl is supported and appears to work
2014-08-26 22:36:12.717121 7f5b38e507a0  0 filestore(/ceph/osd.1) mount FIEMAP 
ioctl is disabled via 'filestore fiemap' config option
2014-08-26 22:36:12.717434 7f5b38e507a0  0 filestore(/ceph/osd.1) mount 
detected btrfs
2014-08-26 22:36:12.717471 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
CLONE_RANGE ioctl is supported
2014-08-26 22:36:12.765009 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
SNAP_CREATE is supported
2014-08-26 22:36:12.765335 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
SNAP_DESTROY is supported
2014-08-26 22:36:12.765541 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
START_SYNC is supported (transid 3118)
2014-08-26 22:36:12.789600 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
WAIT_SYNC is supported
2014-08-26 22:36:12.808287 7f5b38e507a0  0 filestore(/ceph/osd.1) mount btrfs 
SNAP_CREATE_V2 is supported
2014-08-26 22:36:12.834144 7f5b38e507a0  0 filestore(/ceph/osd.1) mount 
syscall(SYS_syncfs, fd) fully supported
2014-08-26 22:36:12.834377 7f5b38e507a0  0 filestore(/ceph/osd.1) mount found 
snaps <6009082,6009083>
2014-08-26 22:36:12.834427 7f5b38e507a0 -1 filestore(/ceph/osd.1) 
FileStore::mount: error removing old current subvol: (22) Invalid argument
2014-08-26 22:36:12.861045 7f5b38e507a0 -1 filestore(/ceph/osd.1) mount initial 
op seq is 0; something is wrong
2014-08-26 22:36:12.861428 7f5b38e507a0 -1 ^[[0;31m ** ERROR: error converting 
store /ceph/osd.1: (22) Invalid argument^[[0m
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] 'incomplete' PGs: what does it mean?

2014-08-26 Thread John Morris

In the docs [1], 'incomplete' is defined thusly:

  Ceph detects that a placement group is missing a necessary period of
  history from its log. If you see this state, report a bug, and try
  to start any failed OSDs that may contain the needed information.

However, during an extensive review of list postings related to
incomplete PGs, an alternate and oft-repeated definition is something
like 'the number of existing replicas is less than the min_size of the
pool'.  In no list posting was there any acknowledgement of the
definition from the docs.

While trying to understand what 'incomplete' PGs are, I simply set
min_size = 1 on this cluster with incomplete PGs, and they continue to
be 'incomplete'.  Does this mean that definition #2 is incorrect?

In case #1 is correct, how can the cluster be told to forget the lapse
in history?  In our case, there was nothing writing to the cluster
during the OSD reorganization that could have caused this lapse.

[1] http://ceph.com/docs/master/rados/operations/pg-states/

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

40 matches

Mail list logo