[ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread baijia...@126.com
hi, everyone

when I user rest bench testing RGW with cmd : rest-bench --access-key=ak 
--secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup 
write 

I found when RGW call the method bucket_prepare_op  is very slow. so I 
observed from 'dump_historic_ops',to see:
{ description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call 
rgw.bucket_prepare_op] 3.b168f3d0 e37),
  received_at: 2014-07-03 11:07:02.465700,
  age: 308.315230,
  duration: 3.401743,
  type_data: [
commit sent; apply or cleanup,
{ client: client.4211,
  tid: 265984},
[
{ time: 2014-07-03 11:07:02.465852,
  event: waiting_for_osdmap},
{ time: 2014-07-03 11:07:02.465875,
  event: queue op_wq},
{ time: 2014-07-03 11:07:03.729087,
  event: reached_pg},
{ time: 2014-07-03 11:07:03.729120,
  event: started},
{ time: 2014-07-03 11:07:03.729126,
  event: started},
{ time: 2014-07-03 11:07:03.804366,
  event: waiting for subops from [19,9]},
{ time: 2014-07-03 11:07:03.804431,
  event: commit_queued_for_journal_write},
{ time: 2014-07-03 11:07:03.804509,
  event: write_thread_in_journal_buffer},
{ time: 2014-07-03 11:07:03.934419,
  event: journaled_completion_queued},
{ time: 2014-07-03 11:07:05.297282,
  event: sub_op_commit_rec},
{ time: 2014-07-03 11:07:05.297319,
  event: sub_op_commit_rec},
{ time: 2014-07-03 11:07:05.311217,
  event: op_applied},
{ time: 2014-07-03 11:07:05.867384,
  event: op_commit finish lock},
{ time: 2014-07-03 11:07:05.867385,
  event: op_commit},
{ time: 2014-07-03 11:07:05.867424,
  event: commit_sent},
{ time: 2014-07-03 11:07:05.867428,
  event: op_commit finish},
{ time: 2014-07-03 11:07:05.867443,
  event: done}]]}]}

so I find 2 performance degradation. one is from queue op_wq to reached_pg 
, anothor is from journaled_completion_queued to op_commit.
and I must stess that there are so many ops write to one bucket object, so how 
to reduce Latency ?





baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD and Backup.

2014-07-03 Thread Wolfgang Hennerbichler
if the rbd filesystem ‘belongs’ to you you can do sth like this:

http://www.wogri.com/linux/ceph-vm-backup/

On Jul 3, 2014, at 7:21 AM, Irek Fasikhov malm...@gmail.com wrote:

 
 Hi,All.
 
 Dear community. How do you make backups CEPH RDB?
 
 Thanks
 
 -- 
 Fasihov Irek (aka Kataklysm).
 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD and Backup.

2014-07-03 Thread Christian Kauhaus
Am 03.07.2014 07:21, schrieb Irek Fasikhov:
 Dear community. How do you make backups CEPH RDB?

We @ gocept are currently in the process of developing backy, a new-style
backup tool that works directly with block level snapshots / diffs.

The tool is not quite finished, but it is making rapid progress. It would be
great if you'd try it, spot bugs, contribute code etc. Help is appreciated. :-)

PyPI page: https://pypi.python.org/pypi/backy/

Pull requests go here: https://bitbucket.org/ctheune/backy

Christian Theune c...@gocept.com is the primary contact.

HTH

Christian

-- 
Dipl.-Inf. Christian Kauhaus  · k...@gocept.com · systems administration
gocept gmbh  co. kg · Forsterstraße 29 · 06112 Halle (Saale) · Germany
http://gocept.com · tel +49 345 219401-11
Python, Pyramid, Plone, Zope · consulting, development, hosting, operations
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] release date for 0.80.2

2014-07-03 Thread Andrei Mikhailovsky
Hi guys, 

Was wondering if 0.80.2 is coming any time soon? I am planning na upgrade from 
Emperor and was wondering if I should wait for 0.80.2 to come out if the 
release date is pretty soon. Otherwise, I will go for the 0.80.1. 

Cheers 
Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Hi Wido, thanks for answers - I have mons and OSD on each host... server1:
mon + 2 OSDs, same for server2 and server3.

Any Proposed upgrade path, or just start with 1 server and move along to
others ?

Thanks again.
Andrija


On 2 July 2014 16:34, Wido den Hollander w...@42on.com wrote:

 On 07/02/2014 04:08 PM, Andrija Panic wrote:

 Hi,

 I have existing CEPH cluster of 3 nodes, versions 0.72.2

 I'm in a process of installing CEPH on 4th node, but now CEPH version is
 0.80.1

 Will this make problems running mixed CEPH versions ?


 No, but the recommendation is not to have this running for a very long
 period. Try to upgrade all nodes to the same version within a reasonable
 amount of time.


  I intend to upgrade CEPH on exsiting 3 nodes anyway ?
 Recommended steps ?


 Always upgrade the monitors first! Then to the OSDs one by one.

  Thanks

 --

 Andrija Panić


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] release date for 0.80.2

2014-07-03 Thread Wido den Hollander

On 07/03/2014 10:27 AM, Andrei Mikhailovsky wrote:

Hi guys,

Was wondering if 0.80.2 is coming any time soon? I am planning na
upgrade from Emperor and was wondering if I should wait for 0.80.2 to
come out if the release date is pretty soon. Otherwise, I will go for
the 0.80.1.



Why bother? Upgrading from 0.80.1 to .2 is not that much work.

Or is there a specific bug in 0.80.1 which you don't want to run into?


Cheers
Andrei


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Wido den Hollander

On 07/03/2014 10:59 AM, Andrija Panic wrote:

Hi Wido, thanks for answers - I have mons and OSD on each host...
server1: mon + 2 OSDs, same for server2 and server3.

Any Proposed upgrade path, or just start with 1 server and move along to
others ?



Upgrade the packages, but don't restart the daemons yet, then:

1. Restart the mon leader
2. Restart the two other mons
3. Restart all the OSDs one by one

I suggest that you wait for the cluster to become fully healthy again 
before restarting the next OSD.


Wido


Thanks again.
Andrija


On 2 July 2014 16:34, Wido den Hollander w...@42on.com
mailto:w...@42on.com wrote:

On 07/02/2014 04:08 PM, Andrija Panic wrote:

Hi,

I have existing CEPH cluster of 3 nodes, versions 0.72.2

I'm in a process of installing CEPH on 4th node, but now CEPH
version is
0.80.1

Will this make problems running mixed CEPH versions ?


No, but the recommendation is not to have this running for a very
long period. Try to upgrade all nodes to the same version within a
reasonable amount of time.


I intend to upgrade CEPH on exsiting 3 nodes anyway ?
Recommended steps ?


Always upgrade the monitors first! Then to the OSDs one by one.

Thanks

--

Andrija Panić


_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902
Skype: contact42on
_
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Andrija Panić



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pools do not respond

2014-07-03 Thread Iban Cabrillo
Hi folk,
  I am following step by step the test intallation, and checking some
configuration before try to deploy a production cluster.

  Now I have a Health cluster with 3 mons + 4 OSDs.
  I have created a pool with belonging all osd.x and two more one for two
servers o the other for the other two.

  The general pool work fine (I can create images and mount it on remote
machines).

  But the other two does not work (the commands rados put, or rbd ls pool
hangs for ever).

  this is the tree:

   [ceph@cephadm ceph-cloud]$ sudo ceph osd tree
# id weight type name up/down reweight
-7 5.4 root 4x1GbFCnlSAS
-3 2.7 host node04
1 2.7 osd.1 up 1
-4 2.7 host node03
2 2.7 osd.2 up 1
-6 8.1 root 4x4GbFCnlSAS
-5 5.4 host node01
3 2.7 osd.3 up 1
4 2.7 osd.4 up 1
-2 2.7 host node04
0 2.7 osd.0 up 1
-1 13.5 root default
-2 2.7 host node04
0 2.7 osd.0 up 1
-3 2.7 host node04
1 2.7 osd.1 up 1
-4 2.7 host node03
2 2.7 osd.2 up 1
-5 5.4 host node01
3 2.7 osd.3 up 1
4 2.7 osd.4 up 1


And this is the crushmap:

...
root 4x4GbFCnlSAS {
id -6 #do not change unnecessarily
alg straw
hash 0  # rjenkins1
item node01 weight 5.400
item node04 weight 2.700
}
root 4x1GbFCnlSAS {
id -7 #do not change unnecessarily
alg straw
hash 0  # rjenkins1
item node04 weight 2.700
item node03 weight 2.700
}
# rules
rule 4x4GbFCnlSAS {
ruleset 1
type replicated
min_size 1
max_size 10
step take 4x4GbFCnlSAS
step choose firstn 0 type host
step emit
}
rule 4x1GbFCnlSAS {
ruleset 2
type replicated
min_size 1
max_size 10
step take 4x1GbFCnlSAS
step choose firstn 0 type host
step emit
}
..
I of course set the crush_rules:
sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2
sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1

but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file):
   sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object
4x4GbFCnlSAS.pool
!!HANGS for eve!

from the ceph-client happen the same
 rbd ls cloud-4x1GbFCnlSAS
 !!HANGS for eve!


[root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS
4x1GbFCnlSAS.object
osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' - pg
3.114ae7a9 (3.29) - *up ([], p-1) acting ([], p-1)*

Any idea what i am doing wrong??

Thanks in advance, I
Bertrand Russell:
*El problema con el mundo es que los estúpidos están seguros de todo y los
inteligentes están llenos de dudas*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Joao Eduardo Luis

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


The only thing that comes to mind that could cause this is if we changed 
the leader's in-memory map, proposed it, it failed, and only the leader 
got to write the map to disk somehow.  This happened once on a totally 
different issue (although I can't pinpoint right now which).


In such a scenario, the leader would serve the incorrect osdmap to 
whoever asked osdmaps from it, the remaining quorum would serve the 
correct osdmaps to all the others.  This could cause this divergence. 
Or it could be something else.


Are there logs for the monitors for the timeframe this may have happened in?

  -Joao



Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the
problem of crush map and I update my client and server kernel to 3.16rc.

It's could be that ?

Pierre


-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something
like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do
you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31
osd up to
16.
I remark that after this the number of down+peering PG decrease
from 367
to
248. It's normal ? May be it's temporary, the time that the
cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:




Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash
other
OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's 

Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Thanks a lot Wido, will do...

Andrija


On 3 July 2014 13:12, Wido den Hollander w...@42on.com wrote:

 On 07/03/2014 10:59 AM, Andrija Panic wrote:

 Hi Wido, thanks for answers - I have mons and OSD on each host...
 server1: mon + 2 OSDs, same for server2 and server3.

 Any Proposed upgrade path, or just start with 1 server and move along to
 others ?


 Upgrade the packages, but don't restart the daemons yet, then:

 1. Restart the mon leader
 2. Restart the two other mons
 3. Restart all the OSDs one by one

 I suggest that you wait for the cluster to become fully healthy again
 before restarting the next OSD.

 Wido

  Thanks again.
 Andrija


 On 2 July 2014 16:34, Wido den Hollander w...@42on.com
 mailto:w...@42on.com wrote:

 On 07/02/2014 04:08 PM, Andrija Panic wrote:

 Hi,

 I have existing CEPH cluster of 3 nodes, versions 0.72.2

 I'm in a process of installing CEPH on 4th node, but now CEPH
 version is
 0.80.1

 Will this make problems running mixed CEPH versions ?


 No, but the recommendation is not to have this running for a very
 long period. Try to upgrade all nodes to the same version within a
 reasonable amount of time.


 I intend to upgrade CEPH on exsiting 3 nodes anyway ?
 Recommended steps ?


 Always upgrade the monitors first! Then to the OSDs one by one.

 Thanks

 --

 Andrija Panić


 _
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902
 Skype: contact42on
 _
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --

 Andrija Panić



 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.


 Phone: +31 (0)20 700 9902
 Skype: contact42on




-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Wido,
one final question:
since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to
recompile libvirt again now with ceph-devel 0.80 ?

Perhaps not smart question, but need to make sure I don't screw something...
Thanks for your time,
Andrija


On 3 July 2014 14:27, Andrija Panic andrija.pa...@gmail.com wrote:

 Thanks a lot Wido, will do...

 Andrija


 On 3 July 2014 13:12, Wido den Hollander w...@42on.com wrote:

 On 07/03/2014 10:59 AM, Andrija Panic wrote:

 Hi Wido, thanks for answers - I have mons and OSD on each host...
 server1: mon + 2 OSDs, same for server2 and server3.

 Any Proposed upgrade path, or just start with 1 server and move along to
 others ?


 Upgrade the packages, but don't restart the daemons yet, then:

 1. Restart the mon leader
 2. Restart the two other mons
 3. Restart all the OSDs one by one

 I suggest that you wait for the cluster to become fully healthy again
 before restarting the next OSD.

 Wido

  Thanks again.
 Andrija


 On 2 July 2014 16:34, Wido den Hollander w...@42on.com
 mailto:w...@42on.com wrote:

 On 07/02/2014 04:08 PM, Andrija Panic wrote:

 Hi,

 I have existing CEPH cluster of 3 nodes, versions 0.72.2

 I'm in a process of installing CEPH on 4th node, but now CEPH
 version is
 0.80.1

 Will this make problems running mixed CEPH versions ?


 No, but the recommendation is not to have this running for a very
 long period. Try to upgrade all nodes to the same version within a
 reasonable amount of time.


 I intend to upgrade CEPH on exsiting 3 nodes anyway ?
 Recommended steps ?


 Always upgrade the monitors first! Then to the OSDs one by one.

 Thanks

 --

 Andrija Panić


 _
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902
 Skype: contact42on
 _
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --

 Andrija Panić



 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.


 Phone: +31 (0)20 700 9902
 Skype: contact42on




 --

 Andrija Panić
 --
   http://admintweets.com
 --




-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] write performance per disk

2014-07-03 Thread VELARTIS Philipp Dürhammer
Hi,

I have a ceph cluster setup (with 45 sata disk journal on disks) and get only 
450mb/sec writes seq (maximum playing around with threads in rados bench) with 
replica of 2
Which is about ~20Mb writes per disk (what y see in atop also)
theoretically with replica2 and having journals on disk should be 45 X 100mb 
(sata) / 2 (replica) / 2 (journal writes) which makes it 1125
satas in reality have 120mb/sec so the theoretical output should be more.

I would expect to have between 40-50mb/sec for each sata disk

Can somebody confirm that he can reach this speed with a setup with journals on 
the satas (with journals on ssd speed should be 100mb per disk)?
or does ceph only give about ¼ of the speed for a disk? (and not the ½ as 
expected because of journals)


My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for ceph (and 
ssds for system) 1 x 10gig for external traffic, 1 x 10gig for osd traffic
with reads I can saturate the network but writes is far away. And I would 
expect at least to saturate the 10gig with sequential writes also

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Wido den Hollander

On 07/03/2014 03:07 PM, Andrija Panic wrote:

Wido,
one final question:
since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to
recompile libvirt again now with ceph-devel 0.80 ?

Perhaps not smart question, but need to make sure I don't screw something...


No, no need to. The librados API didn't change in case you are using RBD 
storage pool support.


Otherwise it just talks to Qemu and that talks to librbd/librados.

Wido


Thanks for your time,
Andrija


On 3 July 2014 14:27, Andrija Panic andrija.pa...@gmail.com
mailto:andrija.pa...@gmail.com wrote:

Thanks a lot Wido, will do...

Andrija


On 3 July 2014 13:12, Wido den Hollander w...@42on.com
mailto:w...@42on.com wrote:

On 07/03/2014 10:59 AM, Andrija Panic wrote:

Hi Wido, thanks for answers - I have mons and OSD on each
host...
server1: mon + 2 OSDs, same for server2 and server3.

Any Proposed upgrade path, or just start with 1 server and
move along to
others ?


Upgrade the packages, but don't restart the daemons yet, then:

1. Restart the mon leader
2. Restart the two other mons
3. Restart all the OSDs one by one

I suggest that you wait for the cluster to become fully healthy
again before restarting the next OSD.

Wido

Thanks again.
Andrija


On 2 July 2014 16:34, Wido den Hollander w...@42on.com
mailto:w...@42on.com
mailto:w...@42on.com mailto:w...@42on.com wrote:

 On 07/02/2014 04:08 PM, Andrija Panic wrote:

 Hi,

 I have existing CEPH cluster of 3 nodes, versions
0.72.2

 I'm in a process of installing CEPH on 4th node,
but now CEPH
 version is
 0.80.1

 Will this make problems running mixed CEPH versions ?


 No, but the recommendation is not to have this running
for a very
 long period. Try to upgrade all nodes to the same
version within a
 reasonable amount of time.


 I intend to upgrade CEPH on exsiting 3 nodes anyway ?
 Recommended steps ?


 Always upgrade the monitors first! Then to the OSDs one
by one.

 Thanks

 --

 Andrija Panić


 ___
 ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
mailto:ceph-us...@lists.ceph.__com
mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com


http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
tel:%2B31%20%280%2920%20700%209902
tel:%2B31%20%280%2920%20700%__209902
 Skype: contact42on
 ___
 ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
mailto:ceph-us...@lists.ceph.__com
mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com


http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Andrija Panić



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.


Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902
Skype: contact42on




--

Andrija Panić
--
http://admintweets.com
--




--

Andrija Panić
--
http://admintweets.com
--



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mixing CEPH versions on new ceph nodes...

2014-07-03 Thread Andrija Panic
Thanks again a lot.


On 3 July 2014 15:20, Wido den Hollander w...@42on.com wrote:

 On 07/03/2014 03:07 PM, Andrija Panic wrote:

 Wido,
 one final question:
 since I compiled libvirt1.2.3 usinfg ceph-devel 0.72 - do I need to
 recompile libvirt again now with ceph-devel 0.80 ?

 Perhaps not smart question, but need to make sure I don't screw
 something...


 No, no need to. The librados API didn't change in case you are using RBD
 storage pool support.

 Otherwise it just talks to Qemu and that talks to librbd/librados.

 Wido

  Thanks for your time,
 Andrija


 On 3 July 2014 14:27, Andrija Panic andrija.pa...@gmail.com
 mailto:andrija.pa...@gmail.com wrote:

 Thanks a lot Wido, will do...

 Andrija


 On 3 July 2014 13:12, Wido den Hollander w...@42on.com
 mailto:w...@42on.com wrote:

 On 07/03/2014 10:59 AM, Andrija Panic wrote:

 Hi Wido, thanks for answers - I have mons and OSD on each
 host...
 server1: mon + 2 OSDs, same for server2 and server3.

 Any Proposed upgrade path, or just start with 1 server and
 move along to
 others ?


 Upgrade the packages, but don't restart the daemons yet, then:

 1. Restart the mon leader
 2. Restart the two other mons
 3. Restart all the OSDs one by one

 I suggest that you wait for the cluster to become fully healthy
 again before restarting the next OSD.

 Wido

 Thanks again.
 Andrija


 On 2 July 2014 16:34, Wido den Hollander w...@42on.com
 mailto:w...@42on.com
 mailto:w...@42on.com mailto:w...@42on.com wrote:

  On 07/02/2014 04:08 PM, Andrija Panic wrote:

  Hi,

  I have existing CEPH cluster of 3 nodes, versions
 0.72.2

  I'm in a process of installing CEPH on 4th node,
 but now CEPH
  version is
  0.80.1

  Will this make problems running mixed CEPH versions ?


  No, but the recommendation is not to have this running
 for a very
  long period. Try to upgrade all nodes to the same
 version within a
  reasonable amount of time.


  I intend to upgrade CEPH on exsiting 3 nodes anyway ?
  Recommended steps ?


  Always upgrade the monitors first! Then to the OSDs one
 by one.

  Thanks

  --

  Andrija Panić


  ___

  ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 mailto:ceph-us...@lists.ceph.__com
 mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph._
 ___com
 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com



 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



  --
  Wido den Hollander
  42on B.V.
  Ceph trainer and consultant

  Phone: +31 (0)20 700 9902
 tel:%2B31%20%280%2920%20700%209902
 tel:%2B31%20%280%2920%20700%__209902
  Skype: contact42on
  ___

  ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 mailto:ceph-us...@lists.ceph.__com
 mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph._
 ___com
 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com



 http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --

 Andrija Panić



 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.


 Phone: +31 (0)20 700 9902 tel:%2B31%20%280%2920%20700%209902

 Skype: contact42on




 --

 Andrija Panić
 --
 http://admintweets.com
 --




 --

 Andrija Panić
 --
 http://admintweets.com
 --



 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on




-- 

Andrija Panić
--
  http://admintweets.com
--
___
ceph-users mailing list

Re: [ceph-users] write performance per disk

2014-07-03 Thread Wido den Hollander

On 07/03/2014 03:11 PM, VELARTIS Philipp Dürhammer wrote:

Hi,

I have a ceph cluster setup (with 45 sata disk journal on disks) and get
only 450mb/sec writes seq (maximum playing around with threads in rados
bench) with replica of 2



How many threads?


Which is about ~20Mb writes per disk (what y see in atop also)
theoretically with replica2 and having journals on disk should be 45 X
100mb (sata) / 2 (replica) / 2 (journal writes) which makes it 1125
satas in reality have 120mb/sec so the theoretical output should be more.

I would expect to have between 40-50mb/sec for each sata disk

Can somebody confirm that he can reach this speed with a setup with
journals on the satas (with journals on ssd speed should be 100mb per disk)?
or does ceph only give about ¼ of the speed for a disk? (and not the ½
as expected because of journals)



Did you verify how much each machine is doing? It could be that the data 
is not distributed evenly and that on a certain machine the drives are 
doing 50MB/sec.



My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for
ceph (and ssds for system) 1 x 10gig for external traffic, 1 x 10gig for
osd traffic
with reads I can saturate the network but writes is far away. And I
would expect at least to saturate the 10gig with sequential writes also



Should be possible, but with 3 servers the data distribution might not 
be optimal causing a lower write performance.


I've seen 10Gbit write performance on multiple clusters without any 
problems.



Thank you



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] what is the difference between snapshot and clone in theory?

2014-07-03 Thread yalogr
hi,all
 
what is the difference between snapshot and clone in theory?
 
 
 
thanks___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] write performance per disk

2014-07-03 Thread VELARTIS Philipp Dürhammer
HI,

Ceph.conf:
   osd journal size = 15360
   rbd cache = true
rbd cache size = 2147483648
rbd cache max dirty = 1073741824
rbd cache max dirty age = 100
osd recovery max active = 1
 osd max backfills = 1
 osd mkfs options xfs = -f -i size=2048
 osd mount options xfs = 
rw,noatime,nobarrier,logbsize=256k,logbufs=8,inode64,allocsize=4M
 osd op threads = 8

so it should be 8 threads?

All 3 machines have more or less the same disk load at the same time.
also the disks:
sdb  35.5687.10  6849.09 617310   48540806
sdc  26.7572.62  5148.58 514701   36488992
sdd  35.1553.48  6802.57 378993   48211141
sde  31.0479.04  6208.48 560141   44000710
sdf  32.7938.35  6238.28 271805   44211891
sdg  31.6777.84  5987.45 551680   42434167
sdh  32.9551.29  6315.76 363533   44761001
sdi  31.6756.93  5956.29 403478   42213336
sdj  35.8377.82  6929.31 551501   49109354
sdk  36.8673.84  7291.00 523345   51672704
sdl  36.02   112.90  7040.47 800177   49897132
sdm  33.2538.02  6455.05 269446   45748178
sdn  33.5239.10  6645.19 277101   47095696
sdo  33.2646.22  6388.20 327541   45274394
sdp  33.3874.12  6480.62 525325   45929369


the question is: is this a poor performance to get max 500mb/write with 45 
disks and replica 2 or should I expect this?


-Ursprüngliche Nachricht-
Von: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] Im Auftrag von Wido 
den Hollander
Gesendet: Donnerstag, 03. Juli 2014 15:22
An: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] write performance per disk

On 07/03/2014 03:11 PM, VELARTIS Philipp Dürhammer wrote:
 Hi,

 I have a ceph cluster setup (with 45 sata disk journal on disks) and 
 get only 450mb/sec writes seq (maximum playing around with threads in 
 rados
 bench) with replica of 2


How many threads?

 Which is about ~20Mb writes per disk (what y see in atop also) 
 theoretically with replica2 and having journals on disk should be 45 X 
 100mb (sata) / 2 (replica) / 2 (journal writes) which makes it 1125 
 satas in reality have 120mb/sec so the theoretical output should be more.

 I would expect to have between 40-50mb/sec for each sata disk

 Can somebody confirm that he can reach this speed with a setup with 
 journals on the satas (with journals on ssd speed should be 100mb per disk)?
 or does ceph only give about ¼ of the speed for a disk? (and not the ½ 
 as expected because of journals)


Did you verify how much each machine is doing? It could be that the data is not 
distributed evenly and that on a certain machine the drives are doing 50MB/sec.

 My setup is 3 servers with: 2 x 2.6ghz xeons, 128gb ram 15 satas for 
 ceph (and ssds for system) 1 x 10gig for external traffic, 1 x 10gig 
 for osd traffic with reads I can saturate the network but writes is 
 far away. And I would expect at least to saturate the 10gig with 
 sequential writes also


Should be possible, but with 3 servers the data distribution might not be 
optimal causing a lower write performance.

I've seen 10Gbit write performance on multiple clusters without any problems.

 Thank you



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] why lock th whole osd handle thread

2014-07-03 Thread baijia...@126.com
when I see the function OSD::OpWQ::_process . I find pg lock locks the whole 
function. so when I  use multi-thread write the same object , so are they must 
serialize from osd handle thread to journal write thread ?



baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multipart upload on ceph 0.8 doesn't work?

2014-07-03 Thread Patrycja Szabłowska
Hi,

I'm trying to make multi part upload work. I'm using ceph
0.80-702-g9bac31b (from the ceph's github).

I've tried the code provided by Mark Kirkwood here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034940.html


But unfortunately, it gives me the error:

(multitest)pszablow@pat-desktop:~/$ python boto_multi.py
  begin upload of abc.yuv
  size 746496, 7 parts
Traceback (most recent call last):
  File boto_multi.py, line 36, in module
part = bucket.initiate_multipart_upload(objname)
  File 
/home/pszablow/venvs/multitest/local/lib/python2.7/site-packages/boto/s3/bucket.py,
line 1742, in initiate_multipart_upload
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
?xml version=1.0 encoding=UTF-8?ErrorCodeAccessDenied/Code/Error


The single part upload works for me. I am able to create buckets and objects.
I've tried also other similar examples, but none of them works.


Any ideas what's wrong? Does the ceph's multi part upload actually
work for anybody?


Thanks,

Patrycja Szabłowska
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multipart upload on ceph 0.8 doesn't work?

2014-07-03 Thread Luis Periquito
I was at this issue this morning. It seems radosgw requires you to have a
pool named '' to work with multipart. I just created a pool with that name
rados mkpool ''

either that or allow the pool be created by the radosgw...


On 3 July 2014 16:27, Patrycja Szabłowska szablowska.patry...@gmail.com
wrote:

 Hi,

 I'm trying to make multi part upload work. I'm using ceph
 0.80-702-g9bac31b (from the ceph's github).

 I've tried the code provided by Mark Kirkwood here:


 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-October/034940.html


 But unfortunately, it gives me the error:

 (multitest)pszablow@pat-desktop:~/$ python boto_multi.py
   begin upload of abc.yuv
   size 746496, 7 parts
 Traceback (most recent call last):
   File boto_multi.py, line 36, in module
 part = bucket.initiate_multipart_upload(objname)
   File
 /home/pszablow/venvs/multitest/local/lib/python2.7/site-packages/boto/s3/bucket.py,
 line 1742, in initiate_multipart_upload
 response.status, response.reason, body)
 boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
 ?xml version=1.0
 encoding=UTF-8?ErrorCodeAccessDenied/Code/Error


 The single part upload works for me. I am able to create buckets and
 objects.
 I've tried also other similar examples, but none of them works.


 Any ideas what's wrong? Does the ceph's multi part upload actually
 work for anybody?


 Thanks,

 Patrycja Szabłowska
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Luis Periquito

Unix Engineer

Ocado.com http://www.ocado.com/

Head Office, Titan Court, 3 Bishop Square, Hatfield Business Park,
Hatfield, Herts AL10 9NE

-- 


Notice:  This email is confidential and may contain copyright material of 
members of the Ocado Group. Opinions and views expressed in this message 
may not necessarily reflect the opinions and views of the members of the 
Ocado Group.

If you are not the intended recipient, please notify us immediately and 
delete all copies of this message. Please note that it is your 
responsibility to scan this message for viruses.  

References to the “Ocado Group” are to Ocado Group plc (registered in 
England and Wales with number 7098618) and its subsidiary undertakings (as 
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Titan Court, 3 Bishops Square, 
Hatfield Business Park, Hatfield, Herts. AL10 9NE.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Pierre BLONDEAU

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


The only thing that comes to mind that could cause this is if we changed
the leader's in-memory map, proposed it, it failed, and only the leader
got to write the map to disk somehow.  This happened once on a totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect osdmap to
whoever asked osdmaps from it, the remaining quorum would serve the
correct osdmaps to all the others.  This could cause this divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should have 
informations about the upgrade from firefly to 0.82.

Which mon's log do you want ? Three ?

Regards


   -Joao



Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the
problem of crush map and I update my client and server kernel to
3.16rc.

It's could be that ?

Pierre


-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something
like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do
you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31
osd up to
16.
I remark that after this the number of down+peering PG decrease
from 367
to
248. It's normal ? May be it's temporary, the time that the
cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:




Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash
other
OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because 

Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread Gregory Farnum
It looks like you're just putting in data faster than your cluster can
handle (in terms of IOPS).
The first big hole (queue_op_wq-reached_pg) is it sitting in a queue
and waiting for processing. The second parallel blocks are
1) write_thread_in_journal_buffer-journaled_completion_queued, and
that is again a queue while it's waiting to be written to disk,
2) waiting for subops from [19,9]-sub_op_commit_received(x2) is
waiting for the replica OSDs to write the transaction to disk.

You might be able to tune it a little, but right now bucket indices
live in one object, so every write has to touch the same set of OSDs
(twice! to mark an object as putting, and put). 2*3/360 = 166,
which is probably past what those disks can do, and artificially
increasing the latency.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com baijia...@126.com wrote:
 hi, everyone

 when I user rest bench testing RGW with cmd : rest-bench --access-key=ak
 --secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup
 write

 I found when RGW call the method bucket_prepare_op  is very slow. so I
 observed from 'dump_historic_ops',to see:
 { description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call
 rgw.bucket_prepare_op] 3.b168f3d0 e37),
   received_at: 2014-07-03 11:07:02.465700,
   age: 308.315230,
   duration: 3.401743,
   type_data: [
 commit sent; apply or cleanup,
 { client: client.4211,
   tid: 265984},
 [
 { time: 2014-07-03 11:07:02.465852,
   event: waiting_for_osdmap},
 { time: 2014-07-03 11:07:02.465875,
   event: queue op_wq},
 { time: 2014-07-03 11:07:03.729087,
   event: reached_pg},
 { time: 2014-07-03 11:07:03.729120,
   event: started},
 { time: 2014-07-03 11:07:03.729126,
   event: started},
 { time: 2014-07-03 11:07:03.804366,
   event: waiting for subops from [19,9]},
 { time: 2014-07-03 11:07:03.804431,
   event: commit_queued_for_journal_write},
 { time: 2014-07-03 11:07:03.804509,
   event: write_thread_in_journal_buffer},
 { time: 2014-07-03 11:07:03.934419,
   event: journaled_completion_queued},
 { time: 2014-07-03 11:07:05.297282,
   event: sub_op_commit_rec},
 { time: 2014-07-03 11:07:05.297319,
   event: sub_op_commit_rec},
 { time: 2014-07-03 11:07:05.311217,
   event: op_applied},
 { time: 2014-07-03 11:07:05.867384,
   event: op_commit finish lock},
 { time: 2014-07-03 11:07:05.867385,
   event: op_commit},
 { time: 2014-07-03 11:07:05.867424,
   event: commit_sent},
 { time: 2014-07-03 11:07:05.867428,
   event: op_commit finish},
 { time: 2014-07-03 11:07:05.867443,
   event: done}]]}]}

 so I find 2 performance degradation. one is from queue op_wq to
 reached_pg , anothor is from journaled_completion_queued to op_commit.
 and I must stess that there are so many ops write to one bucket object, so
 how to reduce Latency ?


 
 baijia...@126.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pools do not respond

2014-07-03 Thread Gregory Farnum
The PG in question isn't being properly mapped to any OSDs. There's a
good chance that those trees (with 3 OSDs in 2 hosts) aren't going to
map well anyway, but the immediate problem should resolve itself if
you change the choose to chooseleaf in your rules.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Jul 3, 2014 at 4:17 AM, Iban Cabrillo cabri...@ifca.unican.es wrote:
 Hi folk,
   I am following step by step the test intallation, and checking some
 configuration before try to deploy a production cluster.

   Now I have a Health cluster with 3 mons + 4 OSDs.
   I have created a pool with belonging all osd.x and two more one for two
 servers o the other for the other two.

   The general pool work fine (I can create images and mount it on remote
 machines).

   But the other two does not work (the commands rados put, or rbd ls pool
 hangs for ever).

   this is the tree:

[ceph@cephadm ceph-cloud]$ sudo ceph osd tree
 # id weight type name up/down reweight
 -7 5.4 root 4x1GbFCnlSAS
 -3 2.7 host node04
 1 2.7 osd.1 up 1
 -4 2.7 host node03
 2 2.7 osd.2 up 1
 -6 8.1 root 4x4GbFCnlSAS
 -5 5.4 host node01
 3 2.7 osd.3 up 1
 4 2.7 osd.4 up 1
 -2 2.7 host node04
 0 2.7 osd.0 up 1
 -1 13.5 root default
 -2 2.7 host node04
 0 2.7 osd.0 up 1
 -3 2.7 host node04
 1 2.7 osd.1 up 1
 -4 2.7 host node03
 2 2.7 osd.2 up 1
 -5 5.4 host node01
 3 2.7 osd.3 up 1
 4 2.7 osd.4 up 1


 And this is the crushmap:

 ...
 root 4x4GbFCnlSAS {
 id -6 #do not change unnecessarily
 alg straw
 hash 0  # rjenkins1
 item node01 weight 5.400
 item node04 weight 2.700
 }
 root 4x1GbFCnlSAS {
 id -7 #do not change unnecessarily
 alg straw
 hash 0  # rjenkins1
 item node04 weight 2.700
 item node03 weight 2.700
 }
 # rules
 rule 4x4GbFCnlSAS {
 ruleset 1
 type replicated
 min_size 1
 max_size 10
 step take 4x4GbFCnlSAS
 step choose firstn 0 type host
 step emit
 }
 rule 4x1GbFCnlSAS {
 ruleset 2
 type replicated
 min_size 1
 max_size 10
 step take 4x1GbFCnlSAS
 step choose firstn 0 type host
 step emit
 }
 ..
 I of course set the crush_rules:
 sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2
 sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1

 but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file):
sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object
 4x4GbFCnlSAS.pool
 !!HANGS for eve!

 from the ceph-client happen the same
  rbd ls cloud-4x1GbFCnlSAS
  !!HANGS for eve!


 [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS
 4x1GbFCnlSAS.object
 osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' - pg
 3.114ae7a9 (3.29) - up ([], p-1) acting ([], p-1)

 Any idea what i am doing wrong??

 Thanks in advance, I
 Bertrand Russell:
 El problema con el mundo es que los estúpidos están seguros de todo y los
 inteligentes están llenos de dudas

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bypass Cache-Tiering for special reads (Backups)

2014-07-03 Thread Gregory Farnum
On Wed, Jul 2, 2014 at 3:06 PM, Marc m...@shoowin.de wrote:
 Hi,

 I was wondering, having a cache pool in front of an RBD pool is all fine
 and dandy, but imagine you want to pull backups of all your VMs (or one
 of them, or multiple...). Going to the cache for all those reads isn't
 only pointless, it'll also potentially fill up the cache and possibly
 evict actually frequently used data. Which got me thinking... wouldn't
 it be nifty if there was a special way of doing specific backup reads
 where you'd bypass the cache, ensuring the dirty cache contents get
 written to cold pool first? Or at least doing special reads where a
 cache-miss won't actually cache the requested data?

Yeah, these are nifty features but the cache coherency implications
are a bit difficult. More options will come as we are able to develop
and (more importantly, by far) validate them.
-Greg


 AFAIK the backup routine for an RBD-backed KVM usually involves creating
 a snapshot of the RBD and putting that into a backup storage/tape, all
 done via librbd/API.

 Maybe something like that even already exists?


 KR,
 Marc
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why lock th whole osd handle thread

2014-07-03 Thread Gregory Farnum
On Thu, Jul 3, 2014 at 8:24 AM, baijia...@126.com baijia...@126.com wrote:
 when I see the function OSD::OpWQ::_process . I find pg lock locks the
 whole function. so when I  use multi-thread write the same object , so are
 they must
 serialize from osd handle thread to journal write thread ?

It's serialized while processing the write, but that doesn't include
the wait time for the data to be placed on disk — merely sequencing it
and feeding it into the journal queue. Writes have to be ordered, so
that's not likely to change.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread baijia...@126.com
I find that the function of OSD::OpWQ::_process  use pg-lock lock the whole 
function.so this mean that osd threads can't handle op which write for the same 
object.
though add log to the  ReplicatedPG::op_commit , I find pg lock cost long time 
sometimes. but I don't know where lock pg .
where lock pg for a long time?

thanks



baijia...@126.com

From: Gregory Farnum
Date: 2014-07-04 01:02
To: baijia...@126.com
CC: ceph-users
Subject: Re: [ceph-users] RGW performance test , put 30 thousands objects to 
one bucket, average latency 3 seconds
It looks like you're just putting in data faster than your cluster can
handle (in terms of IOPS).
The first big hole (queue_op_wq-reached_pg) is it sitting in a queue
and waiting for processing. The second parallel blocks are
1) write_thread_in_journal_buffer-journaled_completion_queued, and
that is again a queue while it's waiting to be written to disk,
2) waiting for subops from [19,9]-sub_op_commit_received(x2) is
waiting for the replica OSDs to write the transaction to disk.

You might be able to tune it a little, but right now bucket indices
live in one object, so every write has to touch the same set of OSDs
(twice! to mark an object as putting, and put). 2*3/360 = 166,
which is probably past what those disks can do, and artificially
increasing the latency.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com baijia...@126.com wrote:
 hi, everyone

 when I user rest bench testing RGW with cmd : rest-bench --access-key=ak
 --secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup
 write

 I found when RGW call the method bucket_prepare_op  is very slow. so I
 observed from 'dump_historic_ops',to see:
 { description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call
 rgw.bucket_prepare_op] 3.b168f3d0 e37),
   received_at: 2014-07-03 11:07:02.465700,
   age: 308.315230,
   duration: 3.401743,
   type_data: [
 commit sent; apply or cleanup,
 { client: client.4211,
   tid: 265984},
 [
 { time: 2014-07-03 11:07:02.465852,
   event: waiting_for_osdmap},
 { time: 2014-07-03 11:07:02.465875,
   event: queue op_wq},
 { time: 2014-07-03 11:07:03.729087,
   event: reached_pg},
 { time: 2014-07-03 11:07:03.729120,
   event: started},
 { time: 2014-07-03 11:07:03.729126,
   event: started},
 { time: 2014-07-03 11:07:03.804366,
   event: waiting for subops from [19,9]},
 { time: 2014-07-03 11:07:03.804431,
   event: commit_queued_for_journal_write},
 { time: 2014-07-03 11:07:03.804509,
   event: write_thread_in_journal_buffer},
 { time: 2014-07-03 11:07:03.934419,
   event: journaled_completion_queued},
 { time: 2014-07-03 11:07:05.297282,
   event: sub_op_commit_rec},
 { time: 2014-07-03 11:07:05.297319,
   event: sub_op_commit_rec},
 { time: 2014-07-03 11:07:05.311217,
   event: op_applied},
 { time: 2014-07-03 11:07:05.867384,
   event: op_commit finish lock},
 { time: 2014-07-03 11:07:05.867385,
   event: op_commit},
 { time: 2014-07-03 11:07:05.867424,
   event: commit_sent},
 { time: 2014-07-03 11:07:05.867428,
   event: op_commit finish},
 { time: 2014-07-03 11:07:05.867443,
   event: done}]]}]}

 so I find 2 performance degradation. one is from queue op_wq to
 reached_pg , anothor is from journaled_completion_queued to op_commit.
 and I must stess that there are so many ops write to one bucket object, so
 how to reduce Latency ?


 
 baijia...@126.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance test , put 30 thousands objects to one bucket, average latency 3 seconds

2014-07-03 Thread baijia...@126.com
I put .rgw.buckets.index pool to SSD osd,bucket object must write to the SSD, 
and disk use ratio less than 50%. so I don't think disk is bottleneck




baijia...@126.com

From: baijia...@126.com
Date: 2014-07-04 01:29
To: Gregory Farnum
CC: ceph-users
Subject: Re: Re: [ceph-users] RGW performance test , put 30 thousands objects 
to one bucket, average latency 3 seconds
I find that the function of OSD::OpWQ::_process  use pg-lock lock the whole 
function.so this mean that osd threads can't handle op which write for the same 
object.
though add log to the  ReplicatedPG::op_commit , I find pg lock cost long time 
sometimes. but I don't know where lock pg .
where lock pg for a long time?

thanks



baijia...@126.com

From: Gregory Farnum
Date: 2014-07-04 01:02
To: baijia...@126.com
CC: ceph-users
Subject: Re: [ceph-users] RGW performance test , put 30 thousands objects to 
one bucket, average latency 3 seconds
It looks like you're just putting in data faster than your cluster can
handle (in terms of IOPS).
The first big hole (queue_op_wq-reached_pg) is it sitting in a queue
and waiting for processing. The second parallel blocks are
1) write_thread_in_journal_buffer-journaled_completion_queued, and
that is again a queue while it's waiting to be written to disk,
2) waiting for subops from [19,9]-sub_op_commit_received(x2) is
waiting for the replica OSDs to write the transaction to disk.

You might be able to tune it a little, but right now bucket indices
live in one object, so every write has to touch the same set of OSDs
(twice! to mark an object as putting, and put). 2*3/360 = 166,
which is probably past what those disks can do, and artificially
increasing the latency.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Jul 2, 2014 at 11:24 PM, baijia...@126.com baijia...@126.com wrote:
 hi, everyone

 when I user rest bench testing RGW with cmd : rest-bench --access-key=ak
 --secret=sk  --bucket=bucket --seconds=360 -t 200  -b 524288  --no-cleanup
 write

 I found when RGW call the method bucket_prepare_op  is very slow. so I
 observed from 'dump_historic_ops',to see:
 { description: osd_op(client.4211.0:265984 .dir.default.4148.1 [call
 rgw.bucket_prepare_op] 3.b168f3d0 e37),
   received_at: 2014-07-03 11:07:02.465700,
   age: 308.315230,
   duration: 3.401743,
   type_data: [
 commit sent; apply or cleanup,
 { client: client.4211,
   tid: 265984},
 [
 { time: 2014-07-03 11:07:02.465852,
   event: waiting_for_osdmap},
 { time: 2014-07-03 11:07:02.465875,
   event: queue op_wq},
 { time: 2014-07-03 11:07:03.729087,
   event: reached_pg},
 { time: 2014-07-03 11:07:03.729120,
   event: started},
 { time: 2014-07-03 11:07:03.729126,
   event: started},
 { time: 2014-07-03 11:07:03.804366,
   event: waiting for subops from [19,9]},
 { time: 2014-07-03 11:07:03.804431,
   event: commit_queued_for_journal_write},
 { time: 2014-07-03 11:07:03.804509,
   event: write_thread_in_journal_buffer},
 { time: 2014-07-03 11:07:03.934419,
   event: journaled_completion_queued},
 { time: 2014-07-03 11:07:05.297282,
   event: sub_op_commit_rec},
 { time: 2014-07-03 11:07:05.297319,
   event: sub_op_commit_rec},
 { time: 2014-07-03 11:07:05.311217,
   event: op_applied},
 { time: 2014-07-03 11:07:05.867384,
   event: op_commit finish lock},
 { time: 2014-07-03 11:07:05.867385,
   event: op_commit},
 { time: 2014-07-03 11:07:05.867424,
   event: commit_sent},
 { time: 2014-07-03 11:07:05.867428,
   event: op_commit finish},
 { time: 2014-07-03 11:07:05.867443,
   event: done}]]}]}

 so I find 2 performance degradation. one is from queue op_wq to
 reached_pg , anothor is from journaled_completion_queued to op_commit.
 and I must stess that there are so many ops write to one bucket object, so
 how to reduce Latency ?


 
 baijia...@126.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pools do not respond

2014-07-03 Thread Iban Cabrillo
Hi Gregory,
  Thanks a lot I begin to understand who ceph works.
  I add a couple of osd servers, and balance the disk between them.

[ceph@cephadm ceph-cloud]$ sudo ceph osd tree
# idweighttype nameup/downreweight
-716.2root 4x1GbFCnlSAS
-95.4host node02
72.7osd.7up1
82.7osd.8up1
-45.4host node03
22.7osd.2up1
92.7osd.9up1
-35.4host node04
12.7osd.1up1
102.7osd.10up1
-616.2root 4x4GbFCnlSAS
-55.4host node01
32.7osd.3up1
42.7osd.4up1
-85.4host node02
52.7osd.5up1
62.7osd.6up1
-25.4host node04
02.7osd.0up1
112.7osd.11up1
-132.4root default
-25.4host node04
02.7osd.0up1
112.7osd.11up1
-35.4host node04
12.7osd.1up1
102.7osd.10up1
-45.4host node03
22.7osd.2up1
92.7osd.9up1
-55.4host node01
32.7osd.3up1
42.7osd.4up1
-85.4host node02
52.7osd.5up1
62.7osd.6up1
-95.4host node02
72.7osd.7up1
82.7osd.8up1

The Idea Is to have at least 4 servers and 3 disk (2.7 TB SAN attached) for
server per pool.
Now i have to adjust the pg and pgp and make some performance test.

PD which is the difference betwwwn chose ans choseleaf?

Thanks a lot!


2014-07-03 19:06 GMT+02:00 Gregory Farnum g...@inktank.com:

 The PG in question isn't being properly mapped to any OSDs. There's a
 good chance that those trees (with 3 OSDs in 2 hosts) aren't going to
 map well anyway, but the immediate problem should resolve itself if
 you change the choose to chooseleaf in your rules.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Thu, Jul 3, 2014 at 4:17 AM, Iban Cabrillo cabri...@ifca.unican.es
 wrote:
  Hi folk,
I am following step by step the test intallation, and checking some
  configuration before try to deploy a production cluster.
 
Now I have a Health cluster with 3 mons + 4 OSDs.
I have created a pool with belonging all osd.x and two more one for two
  servers o the other for the other two.
 
The general pool work fine (I can create images and mount it on remote
  machines).
 
But the other two does not work (the commands rados put, or rbd ls
 pool
  hangs for ever).
 
this is the tree:
 
 [ceph@cephadm ceph-cloud]$ sudo ceph osd tree
  # id weight type name up/down reweight
  -7 5.4 root 4x1GbFCnlSAS
  -3 2.7 host node04
  1 2.7 osd.1 up 1
  -4 2.7 host node03
  2 2.7 osd.2 up 1
  -6 8.1 root 4x4GbFCnlSAS
  -5 5.4 host node01
  3 2.7 osd.3 up 1
  4 2.7 osd.4 up 1
  -2 2.7 host node04
  0 2.7 osd.0 up 1
  -1 13.5 root default
  -2 2.7 host node04
  0 2.7 osd.0 up 1
  -3 2.7 host node04
  1 2.7 osd.1 up 1
  -4 2.7 host node03
  2 2.7 osd.2 up 1
  -5 5.4 host node01
  3 2.7 osd.3 up 1
  4 2.7 osd.4 up 1
 
 
  And this is the crushmap:
 
  ...
  root 4x4GbFCnlSAS {
  id -6 #do not change unnecessarily
  alg straw
  hash 0  # rjenkins1
  item node01 weight 5.400
  item node04 weight 2.700
  }
  root 4x1GbFCnlSAS {
  id -7 #do not change unnecessarily
  alg straw
  hash 0  # rjenkins1
  item node04 weight 2.700
  item node03 weight 2.700
  }
  # rules
  rule 4x4GbFCnlSAS {
  ruleset 1
  type replicated
  min_size 1
  max_size 10
  step take 4x4GbFCnlSAS
  step choose firstn 0 type host
  step emit
  }
  rule 4x1GbFCnlSAS {
  ruleset 2
  type replicated
  min_size 1
  max_size 10
  step take 4x1GbFCnlSAS
  step choose firstn 0 type host
  step emit
  }
  ..
  I of course set the crush_rules:
  sudo ceph osd pool set cloud-4x1GbFCnlSAS crush_ruleset 2
  sudo ceph osd pool set cloud-4x4GbFCnlSAS crush_ruleset 1
 
  but seems that are something wrong (4x4GbFCnlSAS.pool is 512MB file):
 sudo rados -p cloud-4x1GbFCnlSAS put 4x4GbFCnlSAS.object
  4x4GbFCnlSAS.pool
  !!HANGS for eve!
 
  from the ceph-client happen the same
   rbd ls cloud-4x1GbFCnlSAS
   !!HANGS for eve!
 
 
  [root@cephadm ceph-cloud]# ceph osd map cloud-4x1GbFCnlSAS
  4x1GbFCnlSAS.object
  osdmap e49 pool 'cloud-4x1GbFCnlSAS' (3) object '4x1GbFCnlSAS.object' -
 pg
  3.114ae7a9 (3.29) - up ([], p-1) acting ([], p-1)
 
  Any idea what i am doing wrong??
 
  Thanks in advance, I
  Bertrand Russell:
  El problema con el mundo es que los estúpidos 

Re: [ceph-users] Pools do not respond

2014-07-03 Thread Gregory Farnum
On Thu, Jul 3, 2014 at 11:17 AM, Iban Cabrillo cabri...@ifca.unican.es wrote:
 Hi Gregory,
   Thanks a lot I begin to understand who ceph works.
   I add a couple of osd servers, and balance the disk between them.


 [ceph@cephadm ceph-cloud]$ sudo ceph osd tree
 # idweighttype nameup/downreweight
 -716.2root 4x1GbFCnlSAS
 -95.4host node02
 72.7osd.7up1
 82.7osd.8up1
 -45.4host node03

 22.7osd.2up1
 92.7osd.9up1
 -35.4host node04

 12.7osd.1up1
 102.7osd.10up1
 -616.2root 4x4GbFCnlSAS

 -55.4host node01
 32.7osd.3up1
 42.7osd.4up1
 -85.4host node02
 52.7osd.5up1
 62.7osd.6up1
 -25.4host node04

 02.7osd.0up1
 112.7osd.11up1
 -132.4root default
 -25.4host node04

 02.7osd.0up1
 112.7osd.11up1
 -35.4host node04

 12.7osd.1up1
 102.7osd.10up1
 -45.4host node03

 22.7osd.2up1
 92.7osd.9up1
 -55.4host node01
 32.7osd.3up1
 42.7osd.4up1
 -85.4host node02
 52.7osd.5up1
 62.7osd.6up1
 -95.4host node02
 72.7osd.7up1
 82.7osd.8up1

 The Idea Is to have at least 4 servers and 3 disk (2.7 TB SAN attached) for
 server per pool.
 Now i have to adjust the pg and pgp and make some performance test.

 PD which is the difference betwwwn chose ans choseleaf?

choose instructs the system to choose N different buckets of the
given type (where N is specified by the firstn 0 block to be the
replication level, but could be 1: firstn 1, or replication - 1:
firstn -1). Since you're saying choose firstn 0 type host, that's
what you're getting out, and then you're emitting those 3 (by default)
hosts. But they aren't valid devices (OSDs), so it's not a valid
mapping; you're supposed to then say choose firstn 1 device or
similar.
chooseleaf instead tells the system to choose N different buckets,
and then descend from each of those buckets to a leaf (device) in
the CRUSH hierarchy. It's a little more robust against different
mappings and failure conditions, so generally a better choice than
choose if you don't need the finer granularity provided by choose.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Joao Luis
Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the tunables.
Say, before the upgrade and a bit after you set the tunable. If you want to
be finer grained, then ideally it would be the moment where those maps were
created, but you'd have to grep the logs for that.

Or drop the logs somewhere and I'll take a look.

  -Joao
On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
wrote:

 Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

 On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

 Le 03/07/2014 00:55, Samuel Just a écrit :

 Ah,

 ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
 /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
 /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
 ../ceph/src/osdmaptool: osdmap file
 'osd-20_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
 ../ceph/src/osdmaptool: osdmap file
 'osd-23_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
 6d5
  tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


 The only thing that comes to mind that could cause this is if we changed
 the leader's in-memory map, proposed it, it failed, and only the leader
 got to write the map to disk somehow.  This happened once on a totally
 different issue (although I can't pinpoint right now which).

 In such a scenario, the leader would serve the incorrect osdmap to
 whoever asked osdmaps from it, the remaining quorum would serve the
 correct osdmaps to all the others.  This could cause this divergence. Or
 it could be something else.

 Are there logs for the monitors for the timeframe this may have happened
 in?


 Which exactly timeframe you want ? I have 7 days of logs, I should have
 informations about the upgrade from firefly to 0.82.
 Which mon's log do you want ? Three ?

 Regards

 -Joao


 Pierre: do you recall how and when that got set?


 I am not sure to understand, but if I good remember after the update in
 firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
 I see feature set mismatch in log.

 So if I good remeber, i do : ceph osd crush tunables optimal for the
 problem of crush map and I update my client and server kernel to
 3.16rc.

 It's could be that ?

 Pierre

  -Sam

 On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com
 wrote:

 Yeah, divergent osdmaps:
 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_
 4E62BB79__none
 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_
 4E62BB79__none

 Joao: thoughts?
 -Sam

 On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 The files

 When I upgrade :
   ceph-deploy install --stable firefly servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each servers service ceph restart mds

 I upgraded from emperor to firefly. After repair, remap, replace,
 etc ... I
 have some PG which pass in peering state.

 I thought why not try the version 0.82, it could solve my problem. (
 It's my mistake ). So, I upgrade from firefly to 0.83 with :
   ceph-deploy install --testing servers...
   ..

 Now, all programs are in version 0.82.
 I have 3 mons, 36 OSD and 3 mds.

 Pierre

 PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
 directory.

 Le 03/07/2014 00:10, Samuel Just a écrit :

  Also, what version did you upgrade from, and how did you upgrade?
 -Sam

 On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
 wrote:


 Ok, in current/meta on osd 20 and osd 23, please attach all files
 matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something
 like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do
 you have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31
 osd up to
 16.
 I remark that after this the number of down+peering PG decrease
 from 367
 to
 248. It's normal ? May be it's temporary, the time that the
 cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

  You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd
 like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:



 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 

[ceph-users] mon: leveldb checksum mismatch

2014-07-03 Thread Jason Harley
Hi list —

I’ve got a small dev. cluster: 3 OSD nodes with 6 disks/OSDs each and a single 
monitor (this, it seems, was my mistake).  The monitor node went down hard and 
it looks like the monitor’s db is in a funny state.  Running ‘ceph-mon’ 
manually with ‘debug_mon 20’ and ‘debug_ms 20’ gave the following:

 /usr/bin/ceph-mon -i monhost --mon-data /var/lib/ceph/mon/ceph-monhost 
 --debug_mon 20 --debug_ms 20 -d
 2014-07-03 23:20:55.800512 7f973918e7c0  0 ceph version 0.67.7 
 (d7ab4244396b57aac8b7e80812115bbd079e6b73), process ceph-mon, pid 24930
 Corruption: checksum mismatch
 Corruption: checksum mismatch
 2014-07-03 23:20:56.455797 7f973918e7c0 -1 failed to create new leveldb store

I attempted to make use of the leveldb Python library’s ‘RepairDB’ function, 
which just moves enough files into ‘lost’ that when running the monitor again 
I’m asked if I ran mkcephfs.

Any insight into resolving these two checksum mismatches so I can access my OSD 
data would be greatly appreciated.

Thanks,
./JRH

p.s. I’m assuming that without the maps from the monitor, my OSD data is 
unrecoverable also.

  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon: leveldb checksum mismatch

2014-07-03 Thread Joao Eduardo Luis

On 07/04/2014 12:29 AM, Jason Harley wrote:

Hi list —

I’ve got a small dev. cluster: 3 OSD nodes with 6 disks/OSDs each and a single 
monitor (this, it seems, was my mistake).  The monitor node went down hard and 
it looks like the monitor’s db is in a funny state.  Running ‘ceph-mon’ 
manually with ‘debug_mon 20’ and ‘debug_ms 20’ gave the following:


/usr/bin/ceph-mon -i monhost --mon-data /var/lib/ceph/mon/ceph-monhost 
--debug_mon 20 --debug_ms 20 -d
2014-07-03 23:20:55.800512 7f973918e7c0  0 ceph version 0.67.7 
(d7ab4244396b57aac8b7e80812115bbd079e6b73), process ceph-mon, pid 24930
Corruption: checksum mismatch
Corruption: checksum mismatch
2014-07-03 23:20:56.455797 7f973918e7c0 -1 failed to create new leveldb store


I attempted to make use of the leveldb Python library’s ‘RepairDB’ function, 
which just moves enough files into ‘lost’ that when running the monitor again 
I’m asked if I ran mkcephfs.

Any insight into resolving these two checksum mismatches so I can access my OSD 
data would be greatly appreciated.

Thanks,
./JRH

p.s. I’m assuming that without the maps from the monitor, my OSD data is 
unrecoverable also.


Hello Jason,

We don't have a way to repair leveldb.  Having multiple monitors usually 
help with such tricky situations.


According to this [1] the python bindings you're using may not be linked 
into snappy, which we were using (mistakenly until recently) to compress 
data as it goes into leveldb.  Not having those snappy bindings may be 
what's causing all those files to be moved to lost instead.


The suggestion that the thread in [1] offers is to have the repair 
functionality directly in the 'application' itself.  We could do this by 
adding a repair option to ceph-kvstore-tool -- which could help.


I'll be happy to get that into ceph-kvstore-tool tomorrow and push a 
branch for you to compile and test.


  -Joao


[1] - https://groups.google.com/forum/#!topic/leveldb/YvszWNio2-Q

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon: leveldb checksum mismatch

2014-07-03 Thread Jason Harley
Hi Joao,

On Jul 3, 2014, at 7:57 PM, Joao Eduardo Luis joao.l...@inktank.com wrote:

 We don't have a way to repair leveldb.  Having multiple monitors usually help 
 with such tricky situations.

I know this, but for this small dev cluster I wasn’t thinking about corruption 
of my mon’s backing store.  Silly me :)

 
 According to this [1] the python bindings you're using may not be linked into 
 snappy, which we were using (mistakenly until recently) to compress data as 
 it goes into leveldb.  Not having those snappy bindings may be what's causing 
 all those files to be moved to lost instead.

I found the same posting, and confirmed that the ‘levedb.so’ that ships with 
the ‘python-leveldb’ package on Ubuntu 13.10 links against ‘snappy’.

 The suggestion that the thread in [1] offers is to have the repair 
 functionality directly in the 'application' itself.  We could do this by 
 adding a repair option to ceph-kvstore-tool -- which could help.
 
 I'll be happy to get that into ceph-kvstore-tool tomorrow and push a branch 
 for you to compile and test.

I would be more than happy to try this out.  Without fixing these checksums, I 
think I’m reinitializing my cluster. :\

Thank you,
./JRH___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com