date:20150304

Yes, good idea.

I was looking the «WBThrottle» feature, but go for logging instead.


Le mercredi 04 mars 2015 à 17:10 +0100, Alexandre DERUMIER a écrit :
 Only writes ;) 
 
 ok, so maybe some background operations (snap triming, scrubing...).
 
 maybe debug_osd=20 , could give you more logs ?
 
 
 - Mail original -
 De: Olivier Bonvalet ceph.l...@daevel.fr
 À: aderumier aderum...@odiso.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Envoyé: Mercredi 4 Mars 2015 16:42:13
 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
 
 Only writes ;) 
 
 
 Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit : 
  The change is only on OSD (and not on OSD journal). 
  
  do you see twice iops for read and write ? 
  
  if only read, maybe a read ahead bug could explain this. 
  
  - Mail original - 
  De: Olivier Bonvalet ceph.l...@daevel.fr 
  À: aderumier aderum...@odiso.com 
  Cc: ceph-users ceph-users@lists.ceph.com 
  Envoyé: Mercredi 4 Mars 2015 15:13:30 
  Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly 
  
  Ceph health is OK yes. 
  
  The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by 
  ceph : there is no change between dumpling and firefly. The change is 
  only on OSD (and not on OSD journal). 
  
  
  Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : 
   The load problem is permanent : I have twice IO/s on HDD since firefly. 
   
   Oh, permanent, that's strange. (If you don't see more traffic coming from 
   clients, I don't understand...) 
   
   do you see also twice ios/ ops in ceph -w  stats ? 
   
   is the ceph health ok ? 
   
   
   
   - Mail original - 
   De: Olivier Bonvalet ceph.l...@daevel.fr 
   À: aderumier aderum...@odiso.com 
   Cc: ceph-users ceph-users@lists.ceph.com 
   Envoyé: Mercredi 4 Mars 2015 14:49:41 
   Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to 
   firefly 
   
   Thanks Alexandre. 
   
   The load problem is permanent : I have twice IO/s on HDD since firefly. 
   And yes, the problem hang the production at night during snap trimming. 
   
   I suppose there is a new OSD parameter which change behavior of the 
   journal, or something like that. But didn't find anything about that. 
   
   Olivier 
   
   Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
Hi, 

maybe this is related ?: 

http://tracker.ceph.com/issues/9503 
Dumpling: removing many snapshots in a short time makes OSDs go 
berserk 

http://tracker.ceph.com/issues/9487 
dumpling: snaptrimmer causes slow requests while backfilling. 
osd_snap_trim_sleep not helping 

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
 



I think it's already backport in dumpling, not sure it's already done 
for firefly 


Alexandre 



- Mail original - 
De: Olivier Bonvalet ceph.l...@daevel.fr 
À: ceph-users ceph-users@lists.ceph.com 
Envoyé: Mercredi 4 Mars 2015 12:10:30 
Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 

Hi, 

last saturday I upgraded my production cluster from dumpling to emperor 
(since we were successfully using it on a test cluster). 
A couple of hours later, we had falling OSD : some of them were marked 
as down by Ceph, probably because of IO starvation. I marked the 
cluster 
in «noout», start downed OSD, then let him recover. 24h later, same 
problem (near same hour). 

So, I choose to directly upgrade to firefly, which is maintained. 
Things are better, but the cluster is slower than with dumpling. 

The main problem seems that OSD have twice more write operations par 
second : 
https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 

But journal doesn't change (SSD dedicated to OSD70+71+72) : 
https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 

Neither node bandwidth : 
https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 

Or whole cluster IO activity : 
https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 

Some background : 
The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
journal» OSD. Only «HDD+SSD» OSD seems to be affected. 

I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
nodes (so a total of 27 «HDD+SSD» OSD). 

The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
«rbd snap rm» operations). 
osd_snap_trim_sleep is setup to 0.8 since monthes. 
Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
doesn't seem to really help. 

The only thing which seems to help, is to reduce osd_disk_threads from

Re: [ceph-users] Implement replication network with live cluster

If I remember right, someone has done this on a live cluster without
any issues. I seem to remember that it had a fallback mechanism if the
OSDs couldn't be reached on the cluster network to contact them on the
public network. You could test it pretty easily without much impact.
Take one OSD that has both networks and configure it and restart the
process. If all the nodes (specifically the old ones with only one
network) is able to connect to it, then you are good to go by
restarting one OSD at a time.

On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote:
 Hi,

 I'm having a live cluster with only public network (so no explicit network
 configuraion in the ceph.conf file)

 I'm wondering what is the procedure to implement dedicated
 Replication/Private and Public network.
 I've read the manual, know how to do it in ceph.conf, but I'm wondering
 since this is already running cluster - what should I do after I change
 ceph.conf on all nodes ?
 Restarting OSDs one by one, or... ? Is there any downtime expected ? - for
 the replication network to actually imlemented completely.


 Another related quetion:

 Also, I'm demoting some old OSDs, on old servers, I will have them all
 stoped, but would like to implement replication network before actually
 removing old OSDs from crush map - since lot of data will be moved arround.

 My old nodes/OSDs (that will be stoped before I implement replication
 network) - do NOT have dedicated NIC for replication network, in contrast to
 new nodes/OSDs. So there will be still reference to these old OSD in the
 crush map.
 Will this be a problem - me changing/implementing replication network that
 WILL work on new nodes/OSDs, but not on old ones since they don't have
 dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like
 opinion.

 Or perhaps i might remove OSD from crush map with prior seting of
 nobackfill and   norecover (so no rebalancing happens) and then implement
 replication netwotk?


 Sorry for old post, but...

 Thanks,
 --

 Andrija Panić

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

Only writes ;)


Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit :
 The change is only on OSD (and not on OSD journal). 
 
 do you see twice iops for read and write ?
 
 if only read, maybe a read ahead bug could explain this. 
 
 - Mail original -
 De: Olivier Bonvalet ceph.l...@daevel.fr
 À: aderumier aderum...@odiso.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Envoyé: Mercredi 4 Mars 2015 15:13:30
 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
 
 Ceph health is OK yes. 
 
 The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by 
 ceph : there is no change between dumpling and firefly. The change is 
 only on OSD (and not on OSD journal). 
 
 
 Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : 
  The load problem is permanent : I have twice IO/s on HDD since firefly. 
  
  Oh, permanent, that's strange. (If you don't see more traffic coming from 
  clients, I don't understand...) 
  
  do you see also twice ios/ ops in ceph -w  stats ? 
  
  is the ceph health ok ? 
  
  
  
  - Mail original - 
  De: Olivier Bonvalet ceph.l...@daevel.fr 
  À: aderumier aderum...@odiso.com 
  Cc: ceph-users ceph-users@lists.ceph.com 
  Envoyé: Mercredi 4 Mars 2015 14:49:41 
  Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly 
  
  Thanks Alexandre. 
  
  The load problem is permanent : I have twice IO/s on HDD since firefly. 
  And yes, the problem hang the production at night during snap trimming. 
  
  I suppose there is a new OSD parameter which change behavior of the 
  journal, or something like that. But didn't find anything about that. 
  
  Olivier 
  
  Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
   Hi, 
   
   maybe this is related ?: 
   
   http://tracker.ceph.com/issues/9503 
   Dumpling: removing many snapshots in a short time makes OSDs go berserk 
   
   http://tracker.ceph.com/issues/9487 
   dumpling: snaptrimmer causes slow requests while backfilling. 
   osd_snap_trim_sleep not helping 
   
   http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html

   
   
   
   I think it's already backport in dumpling, not sure it's already done for 
   firefly 
   
   
   Alexandre 
   
   
   
   - Mail original - 
   De: Olivier Bonvalet ceph.l...@daevel.fr 
   À: ceph-users ceph-users@lists.ceph.com 
   Envoyé: Mercredi 4 Mars 2015 12:10:30 
   Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 
   
   Hi, 
   
   last saturday I upgraded my production cluster from dumpling to emperor 
   (since we were successfully using it on a test cluster). 
   A couple of hours later, we had falling OSD : some of them were marked 
   as down by Ceph, probably because of IO starvation. I marked the cluster 
   in «noout», start downed OSD, then let him recover. 24h later, same 
   problem (near same hour). 
   
   So, I choose to directly upgrade to firefly, which is maintained. 
   Things are better, but the cluster is slower than with dumpling. 
   
   The main problem seems that OSD have twice more write operations par 
   second : 
   https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
   https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
   
   But journal doesn't change (SSD dedicated to OSD70+71+72) : 
   https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
   
   Neither node bandwidth : 
   https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
   
   Or whole cluster IO activity : 
   https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
   
   Some background : 
   The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
   journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
   
   I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
   nodes (so a total of 27 «HDD+SSD» OSD). 
   
   The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
   «rbd snap rm» operations). 
   osd_snap_trim_sleep is setup to 0.8 since monthes. 
   Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
   doesn't seem to really help. 
   
   The only thing which seems to help, is to reduce osd_disk_threads from 8 
   to 1. 
   
   So. Any idea about what's happening ? 
   
   Thanks for any help, 
   Olivier 
   
   ___ 
   ceph-users mailing list 
   ceph-users@lists.ceph.com 
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
   
  
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

Only writes ;) 

ok, so maybe some background operations (snap triming, scrubing...).

maybe debug_osd=20 , could give you more logs ?


- Mail original -
De: Olivier Bonvalet ceph.l...@daevel.fr
À: aderumier aderum...@odiso.com
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Mercredi 4 Mars 2015 16:42:13
Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

Only writes ;) 


Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit : 
 The change is only on OSD (and not on OSD journal). 
 
 do you see twice iops for read and write ? 
 
 if only read, maybe a read ahead bug could explain this. 
 
 - Mail original - 
 De: Olivier Bonvalet ceph.l...@daevel.fr 
 À: aderumier aderum...@odiso.com 
 Cc: ceph-users ceph-users@lists.ceph.com 
 Envoyé: Mercredi 4 Mars 2015 15:13:30 
 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly 
 
 Ceph health is OK yes. 
 
 The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by 
 ceph : there is no change between dumpling and firefly. The change is 
 only on OSD (and not on OSD journal). 
 
 
 Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : 
  The load problem is permanent : I have twice IO/s on HDD since firefly. 
  
  Oh, permanent, that's strange. (If you don't see more traffic coming from 
  clients, I don't understand...) 
  
  do you see also twice ios/ ops in ceph -w  stats ? 
  
  is the ceph health ok ? 
  
  
  
  - Mail original - 
  De: Olivier Bonvalet ceph.l...@daevel.fr 
  À: aderumier aderum...@odiso.com 
  Cc: ceph-users ceph-users@lists.ceph.com 
  Envoyé: Mercredi 4 Mars 2015 14:49:41 
  Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly 
  
  Thanks Alexandre. 
  
  The load problem is permanent : I have twice IO/s on HDD since firefly. 
  And yes, the problem hang the production at night during snap trimming. 
  
  I suppose there is a new OSD parameter which change behavior of the 
  journal, or something like that. But didn't find anything about that. 
  
  Olivier 
  
  Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
   Hi, 
   
   maybe this is related ?: 
   
   http://tracker.ceph.com/issues/9503 
   Dumpling: removing many snapshots in a short time makes OSDs go berserk 
   
   http://tracker.ceph.com/issues/9487 
   dumpling: snaptrimmer causes slow requests while backfilling. 
   osd_snap_trim_sleep not helping 
   
   http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html

   
   
   
   I think it's already backport in dumpling, not sure it's already done for 
   firefly 
   
   
   Alexandre 
   
   
   
   - Mail original - 
   De: Olivier Bonvalet ceph.l...@daevel.fr 
   À: ceph-users ceph-users@lists.ceph.com 
   Envoyé: Mercredi 4 Mars 2015 12:10:30 
   Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 
   
   Hi, 
   
   last saturday I upgraded my production cluster from dumpling to emperor 
   (since we were successfully using it on a test cluster). 
   A couple of hours later, we had falling OSD : some of them were marked 
   as down by Ceph, probably because of IO starvation. I marked the cluster 
   in «noout», start downed OSD, then let him recover. 24h later, same 
   problem (near same hour). 
   
   So, I choose to directly upgrade to firefly, which is maintained. 
   Things are better, but the cluster is slower than with dumpling. 
   
   The main problem seems that OSD have twice more write operations par 
   second : 
   https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
   https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
   
   But journal doesn't change (SSD dedicated to OSD70+71+72) : 
   https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
   
   Neither node bandwidth : 
   https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
   
   Or whole cluster IO activity : 
   https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
   
   Some background : 
   The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
   journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
   
   I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
   nodes (so a total of 27 «HDD+SSD» OSD). 
   
   The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
   «rbd snap rm» operations). 
   osd_snap_trim_sleep is setup to 0.8 since monthes. 
   Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
   doesn't seem to really help. 
   
   The only thing which seems to help, is to reduce osd_disk_threads from 8 
   to 1. 
   
   So. Any idea about what's happening ? 
   
   Thanks for any help, 
   Olivier 
   
   ___ 
   ceph-users mailing list 
   ceph-users@lists.ceph.com 
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Implement replication network with live cluster

That was my thought, yes - I found this blog that confirms what you are
saying I guess:
http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/
I will do that... Thx

I guess it doesnt matter, since my Crush Map will still refernce old OSDs,
that are stoped (and cluster resynced after that) ?

Thx again for the help

On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote:

If I remember right, someone has done this on a live cluster without
any issues. I seem to remember that it had a fallback mechanism if the
OSDs couldn't be reached on the cluster network to contact them on the
public network. You could test it pretty easily without much impact.
Take one OSD that has both networks and configure it and restart the
process. If all the nodes (specifically the old ones with only one
network) is able to connect to it, then you are good to go by
restarting one OSD at a time.

On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com
wrote:
Hi,

I'm having a live cluster with only public network (so no explicit
network
configuraion in the ceph.conf file)

I'm wondering what is the procedure to implement dedicated
Replication/Private and Public network.
I've read the manual, know how to do it in ceph.conf, but I'm wondering
since this is already running cluster - what should I do after I change
ceph.conf on all nodes ?
Restarting OSDs one by one, or... ? Is there any downtime expected ? -
for
the replication network to actually imlemented completely.

Another related quetion:

Also, I'm demoting some old OSDs, on old servers, I will have them all
stoped, but would like to implement replication network before actually
removing old OSDs from crush map - since lot of data will be moved
arround.

My old nodes/OSDs (that will be stoped before I implement replication
network) - do NOT have dedicated NIC for replication network, in
contrast to
new nodes/OSDs. So there will be still reference to these old OSD in the
crush map.
Will this be a problem - me changing/implementing replication network
that
WILL work on new nodes/OSDs, but not on old ones since they don't have
dedicated NIC ? I guess not since old OSDs are stoped anyway, but would
like
opinion.

Or perhaps i might remove OSD from crush map with prior seting of
nobackfill and norecover (so no rebalancing happens) and then implement
replication netwotk?

Sorry for old post, but...

Thanks,
--

Andrija Panić

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CEPH hardware recommendations and cluster design questions

2015-03-04 Thread Adrian Sevcenco

Hi! I seen the documentation
http://ceph.com/docs/master/start/hardware-recommendations/ but those
minimum requirements without some recommendations don't tell me much ...

So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd
would do ... what puzzles me is that per daemon construct ...
Why would i need/require to have multiple daemons? with separate servers
(3 mon + 1 mds - i understood that this is the requirement) i imagine
that each will run a single type of daemon.. did i miss something?
(beside that maybe is a relation between daemons and block devices and
for each block device should be a daemon?)

for mon and mds : would help the clients if these are on 10 GbE?

for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all
disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB
how much ram i would really need? (128 gb would be way to much i think)
(that RAIDZ3 for 36 disks is just a thought - i have also choices like:
2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare)

Regarding journal and scrubbing : by using ZFS i would think that i can
safely not use the CEPH ones ... is this ok?

Do you have some other advises and recommendations for me? (the
read:writes ratios will be 10:1)

Thank you!!
Adrian



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Cluster Address

2015-03-04 Thread Gregory Farnum

On Tue, Mar 3, 2015 at 9:26 AM, Garg, Pankaj
pankaj.g...@caviumnetworks.com wrote:
 Hi,

 I have ceph cluster that is contained within a rack (1 Monitor and 5 OSD
 nodes). I kept the same public and private address for configuration.

 I do have 2 NICS and 2 valid IP addresses (one internal only and one
 external) for each machine.



 Is it possible now, to change the Public Network address, after the cluster
 is up and running?

 I had used Ceph-deploy for the cluster. If I change the address of the
 public network in Ceph.conf, do I need to propagate to all the machines in
 the cluster or just the Monitor Node is enough?

You'll need to change the config on each node and then restart it so
that the OSDs will bind to the new location. The OSDs will let you do
this on a rolling basis, but the networks will need to be routable to
each other.

Note that changing the addresses on the monitors (I can't tell if you
want to do that) is much more difficult; it's probably easiest to
remove one at a time from the cluster and then recreate it with its
new IP. (There are docs on how to do this.)
-Greg




 Thanks

 Pankaj


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?

2015-03-04 Thread Gregory Farnum

Just to get more specific: the reason you can apparently write stuff
to a file when you can't write to the pool it's stored in is because
the file data is initially stored in cache. The flush out to RADOS,
when it happens, will fail.

It would definitely be preferable if there was some way to immediately
return a permission or IO error in this case, but so far we haven't
found one; the relevant interfaces just aren't present and it's
unclear how to propagate the data back to users in a way that makes
sense even if they were. :/
-Greg

On Wed, Mar 4, 2015 at 3:37 AM, SCHAER Frederic frederic.sch...@cea.fr wrote:
 Hi,

 Many thanks for the explanations.
 I haven't used the nodcache option when mounting cephfs, it actually got 
 there by default

 My mount command is/was :
 # mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret

 I don't know what causes this option to be default, maybe it's the kernel 
 module I compiled from git (because there is no kmod-ceph or kmod-rbd in any 
 RHEL-like distributions except RHEV), I'll try to update/check ...

 Concerning the rados pool ls, indeed : I created empty files in the pool, and 
 they were not showing up probably because they were just empty - but when I 
 create a non empty file, I see things in rados ls...

 Thanks again
 Frederic


 -Message d'origine-
 De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John 
 Spray
 Envoyé : mardi 3 mars 2015 17:15
 À : ceph-users@lists.ceph.com
 Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?



 On 03/03/2015 15:21, SCHAER Frederic wrote:

 By the way : looks like the ceph fs ls command is inconsistent when
 the cephfs is mounted (I used a locally compiled kmod-ceph rpm):

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ]

 (umount /mnt .)

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools:
 [puppet root ]

 This is probably #10288, which was fixed in 0.87.1

 So, I have this pool named root that I added in the cephfs filesystem.

 I then edited the filesystem xattrs :

 [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root

 ceph.dir.layout=stripe_unit=4194304 stripe_count=1
 object_size=4194304 pool=root

 I'm therefore assuming client.puppet should not be allowed to write or
 read anything in /mnt/root, which belongs to the root pool. but that
 is not the case.

 On another machine where I mounted cephfs using the client.puppet key,
 I can do this :

 The mount was done with the client.puppet key, not the admin one that
 is not deployed on that node :

 1.2.3.4:6789:/ on /mnt type ceph
 (rw,relatime,name=puppet,secret=hidden,nodcache)

 [root@dev7248 ~]# echo not allowed  /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 not allowed

 This is data you're seeing from the page cache, it hasn't been written
 to RADOS.

 You have used the nodcache setting, but that doesn't mean what you
 think it does (it was about caching dentries, not data).  It's actually
 not even used in recent kernels (http://tracker.ceph.com/issues/11009).

 You could try the nofsc option, but I don't know exactly how much
 caching that turns off -- the safer approach here is probably to do your
 testing using I/Os that have O_DIRECT set.

 And I can even see the xattrs inherited from the parent dir :

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1
 object_size=4194304 pool=root

 Whereas on the node where I mounted cephfs as ceph admin, I get nothing :

 [root@ceph0 ~]# cat /mnt/root/secret.notfailed

 [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed

 -rw-r--r-- 1 root root 12 Mar  3 15:27 /mnt/root/secret.notfailed

 After some time, the file also gets empty on the puppet client host :

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 (but the metadata remained ?)

 Right -- eventually the cache goes away, and you see the true (empty)
 state of the file.

 Also, as an unpriviledged user, I can get ownership of a secret file
 by changing the extended attribute :

 [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet
 /mnt/root/secret.notfailed

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1
 object_size=4194304 pool=puppet

 Well, you're not really getting ownership of anything here: you're
 modifying the file's metadata, which you are entitled to do (pool
 permissions have nothing to do with file metadata).  There was a recent
 bug where a file's pool layout could

Re: [ceph-users] Implement replication network with live cluster

2015-03-04 Thread Wido den Hollander

On 03/04/2015 05:44 PM, Robert LeBlanc wrote:
 If I remember right, someone has done this on a live cluster without
 any issues. I seem to remember that it had a fallback mechanism if the
 OSDs couldn't be reached on the cluster network to contact them on the
 public network. You could test it pretty easily without much impact.
 Take one OSD that has both networks and configure it and restart the
 process. If all the nodes (specifically the old ones with only one
 network) is able to connect to it, then you are good to go by
 restarting one OSD at a time.
 

In the OSDMap each OSD has a public and cluster network address. If the
cluster network address is not set, replication to that OSD will be done
over the public network.

So you can push a new configuration to all OSDs and restart them one by one.

Make sure the network is ofcourse up and running and it should work.

 On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote:
 Hi,

 I'm having a live cluster with only public network (so no explicit network
 configuraion in the ceph.conf file)

 I'm wondering what is the procedure to implement dedicated
 Replication/Private and Public network.
 I've read the manual, know how to do it in ceph.conf, but I'm wondering
 since this is already running cluster - what should I do after I change
 ceph.conf on all nodes ?
 Restarting OSDs one by one, or... ? Is there any downtime expected ? - for
 the replication network to actually imlemented completely.


 Another related quetion:

 Also, I'm demoting some old OSDs, on old servers, I will have them all
 stoped, but would like to implement replication network before actually
 removing old OSDs from crush map - since lot of data will be moved arround.

 My old nodes/OSDs (that will be stoped before I implement replication
 network) - do NOT have dedicated NIC for replication network, in contrast to
 new nodes/OSDs. So there will be still reference to these old OSD in the
 crush map.
 Will this be a problem - me changing/implementing replication network that
 WILL work on new nodes/OSDs, but not on old ones since they don't have
 dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like
 opinion.

 Or perhaps i might remove OSD from crush map with prior seting of
 nobackfill and   norecover (so no rebalancing happens) and then implement
 replication netwotk?


 Sorry for old post, but...

 Thanks,
 --

 Andrija Panić

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

You will most likely have a very high relocation percentage. Backfills
always are more impactful on smaller clusters, but osd max backfills
should be what you need to help reduce the impact. The default is 10,
you will want to use 1.

I didn't catch which version of Ceph you are running, but I think
there was some priority work done in firefly to help make backfills
lower priority. I think it has gotten better in later versions.

On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic andrija.pa...@gmail.com wrote:
 Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush
 map - weather that will cause more than 37% of data moved (80% or whatever)

 I'm also wondering if the thortling that I applied is fine or not - I will
 introduce the osd_recovery_delay_start 10sec as Irek said.

 I'm just wondering hom much will be the performance impact, because:
 - when stoping OSD, the impact while backfilling was fine more or a less - I
 can leave with this
 - when I removed OSD from cursh map - first 1h or so, impact was tremendous,
 and later on during recovery process impact was much less but still
 noticable...

 Thanks for the tip of course !
 Andrija

 On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote:

 I would be inclined to shut down both OSDs in a node, let the cluster
 recover. Once it is recovered, shut down the next two, let it recover.
 Repeat until all the OSDs are taken out of the cluster. Then I would
 set nobackfill and norecover. Then remove the hosts/disks from the
 CRUSH then unset nobackfill and norecover.

 That should give you a few small changes (when you shut down OSDs) and
 then one big one to get everything in the final place. If you are
 still adding new nodes, when nobackfill and norecover is set, you can
 add them in so that the one big relocate fills the new drives too.

 On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com
 wrote:
  Thx Irek. Number of replicas is 3.
 
  I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already
  decommissioned), which is further connected to a new 10G switch/network
  with
  3 servers on it with 12 OSDs each.
  I'm decommissioning old 3 nodes on 1G network...
 
  So you suggest removing whole node with 2 OSDs manually from crush map?
  Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas
  were originally been distributed over all 3 nodes. So anyway It could be
  safe to remove 2 OSDs at once together with the node itself...since
  replica
  count is 3...
  ?
 
  Thx again for your time
 
  On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote:
 
  Once you have only three nodes in the cluster.
  I recommend you add new nodes to the cluster, and then delete the old.
 
  2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com:
 
  You have a number of replication?
 
  2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com:
 
  Hi Irek,
 
  yes, stoping OSD (or seting it to OUT) resulted in only 3% of data
  degraded and moved/recovered.
  When I after that removed it from Crush map ceph osd crush rm id,
  that's when the stuff with 37% happened.
 
  And thanks Irek for help - could you kindly just let me know of the
  prefered steps when removing whole node?
  Do you mean I first stop all OSDs again, or just remove each OSD from
  crush map, or perhaps, just decompile cursh map, delete the node
  completely,
  compile back in, and let it heal/recover ?
 
  Do you think this would result in less data missplaces and moved
  arround
  ?
 
  Sorry for bugging you, I really appreaciate your help.
 
  Thanks
 
  On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote:
 
  A large percentage of the rebuild of the cluster map (But low
  percentage degradation). If you had not made ceph osd crush rm id,
  the
  percentage would be low.
  In your case, the correct option is to remove the entire node,
  rather
  than each disk individually
 
  2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com:
 
  Another question - I mentioned here 37% of objects being moved
  arround
  - this is MISPLACED object (degraded objects were 0.001%, after I
  removed 1
  OSD from cursh map (out of 44 OSD or so).
 
  Can anybody confirm this is normal behaviour - and are there any
  workarrounds ?
 
  I understand this is because of the object placement algorithm of
  CEPH, but still 37% of object missplaces just by removing 1 OSD
  from crush
  maps out of 44 make me wonder why this large percentage ?
 
  Seems not good to me, and I have to remove another 7 OSDs (we are
  demoting some old hardware nodes). This means I can potentialy go
  with 7 x
  the same number of missplaced objects...?
 
  Any thoughts ?
 
  Thanks
 
  On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com
  wrote:
 
  Thanks Irek.
 
  Does this mean, that after peering for each PG, there will be
  delay
  of 10sec, meaning that every once in a while, I will have 10sec od
  the
  cluster

Re: [ceph-users] Implement replication network with live cluster

If the data have been replicated to new OSDs, it will be able to
function properly even them them down or only on the public network.

On Wed, Mar 4, 2015 at 9:49 AM, Andrija Panic andrija.pa...@gmail.com wrote:
 I guess it doesnt matter, since my Crush Map will still refernce old OSDs,
 that are stoped (and cluster resynced after that) ?

 I wanted to say: it doesnt matter (I guess?) that my Crush map is still
 referencing old OSD nodes that are already stoped. Tired, sorry...

 On 4 March 2015 at 17:48, Andrija Panic andrija.pa...@gmail.com wrote:

 That was my thought, yes - I found this blog that confirms what you are
 saying I guess:
 http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/
 I will do that... Thx

 I guess it doesnt matter, since my Crush Map will still refernce old OSDs,
 that are stoped (and cluster resynced after that) ?

 Thx again for the help

 On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote:

 If I remember right, someone has done this on a live cluster without
 any issues. I seem to remember that it had a fallback mechanism if the
 OSDs couldn't be reached on the cluster network to contact them on the
 public network. You could test it pretty easily without much impact.
 Take one OSD that has both networks and configure it and restart the
 process. If all the nodes (specifically the old ones with only one
 network) is able to connect to it, then you are good to go by
 restarting one OSD at a time.

 On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com
 wrote:
  Hi,
 
  I'm having a live cluster with only public network (so no explicit
  network
  configuraion in the ceph.conf file)
 
  I'm wondering what is the procedure to implement dedicated
  Replication/Private and Public network.
  I've read the manual, know how to do it in ceph.conf, but I'm wondering
  since this is already running cluster - what should I do after I change
  ceph.conf on all nodes ?
  Restarting OSDs one by one, or... ? Is there any downtime expected ? -
  for
  the replication network to actually imlemented completely.
 
 
  Another related quetion:
 
  Also, I'm demoting some old OSDs, on old servers, I will have them all
  stoped, but would like to implement replication network before actually
  removing old OSDs from crush map - since lot of data will be moved
  arround.
 
  My old nodes/OSDs (that will be stoped before I implement replication
  network) - do NOT have dedicated NIC for replication network, in
  contrast to
  new nodes/OSDs. So there will be still reference to these old OSD in
  the
  crush map.
  Will this be a problem - me changing/implementing replication network
  that
  WILL work on new nodes/OSDs, but not on old ones since they don't have
  dedicated NIC ? I guess not since old OSDs are stoped anyway, but would
  like
  opinion.
 
  Or perhaps i might remove OSD from crush map with prior seting of
  nobackfill and   norecover (so no rebalancing happens) and then
  implement
  replication netwotk?
 
 
  Sorry for old post, but...
 
  Thanks,
  --
 
  Andrija Panić
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 




 --

 Andrija Panić




 --

 Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH hardware recommendations and cluster design questions

2015-03-04 Thread Sage Weil

On Wed, 4 Mar 2015, Adrian Sevcenco wrote:
 Hi! I seen the documentation
 http://ceph.com/docs/master/start/hardware-recommendations/ but those
 minimum requirements without some recommendations don't tell me much ...
 
 So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd
 would do ... what puzzles me is that per daemon construct ...
 Why would i need/require to have multiple daemons? with separate servers
 (3 mon + 1 mds - i understood that this is the requirement) i imagine
 that each will run a single type of daemon.. did i miss something?
 (beside that maybe is a relation between daemons and block devices and
 for each block device should be a daemon?)

There is normally a ceph-osd daemon per disk.

 for mon and mds : would help the clients if these are on 10 GbE?

For the MDS latency is important, so possibly!
 
 for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all
 disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB
 how much ram i would really need? (128 gb would be way to much i think)
 (that RAIDZ3 for 36 disks is just a thought - i have also choices like:
 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare)

Usually Ceph is deployed without raid underneath.  You can use it, 
though--ceph doesn't really care.  Performance just tends to be lower as 
compared to ceph-osd daemon's per disk.

Note that there is some support for ZFS but it is not tested by us at all, 
so you'll be mostly on your own.  I know a few users have had success here 
but I have no idea how busy their clusters are.  Be careful!

 Regarding journal and scrubbing : by using ZFS i would think that i can
 safely not use the CEPH ones ... is this ok?

You still want Ceph scrubbing as it verifies that the replicas don't get 
out of sync.  Maybe you could forgo deep scrubbing, but it may make more 
sense to disable ZFS scrubbing and let ceph drive it as you get things 
verified through the whole stack...

sage


 
 Do you have some other advises and recommendations for me? (the
 read:writes ratios will be 10:1)
 
 Thank you!!
 Adrian
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Nick Fisk

 

 

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
John Spray
Sent: 04 March 2015 11:34
To: Nick Fisk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Persistent Write Back Cache

 

 

On 04/03/2015 08:26, Nick Fisk wrote:

To illustrate the difference a proper write back cache can make, I put a 1GB
(512mb dirty threshold) flashcache in front of my RBD and tweaked the flush
parameters to flush dirty blocks at a large queue depth. The same fio test
(128k iodepth=1) now runs at 120MB/s and is limited by the performance of
SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In
fact since everything is stored as 4k blocks, pretty much all IO sizes are
accelerated to max speed of the SSD. Looking at iostat I can see all the
IO's are getting coalesced into nice large 512kb IO's at a high queue depth,
which Ceph easily swallows. 

 

If librbd could support writing its cache out to SSD it would hopefully
achieve the same level of performance and having it integrated would be
really neat. 

What are you hoping to gain from building something into ceph instead of
using flashcache/bcache/dm-cache on top of it?  It seems like since you
would anyway need to handle your HA configuration, setting up the actual
cache device would be the simple part.

Cheers,
John

 

Hi John,

 

I guess it's to make things easier rather than having to run a huge stack of
different technologies to achieve the same goal, especially when half of the
caching logic is already in Ceph. It would be really nice and drive adoption
if you could could add a SSD, set a config option and suddenly you have a
storage platform that performs 10x faster.

 

Another way of handling it might be for librbd to be pointed at a uuid
instead of a /dev/sd* device. That way librbd knows what cache device to
look for and will error out if the cache device is missing. These cache
devices could then be presented to all necessary servers via iSCSI or
something similar if the RBD will need to move around.

 

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Implement replication network with live cluster

Thx Wido, I needed this confirmations - thanks!

On 4 March 2015 at 17:49, Wido den Hollander w...@42on.com wrote:

 On 03/04/2015 05:44 PM, Robert LeBlanc wrote:
  If I remember right, someone has done this on a live cluster without
  any issues. I seem to remember that it had a fallback mechanism if the
  OSDs couldn't be reached on the cluster network to contact them on the
  public network. You could test it pretty easily without much impact.
  Take one OSD that has both networks and configure it and restart the
  process. If all the nodes (specifically the old ones with only one
  network) is able to connect to it, then you are good to go by
  restarting one OSD at a time.
 

 In the OSDMap each OSD has a public and cluster network address. If the
 cluster network address is not set, replication to that OSD will be done
 over the public network.

 So you can push a new configuration to all OSDs and restart them one by
 one.

 Make sure the network is ofcourse up and running and it should work.

  On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com
 wrote:
  Hi,
 
  I'm having a live cluster with only public network (so no explicit
 network
  configuraion in the ceph.conf file)
 
  I'm wondering what is the procedure to implement dedicated
  Replication/Private and Public network.
  I've read the manual, know how to do it in ceph.conf, but I'm wondering
  since this is already running cluster - what should I do after I change
  ceph.conf on all nodes ?
  Restarting OSDs one by one, or... ? Is there any downtime expected ? -
 for
  the replication network to actually imlemented completely.
 
 
  Another related quetion:
 
  Also, I'm demoting some old OSDs, on old servers, I will have them all
  stoped, but would like to implement replication network before actually
  removing old OSDs from crush map - since lot of data will be moved
 arround.
 
  My old nodes/OSDs (that will be stoped before I implement replication
  network) - do NOT have dedicated NIC for replication network, in
 contrast to
  new nodes/OSDs. So there will be still reference to these old OSD in the
  crush map.
  Will this be a problem - me changing/implementing replication network
 that
  WILL work on new nodes/OSDs, but not on old ones since they don't have
  dedicated NIC ? I guess not since old OSDs are stoped anyway, but would
 like
  opinion.
 
  Or perhaps i might remove OSD from crush map with prior seting of
  nobackfill and   norecover (so no rebalancing happens) and then
 implement
  replication netwotk?
 
 
  Sorry for old post, but...
 
  Thanks,
  --
 
  Andrija Panić
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Implement replication network with live cluster

Thx again - I really appreciatethe help guys !

On 4 March 2015 at 17:51, Robert LeBlanc rob...@leblancnet.us wrote:

 If the data have been replicated to new OSDs, it will be able to
 function properly even them them down or only on the public network.

 On Wed, Mar 4, 2015 at 9:49 AM, Andrija Panic andrija.pa...@gmail.com
 wrote:
  I guess it doesnt matter, since my Crush Map will still refernce old
 OSDs,
  that are stoped (and cluster resynced after that) ?
 
  I wanted to say: it doesnt matter (I guess?) that my Crush map is still
  referencing old OSD nodes that are already stoped. Tired, sorry...
 
  On 4 March 2015 at 17:48, Andrija Panic andrija.pa...@gmail.com wrote:
 
  That was my thought, yes - I found this blog that confirms what you are
  saying I guess:
 
 http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/
  I will do that... Thx
 
  I guess it doesnt matter, since my Crush Map will still refernce old
 OSDs,
  that are stoped (and cluster resynced after that) ?
 
  Thx again for the help
 
  On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote:
 
  If I remember right, someone has done this on a live cluster without
  any issues. I seem to remember that it had a fallback mechanism if the
  OSDs couldn't be reached on the cluster network to contact them on the
  public network. You could test it pretty easily without much impact.
  Take one OSD that has both networks and configure it and restart the
  process. If all the nodes (specifically the old ones with only one
  network) is able to connect to it, then you are good to go by
  restarting one OSD at a time.
 
  On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com
 
  wrote:
   Hi,
  
   I'm having a live cluster with only public network (so no explicit
   network
   configuraion in the ceph.conf file)
  
   I'm wondering what is the procedure to implement dedicated
   Replication/Private and Public network.
   I've read the manual, know how to do it in ceph.conf, but I'm
 wondering
   since this is already running cluster - what should I do after I
 change
   ceph.conf on all nodes ?
   Restarting OSDs one by one, or... ? Is there any downtime expected ?
 -
   for
   the replication network to actually imlemented completely.
  
  
   Another related quetion:
  
   Also, I'm demoting some old OSDs, on old servers, I will have them
 all
   stoped, but would like to implement replication network before
 actually
   removing old OSDs from crush map - since lot of data will be moved
   arround.
  
   My old nodes/OSDs (that will be stoped before I implement replication
   network) - do NOT have dedicated NIC for replication network, in
   contrast to
   new nodes/OSDs. So there will be still reference to these old OSD in
   the
   crush map.
   Will this be a problem - me changing/implementing replication network
   that
   WILL work on new nodes/OSDs, but not on old ones since they don't
 have
   dedicated NIC ? I guess not since old OSDs are stoped anyway, but
 would
   like
   opinion.
  
   Or perhaps i might remove OSD from crush map with prior seting of
   nobackfill and   norecover (so no rebalancing happens) and then
   implement
   replication netwotk?
  
  
   Sorry for old post, but...
  
   Thanks,
   --
  
   Andrija Panić
  
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
 
 
 
 
  --
 
  Andrija Panić
 
 
 
 
  --
 
  Andrija Panić




-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.93 Hammer release candidate released

2015-03-04 Thread Sage Weil

On Wed, 4 Mar 2015, Thomas Lemarchand wrote:
 Thanks to all Ceph developers for the good work !
 
 I see some love given to CephFS. When will you consider CephFS to be
 production ready ?

The key missing piece is fsck (check and repair).  That's where our 
efforts are focused now.  I think infernalis will have something pretty 
reasonable?

 I use CephFS in production since Giant, and apart from the cache
 pressure health warning bug, now resolved, I didn't have a single
 problem.

That's great to hear!

sage

 
 -- 
 Thomas Lemarchand
 Cloud Solutions SAS - Responsable des systèmes d'information
 
 
 
 On ven., 2015-02-27 at 14:10 -0800, Sage Weil wrote:
  This is the first release candidate for Hammer, and includes all of
  the features that will be present in the final release.  We welcome
  and encourage any and all testing in non-production clusters to identify
  any problems with functionality, stability, or performance before the
  final Hammer release.
  
  We suggest some caution in one area: librbd.  There is a lot of new
  functionality around object maps and locking that is disabled by
  default but may still affect stability for existing images.  We are
  continuing to shake out those bugs so that the final Hammer release
  (probably v0.94) will be stable.
  
  Major features since Giant include:
  
  * cephfs: journal scavenger repair tool (John Spray)
  * crush: new and improved straw2 bucket type (Sage Weil, Christina 
Anderson, Xiaoxi Chen)
  * doc: improved guidance for CephFS early adopters (John Spray)
  * librbd: add per-image object map for improved performance (Jason 
Dillaman)
  * librbd: copy-on-read (Min Chen, Li Wang, Yunchuan Wen, Cheng Cheng)
  * librados: fadvise-style IO hints (Jianpeng Ma)
  * mds: many many snapshot-related fixes (Yan, Zheng)
  * mon: new 'ceph osd df' command (Mykola Golub)
  * mon: new 'ceph pg ls ...' command (Xinxin Shu)
  * osd: improved performance for high-performance backends
  * osd: improved recovery behavior (Samuel Just)
  * osd: improved cache tier behavior with reads (Zhiqiang Wang)
  * rgw: S3-compatible bucket versioning support (Yehuda Sadeh)
  * rgw: large bucket index sharding (Guang Yang, Yehuda Sadeh)
  * RDMA xio messenger support (Matt Benjamin, Vu Pham)
  
  Upgrading
  -
  
  * No special restrictions when upgrading from firefly or giant
  
  Notable Changes
  ---
  
  * build: CMake support (Ali Maredia, Casey Bodley, Adam Emerson, Marcus 
Watts, Matt Benjamin)
  * ceph-disk: do not re-use partition if encryption is required (Loic 
Dachary)
  * ceph-disk: support LUKS for encrypted partitions (Andrew Bartlett, Loic 
Dachary)
  * ceph-fuse,libcephfs: add support for O_NOFOLLOW and O_PATH (Greg Farnum)
  * ceph-fuse,libcephfs: resend requests before completing cap reconnect 
(#10912 Yan, Zheng)
  * ceph-fuse: select kernel cache invalidation mechanism based on kernel 
version (Greg Farnum)
  * ceph-objectstore-tool: improved import (David Zafman)
  * ceph-objectstore-tool: misc improvements, fixes (#9870 #9871 David 
Zafman)
  * ceph: add 'ceph osd df [tree]' command (#10452 Mykola Golub)
  * ceph: fix 'ceph tell ...' command validation (#10439 Joao Eduardo Luis)
  * ceph: improve 'ceph osd tree' output (Mykola Golub)
  * cephfs-journal-tool: add recover_dentries function (#9883 John Spray)
  * common: add newline to flushed json output (Sage Weil)
  * common: filtering for 'perf dump' (John Spray)
  * common: fix Formatter factory breakage (#10547 Loic Dachary)
  * common: make json-pretty output prettier (Sage Weil)
  * crush: new and improved straw2 bucket type (Sage Weil, Christina 
Anderson, Xiaoxi Chen)
  * crush: update tries stats for indep rules (#10349 Loic Dachary)
  * crush: use larger choose_tries value for erasure code rulesets (#10353 
Loic Dachary)
  * debian,rpm: move RBD udev rules to ceph-common (#10864 Ken Dreyer)
  * debian: split python-ceph into python-{rbd,rados,cephfs} (Boris Ranto)
  * doc: CephFS disaster recovery guidance (John Spray)
  * doc: CephFS for early adopters (John Spray)
  * doc: fix OpenStack Glance docs (#10478 Sebastien Han)
  * doc: misc updates (#9793 #9922 #10204 #10203 Travis Rhoden, Hazem, 
Ayari, Florian Coste, Andy Allan, Frank Yu, Baptiste Veuillez-Mainard, 
Yuan Zhou, Armando Segnini, Robert Jansen, Tyler Brekke, Viktor Suprun)
  * doc: replace cloudfiles with swiftclient Python Swift example (Tim 
Freund)
  * erasure-code: add mSHEC erasure code support (Takeshi Miyamae)
  * erasure-code: improved docs (#10340 Loic Dachary)
  * erasure-code: set max_size to 20 (#10363 Loic Dachary)
  * libcephfs,ceph-fuse: fix getting zero-length xattr (#10552 Yan, Zheng)
  * librados: add blacklist_add convenience method (Jason Dillaman)
  * librados: expose rados_{read|write}_op_assert_version in C API (Kim 
Vandry)
  * librados: fix pool name caching (#10458 Radoslaw Zarzynski)
  * librados: fix resource leak, misc bugs

Re: [ceph-users] CEPH hardware recommendations and cluster design questions

Hi for hardware, inktank have good guides here:

http://www.inktank.com/resource/inktank-hardware-selection-guide/
http://www.inktank.com/resource/inktank-hardware-configuration-guide/

ceph works well with multiple osd daemon (1 osd by disk),
so you should not use raid.

(xfs is the recommended fs for osd daemons).

you don't need disk spare too, juste enough disk space to handle a disk failure.
(datas are replicated-rebalanced on other disks/osd in case of disk failure)


- Mail original -
De: Adrian Sevcenco adrian.sevce...@cern.ch
À: ceph-users ceph-users@lists.ceph.com
Envoyé: Mercredi 4 Mars 2015 18:30:31
Objet: [ceph-users] CEPH hardware recommendations and cluster design
questions

Hi! I seen the documentation 
http://ceph.com/docs/master/start/hardware-recommendations/ but those 
minimum requirements without some recommendations don't tell me much ... 

So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd 
would do ... what puzzles me is that per daemon construct ... 
Why would i need/require to have multiple daemons? with separate servers 
(3 mon + 1 mds - i understood that this is the requirement) i imagine 
that each will run a single type of daemon.. did i miss something? 
(beside that maybe is a relation between daemons and block devices and 
for each block device should be a daemon?) 

for mon and mds : would help the clients if these are on 10 GbE? 

for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all 
disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB 
how much ram i would really need? (128 gb would be way to much i think) 
(that RAIDZ3 for 36 disks is just a thought - i have also choices like: 
2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare) 

Regarding journal and scrubbing : by using ZFS i would think that i can 
safely not use the CEPH ones ... is this ok? 

Do you have some other advises and recommendations for me? (the 
read:writes ratios will be 10:1) 

Thank you!! 
Adrian 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Mark Nelson


On 03/04/2015 05:34 AM, John Spray wrote:



On 04/03/2015 08:26, Nick Fisk wrote:

To illustrate the difference a proper write back cache can make, I put
a 1GB (512mb dirty threshold) flashcache in front of my RBD and
tweaked the flush parameters to flush dirty blocks at a large queue
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is
limited by the performance of SSD used by flashcache, as everything is
stored as 4k blocks on the ssd. In fact since everything is stored as
4k blocks, pretty much all IO sizes are accelerated to max speed of
the SSD. Looking at iostat I can see all the IO’s are getting
coalesced into nice large 512kb IO’s at a high queue depth, which Ceph
easily swallows.

If librbd could support writing its cache out to SSD it would
hopefully achieve the same level of performance and having it
integrated would be really neat.


What are you hoping to gain from building something into ceph instead of
using flashcache/bcache/dm-cache on top of it?  It seems like since you
would anyway need to handle your HA configuration, setting up the actual
cache device would be the simple part.


Agreed regarding flashcache/bcache/dm-cache.  I suspect improving an 
existing project rather than reinventing it ourselves would be the way 
to go.  It may also be worth looking at Luis's work, though I note that 
he specifically says write-through:


http://vault2015.sched.org/event/6cc56a5b8a95ead46961697028b59c39#.VPc0uX-etWQ

https://github.com/pblcache/pblcache



Cheers,
John


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Implement replication network with live cluster

I guess it doesnt matter, since my Crush Map will still refernce old OSDs,
that are stoped (and cluster resynced after that) ?

I wanted to say: it doesnt matter (I guess?) that my Crush map is still
referencing old OSD nodes that are already stoped. Tired, sorry...

On 4 March 2015 at 17:48, Andrija Panic andrija.pa...@gmail.com wrote:

I guess it doesnt matter, since my Crush Map will still refernce old OSDs,
that are stoped (and cluster resynced after that) ?

Thx again for the help

On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote:

On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com
wrote:
Hi,

I'm having a live cluster with only public network (so no explicit
network
configuraion in the ceph.conf file)

Another related quetion:

Or perhaps i might remove OSD from crush map with prior seting of
nobackfill and norecover (so no rebalancing happens) and then
implement
replication netwotk?

Sorry for old post, but...

Thanks,
--

Andrija Panić

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Andrija Panić

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

Hi Robert,

I already have this stuff set. CEph is 0.87.0 now...

Thanks, will schedule this for weekend, 10G network and 36 OSDs - should
move data in less than 8h per my last experineced that was arround8h, but
some 1G OSDs were included...

Thx!

On 4 March 2015 at 17:49, Robert LeBlanc rob...@leblancnet.us wrote:

 You will most likely have a very high relocation percentage. Backfills
 always are more impactful on smaller clusters, but osd max backfills
 should be what you need to help reduce the impact. The default is 10,
 you will want to use 1.

 I didn't catch which version of Ceph you are running, but I think
 there was some priority work done in firefly to help make backfills
 lower priority. I think it has gotten better in later versions.

 On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic andrija.pa...@gmail.com
 wrote:
  Thank you Rober - I'm wondering when I do remove total of 7 OSDs from
 crush
  map - weather that will cause more than 37% of data moved (80% or
 whatever)
 
  I'm also wondering if the thortling that I applied is fine or not - I
 will
  introduce the osd_recovery_delay_start 10sec as Irek said.
 
  I'm just wondering hom much will be the performance impact, because:
  - when stoping OSD, the impact while backfilling was fine more or a less
 - I
  can leave with this
  - when I removed OSD from cursh map - first 1h or so, impact was
 tremendous,
  and later on during recovery process impact was much less but still
  noticable...
 
  Thanks for the tip of course !
  Andrija
 
  On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote:
 
  I would be inclined to shut down both OSDs in a node, let the cluster
  recover. Once it is recovered, shut down the next two, let it recover.
  Repeat until all the OSDs are taken out of the cluster. Then I would
  set nobackfill and norecover. Then remove the hosts/disks from the
  CRUSH then unset nobackfill and norecover.
 
  That should give you a few small changes (when you shut down OSDs) and
  then one big one to get everything in the final place. If you are
  still adding new nodes, when nobackfill and norecover is set, you can
  add them in so that the one big relocate fills the new drives too.
 
  On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com
  wrote:
   Thx Irek. Number of replicas is 3.
  
   I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already
   decommissioned), which is further connected to a new 10G
 switch/network
   with
   3 servers on it with 12 OSDs each.
   I'm decommissioning old 3 nodes on 1G network...
  
   So you suggest removing whole node with 2 OSDs manually from crush
 map?
   Per my knowledge, ceph never places 2 replicas on 1 node, all 3
 replicas
   were originally been distributed over all 3 nodes. So anyway It could
 be
   safe to remove 2 OSDs at once together with the node itself...since
   replica
   count is 3...
   ?
  
   Thx again for your time
  
   On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote:
  
   Once you have only three nodes in the cluster.
   I recommend you add new nodes to the cluster, and then delete the
 old.
  
   2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com:
  
   You have a number of replication?
  
   2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com:
  
   Hi Irek,
  
   yes, stoping OSD (or seting it to OUT) resulted in only 3% of data
   degraded and moved/recovered.
   When I after that removed it from Crush map ceph osd crush rm id,
   that's when the stuff with 37% happened.
  
   And thanks Irek for help - could you kindly just let me know of the
   prefered steps when removing whole node?
   Do you mean I first stop all OSDs again, or just remove each OSD
 from
   crush map, or perhaps, just decompile cursh map, delete the node
   completely,
   compile back in, and let it heal/recover ?
  
   Do you think this would result in less data missplaces and moved
   arround
   ?
  
   Sorry for bugging you, I really appreaciate your help.
  
   Thanks
  
   On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote:
  
   A large percentage of the rebuild of the cluster map (But low
   percentage degradation). If you had not made ceph osd crush rm
 id,
   the
   percentage would be low.
   In your case, the correct option is to remove the entire node,
   rather
   than each disk individually
  
   2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com
 :
  
   Another question - I mentioned here 37% of objects being moved
   arround
   - this is MISPLACED object (degraded objects were 0.001%, after I
   removed 1
   OSD from cursh map (out of 44 OSD or so).
  
   Can anybody confirm this is normal behaviour - and are there any
   workarrounds ?
  
   I understand this is because of the object placement algorithm of
   CEPH, but still 37% of object missplaces just by removing 1 OSD
   from crush
   maps out of 44 make me wonder why this large percentage ?
  
   Seems not good to

Re: [ceph-users] CEPH hardware recommendations and cluster design questions

2015-03-04 Thread Stephen Mercier

To expand upon this, the very nature and existence of Ceph is to replace RAID. 
The FS itself replicates data and handles the HA functionality that you're 
looking for. If you're going to build a single server with all those disks, 
backed by a ZFS RAID setup, you're going to be much better suited with an iSCSI 
setup. The idea of ceph is that it takes the place of all the ZFS bells and 
whistles. A CEPH cluster that only has one OSD backed by that huge ZFS setup 
becomes just a wire-protocol to speak to the server. The magic in ceph comes 
from the replication and distribution of the data across many OSDs, hopefully 
living in many hosts. My own setup for instance uses 96 OSDs that are spread 
across 4 hosts (I know I know guys - CPU is a big deal with SSDs so 24 per host 
is a tall order - didn't know that when we built it - been working ok so far) 
that is then distributed between 2 cabinets on 2 separate cooling/power/data 
zones in our datacenter. My CRUSH map is currently setup for 3 copies of all 
data, and laid out so that at least one copy is located in each cabinet, and 
then the cab that gets the 2 copies also makes sure that each copy is on a 
different host. No RAID needed because ceph makes sure that I have a safe 
amount of copies of the data, in a distribution layout that allows us to sleep 
at night. In my opinion, ceph is much more pleasant, powerful, and versatile to 
deal with than both hardware RAID and ZFS (Both of which we have instances of 
deployed as well from previous iterations of infrastructure deployments). Now, 
you could always create small little zRAID clusters using ZFS, and then give an 
OSD to each of those, if you wanted even an additional layer of safety. Heck, 
you could even have hardware RAID behind the zRAID, for even another layer. 
Where YOU need to make the decision is the trade-off between HA 
functionality/peace of mind, performance, and useability/maintainability.

Would me happy to answer any questions you still have...

Cheers,
-- 
Stephen Mercier
Senior Systems Architect
Attainia, Inc.
Phone: 866-288-2464 ext. 727
Email: stephen.merc...@attainia.com
Web: www.attainia.com

Capital equipment lifecycle planning  budgeting solutions for healthcare






On Mar 4, 2015, at 10:42 AM, Alexandre DERUMIER wrote:

 Hi for hardware, inktank have good guides here:
 
 http://www.inktank.com/resource/inktank-hardware-selection-guide/
 http://www.inktank.com/resource/inktank-hardware-configuration-guide/
 
 ceph works well with multiple osd daemon (1 osd by disk),
 so you should not use raid.
 
 (xfs is the recommended fs for osd daemons).
 
 you don't need disk spare too, juste enough disk space to handle a disk 
 failure.
 (datas are replicated-rebalanced on other disks/osd in case of disk failure)
 
 
 - Mail original -
 De: Adrian Sevcenco adrian.sevce...@cern.ch
 À: ceph-users ceph-users@lists.ceph.com
 Envoyé: Mercredi 4 Mars 2015 18:30:31
 Objet: [ceph-users] CEPH hardware recommendations and cluster design  
 questions
 
 Hi! I seen the documentation 
 http://ceph.com/docs/master/start/hardware-recommendations/ but those 
 minimum requirements without some recommendations don't tell me much ... 
 
 So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd 
 would do ... what puzzles me is that per daemon construct ... 
 Why would i need/require to have multiple daemons? with separate servers 
 (3 mon + 1 mds - i understood that this is the requirement) i imagine 
 that each will run a single type of daemon.. did i miss something? 
 (beside that maybe is a relation between daemons and block devices and 
 for each block device should be a daemon?) 
 
 for mon and mds : would help the clients if these are on 10 GbE? 
 
 for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all 
 disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB 
 how much ram i would really need? (128 gb would be way to much i think) 
 (that RAIDZ3 for 36 disks is just a thought - i have also choices like: 
 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare) 
 
 Regarding journal and scrubbing : by using ZFS i would think that i can 
 safely not use the CEPH ones ... is this ok? 
 
 Do you have some other advises and recommendations for me? (the 
 read:writes ratios will be 10:1) 
 
 Thank you!! 
 Adrian 
 
 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.80.8 and librbd performance

2015-03-04 Thread Josh Durgin


On 03/03/2015 03:28 PM, Ken Dreyer wrote:

On 03/03/2015 04:19 PM, Sage Weil wrote:

Hi,

This is just a heads up that we've identified a performance regression in
v0.80.8 from previous firefly releases.  A v0.80.9 is working it's way
through QA and should be out in a few days.  If you haven't upgraded yet
you may want to wait.

Thanks!
sage


Hi Sage,

I've seen a couple Redmine tickets on this (eg
http://tracker.ceph.com/issues/9854 ,
http://tracker.ceph.com/issues/10956). It's not totally clear to me
which of the 70+ unreleased commits on the firefly branch fix this
librbd issue.  Is it only the three commits in
https://github.com/ceph/ceph/pull/3410 , or are there more?


Those are the only ones needed to fix the librbd performance
regression, yes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] The project of ceph client file system porting from Linux to AIX

2015-03-04 Thread McNamara, Bradley

I'd like to see a Solaris client.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dennis 
Chen
Sent: Wednesday, March 04, 2015 2:00 AM
To: ceph-devel; ceph-users; Sage Weil; Loic Dachary
Subject: [ceph-users] The project of ceph client file system porting from Linux 
to AIX

Hello,

The ceph cluster now can only be used by Linux system AFAICT, so I planed to 
port the ceph client file system from Linux to AIX as a tiered storage solution 
in that platform. Below is the source code repository I've done, which is still 
in progress. 3 important modules:

1. aixker: maintain a uniform kernel API beteween the Linux and AIX 2. net: as 
a data transfering layer between the client and cluster 3. fs: as an adaptor to 
make the AIX can recognize the Linux file system.

https://github.com/Dennis-Chen1977/aix-cephfs

Welcome any comments or anything...

--
Den
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] New EC pool undersized

Last night I blew away my previous ceph configuration (this environment is
pre-production) and have 0.87.1 installed. I've manually edited the
crushmap so it down looks like https://dpaste.de/OLEa

I currently have 144 OSDs on 8 nodes.

After increasing pg_num and pgp_num to a more suitable 1024 (due to the
high number of OSDs), everything looked happy.
So, now I'm trying to play with an erasure-coded pool.
I did:
ceph osd erasure-code-profile set ec44profile k=4 m=4
ruleset-failure-domain=rack
ceph osd pool create ec44pool 8192 8192 erasure ec44profile

After settling for a bit 'ceph status' gives
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck
unclean; 7 pgs stuck undersized; 7 pgs undersized
 monmap e1: 4 mons at {hobbit01=
10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0},
election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e409: 144 osds: 144 up, 144 in
  pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects
90598 MB used, 640 TB / 640 TB avail
   7 active+undersized+degraded
   12281 active+clean

So to troubleshoot the undersized pgs, I issued 'ceph pg dump_stuck'
ok
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v
reported up up_primary acting acting_primary last_scrub scrub_stamp
last_deep_scrub deep_scrub_stamp
1.d77 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:57.502849
0'0 408:12 [15,95,58,73,52,31,116,2147483647] 15
[15,95,58,73,52,31,116,2147483647] 15 0'0 2015-03-04 11:33:42.100752
0'0 2015-03-04
11:33:42.100752
1.10fa 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:29.362554
0'0 408:12 [23,12,99,114,132,53,56,2147483647] 23
[23,12,99,114,132,53,56,2147483647] 23 0'0 2015-03-04 11:33:42.168571
0'0 2015-03-04
11:33:42.168571
1.1271 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:48.795742
0'0 408:12 [135,112,69,4,22,95,2147483647,83] 135
[135,112,69,4,22,95,2147483647,83] 135 0'0 2015-03-04 11:33:42.139555
0'0 2015-03-04
11:33:42.139555
1.2b5 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:32.189738
0'0 408:12 [11,115,139,19,76,52,94,2147483647] 11
[11,115,139,19,76,52,94,2147483647] 11 0'0 2015-03-04 11:33:42.079673
0'0 2015-03-04
11:33:42.079673
1.7ae 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:26.848344
0'0 408:12 [27,5,132,119,94,56,52,2147483647] 27
[27,5,132,119,94,56,52,2147483647] 27 0'0 2015-03-04 11:33:42.109832
0'0 2015-03-04
11:33:42.109832
1.1a97 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:25.457454
0'0 408:12 [20,53,14,54,102,118,2147483647,72] 20
[20,53,14,54,102,118,2147483647,72] 20 0'0 2015-03-04 11:33:42.833850
0'0 2015-03-04
11:33:42.833850
1.10a6 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:30.059936
0'0 408:12 [136,22,4,2147483647,72,52,101,55] 136
[136,22,4,2147483647,72,52,101,55] 136 0'0 2015-03-04 11:33:42.125871
0'0 2015-03-04
11:33:42.125871

This appears to have a number on all these (2147483647) that is way out of
line from what I would expect.

Thoughts?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New EC pool undersized

Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 
2048.

-don-

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don 
Doerner
Sent: 04 March, 2015 12:14
To: Kyle Hutson; Ceph Users
Subject: Re: [ceph-users] New EC pool undersized

In this case, that number means that there is not an OSD that can be assigned.  
What’s your k, m from you erasure coded pool?  You’ll need approximately 
(14400/(k+m)) PGs, rounded up to the next power of 2…

-don-

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kyle 
Hutson
Sent: 04 March, 2015 12:06
To: Ceph Users
Subject: [ceph-users] New EC pool undersized

Last night I blew away my previous ceph configuration (this environment is 
pre-production) and have 0.87.1 installed. I've manually edited the crushmap so 
it down looks like 
https://dpaste.de/OLEahttps://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEak=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442

I currently have 144 OSDs on 8 nodes.

After increasing pg_num and pgp_num to a more suitable 1024 (due to the high 
number of OSDs), everything looked happy.
So, now I'm trying to play with an erasure-coded pool.
I did:
ceph osd erasure-code-profile set ec44profile k=4 m=4 
ruleset-failure-domain=rack
ceph osd pool create ec44pool 8192 8192 erasure ec44profile

After settling for a bit 'ceph status' gives
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck 
unclean; 7 pgs stuck undersized; 7 pgs undersized
 monmap e1: 4 mons at 
{hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=6fe07b47a00235857630057e09cfb702dcddcea1d3f98d81a574020ee95dee44},
 election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e409: 144 osds: 144 up, 144 in
  pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects
90598 MB used, 640 TB / 640 TB avail
   7 active+undersized+degraded
   12281 active+clean

So to troubleshoot the undersized pgs, I issued 'ceph pg dump_stuck'
ok
pg_stat   objects   mip  degr misp unf  bytes log  disklog state 
state_stampvreported  up   up_primary actingacting_primary 
last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
1.d77 00000000 active+undersized+degraded   
 2015-03-04 11:33:57.502849 0'0  408:12
[15,95,58,73,52,31,116,2147483647] 15 [15,95,58,73,52,31,116,2147483647] 15 
  0'0  2015-03-04 11:33:42.100752 0'0  2015-03-04 11:33:42.100752
1.10fa00000000 active+undersized+degraded   
 2015-03-04 11:34:29.362554 0'0  408:12
[23,12,99,114,132,53,56,2147483647] 23   
[23,12,99,114,132,53,56,2147483647] 23   0'0  2015-03-04 11:33:42.168571
0'0  2015-03-04 11:33:42.168571
1.127100000000 active+undersized+degraded   
 2015-03-04 11:33:48.795742 0'0  408:12
[135,112,69,4,22,95,2147483647,83] 135 [135,112,69,4,22,95,2147483647,83] 
135  0'0  2015-03-04 11:33:42.139555 0'0  2015-03-04 11:33:42.139555
1.2b5 00000000 active+undersized+degraded   
 2015-03-04 11:34:32.189738 0'0  408:12
[11,115,139,19,76,52,94,2147483647] 11   
[11,115,139,19,76,52,94,2147483647] 11   0'0  2015-03-04 11:33:42.079673
0'0  2015-03-04 11:33:42.079673
1.7ae 00000000 active+undersized+degraded   
 2015-03-04 11:34:26.848344 0'0  408:12
[27,5,132,119,94,56,52,2147483647] 27 [27,5,132,119,94,56,52,2147483647] 27 
  0'0  2015-03-04 11:33:42.109832 0'0  2015-03-04 11:33:42.109832
1.1a9700000000 active+undersized+degraded   
 2015-03-04 11:34:25.457454 0'0  408:12
[20,53,14,54,102,118,2147483647,72] 20   
[20,53,14,54,102,118,2147483647,72] 20   0'0  2015-03-04 11:33:42.833850
0'0  2015-03-04 11:33:42.833850
1.10a600000000 active+undersized+degraded   
 2015-03-04 11:34:30.059936 0'0  408:12
[136,22,4,2147483647,72,52,101,55] 136 [136,22,4,2147483647,72,52,101,55] 
136  0'0  2015-03-04 11:33:42.125871 0'0  2015-03-04 11:33:42.125871

This appears to have a number on all these (2147483647) that is way out of line 
from what I would expect.

Thoughts?


The information contained in

[ceph-users] Ceph User Teething Problems

2015-03-04 Thread Datatone Lists

I have been following ceph for a long time. I have yet to put it into
service, and I keep coming back as btrfs improves and ceph reaches
higher version numbers.

I am now trying ceph 0.93 and kernel 4.0-rc1.

Q1) Is it still considered that btrfs is not robust enough, and that
xfs should be used instead? [I am trying with btrfs].

I followed the manual deployment instructions on the web site 
(http://ceph.com/docs/master/install/manual-deployment/) and I managed
to get a monitor and several osds running and apparently working. The
instructions fizzle out without explaining how to set up mds. I went
back to mkcephfs and got things set up that way. The mds starts.

[Please don't mention ceph-deploy]

The first thing that I noticed is that (whether I set up mon and osds
by following the manual deployment, or using mkcephfs), the correct
default pools were not created.

bash-4.3# ceph osd lspools
0 rbd,
bash-4.3# 

 I get only 'rbd' created automatically. I deleted this pool, and
 re-created data, metadata and rbd manually. When doing this, I had to
 juggle with the pg- num in order to avoid the 'too many pgs for osd'.
 I have three osds running at the moment, but intend to add to these
 when I have some experience of things working reliably. I am puzzled,
 because I seem to have to set the pg-num for the pool to a number that
 makes (N-pools x pg-num)/N-osds come to the right kind of number. So
 this implies that I can't really expand a set of pools by adding osds
 at a later date. 

Q2) Is there any obvious reason why my default pools are not getting
created automatically as expected?

Q3) Can pg-num be modified for a pool later? (If the number of osds is 
increased dramatically).

Finally, when I try to mount cephfs, I get a mount 5 error.

A mount 5 error typically occurs if a MDS server is laggy or if it
crashed. Ensure at least one MDS is up and running, and the cluster is
active + healthy.

My mds is running, but its log is not terribly active:

2015-03-04 17:47:43.177349 7f42da2c47c0  0 ceph version 0.93 
(bebf8e9a830d998eeaab55f86bb256d4360dd3c4), process ceph-mds, pid 4110
2015-03-04 17:47:43.182716 7f42da2c47c0 -1 mds.-1.0 log_to_monitors 
{default=true}

(This is all there is in the log).

I think that a key indicator of the problem must be this from the
monitor log:

2015-03-04 16:53:20.715132 7f3cd0014700  1
mon.ceph-mon-00@0(leader).mds e1 warning, MDS mds.?
[2001:8b0::5fb3::1fff::9054]:6800/4036 up but filesystem
disabled

(I have added the '' sections to obscure my ip address)

Q4) Can you give me an idea of what is wrong that causes the mds to not
play properly?

I think that there are some typos on the manual deployment pages, for
example:

ceph-osd id={osd-num}

This is not right. As far as I am aware it should be:

ceph-osd -i {osd-num}

An observation. In principle, setting things up manually is not all
that complicated, provided that clear and unambiguous instructions are
provided. This simple piece of documentation is very important. My view
is that the existing manual deployment instructions gets a bit confused
and confusing when it gets to the osd setup, and the mds setup is
completely absent.

For someone who knows, this would be a fairly simple and fairly quick 
operation to review and revise this part of the documentation. I
suspect that this part suffers from being really obvious stuff to the
well initiated. For those of us closer to the start, this forms the
ends of the threads that have to be picked up before the journey can be
made.

Very best regards,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread koukou73gr


On 03/03/2015 05:53 PM, Jason Dillaman wrote:

Your procedure appears correct to me.  Would you mind re-running your cloned 
image VM with the following ceph.conf properties:

[client]
rbd cache off
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

If you recreate the issue, would you mind opening a ticket at 
http://tracker.ceph.com/projects/rbd/issues?

Jason,

Thanks for the reply. Recreating the issue is not a problem, I can 
reproduce it any time.
The log file was getting a bit large, I destroyed the guest after 
letting it thrash for about ~3 minutes, plenty of time to hit the 
problem. I've uploaded it at:


http://paste.scsys.co.uk/468868 (~19MB)

Do you really think this is a bug and not an err on my side?

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New EC pool undersized

Sorry, I missed your other questions, down at the bottom.  See 
herehttp://ceph.com/docs/master/rados/operations/placement-groups/ (look for 
“number of replicas for replicated pools or the K+M sum for erasure coded 
pools”) for the formula; 38400/8 probably implies 8192.

The thing is, you’ve got to think about how many ways you can form combinations 
of 8 unique OSDs (with replacement) that match your failure domain rules.  If 
you’ve only got 8 hosts, and your failure domain is hosts, it severely limits 
this number.  And I have read that too many isn’t good either – a serialization 
issue, I believe.

-don-

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don 
Doerner
Sent: 04 March, 2015 12:49
To: Kyle Hutson
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New EC pool undersized

Hmmm, I just struggled through this myself.  How many racks do you have?  If 
not more than 8, you might want to make your failure domain smaller?  I.e., 
maybe host?  That, at least, would allow you to debug the situation…

-don-

From: Kyle Hutson [mailto:kylehut...@ksu.edu]
Sent: 04 March, 2015 12:43
To: Don Doerner
Cc: Ceph Users
Subject: Re: [ceph-users] New EC pool undersized

It wouldn't let me simply change the pg_num, giving
Error EEXIST: specified pg_num 2048 = current 8192

But that's not a big deal, I just deleted the pool and recreated with 'ceph osd 
pool create ec44pool 2048 2048 erasure ec44profile'
...and the result is quite similar: 'ceph status' is now
ceph status
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized
 monmap e1: 4 mons at 
{hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8},
 election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e412: 144 osds: 144 up, 144 in
  pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects
90590 MB used, 640 TB / 640 TB avail
   4 active+undersized+degraded
6140 active+clean

'ceph pg dump_stuck results' in
ok
pg_stat   objects   mip  degr misp unf  bytes log  disklog state 
state_stampvreported  up   up_primary actingacting_primary 
last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.296 00000000 active+undersized+degraded   
 2015-03-04 14:33:26.672224 0'0  412:9 
[5,55,91,2147483647,83,135,53,26]  5 [5,55,91,2147483647,83,135,53,26]  5   
 0'0  2015-03-04 14:33:15.649911 0'0  2015-03-04 14:33:15.649911
2.69c 00000000 active+undersized+degraded   
 2015-03-04 14:33:24.984802 0'0  412:9 
[93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647,60] 93 
  0'0  2015-03-04 14:33:15.695747 0'0  2015-03-04 14:33:15.695747
2.36d 00000000 active+undersized+degraded   
 2015-03-04 14:33:21.937620 0'0  412:9 
[12,108,136,104,52,18,63,2147483647]12   
[12,108,136,104,52,18,63,2147483647]12   0'0  2015-03-04 14:33:15.652480
0'0  2015-03-04 14:33:15.652480
2.5f7 00000000 active+undersized+degraded   
 2015-03-04 14:33:26.169242 0'0  412:9 
[94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647,113] 94 
  0'0  2015-03-04 14:33:15.687695 0'0  2015-03-04 14:33:15.687695

I do have questions for you, even at this point, though.
1) Where did you find the formula (14400/(k+m))?
2) I was really trying to size this for when it goes to production, at which 
point it may have as many as 384 OSDs. Doesn't that imply I should have even 
more pgs?

On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner 
don.doer...@quantum.commailto:don.doer...@quantum.com wrote:
Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 
2048.

-don-

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Don Doerner
Sent: 04 March, 2015 12:14
To: Kyle Hutson; Ceph Users
Subject: Re: [ceph-users] New EC pool undersized

In this case, that number means that there is not an OSD that can be assigned.  
What’s your k, m from you erasure coded pool?  You’ll need approximately 
(14400/(k+m)) PGs, rounded up to the next power of 2…

-don-

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kyle 
Hutson
Sent: 04 March, 2015 12:06
To: Ceph Users
Subject: [ceph-users] New EC pool undersized

Last night I blew away my previous ceph configuration (this

Re: [ceph-users] New EC pool undersized

So it sounds like I should figure out at 'how many nodes' do I need to
increase pg_num to 4096, and again for 8192, and increase those
incrementally when as I add more hosts, correct?

On Wed, Mar 4, 2015 at 3:04 PM, Don Doerner don.doer...@quantum.com wrote:

  Sorry, I missed your other questions, down at the bottom.  See here
 http://ceph.com/docs/master/rados/operations/placement-groups/ (look
 for “number of replicas for replicated pools or the K+M sum for erasure
 coded pools”) for the formula; 38400/8 probably implies 8192.



 The thing is, you’ve got to think about how many ways you can form
 combinations of 8 unique OSDs (with replacement) that match your failure
 domain rules.  If you’ve only got 8 hosts, and your failure domain is
 hosts, it severely limits this number.  And I have read that too many
 isn’t good either – a serialization issue, I believe.



 -don-



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Don Doerner
 *Sent:* 04 March, 2015 12:49
 *To:* Kyle Hutson
 *Cc:* ceph-users@lists.ceph.com

 *Subject:* Re: [ceph-users] New EC pool undersized



 Hmmm, I just struggled through this myself.  How many racks do you have?  If
 not more than 8, you might want to make your failure domain smaller?  I.e.,
 maybe host?  That, at least, would allow you to debug the situation…



 -don-



 *From:* Kyle Hutson [mailto:kylehut...@ksu.edu kylehut...@ksu.edu]
 *Sent:* 04 March, 2015 12:43
 *To:* Don Doerner
 *Cc:* Ceph Users
 *Subject:* Re: [ceph-users] New EC pool undersized



 It wouldn't let me simply change the pg_num, giving

 Error EEXIST: specified pg_num 2048 = current 8192



 But that's not a big deal, I just deleted the pool and recreated with
 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile'

 ...and the result is quite similar: 'ceph status' is now

 ceph status

 cluster 196e5eb8-d6a7-4435-907e-ea028e946923

  health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs
 undersized

  monmap e1: 4 mons at {hobbit01=
 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0
 https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8},
 election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14

  osdmap e412: 144 osds: 144 up, 144 in

   pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects

 90590 MB used, 640 TB / 640 TB avail

4 active+undersized+degraded

 6140 active+clean



 'ceph pg dump_stuck results' in

 ok

 pg_stat   objects   mip  degr misp unf  bytes log  disklog state
 state_stampvreported  up   up_primary actingacting_primary
 last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp

 2.296 00000000
 active+undersized+degraded2015-03-04 14:33:26.672224 0'0  412:9
 [5,55,91,2147483647,83,135,53,26]  5 [5,55,91,2147483647,83,135,53,26]
 50'0  2015-03-04 14:33:15.649911 0'0  2015-03-04 14:33:15.649911

 2.69c 00000000
 active+undersized+degraded2015-03-04 14:33:24.984802 0'0  412:9
 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647
 ,60] 93   0'0  2015-03-04 14:33:15.695747 0'0  2015-03-04
 14:33:15.695747

 2.36d 00000000
 active+undersized+degraded2015-03-04 14:33:21.937620 0'0  412:9
 [12,108,136,104,52,18,63,2147483647]12   [12,108,136,104,52,18,63,
 2147483647]12   0'0  2015-03-04 14:33:15.6524800'0  2015-03-04
 14:33:15.652480

 2.5f7 00000000
 active+undersized+degraded2015-03-04 14:33:26.169242 0'0  412:9
 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647
 ,113] 94   0'0  2015-03-04 14:33:15.687695 0'0  2015-03-04
 14:33:15.687695



 I do have questions for you, even at this point, though.

 1) Where did you find the formula (14400/(k+m))?

 2) I was really trying to size this for when it goes to production, at
 which point it may have as many as 384 OSDs. Doesn't that imply I should
 have even more pgs?



 On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.com
 wrote:

 Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so
 try 2048.



 -don-



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Don Doerner
 *Sent:* 04 March, 2015 12:14
 *To:* Kyle Hutson; Ceph Users
 *Subject:* Re: [ceph-users] New EC pool undersized



 In this case, that number means that there is not an OSD that can be
 assigned.  What’s your k, m from you erasure coded pool?  You’ll need

Re: [ceph-users] New EC pool undersized

That did it.

'step set_choose_tries 200'  fixed the problem right away.

Thanks Yann!

On Wed, Mar 4, 2015 at 2:59 PM, Yann Dupont y...@objoo.org wrote:


 Le 04/03/2015 21:48, Don Doerner a écrit :

  Hmmm, I just struggled through this myself.  How many racks do you have?
 If not more than 8, you might want to make your failure domain smaller?  I.e.,
 maybe host?  That, at least, would allow you to debug the situation…



 -don-




 Hello, I think I already had this problem.
 It's explained here
 http://tracker.ceph.com/issues/10350

 And solution is probably here :
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

 Section : CRUSH gives up too soon

 Cheers,
 Yann

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Teething Problems

2015-03-04 Thread John Spray


On 04/03/2015 20:27, Datatone Lists wrote:

I have been following ceph for a long time. I have yet to put it into
service, and I keep coming back as btrfs improves and ceph reaches
higher version numbers.

I am now trying ceph 0.93 and kernel 4.0-rc1.

Q1) Is it still considered that btrfs is not robust enough, and that
xfs should be used instead? [I am trying with btrfs].
XFS is still the recommended default backend 
(http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/#filesystems)


I followed the manual deployment instructions on the web site
(http://ceph.com/docs/master/install/manual-deployment/) and I managed
to get a monitor and several osds running and apparently working. The
instructions fizzle out without explaining how to set up mds. I went
back to mkcephfs and got things set up that way. The mds starts.

[Please don't mention ceph-deploy]
This kind of comment isn't very helpful unless there is a specific issue 
with ceph-deploy that is preventing you from using it, and causing you 
to resort to manual steps.  I happy to find ceph-deploy very useful, so 
I'm afraid I'm going to mention it anyway :-)


The first thing that I noticed is that (whether I set up mon and osds
by following the manual deployment, or using mkcephfs), the correct
default pools were not created.
This is not a bug.  The 'data' and 'metadata' pools are no longer 
created by default. http://docs.ceph.com/docs/master/cephfs/createfs/

  I get only 'rbd' created automatically. I deleted this pool, and
  re-created data, metadata and rbd manually. When doing this, I had to
  juggle with the pg- num in order to avoid the 'too many pgs for osd'.
  I have three osds running at the moment, but intend to add to these
  when I have some experience of things working reliably. I am puzzled,
  because I seem to have to set the pg-num for the pool to a number that
  makes (N-pools x pg-num)/N-osds come to the right kind of number. So
  this implies that I can't really expand a set of pools by adding osds
  at a later date.
You should pick an appropriate number of PGs for the number of OSDs you 
have at the present time.  When you add more OSDs, you can increase the 
number of PGs.  You would not want to create the larger number of PGs 
initially, as they could exceed the resources available on your initial 
small number of OSDs.

Q4) Can you give me an idea of what is wrong that causes the mds to not
play properly?
You have to explicitly enable the filesystem now (also 
http://docs.ceph.com/docs/master/cephfs/createfs/)

I think that there are some typos on the manual deployment pages, for
example:

ceph-osd id={osd-num}

This is not right. As far as I am aware it should be:

ceph-osd -i {osd-num}
ceph-osd id={osd-num} is an upstart invokation (i.e. it's prefaced with 
sudo start on the manual deployment page).  In that context it's 
correct afaik, unless you're finding otherwise?


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

The change is only on OSD (and not on OSD journal). 

do you see twice iops for read and write ?

if only read, maybe a read ahead bug could explain this. 

- Mail original -
De: Olivier Bonvalet ceph.l...@daevel.fr
À: aderumier aderum...@odiso.com
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Mercredi 4 Mars 2015 15:13:30
Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

Ceph health is OK yes. 

The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by 
ceph : there is no change between dumpling and firefly. The change is 
only on OSD (and not on OSD journal). 


Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : 
 The load problem is permanent : I have twice IO/s on HDD since firefly. 
 
 Oh, permanent, that's strange. (If you don't see more traffic coming from 
 clients, I don't understand...) 
 
 do you see also twice ios/ ops in ceph -w  stats ? 
 
 is the ceph health ok ? 
 
 
 
 - Mail original - 
 De: Olivier Bonvalet ceph.l...@daevel.fr 
 À: aderumier aderum...@odiso.com 
 Cc: ceph-users ceph-users@lists.ceph.com 
 Envoyé: Mercredi 4 Mars 2015 14:49:41 
 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly 
 
 Thanks Alexandre. 
 
 The load problem is permanent : I have twice IO/s on HDD since firefly. 
 And yes, the problem hang the production at night during snap trimming. 
 
 I suppose there is a new OSD parameter which change behavior of the 
 journal, or something like that. But didn't find anything about that. 
 
 Olivier 
 
 Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
  Hi, 
  
  maybe this is related ?: 
  
  http://tracker.ceph.com/issues/9503 
  Dumpling: removing many snapshots in a short time makes OSDs go berserk 
  
  http://tracker.ceph.com/issues/9487 
  dumpling: snaptrimmer causes slow requests while backfilling. 
  osd_snap_trim_sleep not helping 
  
  http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
   
  
  
  
  I think it's already backport in dumpling, not sure it's already done for 
  firefly 
  
  
  Alexandre 
  
  
  
  - Mail original - 
  De: Olivier Bonvalet ceph.l...@daevel.fr 
  À: ceph-users ceph-users@lists.ceph.com 
  Envoyé: Mercredi 4 Mars 2015 12:10:30 
  Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 
  
  Hi, 
  
  last saturday I upgraded my production cluster from dumpling to emperor 
  (since we were successfully using it on a test cluster). 
  A couple of hours later, we had falling OSD : some of them were marked 
  as down by Ceph, probably because of IO starvation. I marked the cluster 
  in «noout», start downed OSD, then let him recover. 24h later, same 
  problem (near same hour). 
  
  So, I choose to directly upgrade to firefly, which is maintained. 
  Things are better, but the cluster is slower than with dumpling. 
  
  The main problem seems that OSD have twice more write operations par 
  second : 
  https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
  https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
  
  But journal doesn't change (SSD dedicated to OSD70+71+72) : 
  https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
  
  Neither node bandwidth : 
  https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
  
  Or whole cluster IO activity : 
  https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
  
  Some background : 
  The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
  journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
  
  I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
  nodes (so a total of 27 «HDD+SSD» OSD). 
  
  The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
  «rbd snap rm» operations). 
  osd_snap_trim_sleep is setup to 0.8 since monthes. 
  Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
  doesn't seem to really help. 
  
  The only thing which seems to help, is to reduce osd_disk_threads from 8 
  to 1. 
  
  So. Any idea about what's happening ? 
  
  Thanks for any help, 
  Olivier 
  
  ___ 
  ceph-users mailing list 
  ceph-users@lists.ceph.com 
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
  
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread Josh Durgin


On 03/02/2015 04:16 AM, koukou73gr wrote:


Hello,

Today I thought I'd experiment with snapshots and cloning. So I did:

rbd import --image-format=2 vm-proto.raw rbd/vm-proto
rbd snap create rbd/vm-proto@s1
rbd snap protect rbd/vm-proto@s1
rbd clone rbd/vm-proto@s1 rbd/server

And then proceeded to create a qemu-kvm guest with rbd/server as its
backing store. The guest booted but as soon as it got to mount the root
fs, things got weird:


What does the qemu command line look like?


[...]
scsi2 : Virtio SCSI HBA
scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0
ANSI: 5
sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
  sda: sda1 sda2
sd 2:0:0:0: [sda] Attached SCSI disk
dracut: Scanning devices sda2  for LVM logical volumes vg_main/lv_swap
vg_main/lv_root
dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit
dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit
EXT4-fs (dm-1): INFO: recovery required on readonly filesystem


This suggests the disk is being exposed as read-only via QEMU,
perhaps via qemu's snapshot or other options.

You can use a clone in exactly the same way as any other rbd image.
If you're running QEMU manually, for example, something like:

qemu-kvm -drive file=rbd:rbd/server,format=raw,cache=writeback

is fine for using the clone. QEMU is supposed to be unaware of any
snapshots, parents, etc. at the rbd level.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New EC pool undersized

It wouldn't let me simply change the pg_num, giving
Error EEXIST: specified pg_num 2048 = current 8192

But that's not a big deal, I just deleted the pool and recreated with 'ceph
osd pool create ec44pool 2048 2048 erasure ec44profile'
...and the result is quite similar: 'ceph status' is now
ceph status
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs
undersized
 monmap e1: 4 mons at {hobbit01=
10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0},
election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e412: 144 osds: 144 up, 144 in
  pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects
90590 MB used, 640 TB / 640 TB avail
   4 active+undersized+degraded
6140 active+clean

'ceph pg dump_stuck results' in
ok
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v
reported up up_primary acting acting_primary last_scrub scrub_stamp
last_deep_scrub deep_scrub_stamp
2.296 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.672224
0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5
[5,55,91,2147483647,83,135,53,26] 5 0'0 2015-03-04 14:33:15.649911 0'0
2015-03-04
14:33:15.649911
2.69c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:24.984802
0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93
[93,134,1,74,112,28,2147483647,60] 93 0'0 2015-03-04 14:33:15.695747
0'0 2015-03-04
14:33:15.695747
2.36d 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:21.937620
0'0 412:9 [12,108,136,104,52,18,63,2147483647] 12
[12,108,136,104,52,18,63,2147483647] 12 0'0 2015-03-04 14:33:15.652480
0'0 2015-03-04
14:33:15.652480
2.5f7 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.169242
0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94
[94,128,73,22,4,60,2147483647,113] 94 0'0 2015-03-04 14:33:15.687695
0'0 2015-03-04
14:33:15.687695

I do have questions for you, even at this point, though.
1) Where did you find the formula (14400/(k+m))?
2) I was really trying to size this for when it goes to production, at
which point it may have as many as 384 OSDs. Doesn't that imply I should
have even more pgs?

On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.com wrote:

  Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so
 try 2048.



 -don-



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Don Doerner
 *Sent:* 04 March, 2015 12:14
 *To:* Kyle Hutson; Ceph Users
 *Subject:* Re: [ceph-users] New EC pool undersized



 In this case, that number means that there is not an OSD that can be
 assigned.  What’s your k, m from you erasure coded pool?  You’ll need
 approximately (14400/(k+m)) PGs, rounded up to the next power of 2…



 -don-



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
 ceph-users-boun...@lists.ceph.com] *On Behalf Of *Kyle Hutson
 *Sent:* 04 March, 2015 12:06
 *To:* Ceph Users
 *Subject:* [ceph-users] New EC pool undersized



 Last night I blew away my previous ceph configuration (this environment is
 pre-production) and have 0.87.1 installed. I've manually edited the
 crushmap so it down looks like https://dpaste.de/OLEa
 https://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEak=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442



 I currently have 144 OSDs on 8 nodes.



 After increasing pg_num and pgp_num to a more suitable 1024 (due to the
 high number of OSDs), everything looked happy.

 So, now I'm trying to play with an erasure-coded pool.

 I did:

 ceph osd erasure-code-profile set ec44profile k=4 m=4
 ruleset-failure-domain=rack

 ceph osd pool create ec44pool 8192 8192 erasure ec44profile



 After settling for a bit 'ceph status' gives

 cluster 196e5eb8-d6a7-4435-907e-ea028e946923

  health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck
 unclean; 7 pgs stuck undersized; 7 pgs undersized

  monmap e1: 4 mons at {hobbit01=
 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0
 https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=6fe07b47a00235857630057e09cfb702dcddcea1d3f98d81a574020ee95dee44},
 election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14

  osdmap e409: 144 osds: 144 up, 144 in

   pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects

 90598 MB used, 640 TB / 640 TB avail

7 active+undersized+degraded

12281 active+clean



 So to

Re: [ceph-users] New EC pool undersized

Hmmm, I just struggled through this myself.  How many racks do you have?  If 
not more than 8, you might want to make your failure domain smaller?  I.e., 
maybe host?  That, at least, would allow you to debug the situation…

-don-

From: Kyle Hutson [mailto:kylehut...@ksu.edu]
Sent: 04 March, 2015 12:43
To: Don Doerner
Cc: Ceph Users
Subject: Re: [ceph-users] New EC pool undersized

It wouldn't let me simply change the pg_num, giving
Error EEXIST: specified pg_num 2048 = current 8192

But that's not a big deal, I just deleted the pool and recreated with 'ceph osd 
pool create ec44pool 2048 2048 erasure ec44profile'
...and the result is quite similar: 'ceph status' is now
ceph status
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized
 monmap e1: 4 mons at 
{hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8},
 election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e412: 144 osds: 144 up, 144 in
  pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects
90590 MB used, 640 TB / 640 TB avail
   4 active+undersized+degraded
6140 active+clean

'ceph pg dump_stuck results' in
ok
pg_stat   objects   mip  degr misp unf  bytes log  disklog state 
state_stampvreported  up   up_primary actingacting_primary 
last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.296 00000000 active+undersized+degraded   
 2015-03-04 14:33:26.672224 0'0  412:9 
[5,55,91,2147483647,83,135,53,26]  5 [5,55,91,2147483647,83,135,53,26]  5   
 0'0  2015-03-04 14:33:15.649911 0'0  2015-03-04 14:33:15.649911
2.69c 00000000 active+undersized+degraded   
 2015-03-04 14:33:24.984802 0'0  412:9 
[93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647,60] 93 
  0'0  2015-03-04 14:33:15.695747 0'0  2015-03-04 14:33:15.695747
2.36d 00000000 active+undersized+degraded   
 2015-03-04 14:33:21.937620 0'0  412:9 
[12,108,136,104,52,18,63,2147483647]12 
[12,108,136,104,52,18,63,2147483647]12   0'0  2015-03-04 14:33:15.652480
0'0  2015-03-04 14:33:15.652480
2.5f7 00000000 active+undersized+degraded   
 2015-03-04 14:33:26.169242 0'0  412:9 
[94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647,113] 94 
  0'0  2015-03-04 14:33:15.687695 0'0  2015-03-04 14:33:15.687695

I do have questions for you, even at this point, though.
1) Where did you find the formula (14400/(k+m))?
2) I was really trying to size this for when it goes to production, at which 
point it may have as many as 384 OSDs. Doesn't that imply I should have even 
more pgs?

On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner 
don.doer...@quantum.commailto:don.doer...@quantum.com wrote:
Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 
2048.

-don-

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Don Doerner
Sent: 04 March, 2015 12:14
To: Kyle Hutson; Ceph Users
Subject: Re: [ceph-users] New EC pool undersized

In this case, that number means that there is not an OSD that can be assigned.  
What’s your k, m from you erasure coded pool?  You’ll need approximately 
(14400/(k+m)) PGs, rounded up to the next power of 2…

-don-

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kyle 
Hutson
Sent: 04 March, 2015 12:06
To: Ceph Users
Subject: [ceph-users] New EC pool undersized

Last night I blew away my previous ceph configuration (this environment is 
pre-production) and have 0.87.1 installed. I've manually edited the crushmap so 
it down looks like 
https://dpaste.de/OLEahttps://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEak=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442

I currently have 144 OSDs on 8 nodes.

After increasing pg_num and pgp_num to a more suitable 1024 (due to the high 
number of OSDs), everything looked happy.
So, now I'm trying to play with an erasure-coded pool.
I did:
ceph osd erasure-code-profile set ec44profile k=4 m=4 
ruleset-failure-domain=rack
ceph osd pool create ec44pool 8192 8192 erasure ec44profile

After settling for a bit 'ceph status' gives
cluster

Re: [ceph-users] Ceph User Teething Problems

I can't help much on the MDS front, but here is some answers and my
view on some of it.

On Wed, Mar 4, 2015 at 1:27 PM, Datatone Lists li...@datatone.co.uk wrote:
 I have been following ceph for a long time. I have yet to put it into
 service, and I keep coming back as btrfs improves and ceph reaches
 higher version numbers.

 I am now trying ceph 0.93 and kernel 4.0-rc1.

 Q1) Is it still considered that btrfs is not robust enough, and that
 xfs should be used instead? [I am trying with btrfs].

We are moving forward with btrfs on our production cluster aware that
there may be performance issues. So far, it seems the later kernels
have resolved the issues we've seen with snapshots. As the system
grows we will keep an eye on it and are prepared to move to XFS if
needed.

 I followed the manual deployment instructions on the web site
 (http://ceph.com/docs/master/install/manual-deployment/) and I managed
 to get a monitor and several osds running and apparently working. The
 instructions fizzle out without explaining how to set up mds. I went
 back to mkcephfs and got things set up that way. The mds starts.

 [Please don't mention ceph-deploy]

 The first thing that I noticed is that (whether I set up mon and osds
 by following the manual deployment, or using mkcephfs), the correct
 default pools were not created.

 bash-4.3# ceph osd lspools
 0 rbd,
 bash-4.3#

  I get only 'rbd' created automatically. I deleted this pool, and
  re-created data, metadata and rbd manually. When doing this, I had to
  juggle with the pg- num in order to avoid the 'too many pgs for osd'.
  I have three osds running at the moment, but intend to add to these
  when I have some experience of things working reliably. I am puzzled,
  because I seem to have to set the pg-num for the pool to a number that
  makes (N-pools x pg-num)/N-osds come to the right kind of number. So
  this implies that I can't really expand a set of pools by adding osds
  at a later date.

 Q2) Is there any obvious reason why my default pools are not getting
 created automatically as expected?

Since Giant, these pools are not automatically created, only the rbd pool is.

 Q3) Can pg-num be modified for a pool later? (If the number of osds is
 increased dramatically).

pg_num and pgp_num can be increased (not decreased) on the fly later
to expand with more OSDs.

 Finally, when I try to mount cephfs, I get a mount 5 error.

 A mount 5 error typically occurs if a MDS server is laggy or if it
 crashed. Ensure at least one MDS is up and running, and the cluster is
 active + healthy.

 My mds is running, but its log is not terribly active:

 2015-03-04 17:47:43.177349 7f42da2c47c0  0 ceph version 0.93
 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4), process ceph-mds, pid 4110
 2015-03-04 17:47:43.182716 7f42da2c47c0 -1 mds.-1.0 log_to_monitors
 {default=true}

 (This is all there is in the log).

 I think that a key indicator of the problem must be this from the
 monitor log:

 2015-03-04 16:53:20.715132 7f3cd0014700  1
 mon.ceph-mon-00@0(leader).mds e1 warning, MDS mds.?
 [2001:8b0::5fb3::1fff::9054]:6800/4036 up but filesystem
 disabled

 (I have added the '' sections to obscure my ip address)

 Q4) Can you give me an idea of what is wrong that causes the mds to not
 play properly?

 I think that there are some typos on the manual deployment pages, for
 example:

 ceph-osd id={osd-num}

 This is not right. As far as I am aware it should be:

 ceph-osd -i {osd-num}

There are a few of these, usually running --help for the command gives
you the right syntax needed for the version you have installed. But it
is still very confusing.

 An observation. In principle, setting things up manually is not all
 that complicated, provided that clear and unambiguous instructions are
 provided. This simple piece of documentation is very important. My view
 is that the existing manual deployment instructions gets a bit confused
 and confusing when it gets to the osd setup, and the mds setup is
 completely absent.

 For someone who knows, this would be a fairly simple and fairly quick
 operation to review and revise this part of the documentation. I
 suspect that this part suffers from being really obvious stuff to the
 well initiated. For those of us closer to the start, this forms the
 ends of the threads that have to be picked up before the journey can be
 made.

 Very best regards,
 David
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New EC pool undersized

2015-03-04 Thread Yann Dupont



Le 04/03/2015 21:48, Don Doerner a écrit :


Hmmm, I just struggled through this myself.How many racks do you 
have?If not more than 8, you might want to make your failure domain 
smaller?I.e., maybe host?That, at least, would allow you to debug the 
situation…


-don-



Hello, I think I already had this problem.
It's explained here
http://tracker.ceph.com/issues/10350

And solution is probably here :
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

Section : CRUSH gives up too soon

Cheers,
Yann
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New EC pool undersized

My lowest level (other than OSD) is 'disktype' (based on the crushmaps at
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
) since I have SSDs and HDDs on the same host.

I just made that change (deleted the pool, deleted the profile, deleted the
crush ruleset), then re-created using ruleset-failure-domain=disktype. Very
similar results.
health HEALTH_WARN 3 pgs degraded; 3 pgs stuck unclean; 3 pgs undersized
'ceph pg dump stuck' looks very similar to the last one I posted.

On Wed, Mar 4, 2015 at 2:48 PM, Don Doerner don.doer...@quantum.com wrote:

  Hmmm, I just struggled through this myself.  How many racks do you have?
 If not more than 8, you might want to make your failure domain smaller?  I.e.,
 maybe host?  That, at least, would allow you to debug the situation…



 -don-



 *From:* Kyle Hutson [mailto:kylehut...@ksu.edu]
 *Sent:* 04 March, 2015 12:43
 *To:* Don Doerner
 *Cc:* Ceph Users

 *Subject:* Re: [ceph-users] New EC pool undersized



 It wouldn't let me simply change the pg_num, giving

 Error EEXIST: specified pg_num 2048 = current 8192



 But that's not a big deal, I just deleted the pool and recreated with
 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile'

 ...and the result is quite similar: 'ceph status' is now

 ceph status

 cluster 196e5eb8-d6a7-4435-907e-ea028e946923

  health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs
 undersized

  monmap e1: 4 mons at {hobbit01=
 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0
 https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8},
 election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14

  osdmap e412: 144 osds: 144 up, 144 in

   pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects

 90590 MB used, 640 TB / 640 TB avail

4 active+undersized+degraded

 6140 active+clean



 'ceph pg dump_stuck results' in

 ok

 pg_stat   objects   mip  degr misp unf  bytes log  disklog state
 state_stampvreported  up   up_primary actingacting_primary
 last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp

 2.296 00000000
 active+undersized+degraded2015-03-04 14:33:26.672224 0'0  412:9
 [5,55,91,2147483647,83,135,53,26]  5 [5,55,91,2147483647,83,135,53,26]
 50'0  2015-03-04 14:33:15.649911 0'0  2015-03-04 14:33:15.649911

 2.69c 00000000
 active+undersized+degraded2015-03-04 14:33:24.984802 0'0  412:9
 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647
 ,60] 93   0'0  2015-03-04 14:33:15.695747 0'0  2015-03-04
 14:33:15.695747

 2.36d 00000000
 active+undersized+degraded2015-03-04 14:33:21.937620 0'0  412:9
 [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63,
 2147483647]12   0'0  2015-03-04 14:33:15.6524800'0  2015-03-04
 14:33:15.652480

 2.5f7 00000000
 active+undersized+degraded2015-03-04 14:33:26.169242 0'0  412:9
 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647
 ,113] 94   0'0  2015-03-04 14:33:15.687695 0'0  2015-03-04
 14:33:15.687695



 I do have questions for you, even at this point, though.

 1) Where did you find the formula (14400/(k+m))?

 2) I was really trying to size this for when it goes to production, at
 which point it may have as many as 384 OSDs. Doesn't that imply I should
 have even more pgs?



 On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.com
 wrote:

 Oh duh…  OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so
 try 2048.



 -don-



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *Don Doerner
 *Sent:* 04 March, 2015 12:14
 *To:* Kyle Hutson; Ceph Users
 *Subject:* Re: [ceph-users] New EC pool undersized



 In this case, that number means that there is not an OSD that can be
 assigned.  What’s your k, m from you erasure coded pool?  You’ll need
 approximately (14400/(k+m)) PGs, rounded up to the next power of 2…



 -don-



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
 ceph-users-boun...@lists.ceph.com] *On Behalf Of *Kyle Hutson
 *Sent:* 04 March, 2015 12:06
 *To:* Ceph Users
 *Subject:* [ceph-users] New EC pool undersized



 Last night I blew away my previous ceph configuration (this environment is
 pre-production) and have 0.87.1 installed. I've manually edited the
 crushmap so it down looks like https://dpaste.de/OLEa

Re: [ceph-users] Hammer sharded radosgw bucket indexes question

2015-03-04 Thread Yehuda Sadeh-Weinraub

- Original Message -
 From: Ben Hines bhi...@gmail.com
 To: ceph-users ceph-users@lists.ceph.com
 Sent: Wednesday, March 4, 2015 1:03:16 PM
 Subject: [ceph-users] Hammer sharded radosgw bucket indexes question

 Hi,

 These questions were asked previously but perhaps lost:

 We have some large buckets.

 - When upgrading to Hammer (0.93 or later), is it necessary to
 recreate the buckets to get a sharded index?

 - What parameters does the system use for deciding when to shard the index?

The system does not re-shard the bucket index, it will only affect new buckets. 
There is a per-zone configurable that specifies num of shards for buckets 
created in that zone (by default it's disabled). There's also a ceph.conf 
configurable that can be set to override that value.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Teething Problems

2015-03-04 Thread Travis Rhoden

On Wed, Mar 4, 2015 at 4:43 PM, Lionel Bouton
lionel-subscript...@bouton.name wrote:
 On 03/04/15 22:18, John Spray wrote:
 On 04/03/2015 20:27, Datatone Lists wrote:
 [...] [Please don't mention ceph-deploy]
 This kind of comment isn't very helpful unless there is a specific
 issue with ceph-deploy that is preventing you from using it, and
 causing you to resort to manual steps.

As a new maintainer of ceph-deploy, I'm happy to hear all gripes.  :)


 ceph-deploy is a subject I never took the time to give feedback on.

 We can't use it (we use Gentoo which isn't supported by ceph-deploy) and
 even if we could I probably wouldn't allow it: I believe that for
 important pieces of infrastructure like Ceph you have to understand its
 inner workings to the point where you can hack your way out in cases of
 problems and build tools to integrate them better with your environment
 (you can understand one of the reasons why we use Gentoo in production
 with other distributions...).
 I believe using ceph-deploy makes it more difficult to acquire the
 knowledge to do so.
 For example we have a script to replace a defective OSD (destroying an
 existing one and replacing with a new one) locking data in place as long
 as we can to avoid crush map changes to trigger movements until the map
 reaches its original state again which minimizes the total amount of
 data copied around. It might have been possible to achieve this with
 ceph-deploy, but I doubt we would have achieved it as easily (from
 understanding the causes of data movements through understanding the osd
 identifiers allocation process to implementing the script) if we hadn't
 created the OSD by hand repeatedly before scripting some processes.

Thanks for this feedback.  I share a lot of your sentiments,
especially that it is good to understand as much of the system as you
can.  Everyone's skill level and use-case is different, and
ceph-deploy is targeted more towards PoC use-cases. It tries to make
things as easy as possible, but that necessarily abstracts most of the
details away.


 Last time I searched for documentation on manual configuration it was
 much harder to find (mds manual configuration was indeed something I
 didn't find at all too).

 Best regards,

 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread koukou73gr


Hi Josh,

Thanks for taking a look at this. I 'm answering your questions inline.

On 03/04/2015 10:01 PM, Josh Durgin wrote:

[...]
And then proceeded to create a qemu-kvm guest with rbd/server as its

backing store. The guest booted but as soon as it got to mount the root
fs, things got weird:


What does the qemu command line look like?


I am using libvirt, so I'll be copy-pasting from the log file:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 
/usr/libexec/qemu-kvm -name server -S -machine 
rhel6.5.0,accel=kvm,usb=off -cpu 
Penryn,+dca,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme 
-m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 
ee13f9a0-b7eb-93fd-aa8c-18da9e23ba5c -nographic -no-user-config 
-nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/server.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc 
-no-shutdown -boot order=nc,menu=on,strict=on -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -drive 
file=rbd:libvirt-pool/server:id=libvirt:key=AQAeDqRTQEknIhAA5Gqfl/CkWIfh+nR01hEgzA==:auth_supported=cephx\;none,if=none,id=drive-scsi0-0-0-0 
-device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 
-netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:73:98:a9,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device 
isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5



[...]
scsi2 : Virtio SCSI HBA
scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0
ANSI: 5
sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
  sda: sda1 sda2
sd 2:0:0:0: [sda] Attached SCSI disk
dracut: Scanning devices sda2  for LVM logical volumes vg_main/lv_swap
vg_main/lv_root
dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit
dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit
EXT4-fs (dm-1): INFO: recovery required on readonly filesystem


This suggests the disk is being exposed as read-only via QEMU,
perhaps via qemu's snapshot or other options.

You're right, the disk does seem R/O but also corrupt. The disk image 
was cleanly unmounted before creating the snapshot and cloning it.


What is more, if I just flatten the image and start the guest again it 
boots fine and there is no recovery needed on the fs.


There are also a some:

block I/O error in device 'drive-scsi0-0-0-0': Operation not permitted (1)

messages logged in /var/log/libvirt/qemu/server.log


You can use a clone in exactly the same way as any other rbd image.
If you're running QEMU manually, for example, something like:

qemu-kvm -drive file=rbd:rbd/server,format=raw,cache=writeback

is fine for using the clone. QEMU is supposed to be unaware of any
snapshots, parents, etc. at the rbd level.
In a sense, the parameters passed to QEMU from libvirt boil down to your 
suggested command line. I think it should work as well, it is written 
all over the place :)


I'm a still a newbie wrt ceph, maybe I am missing something flat-out 
obvious.

Thanks for your time,

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph User Teething Problems

2015-03-04 Thread Lionel Bouton

On 03/04/15 22:50, Travis Rhoden wrote:
 [...]
 Thanks for this feedback.  I share a lot of your sentiments,
 especially that it is good to understand as much of the system as you
 can.  Everyone's skill level and use-case is different, and
 ceph-deploy is targeted more towards PoC use-cases. It tries to make
 things as easy as possible, but that necessarily abstracts most of the
 details away.

To follow up on this subject, assuming ceph-deploy worked with Gentoo,
one feature which would make it really useful to us would be for it to
dump each and every one of the commands it uses so that they might be
replicated manually. Documentation might be inaccurate or hard to browse
for various reasons, but a tool which achieves its purpose can't be
wrong about the command it uses (assuming it simply calls standard
command-line tools and not some API over a socket...).
There might be a way to do it already (seems something you would want at
least when developing it) but obviously I didn't check.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread Josh Durgin


On 03/04/2015 01:36 PM, koukou73gr wrote:

On 03/03/2015 05:53 PM, Jason Dillaman wrote:

Your procedure appears correct to me.  Would you mind re-running your
cloned image VM with the following ceph.conf properties:

[client]
rbd cache off
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

If you recreate the issue, would you mind opening a ticket at
http://tracker.ceph.com/projects/rbd/issues?

Jason,

Thanks for the reply. Recreating the issue is not a problem, I can
reproduce it any time.
The log file was getting a bit large, I destroyed the guest after
letting it thrash for about ~3 minutes, plenty of time to hit the
problem. I've uploaded it at:

http://paste.scsys.co.uk/468868 (~19MB)


It looks like your libvirt rados user doesn't have access to whatever
pool the parent image is in:


librbd::AioRequest: write 0x7f1ec6ad6960 
rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r = -1


-1 is EPERM, for operation not permitted.

Check the libvirt user capabilites shown in ceph auth list - it should
have at least r and class-read access to the pool storing the parent
image. You can update it via the 'ceph auth caps' command.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] New EC pool undersized

I don’t know – I am playing with crush; someday I may fully comprehend it.  Not 
today.

I think you have to look at it like this: if your possible failure domain 
options are OSDs, hosts, racks, …, and you choose racks as your failure domain, 
and you have exactly as many racks as your pool size (and it can’t be any 
smaller, right?), then each PG has to have an OSD from each rack.  If your 144 
OSDs are split evenly across 8 racks, then you have 18 OSDs in each rack 
(presumably distributed over the hosts in that rack, though I don’t think that 
distribution is important for this calculation).  And so your total number of 
choices is 18 to the 8th power, or just over 11 billion (actually, 
11,019,960,576J).  So probably the only thing you have to worry about is “crush 
giving up too soon”, and Yann’s resolution.

-don-

From: Kyle Hutson [mailto:kylehut...@ksu.edu]
Sent: 04 March, 2015 13:15
To: Don Doerner
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New EC pool undersized

So it sounds like I should figure out at 'how many nodes' do I need to increase 
pg_num to 4096, and again for 8192, and increase those incrementally when as I 
add more hosts, correct?

On Wed, Mar 4, 2015 at 3:04 PM, Don Doerner 
don.doer...@quantum.commailto:don.doer...@quantum.com wrote:
Sorry, I missed your other questions, down at the bottom.  See 
herehttps://urldefense.proofpoint.com/v1/url?u=http://ceph.com/docs/master/rados/operations/placement-groups/k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=5LgZXqMcdSY9dR535Wik6Qn%2Fv%2FLOohBS%2FXU6MSfnaEM%3D%0As=2a27a052900c3daae01c06d6a26f502b3be9b9e43bd75515319da5df690823f9
 (look for “number of replicas for replicated pools or the K+M sum for erasure 
coded pools”) for the formula; 38400/8 probably implies 8192.

The thing is, you’ve got to think about how many ways you can form combinations 
of 8 unique OSDs (with replacement) that match your failure domain rules.  If 
you’ve only got 8 hosts, and your failure domain is hosts, it severely limits 
this number.  And I have read that too many isn’t good either – a serialization 
issue, I believe.

-don-

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Don Doerner
Sent: 04 March, 2015 12:49
To: Kyle Hutson
Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com

Subject: Re: [ceph-users] New EC pool undersized

Hmmm, I just struggled through this myself.  How many racks do you have?  If 
not more than 8, you might want to make your failure domain smaller?  I.e., 
maybe host?  That, at least, would allow you to debug the situation…

-don-

From: Kyle Hutson [mailto:kylehut...@ksu.edu]
Sent: 04 March, 2015 12:43
To: Don Doerner
Cc: Ceph Users
Subject: Re: [ceph-users] New EC pool undersized

It wouldn't let me simply change the pg_num, giving
Error EEXIST: specified pg_num 2048 = current 8192

But that's not a big deal, I just deleted the pool and recreated with 'ceph osd 
pool create ec44pool 2048 2048 erasure ec44profile'
...and the result is quite similar: 'ceph status' is now
ceph status
cluster 196e5eb8-d6a7-4435-907e-ea028e946923
 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized
 monmap e1: 4 mons at 
{hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8},
 election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14
 osdmap e412: 144 osds: 144 up, 144 in
  pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects
90590 MB used, 640 TB / 640 TB avail
   4 active+undersized+degraded
6140 active+clean

'ceph pg dump_stuck results' in
ok
pg_stat   objects   mip  degr misp unf  bytes log  disklog state 
state_stampvreported  up   up_primary actingacting_primary 
last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.296 00000000 active+undersized+degraded   
 2015-03-04 14:33:26.672224 0'0  412:9 
[5,55,91,2147483647tel:2147483647,83,135,53,26]  5 
[5,55,91,2147483647tel:2147483647,83,135,53,26]  50'0  2015-03-04 
14:33:15.649911 0'0  2015-03-04 14:33:15.649911
2.69c 00000000 active+undersized+degraded   
 2015-03-04 14:33:24.984802 0'0  412:9 
[93,134,1,74,112,28,2147483647tel:2147483647,60] 93 
[93,134,1,74,112,28,2147483647tel:2147483647,60] 93   0'0  2015-03-04 
14:33:15.695747 0'0  2015-03-04 14:33:15.695747
2.36d 00000000

Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Christian Balzer


Hello Nick,

On Wed, 4 Mar 2015 08:49:22 - Nick Fisk wrote:

 Hi Christian,
 
 Yes that's correct, it's on the client side. I don't see this much
 different to a battery backed Raid controller, if you lose power, the
 data is in the cache until power resumes when it is flushed.
 
 If you are going to have the same RBD accessed by multiple
 servers/clients then you need to make sure the SSD is accessible to both
 (eg DRBD / Dual Port SAS). But then something like pacemaker would be
 responsible for ensuring the RBD and cache device are both present
 before allowing client access.
 
Which is pretty much any and all use cases I can think about.
Because it's not only concurrent (active/active) accesses, but you
really need to have things consistent across all possible client hosts in
case of a node failure.

I'm no stranger to DRBD and Pacemaker (which incidentally didn't make it
into Debian Jessie, queue massive laughter and ridicule), btw.

 When I wrote this I was thinking more about 2 HA iSCSI servers with
 RBD's, however I can understand that this feature would prove more of a
 challenge if you are using Qemu and RBD.
 
One of the reasons I'm using Ceph/RBD instead of DRBD (which is vastly
more suited for some use cases) is that it allows me n+1 instead of n+n
redundancy when it comes to consumers (compute nodes in my case). 

Now for your iSCSI head (looking forward to your results and any config
recipes) that limitation to a pair may be just as well, but as others
wrote it might be best to go forward with this outside of Ceph.
Especially since you're already dealing with a HA cluster/pacemaker in
that scenario.


Christian

 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Christian Balzer
 Sent: 04 March 2015 08:40
 To: ceph-users@lists.ceph.com
 Cc: Nick Fisk
 Subject: Re: [ceph-users] Persistent Write Back Cache
 
 
 Hello,
 
 If I understand you correctly, you're talking about the rbd cache on the
 client side.
 
 So assume that host or the cache SSD on if fail terminally.
 The client thinks its sync'ed are on the permanent storage (the actual
 ceph storage cluster), while they are only present locally. 
 
 So restarting that service or VM on a different host now has to deal with
 likely crippling data corruption.
 
 Regards,
 
 Christian
 
 On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote:
 
  Hi All,
  
   
  
  Is there anything in the pipeline to add the ability to write the 
  librbd cache to ssd so that it can safely ignore sync requests? I have 
  seen a thread a few years back where Sage was discussing something 
  similar, but I can't find anything more recent discussing it.
  
   
  
  I've been running lots of tests on our new cluster, buffered/parallel 
  performance is amazing (40K Read 10K write iops), very impressed. 
  However sync writes are actually quite disappointing.
  
   
  
  Running fio with 128k block size and depth=1, normally only gives me 
  about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's 
  and from what I hear that's about normal, so I don't think I have a 
  ceph config problem. For applications which do a lot of sync's, like 
  ESXi over iSCSI or SQL databases, this has a major performance impact.
  
   
  
  Traditional storage arrays work around this problem by having a 
  battery backed cache which has latency 10-100 times less than what you 
  can currently achieve with Ceph and an SSD . Whilst librbd does have a 
  writeback cache, from what I understand it will not cache syncs and so 
  in my usage case, it effectively acts like a write through cache.
  
   
  
  To illustrate the difference a proper write back cache can make, I put 
  a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
  tweaked the flush parameters to flush dirty blocks at a large queue 
  depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
  limited by the performance of SSD used by flashcache, as everything is 
  stored as 4k blocks on the ssd. In fact since everything is stored as 
  4k blocks, pretty much all IO sizes are accelerated to max speed of the
 SSD.
  Looking at iostat I can see all the IO's are getting coalesced into 
  nice large 512kb IO's at a high queue depth, which Ceph easily
  swallows.
  
   
  
  If librbd could support writing its cache out to SSD it would 
  hopefully achieve the same level of performance and having it 
  integrated would be really neat.
  
   
  
  Nick
  
  
  
  
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.93: Bucket removal with data purge

2015-03-04 Thread Ben Hines

Ah, nevermind - i had to pass the --bucket=bucketname argument.

You'd think the command would print an error if missing the critical argument.

-Ben

On Wed, Mar 4, 2015 at 6:06 PM, Ben Hines bhi...@gmail.com wrote:
 One of the release notes says:
 rgw: fix bucket removal with data purge (Yehuda Sadeh)

 Just tried this and it didnt seem to work:


 bash-4.1$ time radosgw-admin bucket rm mike-cache2 --purge-objects

 real0m7.711s
 user0m0.109s
 sys 0m0.072s

 Yet the bucket was not deleted, nor purged:

 -bash-4.1$ radosgw-admin bucket stats
 [

 mike-cache2,
 {
 bucket: mike-cache2,
 pool: .rgw.buckets,
 index_pool: .rgw.buckets.index,
 id: default.2769570.4,
 marker: default.2769570.4,
 owner: smbuildmachine,
 ver: 0#329,
 master_ver: 0#0,
 mtime: 2014-11-11 16:10:31.00,
 max_marker: 0#,
 usage: {
 rgw.main: {
 size_kb: 223355,
 size_kb_actual: 223768,
 num_objects: 164
 }
 },
 bucket_quota: {
 enabled: false,
 max_size_kb: -1,
 max_objects: -1
 }
 },

 ]




 -Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] v0.93: Bucket removal with data purge

2015-03-04 Thread Ben Hines

One of the release notes says:
rgw: fix bucket removal with data purge (Yehuda Sadeh)

Just tried this and it didnt seem to work:


bash-4.1$ time radosgw-admin bucket rm mike-cache2 --purge-objects

real0m7.711s
user0m0.109s
sys 0m0.072s

Yet the bucket was not deleted, nor purged:

-bash-4.1$ radosgw-admin bucket stats
[

mike-cache2,
{
bucket: mike-cache2,
pool: .rgw.buckets,
index_pool: .rgw.buckets.index,
id: default.2769570.4,
marker: default.2769570.4,
owner: smbuildmachine,
ver: 0#329,
master_ver: 0#0,
mtime: 2014-11-11 16:10:31.00,
max_marker: 0#,
usage: {
rgw.main: {
size_kb: 223355,
size_kb_actual: 223768,
num_objects: 164
}
},
bucket_quota: {
enabled: false,
max_size_kb: -1,
max_objects: -1
}
},

]




-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] pool distribution quality report script

2015-03-04 Thread Mark Nelson


Hi All,

Recently some folks showed interest in gathering pool distribution 
statistics and I remembered I wrote a script to do that a while back. 
It was broken due to a change in the ceph pg dump output format that was 
committed a while back, so I cleaned the script up, added detection of 
header fields, automatic json support, and also added in calculation of 
expected max and min PGs per OSD and std deviation.


The script is available here:

https://github.com/ceph/ceph-tools/blob/master/cbt/tools/readpgdump.py

Some general comments:

1) Expected numbers are derived by treating PGs and OSDs as a 
balls-in-buckets problem ala Raab  Steger:


http://www14.in.tum.de/personen/raab/publ/balls.pdf

2) You can invoke it either by passing it a file or stdout, ie:

ceph pg dump -f json | ./readpgdump.py

or

./readpgdump.py ~/pgdump.out


3) Here's a snippet of some of some sample output from a 210 OSD 
cluster.  Does this output make sense to people?  Is it useful?



[nhm@burnupiX tools]$ ./readpgdump.py ~/pgdump.out
++
| Detected input as plain|
++

++
| Pool ID: 681   |
++
| Participating OSDs: 210|
| Participating PGs: 4096|
++
| OSDs in Primary Role (Acting)  |
| Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2|
| Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5  |
| 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) |
| 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9)   |
| Avg Deviation from Most Subscribed OSD: 54.6%  |
++
| OSDs in Secondary Role (Acting)|
| Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2  |
| Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 |
| 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57)|
| 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20)   |
| Avg Deviation from Most Subscribed OSD: 36.0%  |
++
| OSDs in All Roles (Acting) |
| Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5  |
| Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6|
| 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91)  |
| 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) |
| Avg Deviation from Most Subscribed OSD: 37.1%  |
++
| OSDs in Primary Role (Up)  |
| Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2|
| Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5  |
| 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) |
| 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9)   |
| Avg Deviation from Most Subscribed OSD: 54.6%  |
++
| OSDs in Secondary Role (Up)|
| Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2  |
| Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 |
| 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57)|
| 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20)   |
| Avg Deviation from Most Subscribed OSD: 36.0%  |
++
| OSDs in All Roles (Up) |
| Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5  |
| Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6|
| 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91)  |
| 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) |
| Avg Deviation from Most Subscribed OSD: 37.1%  |

Re: [ceph-users] Unexpected OSD down during deep-scrub

2015-03-04 Thread Italo Santos

New issue created - http://tracker.ceph.com/issues/11027  

Regards.

Italo Santos
http://italosantos.com.br/


On Tuesday, March 3, 2015 at 9:23 PM, Loic Dachary wrote:

 Hi Yann,
  
 That seems related to http://tracker.ceph.com/issues/10536 which seems to be 
 resolved. Could you create a new issue with a link to 10536 ? More logs and 
 ceph report would also be useful to figure out why it resurfaced.
  
 Thanks !
  
  
 On 04/03/2015 00:04, Yann Dupont wrote:
   
  Le 03/03/2015 22:03, Italo Santos a écrit :

   I realised that when the first OSD goes down, the cluster was performing 
   a deep-scrub and I found the bellow trace on the logs of osd.8, anyone 
   can help me understand why the osd.8, and other osds, unexpected goes 
   down?
   
  I'm afraid I've seen this this afternoon too on my test cluster, just after 
  upgrading from 0.87 to 0.93. After an initial migration success, some OSD 
  started to go down : All presented similar stack traces , with magic word 
  scrub in it :
   
  ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
  1: /usr/bin/ceph-osd() [0xbeb3dc]
  2: (()+0xf0a0) [0x7f8f3ca130a0]
  3: (gsignal()+0x35) [0x7f8f3b37d165]
  4: (abort()+0x180) [0x7f8f3b3803e0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d]
  6: (()+0x63996) [0x7f8f3bbd1996]
  7: (()+0x639c3) [0x7f8f3bbd19c3]
  8: (()+0x63bee) [0x7f8f3bbd1bee]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
  const*)+0x220) [0xcd74f0]
  10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x1fc) 
  [0x97259c]
  11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) 
  [0x97344a]
  12: (ReplicatedPG::_scrub(ScrubMap, std::maphobject_t, std::pairunsigned 
  int, unsigned int, std::lesshobject_t, 
  std::allocatorstd::pairhobject_t const, std::pa
  irunsigned int, unsigned intconst)+0x2e4d) [0x9a5ded]
  13: (PG::scrub_compare_maps()+0x658) [0x916378]
  14: (PG::chunky_scrub(ThreadPool::TPHandle)+0x202) [0x917ee2]
  15: (PG::scrub(ThreadPool::TPHandle)+0x3a3) [0x919f83]
  16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0x13) [0x7eff93]
  17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49]
  18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40]
  19: (()+0x6b50) [0x7f8f3ca0ab50]
  20: (clone()+0x6d) [0x7f8f3b42695d]
   
  As a temporary measure, noscrub and nodeep-scrub are now set for this 
  cluster, and all is working fine right now.
   
  So there is probably something wrong here. Need to investigate further.
   
  Cheers,
   
   
   
   
   
   
   
   
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   
  
  
 --  
 Loïc Dachary, Artisan Logiciel Libre
  
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pool distribution quality report script

2015-03-04 Thread Blair Bethwaite

Hi Mark,

Cool, that looks handy. Though it'd be even better if it could go a
step further and recommend re-weighting values to balance things out
(or increased PG counts where needed).

Cheers,

On 5 March 2015 at 15:11, Mark Nelson mnel...@redhat.com wrote:
 Hi All,

 Recently some folks showed interest in gathering pool distribution
 statistics and I remembered I wrote a script to do that a while back. It was
 broken due to a change in the ceph pg dump output format that was committed
 a while back, so I cleaned the script up, added detection of header fields,
 automatic json support, and also added in calculation of expected max and
 min PGs per OSD and std deviation.

 The script is available here:

 https://github.com/ceph/ceph-tools/blob/master/cbt/tools/readpgdump.py

 Some general comments:

 1) Expected numbers are derived by treating PGs and OSDs as a
 balls-in-buckets problem ala Raab  Steger:

 http://www14.in.tum.de/personen/raab/publ/balls.pdf

 2) You can invoke it either by passing it a file or stdout, ie:

 ceph pg dump -f json | ./readpgdump.py

 or

 ./readpgdump.py ~/pgdump.out


 3) Here's a snippet of some of some sample output from a 210 OSD cluster.
 Does this output make sense to people?  Is it useful?

 [nhm@burnupiX tools]$ ./readpgdump.py ~/pgdump.out

 ++
 | Detected input as plain
 |

 ++


 ++
 | Pool ID: 681
 |

 ++
 | Participating OSDs: 210
 |
 | Participating PGs: 4096
 |

 ++
 | OSDs in Primary Role (Acting)
 |
 | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2
 |
 | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5
 |
 | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31)
 |
 | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9)
 |
 | Avg Deviation from Most Subscribed OSD: 54.6%
 |

 ++
 | OSDs in Secondary Role (Acting)
 |
 | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2
 |
 | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7
 |
 | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57)
 |
 | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20)
 |
 | Avg Deviation from Most Subscribed OSD: 36.0%
 |

 ++
 | OSDs in All Roles (Acting)
 |
 | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5
 |
 | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6
 |
 | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91)
 |
 | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32)
 |
 | Avg Deviation from Most Subscribed OSD: 37.1%
 |

 ++
 | OSDs in Primary Role (Up)
 |
 | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2
 |
 | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5
 |
 | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31)
 |
 | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9)
 |
 | Avg Deviation from Most Subscribed OSD: 54.6%
 |

 ++
 | OSDs in Secondary Role (Up)
 |
 | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2
 |
 | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7
 |
 | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57)
 |
 | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20)
 |
 | Avg Deviation from Most Subscribed OSD: 36.0%
 |

 ++
 | OSDs in All Roles (Up)
 |
 | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5
 |
 | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6
 |
 | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91)
 |
 | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32)
 |
 | Avg Deviation from Most Subscribed OSD: 37.1%
 |

 ++


 This is shown for all the pools, followed by the totals:


 ++
 | Pool ID: Totals (All Pools)
 |

 ++
 | Participating OSDs: 210
 |
 | Participating PGs: 131072
 |

 ++
 | OSDs in Primary Role (Acting)
 |
 | Expected PGs Per OSD: Min: 542, Max: 705, Mean:

[ceph-users] Inkscope packages and blog

2015-03-04 Thread alain.dechorgnat

Hi everyone,

I'm proud to announce that DEB and RPM packages for Inkscope V1.1 are available 
on github (https://github.com/inkscope/inkscope-packaging).

Inkscope has also its blog : http://inkscope.blogspot.fr. 
You will find there how to install Inkscope on debian servers 
(http://inkscope.blogspot.fr/2015/03/inkscope-installation-on-debian-servers.html)

Feedback is welcome.

Cheers,
Alain


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush
map - weather that will cause more than 37% of data moved (80% or whatever)

I'm also wondering if the thortling that I applied is fine or not - I will
introduce the osd_recovery_delay_start 10sec as Irek said.

I'm just wondering hom much will be the performance impact, because:
- when stoping OSD, the impact while backfilling was fine more or a less -
I can leave with this
- when I removed OSD from cursh map - first 1h or so, impact was
tremendous, and later on during recovery process impact was much less but
still noticable...

Thanks for the tip of course !
Andrija

On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote:

 I would be inclined to shut down both OSDs in a node, let the cluster
 recover. Once it is recovered, shut down the next two, let it recover.
 Repeat until all the OSDs are taken out of the cluster. Then I would
 set nobackfill and norecover. Then remove the hosts/disks from the
 CRUSH then unset nobackfill and norecover.

 That should give you a few small changes (when you shut down OSDs) and
 then one big one to get everything in the final place. If you are
 still adding new nodes, when nobackfill and norecover is set, you can
 add them in so that the one big relocate fills the new drives too.

 On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com
 wrote:
  Thx Irek. Number of replicas is 3.
 
  I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already
  decommissioned), which is further connected to a new 10G switch/network
 with
  3 servers on it with 12 OSDs each.
  I'm decommissioning old 3 nodes on 1G network...
 
  So you suggest removing whole node with 2 OSDs manually from crush map?
  Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas
  were originally been distributed over all 3 nodes. So anyway It could be
  safe to remove 2 OSDs at once together with the node itself...since
 replica
  count is 3...
  ?
 
  Thx again for your time
 
  On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote:
 
  Once you have only three nodes in the cluster.
  I recommend you add new nodes to the cluster, and then delete the old.
 
  2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com:
 
  You have a number of replication?
 
  2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com:
 
  Hi Irek,
 
  yes, stoping OSD (or seting it to OUT) resulted in only 3% of data
  degraded and moved/recovered.
  When I after that removed it from Crush map ceph osd crush rm id,
  that's when the stuff with 37% happened.
 
  And thanks Irek for help - could you kindly just let me know of the
  prefered steps when removing whole node?
  Do you mean I first stop all OSDs again, or just remove each OSD from
  crush map, or perhaps, just decompile cursh map, delete the node
 completely,
  compile back in, and let it heal/recover ?
 
  Do you think this would result in less data missplaces and moved
 arround
  ?
 
  Sorry for bugging you, I really appreaciate your help.
 
  Thanks
 
  On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote:
 
  A large percentage of the rebuild of the cluster map (But low
  percentage degradation). If you had not made ceph osd crush rm id,
 the
  percentage would be low.
  In your case, the correct option is to remove the entire node, rather
  than each disk individually
 
  2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com:
 
  Another question - I mentioned here 37% of objects being moved
 arround
  - this is MISPLACED object (degraded objects were 0.001%, after I
 removed 1
  OSD from cursh map (out of 44 OSD or so).
 
  Can anybody confirm this is normal behaviour - and are there any
  workarrounds ?
 
  I understand this is because of the object placement algorithm of
  CEPH, but still 37% of object missplaces just by removing 1 OSD
 from crush
  maps out of 44 make me wonder why this large percentage ?
 
  Seems not good to me, and I have to remove another 7 OSDs (we are
  demoting some old hardware nodes). This means I can potentialy go
 with 7 x
  the same number of missplaced objects...?
 
  Any thoughts ?
 
  Thanks
 
  On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com
  wrote:
 
  Thanks Irek.
 
  Does this mean, that after peering for each PG, there will be delay
  of 10sec, meaning that every once in a while, I will have 10sec od
 the
  cluster NOT being stressed/overloaded, and then the recovery takes
 place for
  that PG, and then another 10sec cluster is fine, and then stressed
 again ?
 
  I'm trying to understand process before actually doing stuff
 (config
  reference is there on ceph.com but I don't fully understand the
 process)
 
  Thanks,
  Andrija
 
  On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote:
 
  Hi.
 
  Use value osd_recovery_delay_start
  example:
  [root@ceph08 ceph]# ceph --admin-daemon
  /var/run/ceph/ceph-osd.94.asok config show  |

Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread Nick Fisk

Hi Christian,

Yes that's correct, it's on the client side. I don't see this much different
to a battery backed Raid controller, if you lose power, the data is in the
cache until power resumes when it is flushed.

If you are going to have the same RBD accessed by multiple servers/clients
then you need to make sure the SSD is accessible to both (eg DRBD / Dual
Port SAS). But then something like pacemaker would be responsible for
ensuring the RBD and cache device are both present before allowing client
access.

When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's,
however I can understand that this feature would prove more of a challenge
if you are using Qemu and RBD.

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Christian Balzer
Sent: 04 March 2015 08:40
To: ceph-users@lists.ceph.com
Cc: Nick Fisk
Subject: Re: [ceph-users] Persistent Write Back Cache


Hello,

If I understand you correctly, you're talking about the rbd cache on the
client side.

So assume that host or the cache SSD on if fail terminally.
The client thinks its sync'ed are on the permanent storage (the actual ceph
storage cluster), while they are only present locally. 

So restarting that service or VM on a different host now has to deal with
likely crippling data corruption.

Regards,

Christian

On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote:

 Hi All,
 
  
 
 Is there anything in the pipeline to add the ability to write the 
 librbd cache to ssd so that it can safely ignore sync requests? I have 
 seen a thread a few years back where Sage was discussing something 
 similar, but I can't find anything more recent discussing it.
 
  
 
 I've been running lots of tests on our new cluster, buffered/parallel 
 performance is amazing (40K Read 10K write iops), very impressed. 
 However sync writes are actually quite disappointing.
 
  
 
 Running fio with 128k block size and depth=1, normally only gives me 
 about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's 
 and from what I hear that's about normal, so I don't think I have a 
 ceph config problem. For applications which do a lot of sync's, like 
 ESXi over iSCSI or SQL databases, this has a major performance impact.
 
  
 
 Traditional storage arrays work around this problem by having a 
 battery backed cache which has latency 10-100 times less than what you 
 can currently achieve with Ceph and an SSD . Whilst librbd does have a 
 writeback cache, from what I understand it will not cache syncs and so 
 in my usage case, it effectively acts like a write through cache.
 
  
 
 To illustrate the difference a proper write back cache can make, I put 
 a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
 tweaked the flush parameters to flush dirty blocks at a large queue 
 depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
 limited by the performance of SSD used by flashcache, as everything is 
 stored as 4k blocks on the ssd. In fact since everything is stored as 
 4k blocks, pretty much all IO sizes are accelerated to max speed of the
SSD.
 Looking at iostat I can see all the IO's are getting coalesced into 
 nice large 512kb IO's at a high queue depth, which Ceph easily swallows.
 
  
 
 If librbd could support writing its cache out to SSD it would 
 hopefully achieve the same level of performance and having it 
 integrated would be really neat.
 
  
 
 Nick
 
 
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?

2015-03-04 Thread SCHAER Frederic

Hi,

Many thanks for the explanations.
I haven't used the nodcache option when mounting cephfs, it actually got 
there by default 

My mount command is/was :
# mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret

I don't know what causes this option to be default, maybe it's the kernel 
module I compiled from git (because there is no kmod-ceph or kmod-rbd in any 
RHEL-like distributions except RHEV), I'll try to update/check ...

Concerning the rados pool ls, indeed : I created empty files in the pool, and 
they were not showing up probably because they were just empty - but when I 
create a non empty file, I see things in rados ls...

Thanks again
Frederic


-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John 
Spray
Envoyé : mardi 3 mars 2015 17:15
À : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?



On 03/03/2015 15:21, SCHAER Frederic wrote:

 By the way : looks like the ceph fs ls command is inconsistent when 
 the cephfs is mounted (I used a locally compiled kmod-ceph rpm):

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ]

 (umount /mnt .)

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools: 
 [puppet root ]

This is probably #10288, which was fixed in 0.87.1

 So, I have this pool named root that I added in the cephfs filesystem.

 I then edited the filesystem xattrs :

 [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root

 ceph.dir.layout=stripe_unit=4194304 stripe_count=1 
 object_size=4194304 pool=root

 I'm therefore assuming client.puppet should not be allowed to write or 
 read anything in /mnt/root, which belongs to the root pool. but that 
 is not the case.

 On another machine where I mounted cephfs using the client.puppet key, 
 I can do this :

 The mount was done with the client.puppet key, not the admin one that 
 is not deployed on that node :

 1.2.3.4:6789:/ on /mnt type ceph 
 (rw,relatime,name=puppet,secret=hidden,nodcache)

 [root@dev7248 ~]# echo not allowed  /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 not allowed

This is data you're seeing from the page cache, it hasn't been written 
to RADOS.

You have used the nodcache setting, but that doesn't mean what you 
think it does (it was about caching dentries, not data).  It's actually 
not even used in recent kernels (http://tracker.ceph.com/issues/11009).

You could try the nofsc option, but I don't know exactly how much 
caching that turns off -- the safer approach here is probably to do your 
testing using I/Os that have O_DIRECT set.

 And I can even see the xattrs inherited from the parent dir :

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1 
 object_size=4194304 pool=root

 Whereas on the node where I mounted cephfs as ceph admin, I get nothing :

 [root@ceph0 ~]# cat /mnt/root/secret.notfailed

 [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed

 -rw-r--r-- 1 root root 12 Mar  3 15:27 /mnt/root/secret.notfailed

 After some time, the file also gets empty on the puppet client host :

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 (but the metadata remained ?)

Right -- eventually the cache goes away, and you see the true (empty) 
state of the file.

 Also, as an unpriviledged user, I can get ownership of a secret file 
 by changing the extended attribute :

 [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet 
 /mnt/root/secret.notfailed

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1 
 object_size=4194304 pool=puppet

Well, you're not really getting ownership of anything here: you're 
modifying the file's metadata, which you are entitled to do (pool 
permissions have nothing to do with file metadata).  There was a recent 
bug where a file's pool layout could be changed even if it had data, but 
that was about safety rather than permissions.

 Final question for those that read down here : it appears that before 
 creating the cephfs filesystem, I used the puppet pool to store a 
 test rbd instance.

 And it appears I cannot get the list of cephfs objects in that pool, 
 whereas I can get those that are on the newly created root pool :

 [root@ceph0 ~]# rados -p puppet ls

 test.rbd

 rbd_directory

 [root@ceph0 ~]# rados -p root ls

 10a.

 10b.

 Bug, or feature ?


I didn't see anything in your earlier steps that would have led to any 
objects in

Re: [ceph-users] Fail to bring OSD back to cluster

2015-03-04 Thread Sahana

Hi Luke,

May be you can set these flags:

 ceph osd set nodown
 ceph osd set noout


Regards
Sahana

On Wed, Mar 4, 2015 at 2:32 PM, Luke Kao luke@mycom-osi.com wrote:

  Hello ceph community,
 We need some immediate help that our cluster is in a very strange and bad
 status after unexpected reboot of many OSD nodes in a very short time frame.

 We have a cluster with 195 osd configured on 9 different OSD nodes,
 original version 0.80.5.
 After some issue of the datacenter, at least 5 OSD nodes rebooted and
 after reboot not all OSDs goes up then trigger a lot of recovery, also many
 PGs goes into dead / incomplete state.

 Then we try to restart OSD, and found OSD keep crashes with error FAILED
 assert(log.head = olog.tail  olog.head = log.tail), so we upgrade to
 0.80.7 which covers fix of #9482, however we still see the error with
 different behavior:
 0.80.5: once OSD crashes with this error, any trial to restart the OSD, it
 will crash with same error at the end
 0.80.7: OSD can be restarted, but after some time, there will be another
 OSD will crash with this error

 We also tried to set nobackfill and norecover flag but doesn't help.


 So the cluster get stuck that we cannot bring more osd back.

 Any suggestion that we may have the chance to recover the cluster?
 Many thanks,


  Luke Kao

 MYCOM-OSI
  http://www.mycom-osi.com

 --

 This electronic message contains information from Mycom which may be
 privileged or confidential. The information is intended to be for the use
 of the individual(s) or entity named above. If you are not the intended
 recipient, be aware that any disclosure, copying, distribution or any other
 use of the contents of this information is prohibited. If you have received
 this electronic message in error, please notify us by post or telephone (to
 the numbers or correspondence address above) or by email (at the email
 address above) immediately.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Implement replication network with live cluster

Hi,

I'm having a live cluster with only public network (so no explicit network
configuraion in the ceph.conf file)

I'm wondering what is the procedure to implement dedicated
Replication/Private and Public network.
I've read the manual, know how to do it in ceph.conf, but I'm wondering
since this is already running cluster - what should I do after I change
ceph.conf on all nodes ?
Restarting OSDs one by one, or... ? Is there any downtime expected ? - for
the replication network to actually imlemented completely.


Another related quetion:

Also, I'm demoting some old OSDs, on old servers, I will have them all
stoped, but would like to implement replication network before actually
removing old OSDs from crush map - since lot of data will be moved arround.

My old nodes/OSDs (that will be stoped before I implement replication
network) - do NOT have dedicated NIC for replication network, in contrast
to new nodes/OSDs. So there will be still reference to these old OSD in the
crush map.
Will this be a problem - me changing/implementing replication network that
WILL work on new nodes/OSDs, but not on old ones since they don't have
dedicated NIC ? I guess not since old OSDs are stoped anyway, but would
like opinion.

Or perhaps i might remove OSD from crush map with prior seting of
 nobackfill and   norecover (so no rebalancing happens) and then implement
replication netwotk?


Sorry for old post, but...

Thanks,
-- 

Andrija Panić
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Persistent Write Back Cache

2015-03-04 Thread John Spray




On 04/03/2015 08:26, Nick Fisk wrote:
To illustrate the difference a proper write back cache can make, I put 
a 1GB (512mb dirty threshold) flashcache in front of my RBD and 
tweaked the flush parameters to flush dirty blocks at a large queue 
depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is 
limited by the performance of SSD used by flashcache, as everything is 
stored as 4k blocks on the ssd. In fact since everything is stored as 
4k blocks, pretty much all IO sizes are accelerated to max speed of 
the SSD. Looking at iostat I can see all the IO’s are getting 
coalesced into nice large 512kb IO’s at a high queue depth, which Ceph 
easily swallows.


If librbd could support writing its cache out to SSD it would 
hopefully achieve the same level of performance and having it 
integrated would be really neat.


What are you hoping to gain from building something into ceph instead of 
using flashcache/bcache/dm-cache on top of it?  It seems like since you 
would anyway need to handle your HA configuration, setting up the actual 
cache device would be the simple part.


Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Perf problem after upgrade from dumpling to firefly

Hi,

last saturday I upgraded my production cluster from dumpling to emperor
(since we were successfully using it on a test cluster).
A couple of hours later, we had falling OSD : some of them were marked
as down by Ceph, probably because of IO starvation. I marked the cluster
in «noout», start downed OSD, then let him recover. 24h later, same
problem (near same hour).

So, I choose to directly upgrade to firefly, which is maintained.
Things are better, but the cluster is slower than with dumpling.

The main problem seems that OSD have twice more write operations par
second :
https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png
https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png

But journal doesn't change (SSD dedicated to OSD70+71+72) :
https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png

Neither node bandwidth :
https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png

Or whole cluster IO activity :
https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png

Some background :
The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD
journal» OSD. Only «HDD+SSD» OSD seems to be affected.

I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD»
nodes (so a total of 27 «HDD+SSD» OSD).

The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (=
«rbd snap rm» operations).
osd_snap_trim_sleep is setup to 0.8 since monthes.
Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It
doesn't seem to really help.

The only thing which seems to help, is to reduce osd_disk_threads from 8
to 1.

So. Any idea about what's happening ?

Thanks for any help,
Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rbd image's data deletion

2015-03-04 Thread Jason Dillaman

An RBD image is split up into (by default 4MB) objects within the OSDs.  When 
you delete an RBD image, all the objects associated with the image are removed 
from the OSDs.  The objects are not securely erased from the OSDs if that is 
what you are asking.

-- 

Jason Dillaman 
Red Hat 
dilla...@redhat.com 
http://www.redhat.com 


- Original Message - 
From: Giuseppe Civitella giuseppe.civite...@gmail.com 
To: ceph-users ceph-us...@ceph.com 
Sent: Tuesday, March 3, 2015 11:36:46 AM 
Subject: [ceph-users] Rbd image's data deletion 

Hi all, 

what happens to data contained in an rbd image when the image itself gets 
deleted? 
Are the data just unlinked or are them destroyed in a way that make them 
unreadable? 

thanks 
Giuseppe 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

Hi,

maybe this is related ?:

http://tracker.ceph.com/issues/9503
Dumpling: removing many snapshots in a short time makes OSDs go berserk

http://tracker.ceph.com/issues/9487
dumpling: snaptrimmer causes slow requests while backfilling. 
osd_snap_trim_sleep not helping

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html



I think it's already backport in dumpling, not sure it's already done for 
firefly


Alexandre



- Mail original -
De: Olivier Bonvalet ceph.l...@daevel.fr
À: ceph-users ceph-users@lists.ceph.com
Envoyé: Mercredi 4 Mars 2015 12:10:30
Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly

Hi, 

last saturday I upgraded my production cluster from dumpling to emperor 
(since we were successfully using it on a test cluster). 
A couple of hours later, we had falling OSD : some of them were marked 
as down by Ceph, probably because of IO starvation. I marked the cluster 
in «noout», start downed OSD, then let him recover. 24h later, same 
problem (near same hour). 

So, I choose to directly upgrade to firefly, which is maintained. 
Things are better, but the cluster is slower than with dumpling. 

The main problem seems that OSD have twice more write operations par 
second : 
https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 

But journal doesn't change (SSD dedicated to OSD70+71+72) : 
https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 

Neither node bandwidth : 
https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 

Or whole cluster IO activity : 
https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 

Some background : 
The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
journal» OSD. Only «HDD+SSD» OSD seems to be affected. 

I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
nodes (so a total of 27 «HDD+SSD» OSD). 

The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
«rbd snap rm» operations). 
osd_snap_trim_sleep is setup to 0.8 since monthes. 
Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
doesn't seem to really help. 

The only thing which seems to help, is to reduce osd_disk_threads from 8 
to 1. 

So. Any idea about what's happening ? 

Thanks for any help, 
Olivier 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Firefly, cephfs issues: different unix rights depending on the client and ls are slow

2015-03-04 Thread Francois Lafont

Hi,

I'm trying cepfs and I have some problems. Here is the context:

All the nodes (in cluster and the clients) are Ubuntu 14.04 with a 3.16
kernel (after apt-get install linux-generic-lts-utopic  reboot).

The cluster:
- one server with just one monitor daemon (RAM 2GB)
- 2 servers (RAM 24GB) with one monitor daemon, ~10 OSDs daemon (one
  per disk of 275 GB), and one mds daemon (I use the default
  active/standby mode and the pools for cephfs are data and metadata)

The cluster is totally unused (the servers are idle as regards the RAM
and the load overage etc), it's a little cluster for testing, the raw
space is 5172G, number of replicas is 2. Another remark, facing my problem,
I have put in my ceph conf mds cache size = 100 but without lof of
effect (or else I would not be posting this message). Initially, the
cephfs is completely empty.

The clients, test-cephfs and test-cephfs2, have 512MB of RAM. In these
clients, I mount the cephfs like this (with the root account):

~# mkdir /cephfs
~# mount -t ceph 10.0.2.150,10.0.2.151,10.0.2.152:/ /cephfs/ -o 
name=cephfs,secretfile=/etc/ceph/ceph.client.cephfs.secret

Then in ceph-testfs, I do:

root@test-cephfs:~# mkdir /cephfs/d1
root@test-cephfs:~# ll /cephfs/
total 4
drwxr-xr-x  1 root root0 Mar  4 11:45 ./
drwxr-xr-x 24 root root 4096 Mar  4 11:42 ../
drwxr-xr-x  1 root root0 Mar  4 11:45 d1/

After, in test-cephfs2, I do:

root@test-cephfs2:~# ll /cephfs/
total 4
drwxr-xr-x  1 root root0 Mar  4 11:45 ./
drwxr-xr-x 24 root root 4096 Mar  4 11:42 ../
drwxrwxrwx  1 root root0 Mar  4 11:45 d1/

1) Why the unix rights of d1/ are different when I'm in test-cephfs
and when I'm in test-cephfs2? It should be the same, isn't?

2) If I create 100 files in /cephfs/d1/ with test-cephfs:

for i in $(seq 100)
do
echo $(date +%s.%N) /cephfs/d1/f_$i
done

sometimes, in test-cephfs2, when I do a simple:

root@test-cephfs2:~# time \ls -la /cephfs

the command can take 2 or 3 seconds which seems to me very long
for a directory with just 100 files. Generally, if I repeat the
command on test-cephfs2 just after, it's immediate but not always.
I can not reproduce the problem in a determinist way. Sometimes,
to reproduce the problem, I must remove all the files in /cephfs/
on test-cepfs and recreate them. It's very strange. Sometimes and
randomly, something seems to be stalled but I don't know what. I
suspect a problem of mds tuning but, In fact, I don't know what
to do.

Do have an idea of the problem?

3) I plan to use cephfs in production in a project of web servers
(which share together a cephfs storage) but I would like to solve
the issue above before.

If you have any suggestion about cephfs and mds tuning, I am highly
interested.

Thanks in advance for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

Thanks Alexandre.

The load problem is permanent : I have twice IO/s on HDD since firefly.
And yes, the problem hang the production at night during snap trimming.

I suppose there is a new OSD parameter which change behavior of the
journal, or something like that. But didn't find anything about that.

Olivier

Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit :
 Hi,
 
 maybe this is related ?:
 
 http://tracker.ceph.com/issues/9503
 Dumpling: removing many snapshots in a short time makes OSDs go berserk
 
 http://tracker.ceph.com/issues/9487
 dumpling: snaptrimmer causes slow requests while backfilling. 
 osd_snap_trim_sleep not helping
 
 http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
 
 
 
 I think it's already backport in dumpling, not sure it's already done for 
 firefly
 
 
 Alexandre
 
 
 
 - Mail original -
 De: Olivier Bonvalet ceph.l...@daevel.fr
 À: ceph-users ceph-users@lists.ceph.com
 Envoyé: Mercredi 4 Mars 2015 12:10:30
 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly
 
 Hi, 
 
 last saturday I upgraded my production cluster from dumpling to emperor 
 (since we were successfully using it on a test cluster). 
 A couple of hours later, we had falling OSD : some of them were marked 
 as down by Ceph, probably because of IO starvation. I marked the cluster 
 in «noout», start downed OSD, then let him recover. 24h later, same 
 problem (near same hour). 
 
 So, I choose to directly upgrade to firefly, which is maintained. 
 Things are better, but the cluster is slower than with dumpling. 
 
 The main problem seems that OSD have twice more write operations par 
 second : 
 https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
 https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
 
 But journal doesn't change (SSD dedicated to OSD70+71+72) : 
 https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
 
 Neither node bandwidth : 
 https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
 
 Or whole cluster IO activity : 
 https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
 
 Some background : 
 The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
 journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
 
 I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
 nodes (so a total of 27 «HDD+SSD» OSD). 
 
 The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
 «rbd snap rm» operations). 
 osd_snap_trim_sleep is setup to 0.8 since monthes. 
 Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
 doesn't seem to really help. 
 
 The only thing which seems to help, is to reduce osd_disk_threads from 8 
 to 1. 
 
 So. Any idea about what's happening ? 
 
 Thanks for any help, 
 Olivier 
 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

The load problem is permanent : I have twice IO/s on HDD since firefly.

Oh, permanent, that's strange. (If you don't see more traffic coming from 
clients, I don't understand...)

do you see also twice ios/ ops in ceph -w  stats ?

is the ceph health ok ?



- Mail original -
De: Olivier Bonvalet ceph.l...@daevel.fr
À: aderumier aderum...@odiso.com
Cc: ceph-users ceph-users@lists.ceph.com
Envoyé: Mercredi 4 Mars 2015 14:49:41
Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly

Thanks Alexandre. 

The load problem is permanent : I have twice IO/s on HDD since firefly. 
And yes, the problem hang the production at night during snap trimming. 

I suppose there is a new OSD parameter which change behavior of the 
journal, or something like that. But didn't find anything about that. 

Olivier 

Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : 
 Hi, 
 
 maybe this is related ?: 
 
 http://tracker.ceph.com/issues/9503 
 Dumpling: removing many snapshots in a short time makes OSDs go berserk 
 
 http://tracker.ceph.com/issues/9487 
 dumpling: snaptrimmer causes slow requests while backfilling. 
 osd_snap_trim_sleep not helping 
 
 http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
  
 
 
 
 I think it's already backport in dumpling, not sure it's already done for 
 firefly 
 
 
 Alexandre 
 
 
 
 - Mail original - 
 De: Olivier Bonvalet ceph.l...@daevel.fr 
 À: ceph-users ceph-users@lists.ceph.com 
 Envoyé: Mercredi 4 Mars 2015 12:10:30 
 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly 
 
 Hi, 
 
 last saturday I upgraded my production cluster from dumpling to emperor 
 (since we were successfully using it on a test cluster). 
 A couple of hours later, we had falling OSD : some of them were marked 
 as down by Ceph, probably because of IO starvation. I marked the cluster 
 in «noout», start downed OSD, then let him recover. 24h later, same 
 problem (near same hour). 
 
 So, I choose to directly upgrade to firefly, which is maintained. 
 Things are better, but the cluster is slower than with dumpling. 
 
 The main problem seems that OSD have twice more write operations par 
 second : 
 https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
 https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
 
 But journal doesn't change (SSD dedicated to OSD70+71+72) : 
 https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
 
 Neither node bandwidth : 
 https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
 
 Or whole cluster IO activity : 
 https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
 
 Some background : 
 The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
 journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
 
 I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
 nodes (so a total of 27 «HDD+SSD» OSD). 
 
 The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
 «rbd snap rm» operations). 
 osd_snap_trim_sleep is setup to 0.8 since monthes. 
 Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
 doesn't seem to really help. 
 
 The only thing which seems to help, is to reduce osd_disk_threads from 8 
 to 1. 
 
 So. Any idea about what's happening ? 
 
 Thanks for any help, 
 Olivier 
 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Perf problem after upgrade from dumpling to firefly