Re: [ceph-users] about rgw region sync

2015-05-12 Thread Craig Lewis
Are you trying to setup replication on one cluster right now?

Generally replication is setup between two different clusters, each having
one zone.  Both clusters are in the same region.

I can't think of a reason why two zones in one cluster wouldn't work.  It's
more complicated to setup though.  Anything outside of a test setup would
need a lot of planning to make sure the two zones are as fault isolated as
possible. I'm pretty sure you need separate RadosGW nodes for each zone.
It could be possible to share, but it will be easier if you don't.


I still haven't gone through your previous logs carefully.

On Tue, May 12, 2015 at 6:46 AM, TERRY 316828...@qq.com wrote:

 could i build one region using two clusters, each cluster has one zone。 so
 that I sync metadata and data from one cluster to another cluster。
  I build two ceph clusters.
 for the first cluster, I do the follow steps
 1.create pools
 sudo ceph osd pool create .us-east.rgw.root 64  64
 sudo ceph osd pool create .us-east.rgw.control 64 64
 sudo ceph osd pool create .us-east.rgw.gc 64 64
 sudo ceph osd pool create .us-east.rgw.buckets 64 64
 sudo ceph osd pool create .us-east.rgw.buckets.index 64 64
 sudo ceph osd pool create .us-east.rgw.buckets.extra 64 64
 sudo ceph osd pool create .us-east.log 64 64
 sudo ceph osd pool create .us-east.intent-log 64 64
 sudo ceph osd pool create .us-east.usage 64 64
 sudo ceph osd pool create .us-east.users 64 64
 sudo ceph osd pool create .us-east.users.email 64 64
 sudo ceph osd pool create .us-east.users.swift 64 64
 sudo ceph osd pool create .us-east.users.uid 64 64

 2.create a keyring
 sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.radosgw.keyring
 sudo chmod +r /etc/ceph/ceph.client.radosgw.keyring
 sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n
 client.radosgw.us-east-1 --gen-key
 sudo ceph-authtool -n client.radosgw.us-east-1 --cap osd 'allow rwx' --cap
 mon 'allow rwx' /etc/ceph
 sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add
 client.radosgw.us-east-1 -i /etc/ceph/ceph.client.radosgw.keyring

 3.create a region
 sudo radosgw-admin region set --infile us.json --name
 client.radosgw.us-east-1
 sudo radosgw-admin region default --rgw-region=us --name
 client.radosgw.us-east-1
 sudo radosgw-admin regionmap update --name client.radosgw.us-east-1
the content of us.json:
 cat us.json
 { name: us,
   api_name: us,
   is_master: true,
   endpoints: [
 http:\/\/WH-CEPH-TEST01.MATRIX.CTRIPCORP.COM:80\/, http:\/\/
 WH-CEPH-TEST02.MATRIX.CTRIPCORP.COM:80\/],
   master_zone: us-east,
   zones: [
 { name: us-east,
   endpoints: [
 http:\/\/WH-CEPH-TEST01.MATRIX.CTRIPCORP.COM:80\/],
   log_meta: true,
   log_data: true},
 { name: us-west,
   endpoints: [
 http:\/\/WH-CEPH-TEST02.MATRIX.CTRIPCORP.COM:80\/],
   log_meta: true,
   log_data: true}],
   placement_targets: [
{
  name: default-placement,
  tags: []
}
   ],
   default_placement: default-placement}
 4.create zones
 sudo radosgw-admin zone set --rgw-zone=us-east --infile
 us-east-secert.json --name client.radosgw.us-east-1
 sudo radosgw-admin regionmap update --name client.radosgw.us-east-1
 cat us-east-secert.json
 { domain_root: .us-east.domain.rgw,
   control_pool: .us-east.rgw.control,
   gc_pool: .us-east.rgw.gc,
   log_pool: .us-east.log,
   intent_log_pool: .us-east.intent-log,
   usage_log_pool: .us-east.usage,
   user_keys_pool: .us-east.users,
   user_email_pool: .us-east.users.email,
   user_swift_pool: .us-east.users.swift,
   user_uid_pool: .us-east.users.uid,
   system_key: { access_key: XNK0ST8WXTMWZGN29NF9, secret_key:
 7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5},
   placement_pools: [
 { key: default-placement,
   val: { index_pool: .us-east.rgw.buckets.index,
data_pool: .us-east.rgw.buckets}
 }
   ]
 }

 #5 Create Zone Users system user
 sudo radosgw-admin user create --uid=us-east --display-name=Region-US
 Zone-East --name client.radosgw.us-east-1
 --access_key=XNK0ST8WXTMWZGN29NF9
 --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system
 sudo radosgw-admin user create --uid=us-west --display-name=Region-US
 Zone-West --name client.radosgw.us-east-1
 --access_key=AAK0ST8WXTMWZGN29NF9
 --secret=AAJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system
 #6 creat zone users not system user
 sudo radosgw-admin user create --uid=us-test-east
 --display-name=Region-US Zone-East-test --name client.radosgw.us-east-1
 --access_key=DDK0ST8WXTMWZGN29NF9
 --secret=DDJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5
 #7 subuser create
 sudo radosgw-admin subuser create --uid=us-test-east
 --subuser=us-test-east:swift --access=full --name
 client.radosgw.us-east-1 --key-type swift
 --secret=ffJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5
 sudo /etc/init.d/ceph -a restart
 sudo /etc/init.d/httpd re
 sudo /etc/init.d/ceph-radosgw restart

 for the  second cluster, I do the follow steps
 

Re: [ceph-users] RadosGW - Hardware recomendations

2015-05-06 Thread Craig Lewis
RadosGW is pretty light compared to the rest of Ceph, but it depends on
your use case.


RadosGW just needs network bandwidth and a bit of CPU.  It doesn't access
the cluster network, just the public network.  If you have some spare
public network bandwidth, you can run on existing nodes.  If you plan to
build a big object store, you should dedicate some nodes.  Either way,
you'll want a big enough load balancer in front of them.  RadosGW is just
HTTP, so re-organizing the RadosGW topology is very easy.  For dedicated
hardware, I would use the same hardware that I use for a MON node.

For network bandwidth planning, think of RadosGW as a load balancer.  It's
simplistic, but it works to a first approximation.  An upload comes in to
RadosGW, and gets streamed out to the OSDs.  A download request is made,
RadosGW pulls the data from the OSDs, and sends it to the client.

If you want the RadosGW public IPs on a network that isn't the ceph public
network, then I'd give them dedicated hardware with a connection to the
HTTP network and the ceph public network.



I only have 7 nodes in my cluster, and all RadosGW processes use a total of
~150 Mbps.  Because my usage is light, I'm running Apache and RadosGW
daemon on the MON nodes.  Once those nodes start using 50% of their public
network bandwidth, I'll move RadosGW to dedicated hardware.



On Wed, May 6, 2015 at 2:09 PM, Italo Santos okd...@gmail.com wrote:

  Hello everyone,

 I’m build a new infrastructure which will serve S3 protocol, and I’d like
 your help to estimate a hardware configuration to radosgw servers. I found
 many information on -
 http://ceph.com/docs/master/start/hardware-recommendations/ but nothing
 about the radosgw daemon.

 Regards.

 *Italo Santos*
 http://italosantos.com.br/


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] about rgw region sync

2015-05-06 Thread Craig Lewis
System users are the only ones that need to be created in both zones.
Non-system users (and their sub-users) should be created in the primary
zone.  radosgw-agent will replicate them to the secondary zone.  I didn't
create sub-users for my system users, but I don't think it matters.

I can read my objects from the primary and secondary zones using the same
non-system user's Access and Secret.  Using the S3 API, I only had to
change the host name to use the DNS entries that point at the secondary
cluster.  eg http://bucket1.us-east.myceph.com/object and
http://bucket1.us-west.myceph.com/object.


It's possible that adding the non-system users to the secondary zone causes
replication to fail.

I would verify that users, buckets, and objects are being replicated using
radosgw-admin.
`radosgw-admin --name $name bucket list`, `radosgw-admin --name $name user
info --uid=$username`, and `radosgw-admin --name $name --bucket=$bucket
bucket list`.  That will let you determine if you have a replication or an
access problem.



On Wed, Apr 29, 2015 at 10:27 PM, TERRY 316828...@qq.com wrote:

 hi:
 I am using the following script  to setup my cluster.
 I upgrade  my radosgw-agent  from version 1.2.0 to 1.2.2-1. (1.2.0 will
 results a error!)

 cat repeat.sh
 #!/bin/bash
 set -e
 set -x
 #1 create pools
 sudo ./create_pools.sh
 #2 create a keyring
 sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.radosgw.keyring
 sudo chmod +r /etc/ceph/ceph.client.radosgw.keyring
 sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n
 client.radosgw.us-east-1 --gen-key
 sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n
 client.radosgw.us-west-1 --gen-key
 sudo ceph-authtool -n client.radosgw.us-east-1 --cap osd 'allow rwx' --cap
 mon 'allow rwx' /etc/ceph/ceph.client.radosgw.keyring
 sudo ceph-authtool -n client.radosgw.us-west-1 --cap osd 'allow rwx' --cap
 mon 'allow rwx' /etc/ceph/ceph.client.radosgw.keyring
 sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth del
 client.radosgw.us-east-1
 sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth del
 client.radosgw.us-west-1
 sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add
 client.radosgw.us-east-1 -i /etc/ceph/ceph.client.radosgw.keyring
 sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add
 client.radosgw.us-west-1 -i /etc/ceph/ceph.client.radosgw.keyring
 # 3 create a region
 sudo radosgw-admin region set --infile us.json --name
 client.radosgw.us-east-1
 set +e
 sudo rados -p .us.rgw.root rm region_info.default
 set -e
 sudo radosgw-admin region default --rgw-region=us --name
 client.radosgw.us-east-1
 sudo radosgw-admin regionmap update --name client.radosgw.us-east-1
 # try don't do it
 sudo radosgw-admin region set --infile us.json --name
 client.radosgw.us-west-1
 set +e
 sudo rados -p .us.rgw.root rm region_info.default
 set -e
 sudo radosgw-admin region default --rgw-region=us --name
 client.radosgw.us-west-1
 sudo radosgw-admin regionmap update --name client.radosgw.us-west-1
 # 4 create zones
 # try chanege us-east-no-secert.json file contents
 sudo radosgw-admin zone set --rgw-zone=us-east --infile
 us-east-no-secert.json --name client.radosgw.us-east-1
 sudo radosgw-admin zone set --rgw-zone=us-east --infile
 us-east-no-secert.json --name client.radosgw.us-west-1
 sudo radosgw-admin zone set --rgw-zone=us-west --infile
 us-west-no-secert.json --name client.radosgw.us-east-1
 sudo radosgw-admin zone set --rgw-zone=us-west --infile
 us-west-no-secert.json --name client.radosgw.us-west-1
 set +e
 sudo rados -p .rgw.root rm zone_info.default
 set -e
 sudo radosgw-admin regionmap update --name client.radosgw.us-east-1
 # try don't do it
 sudo radosgw-admin regionmap update --name client.radosgw.us-west-1
 #5 Create Zone Users system user
 sudo radosgw-admin user create --uid=us-east --display-name=Region-US
 Zone-East --name client.radosgw.us-east-1
 --access_key=XNK0ST8WXTMWZGN29NF9
 --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system
 sudo radosgw-admin user create --uid=us-west --display-name=Region-US
 Zone-West --name client.radosgw.us-west-1
 --access_key=AAK0ST8WXTMWZGN29NF9
 --secret=AAJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system
 sudo radosgw-admin user create --uid=us-east --display-name=Region-US
 Zone-East --name client.radosgw.us-west-1
 --access_key=XNK0ST8WXTMWZGN29NF9
 --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system
 sudo radosgw-admin user create --uid=us-west --display-name=Region-US
 Zone-West --name client.radosgw.us-east-1
 --access_key=AAK0ST8WXTMWZGN29NF9
 --secret=AAJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5 --system
 #6 subuser create
 #may create a user without --system?
 sudo radosgw-admin subuser create --uid=us-east
 --subuser=us-east:swift --access=full --name client.radosgw.us-east-1
 --key-type swift --secret=7VJm8uAp71xKQZkjoPZmHu4sACA1SY8jTjay9dP5
 sudo radosgw-admin subuser create --uid=us-west
 --subuser=us-west:swift --access=full --name client.radosgw.us-west-1
 --key-type swift 

Re: [ceph-users] How to backup hundreds or thousands of TB

2015-05-06 Thread Craig Lewis
This is an older post of mine on this topic:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/038484.html.

The only thing that's changed since then is that Hammer now supports
RadosGW object versioning.  A combination of RadosGW replication,
versioning, and access control meets my needs for offsite backup.  I've
abandoned the RadosGW snapshots hack I was working on.


On Wed, May 6, 2015 at 1:25 AM, Götz Reinicke - IT Koordinator 
goetz.reini...@filmakademie.de wrote:

 Hi folks,

 beside hardware and performance and failover design: How do you manage
 to backup hundreds or thousands of TB :) ?

 Any suggestions? Best practice?

 A second ceph cluster at a different location? bigger archive Disks in
 good boxes? Or tabe-libs?

 What kind of backupsoftware can handle such volumes nicely?

 Thanks and regards . Götz
 --
 Götz Reinicke
 IT-Koordinator

 Tel. +49 7141 969 82 420
 E-Mail goetz.reini...@filmakademie.de

 Filmakademie Baden-Württemberg GmbH
 Akademiehof 10
 71638 Ludwigsburg
 www.filmakademie.de

 Eintragung Amtsgericht Stuttgart HRB 205016

 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
 Staatssekretär im Ministerium für Wissenschaft,
 Forschung und Kunst Baden-Württemberg

 Geschäftsführer: Prof. Thomas Schadt


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Radosgw multi zone data replication failure

2015-04-27 Thread Craig Lewis
 [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-east-1

 [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-west-1

Are you trying to setup two zones on one cluster?  That's possible, but
you'll also want to spend some time on your CRUSH map making sure that the
two zones are as independent as possible (no shared disks, etc).

Are you using Civetweb or Apache + FastCGI?

Can you include the output (from both clusters):
radosgw-admin --name=client.radosgw.us-east-1 region get
radosgw-admin --name=client.radosgw.us-east-1 zone get

Double check that both system users exist in both clusters, with the same
secret.




On Sun, Apr 26, 2015 at 8:01 AM, Vickey Singh vickey.singh22...@gmail.com
wrote:

 Hello Geeks


 I am trying to setup Ceph Radosgw multi site data replication using
 official documentation
 http://ceph.com/docs/master/radosgw/federated-config/#multi-site-data-replication


 Everything seems to work except radosgw-agent sync , Request you to please
 check the below outputs and help me in any possible way.


 *Environment : *


 CentOS 7.0.1406

 Ceph Versino 0.87.1

 Rados Gateway configured using Civetweb



 *Radosgw zone list : Works nicely *


 [root@us-east-1 ceph]# radosgw-admin zone list --name
 client.radosgw.us-east-1

 { zones: [

 us-west,

 us-east]}

 [root@us-east-1 ceph]#


 *Curl request to master zone : Works nicely *


 [root@us-east-1 ceph]# curl http://us-east-1.crosslogic.com:7480

 ?xml version=1.0 encoding=UTF-8?ListAllMyBucketsResult xmlns=
 http://s3.amazonaws.com/doc/2006-03-01/
 OwnerIDanonymous/IDDisplayName/DisplayName/OwnerBuckets/Buckets/ListAllMyBucketsResult

 [root@us-east-1 ceph]#


 *Curl request to secondary zone : Works nicely *


 [root@us-east-1 ceph]# curl http://us-west-1.crosslogic.com:7480

 ?xml version=1.0 encoding=UTF-8?ListAllMyBucketsResult xmlns=
 http://s3.amazonaws.com/doc/2006-03-01/
 OwnerIDanonymous/IDDisplayName/DisplayName/OwnerBuckets/Buckets/ListAllMyBucketsResult

 [root@us-east-1 ceph]#


 *Rados Gateway agent configuration file : Seems correct, no TYPO errors*


 [root@us-east-1 ceph]# cat cluster-data-sync.conf

 src_access_key: M7QAKDH8CYGTK86CG93U

 src_secret_key: 0xQR6PINk23W\/GYrWJ14aF+1stG56M6xMkqkdloO

 destination: http://us-west-1.crosslogic.com:7480

 dest_access_key: ZQ32ES1WAWPG05YMZ7T7

 dest_secret_key: INvk8AkrZRsejLEL34yRpMLmOqydt8ncOXy4RHCM

 log_file: /var/log/radosgw/radosgw-sync-us-east-west.log

 [root@us-east-1 ceph]#


 *Rados Gateway agent SYNC : Fails , however it can fetch region map so i
 think src and dest KEYS are correct. But don't know why it fails on
 AttributeError *



 *[root@us-east-1 ceph]# radosgw-agent -c cluster-data-sync.conf*

 *region map is: {u'us': [u'us-west', u'us-east']}*

 *Traceback (most recent call last):*

 *  File /usr/bin/radosgw-agent, line 21, in module*

 *sys.exit(main())*

 *  File /usr/lib/python2.7/site-packages/radosgw_agent/cli.py, line 275,
 in main*

 *except client.ClientException as e:*

 *AttributeError: 'module' object has no attribute 'ClientException'*

 *[root@us-east-1 ceph]#*


 *Can query to Ceph cluster using us-east-1 ID*


 [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-east-1

 cluster 9609b429-eee2-4e23-af31-28a24fcf5cbc

  health HEALTH_OK

  monmap e3: 3 mons at {ceph-node1=
 192.168.1.101:6789/0,ceph-node2=192.168.1.102:6789/0,ceph-node3=192.168.1.103:6789/0},
 election epoch 448, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3

  osdmap e1063: 9 osds: 9 up, 9 in

   pgmap v8473: 1500 pgs, 43 pools, 374 MB data, 2852 objects

 1193 MB used, 133 GB / 134 GB avail

 1500 active+clean

 [root@us-east-1 ceph]#


 *Can query to Ceph cluster using us-west-1 ID*


 [root@us-east-1 ceph]# ceph -s --name client.radosgw.us-west-1

 cluster 9609b429-eee2-4e23-af31-28a24fcf5cbc

  health HEALTH_OK

  monmap e3: 3 mons at {ceph-node1=
 192.168.1.101:6789/0,ceph-node2=192.168.1.102:6789/0,ceph-node3=192.168.1.103:6789/0},
 election epoch 448, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3

  osdmap e1063: 9 osds: 9 up, 9 in

   pgmap v8473: 1500 pgs, 43 pools, 374 MB data, 2852 objects

 1193 MB used, 133 GB / 134 GB avail

 1500 active+clean

 [root@us-east-1 ceph]#


 *Hope these packages are correct*


 [root@us-east-1 ceph]# rpm -qa | egrep -i ceph|radosgw

 libcephfs1-0.87.1-0.el7.centos.x86_64

 ceph-common-0.87.1-0.el7.centos.x86_64

 python-ceph-0.87.1-0.el7.centos.x86_64

 ceph-radosgw-0.87.1-0.el7.centos.x86_64

 ceph-release-1-0.el7.noarch

 ceph-0.87.1-0.el7.centos.x86_64

 radosgw-agent-1.2.1-0.el7.centos.noarch

 [root@us-east-1 ceph]#



 Regards

 VS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster not coming up after reboot

2015-04-23 Thread Craig Lewis
On Thu, Apr 23, 2015 at 5:20 AM, Kenneth Waegeman

 So it is all fixed now, but is it explainable that at first about 90% of
 the OSDS going into shutdown over and over, and only after some time got in
 a stable situation, because of one host network failure ?

 Thanks again!


Yes, unless you've adjusted:
[global]
  mon osd min down reporters = 9
  mon osd min down reports = 12

OSDs talk to the MONs on the public network.  The cluster network is only
used for OSD to OSD communication.

If one OSD node can't talk on that network, the other nodes will tell the
MONs that it's OSDs are down.  And that node will also tell the MONs that
all the other OSDs are down.  Then the OSDs marked down will tell the MONs
that they're not down, and the cycle will repeat.

I'm somewhat surprised that your cluster eventually stabilized.


I have 8 OSDs per node.  I set my min down reporters high enough that no
single node can mark another node's OSDs down.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unbalanced OSDs

2015-04-22 Thread Craig Lewis
ceph osd reweight-by-utilization percentage needs another argument to do
something.  The recommended starting value is 120.  Run it again with lower
and lower values until you're happy.  The value is a percentage, and I'm
not sure what happens if you go below 100.  If you get into trouble with
this (too much backfilling causing problems), you can use ceph osd weight
osdid 1 to go back to normal, just look at ceph osd tree to see the
reweighted osds.

Bear in mind that reweight-by-utilization adjusts the osd weight, which is
not a permanent value.  in/out events will reset this weight.

But that's ok, because you don't need the reweight to last very long.  Even
if you get it perfectly balanced, you're going to be at ~75%.  I order more
hardware when I hit 70% utilization.  Once you start adding hardware, the
data distribution will change, so any permanent weights you set will
probably be wrong.


If you do want the weights to be permanent, you should look at ceph osd
crush reweight osdid weight.  This permanently changes the weight in
the crush map, and it's not affected by in/out events.  Bear in mind that
you'll probably have to revisit all of these weights anytime your cluster
changes.  Also note that this weight is different that ceph osd
reweight.  This weight is the disk size in TiB.  I recommend small change
to all over and under utilizied disks, then re-evaluate each pass.


On Wed, Apr 22, 2015 at 4:12 AM, Stefan Priebe - Profihost AG 
s.pri...@profihost.ag wrote:

 Hello,

 i've heavily unbalanced OSDs.

 Some are at 61% usage and some at 86%.

 Which is 372G free space vs 136G free space.

 All are up and are weightet at 1.

 I'm running firefly with tunables to optimal and hashpspool 1.

 Also a reweight-by-utilization does nothing.

 # ceph osd reweight-by-utilization
 no change: average_util: 0.714381, overload_util: 0.857257. overloaded
 osds: (none)

 Stefan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Odp.: Odp.: CEPH 1 pgs incomplete

2015-04-22 Thread Craig Lewis
ceph pg query says all the OSDs are being probed.  If those 6 OSDs are
staying up, it probably just needs some time.  The OSDs need to stay up
longer than 15 mniutes.  If any of them are getting marked down at all,
that'll cause problems.  I'd like to see the past intervals in the recovery
state get smaller.  All of those entries indicate potential history that
needs to be reconciled.  If that array is getting smaller, then recovery is
proceeding.

You could try pushing it a bit with a ceph pg scrub 0.37.  If that finishes
with out any improvement, try ceph pg deep-scrub 0.37 .  Sometimes it helps
move things faster, and sometimes it doesn't.



On Wed, Apr 22, 2015 at 11:54 AM, MEGATEL / Rafał Gawron 
rafal.gaw...@megatel.com.pl wrote:

  All osd are works fine now
  ceph osd tree
  ID  WEIGHT TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
  -1 1080.71985 root default
  -2  120.07999 host s1
   0   60.03999 osd.0   up  1.0  1.0
   1   60.03999 osd.1   up  1.0  1.0
  -3  120.07999 host s2
   2   60.03999 osd.2   up  1.0  1.0
   3   60.03999 osd.3   up  1.0  1.0
  -4  120.07999 host s3
   4   60.03999 osd.4   up  1.0  1.0
   5   60.03999 osd.5   up  1.0  1.0
  -5  120.07999 host s4
   6   60.03999 osd.6   up  1.0  1.0
   7   60.03999 osd.7   up  1.0  1.0
  -6  120.07999 host s5
   9   60.03999 osd.9   up  1.0  1.0
   8   60.03999 osd.8   up  1.0  1.0
  -7  120.07999 host s6
  10   60.03999 osd.10  up  1.0  1.0
  11   60.03999 osd.11  up  1.0  1.0
   -8  120.07999 host s7
  12   60.03999 osd.12  up  1.0  1.0
  13   60.03999 osd.13  up  1.0  1.0
   -9  120.07999 host s8
  14   60.03999 osd.14  up  1.0  1.0
   15   60.03999 osd.15  up  1.0  1.0
 -10  120.07999 host s9
  17   60.03999 osd.17  up  1.0  1.0
  16   60.03999 osd.16  up  1.0  1.0


 Early I had power failure and my cluster was down.
 After up is recovering but now I have :
 1 pgs incomplete
  1 pgs stuck inactive
 1 pgs stuck unclean

 Cluster don't can revovery this pg.
 I try out some osd and add to my cluster but recovery after this things
 don't rebuild my cluster.


  --
 *Od:* Craig Lewis cle...@centraldesktop.com
 *Wysłane:* 22 kwietnia 2015 20:40
 *Do:* MEGATEL / Rafał Gawron
 *Temat:* Re: Odp.: [ceph-users] CEPH 1 pgs incomplete

  So you have flapping OSDs.  None of the 6 OSDs involved in that PG are
 staying up long enough to complete the recovery.

  What's happened is that because of how quickly the OSDs are coming up
 and failing, no single OSD has a complete copy of the data.  There should
 be a complete copy of the data, but different osds have different chunks of
 it.

  Figure out why those 6 OSDs are failing, and Ceph should recover.  Do
 you see anything interesting in those OSD logs?  If not, you might need to
 increase the logging levels.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is a dirty object

2015-04-20 Thread Craig Lewis
On Mon, Apr 20, 2015 at 3:38 AM, John Spray john.sp...@redhat.com wrote:


 I hadn't noticed that we presented this as nonzero for regular pools
 before, it is a bit weird.  Perhaps we should show zero here instead for
 non-cache-tier pools.


I have always planned to add a cold EC tier later, once my cluster was
large enough to make tiers worth while.  This minor change seems like it
would make that more complicated.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] many slow requests on different osds (scrubbing disabled)

2015-04-17 Thread Craig Lewis
I've seen something like this a few times.

Once, I lost the battery in my battery backed RAID card.  That caused all
the OSDs on that host to be slow, which triggered slow request notices
pretty much cluster wide.  It was only when I histogrammed the slow request
notices that I saw most of them were on a single node.  I compared the disk
latency graphs between nodes, and saw that one node had a much higher write
latency. This took me a while to track down.

Another time, I had a consume HDD that was slowly failing.  It would hit a
group of bad sector, remap, repeat.  SMART warned me about it, so I
replaced the disk after the second slow request alerts.  This was pretty
straight forward to diagnose, only because smartd notified me.


I both cases, I saw slow request notices on the affect disks.  Your
osd.284 says osd.186 and osd.177 are being slow, but osd.186 and osd.177
don't claim to be slow.

It's possible that their is another disk that is slow, causing osd.186 and
osd.177 replication to slow down.  With the PG distribution over OSDs, one
disk being a little slow can affect a large number of OSDs.


If SMART doesn't show you a disk is failing, I'd start looking for disks
(the disk itself, not the OSD daemon) with a high latency around your
problem times.  If you focus on the problem times, give it a +/- 10 minutes
window.  Sometimes it takes a little while for the disk slowness to spread
out enough for Ceph to complain.


On Wed, Apr 15, 2015 at 3:20 PM, Dominik Mostowiec 
dominikmostow...@gmail.com wrote:

 Hi,
 From few days we notice on our cluster many slow request.
 Cluster:
 ceph version 0.67.11
 3 x mon
 36 hosts - 10 osd ( 4T ) + 2 SSD (journals)
 Scrubbing and deep scrubbing is disabled but count of slow requests is
 still increasing.
 Disk utilisation is very small after we have disabled scrubbings.
 Log from one write with slow with debug osd = 20/20
 osd.284 - master: http://pastebin.com/xPtpNU6n
 osd.186 - replica: http://pastebin.com/NS1gmhB0
 osd.177 - replica: http://pastebin.com/Ln9L2Z5Z

 Can you help me find what is reason of it?

 --
 Regards
 Dominik
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Managing larger ceph clusters

2015-04-17 Thread Craig Lewis
I'm running a small cluster, but I'll chime in since nobody else has.

Cern had a presentation a while ago (dumpling time-frame) about their
deployment.  They go over some of your questions:
http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern

My philosophy on Config Management is that it should save me time.  If it's
going to take me longer to write a recipe to do something, I'll just do it
by hand. Since my cluster is small, there are many things I can do faster
by hand.  This may or may not work for you, depending on your documentation
/ repeatability requirements.  For things that need to be documented, I'll
usually write the recipe anyway (I accept Chef recipes as documentation).


For my clusters, I'm using Chef to setups all nodes and manage ceph.conf.
I manually manage my pools, CRUSH map, RadosGW users, and disk
replacement.  I was using Chef to add new disks, but I ran into load
problems due to my small cluster size.  I'm currently adding disks
manually, to manage cluster load better.  As my cluster gets larger,
that'll be less important.

I'm also doing upgrades manually, because it's less work than writing the
Chef recipe to do a cluster upgrade.  Since Chef isn't cluster aware, it
would be a a pain to make the recipe cluster aware enough to handle the
upgrade.  And I figure if I stall long enough, somebody else will write it
:-)  Ansible, with it's cluster wide coordination, looks like it would
handle that a bit better.



On Wed, Apr 15, 2015 at 2:05 PM, Stillwell, Bryan 
bryan.stillw...@twcable.com wrote:

 I'm curious what people managing larger ceph clusters are doing with
 configuration management and orchestration to simplify their lives?

 We've been using ceph-deploy to manage our ceph clusters so far, but
 feel that moving the management of our clusters to standard tools would
 provide a little more consistency and help prevent some mistakes that
 have happened while using ceph-deploy.

 We're looking at using the same tools we use in our OpenStack
 environment (puppet/ansible), but I'm interested in hearing from people
 using chef/salt/juju as well.

 Some of the cluster operation tasks that I can think of along with
 ideas/concerns I have are:

 Keyring management
   Seems like hiera-eyaml is a natural fit for storing the keyrings.

 ceph.conf
   I believe the puppet ceph module can be used to manage this file, but
   I'm wondering if using a template (erb?) might be better method to
   keeping it organized and properly documented.

 Pool configuration
   The puppet module seems to be able to handle managing replicas and the
   number of placement groups, but I don't see support for erasure coded
   pools yet.  This is probably something we would want the initial
   configuration to be set up by puppet, but not something we would want
   puppet changing on a production cluster.

 CRUSH maps
   Describing the infrastructure in yaml makes sense.  Things like which
   servers are in which rows/racks/chassis.  Also describing the type of
   server (model, number of HDDs, number of SSDs) makes sense.

 CRUSH rules
   I could see puppet managing the various rules based on the backend
   storage (HDD, SSD, primary affinity, erasure coding, etc).

 Replacing a failed HDD disk
   Do you automatically identify the new drive and start using it right
   away?  I've seen people talk about using a combination of udev and
   special GPT partition IDs to automate this.  If you have a cluster
   with thousands of drives I think automating the replacement makes
   sense.  How do you handle the journal partition on the SSD?  Does
   removing the old journal partition and creating a new one create a
   hole in the partition map (because the old partition is removed and
   the new one is created at the end of the drive)?

 Replacing a failed SSD journal
   Has anyone automated recreating the journal drive using Sebastien
   Han's instructions, or do you have to rebuild all the OSDs as well?


 http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou
 rnal-failure/

 Adding new OSD servers
   How are you adding multiple new OSD servers to the cluster?  I could
   see an ansible playbook which disables nobackfill, noscrub, and
   nodeep-scrub followed by adding all the OSDs to the cluster being
   useful.

 Upgrading releases
   I've found an ansible playbook for doing a rolling upgrade which looks
   like it would work well, but are there other methods people are using?


 http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi
 ble/

 Decommissioning hardware
   Seems like another ansible playbook for reducing the OSDs weights to
   zero, marking the OSDs out, stopping the service, removing the OSD ID,
   removing the CRUSH entry, unmounting the drives, and finally removing
   the server would be the best method here.  Any other ideas on how to
   approach this?


 That's all I can think of right now.  Is there any other tasks that
 people have run into 

Re: [ceph-users] Rebalance after empty bucket addition

2015-04-06 Thread Craig Lewis
Yes, it's expected.  The crush map contains the inputs to the CRUSH hashing
algorithm.  Every change made to the crush map causes the hashing algorithm
to behave slightly differently.  It is consistent though.  If you removed
the new bucket, it would go back to the way it was before you made the
change.

The Ceph team is working to reduce this, but it's unlikely to go away
completely.


On Sun, Apr 5, 2015 at 11:45 AM, Andrey Korolyov and...@xdel.ru wrote:

 Hello,

 after reaching certain ceiling of host/PG ratio, moving empty bucket
 in causes a small rebalance:

 ceph osd crush add-bucket 10.10.2.13
 ceph osd crush move 10.10.2.13 root=default rack=unknownrack

 I have two pools, one is very large and it is keeping up with proper
 amount of pg/osd but another one contains in fact lesser amount of PGs
 than the number of active OSDs and after insertion of empty bucket in
 it goes to a rebalance, though that the actual placement map is not
 changed. Keeping in mind that this case is very far from being
 offensive to any kind of a sane production configuration, is this an
 expected behavior?

 Thanks!
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering incomplete PGs with ceph_objectstore_tool

2015-04-06 Thread Craig Lewis
In that case, I'd set the crush weight to the disk's size in TiB, and mark
the osd out:
ceph osd crush reweight osd.OSDID weight
ceph osd out OSDID

Then your tree should look like:
-9  *2.72*   host ithome
30  *2.72* osd.30  up  *0*



An OSD can be UP and OUT, which causes Ceph to migrate all of it's data
away.



On Thu, Apr 2, 2015 at 10:20 PM, Chris Kitzmiller ca...@hampshire.edu
wrote:

 On Apr 3, 2015, at 12:37 AM, LOPEZ Jean-Charles jelo...@redhat.com
 wrote:
 
  according to your ceph osd tree capture, although the OSD reweight is
 set to 1, the OSD CRUSH weight is set to 0 (2nd column). You need to assign
 the OSD a CRUSH weight so that it can be selected by CRUSH: ceph osd crush
 reweight osd.30 x.y (where 1.0=1TB)
 
  Only when this is done will you see if it joins.

 I don't really want osd.30 to join my cluster though. It is a purely
 temporary device that I restored just those two PGs to. It should still be
 able to (and be trying to) push out those two PGs with a weight of zero,
 right? I don't want any of my production data to migrate towards osd.30.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error DATE 1970

2015-04-02 Thread Craig Lewis
No, but I've seen it in RadosGW too.  I've been meaning to post about it.
I get about ten a day, out of about 50k objects/day.


clewis@clewis-mac ~ (-) $ s3cmd ls s3://live32/ | grep '1970-01' | head -1
1970-01-01 00:00 0
s3://live-32/39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055

Also note the 0 byte filesize and the lack of MD5 checksum.  What I find
interesting is that I can fix this by downloading the file and uploading it
again.  The filename is a SHA256 hash of the file contents, and the file
downloads correctly every time.  I never see this in my replication
cluster, only the primary cluster.


The access look looks kind of interesting for this file:
192.168.2.146 - - [23/Mar/2015:20:11:30 -0700] PUT
/39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055 HTTP/1.1
500 722 - aws-sdk-php2/2.7.20 Guzzle/3.9.2 curl/7.40.0 PHP/5.5.21
live-32.us-west-1.ceph.cdlocal
192.168.2.146 - - [23/Mar/2015:20:12:01 -0700] PUT
/39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055 HTTP/1.1
200 205 - aws-sdk-php2/2.7.20 Guzzle/3.9.2 curl/7.40.0 PHP/5.5.21
live-32.us-west-1.ceph.cdlocal
192.168.2.146 - - [23/Mar/2015:20:12:09 -0700] HEAD
/39020f17716a18b39efd8daa96e8245eb2901f353ba1004e724cb56de5367055 HTTP/1.1
200 250 - aws-sdk-php2/2.7.20 Guzzle/3.9.2 curl/7.40.0 PHP/5.5.21
live-32.us-west-1.ceph.cdlocal

31 seconds is a big spread between the initial PUT and the second PUT.  The
file is only 43k, so it'll be using the direct PUT, not multi-part upload.
I haven't verified this for all of them.

There's nothing in radosgw.log at the time of the 500.





On Wed, Apr 1, 2015 at 2:42 AM, Jimmy Goffaux ji...@goffaux.fr wrote:

 English Version :

 Hello,

 I found a strange behavior in Ceph. This behavior is visible on Buckets
 (RGW) and pools (RDB).
 pools:

 ``
 root@:~# qemu-img info rbd:pool/kibana2
 image: rbd:pool/kibana2
 file format: raw
 virtual size: 30G (32212254720 bytes)
 disk size: unavailable
 Snapshot list:
 IDTAG VM SIZE  DATE   VM   CLOCK
 snap2014-08-26-kibana2snap2014-08-26-kibana2 30G 1970-01-01 01:00:00
 00:00:00.000
 snap2014-09-05-kibana2snap2014-09-05-kibana2 30G 1970-01-01 01:00:00
 00:00:00.000
 ``

 As you can see the all dates are set to 1970-01-01 ?

 Here's the content of a JSON file in a bucket.

 ``
 {'bytes': 0, 'last_modified': '1970-01-01T00:00:00.000Z', 'hash': u'',
 'name': 'bab34dad-531c-4609-ae5e-62129b43b181'}
 ```

 You can see this is the same for the Last Modified date.

 Do you have any ideas?


 French Version :

 Bonjour,

 J'ai un comportement anormal sur ceph. J'ai des problèmes sur les
 Buckets(RGW) et les pools(RDB).

 Pools :

 ``
 root@:~# qemu-img info rbd:pool/kibana2
 image: rbd:pool/kibana2
 file format: raw
 virtual size: 30G (32212254720 bytes)
 disk size: unavailable
 Snapshot list:
 IDTAG VM SIZE  DATE   VM   CLOCK
 snap2014-08-26-kibana2snap2014-08-26-kibana2 30G 1970-01-01 01:00:00
 00:00:00.000
 snap2014-09-05-kibana2snap2014-09-05-kibana2 30G 1970-01-01 01:00:00
 00:00:00.000
 ``

 En effet la DATE est à 1970-01-01 ???

 Pour les Buckets voici le retour JSON d'un fichier dans un bucket :

 ``
 {'bytes': 0, 'last_modified': '1970-01-01T00:00:00.000Z', 'hash': u'',
 'name': 'bab34dad-531c-4609-ae5e-62129b43b181'}
 ```

 Pareil le Last Modified est à 1970-01-01..

 Avez-vous des idées ?

 --

 Jimmy Goffaux
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Production Ceph :: PG data lost : Cluster PG incomplete, inactive, unclean

2015-04-01 Thread Craig Lewis
 3.d60 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.158179 0'0 262813:169
 [60,56,220] 60 [60,56,220] 60 33552'321 2015-03-12 13:44:43.502907
 28356'39 2015-03-11 13:44:41.663482
 4.1fc 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.217291 0'0 262813:163
 [144,58,153] 144 [144,58,153] 144 0'0 2015-03-12 17:58:19.254170 0'0 
 2015-03-09
 17:54:55.720479
 3.e02 72 0 0 0 585105425 304 304 down+incomplete 2015-04-01
 21:21:16.099150 33568'304 262813:169744 [15,102,147] 15 [15,102,147] 15
 33568'304 2015-03-16 10:04:19.894789 2246'4 2015-03-09 11:43:44.176331
 8.1d4 0 0 0 0 0 0 0 down+incomplete 2015-04-01 21:21:16.218644 0'0
 262813:21867 [126,43,174] 126 [126,43,174] 126 0'0 2015-03-12
 14:34:35.258338 0'0 2015-03-12 14:34:35.258338
 4.2f4 0 0 0 0 0 0 0 down+incomplete 2015-04-01 21:21:16.117515 0'0
 262813:116150 [181,186,13] 181 [181,186,13] 181 0'0 2015-03-12
 14:59:03.529264 0'0 2015-03-09 13:46:40.601301
 3.e5a 76 70 0 0 623902741 325 325 incomplete 2015-04-01 21:21:16.043300
 33569'325 262813:73426 [97,22,62] 97 [97,22,62] 97 33569'325 2015-03-12
 13:58:05.813966 28433'44 2015-03-11 13:57:53.909795
 8.3a0 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.056437 0'0
 262813:175168 [62,14,224] 62 [62,14,224] 62 0'0 2015-03-12 13:52:44.546418
 0'0 2015-03-12 13:52:44.546418
 3.24e 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.130831 0'0 262813:165
 [39,202,90] 39 [39,202,90] 39 33556'272 2015-03-13 11:44:41.263725 2327'4 
 2015-03-09
 17:54:43.675552
 5.f7 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.145298 0'0 262813:153
 [54,193,123] 54 [54,193,123] 54 0'0 2015-03-12 17:58:30.257371 0'0 2015-03-09
 17:55:11.725629
 [root@pouta-s01 ceph]#


 ##  Example 1 : PG 10.70 ###


 *10.70 0 0 0 0 0 0 0 incomplete 2015-04-01 21:21:16.152179 0'0 262813:163
 [213,88,80] 213 [213,88,80] 213 0'0 2015-03-12 17:59:43.275049 0'0
 2015-03-09 17:55:58.745662*


 This is how i found location of each OSD

 [root@pouta-s01 ceph]# *ceph osd find 88*

 { osd: 88,
   ip: 10.100.50.3:7079\/916853,
   crush_location: { host: pouta-s03,
   root: default”}}
 [root@pouta-s01 ceph]#


 When i manually check current/pg_head directory , data is not present here
 ( i.e. data is lost from all the copies )


 [root@pouta-s04 current]# ls -l
 /var/lib/ceph/osd/ceph-80/current/10.70_head
 *total 0*
 [root@pouta-s04 current]#


 On some of the OSD’s HEAD directory does not exists

 [root@pouta-s03 ~]# ls -l /var/lib/ceph/osd/ceph-88/current/10.70_head
 *ls: cannot access /var/lib/ceph/osd/ceph-88/current/10.70_head: No such
 file or directory*
 [root@pouta-s03 ~]#

 [root@pouta-s02 ~]# ls -l /var/lib/ceph/osd/ceph-213/current/10.70_head
 *total 0*
 [root@pouta-s02 ~]#


 # ceph pg 10.70 query  ---  *http://paste.ubuntu.com/10719840/
 http://paste.ubuntu.com/10719840/*


 ##  Example 2 : PG 3.7d0 ###

 *3.7d0 78 0 0 0 609222686 376 376 down+incomplete 2015-04-01
 21:21:16.135599 33538'376 262813:185045 [117,118,177] 117 [117,118,177] 117
 33538'376 2015-03-12 13:51:03.984454 28394'62 2015-03-11 13:50:58.196288*


 [root@pouta-s04 current]# ceph pg map 3.7d0
 osdmap e262813 pg 3.7d0 (3.7d0) - up [117,118,177] acting [117,118,177]
 [root@pouta-s04 current]#


 *Data is present here , so 1 copy is present out of 3 *

 *[root@pouta-s04 current]# ls -l
 /var/lib/ceph/osd/ceph-117/current/3.7d0_head/ | wc -l*
 *63*
 *[root@pouta-s04 current]#*



 [root@pouta-s03 ~]#  ls -l /var/lib/ceph/osd/ceph-118/current/3.7d0_head/
 *total 0*
 [root@pouta-s03 ~]#


 [root@pouta-s01 ceph]# ceph osd find 177
 { osd: 177,
   ip: 10.100.50.2:7062\/99,
   crush_location: { host: pouta-s02,
   root: default”}}
 [root@pouta-s01 ceph]#

 *Even directory is not present here *

 [root@pouta-s02 ~]#  ls -l /var/lib/ceph/osd/ceph-177/current/3.7d0_head/
 *ls: cannot access /var/lib/ceph/osd/ceph-177/current/3.7d0_head/: No such
 file or directory*
 [root@pouta-s02 ~]#


 *# ceph pg  3.7d0 query http://paste.ubuntu.com/10720107/
 http://paste.ubuntu.com/10720107/*


 - Karan -

 On 20 Mar 2015, at 22:43, Craig Lewis cle...@centraldesktop.com wrote:

  osdmap e261536: 239 osds: 239 up, 238 in

 Why is that last OSD not IN?  The history you need is probably there.

 Run  ceph pg pgid query on some of the stuck PGs.  Look for
 the recovery_state section.  That should tell you what Ceph needs to
 complete the recovery.


 If you need more help, post the output of a couple pg queries.



 On Fri, Mar 20, 2015 at 4:22 AM, Karan Singh karan.si...@csc.fi wrote:

 Hello Guys

 My CEPH cluster lost data and not its not recovering. This problem
 occurred when Ceph performed recovery when one of the node was down.
 Now all the nodes are up but Ceph is showing PG as incomplete , unclean ,
 recovering.


 I have tried several things to recover them like , *scrub , deep-scrub ,
 pg repair , try changing primary affinity and then scrubbing ,
 osd_pool_default_size etc. BUT NO LUCK*

 Could yo please advice , how to recover PG and achieve HEALTH_OK

 # ceph

Re: [ceph-users] PGs issue

2015-03-20 Thread Craig Lewis
This seems to be a fairly consistent problem for new users.

 The create-or-move is adjusting the crush weight, not the osd weight.
Perhaps the init script should set the defaultweight to 0.01 if it's = 0?

It seems like there's a downside to this, but I don't see it.




On Fri, Mar 20, 2015 at 1:25 PM, Robert LeBlanc rob...@leblancnet.us
wrote:

 The weight can be based on anything, size, speed, capability, some random
 value, etc. The important thing is that it makes sense to you and that you
 are consistent.

 Ceph by default (ceph-disk and I believe ceph-deploy) take the approach of
 using size. So if you use a different weighting scheme, you should manually
 add the OSDs, or clean up after using ceph-disk/ceph-deploy. Size works
 well for most people, unless the disks are less than 10 GB so most people
 don't bother messing with it.

 On Fri, Mar 20, 2015 at 12:06 PM, Bogdan SOLGA bogdan.so...@gmail.com
 wrote:

 Thank you for the clarifications, Sahana!

 I haven't got to that part, yet, so these details were (yet) unknown to
 me. Perhaps some information on the PGs weight should be provided in the
 'quick deployment' page, as this issue might be encountered in the future
 by other users, as well.

 Kind regards,
 Bogdan


 On Fri, Mar 20, 2015 at 12:05 PM, Sahana shna...@gmail.com wrote:

 Hi Bogdan,

  Here is the link for hardware recccomendations :
 http://ceph.com/docs/master/start/hardware-recommendations/#hard-disk-drives.
 As per this link, minimum  size  reccommended for osds  is 1TB.
  Butt as Nick said, Ceph OSDs must be min. 10GB to get an weight of
 0.01
 Here is the snippet from crushmaps section of ceph docs:

 Weighting Bucket Items

 Ceph expresses bucket weights as doubles, which allows for fine
 weighting. A weight is the relative difference between device capacities.
 We recommend using 1.00 as the relative weight for a 1TB storage
 device. In such a scenario, a weight of 0.5 would represent
 approximately 500GB, and a weight of 3.00 would represent approximately
 3TB. Higher level buckets have a weight that is the sum total of the leaf
 items aggregated by the bucket.

 Thanks

 Sahana

 On Fri, Mar 20, 2015 at 2:08 PM, Bogdan SOLGA bogdan.so...@gmail.com
 wrote:

 Thank you for your suggestion, Nick! I have re-weighted the OSDs and
 the status has changed to '256 active+clean'.

 Is this information clearly stated in the documentation, and I have
 missed it? In case it isn't - I think it would be recommended to add it, as
 the issue might be encountered by other users, as well.

 Kind regards,
 Bogdan


 On Fri, Mar 20, 2015 at 10:33 AM, Nick Fisk n...@fisk.me.uk wrote:

 I see the Problem, as your OSD's are only 8GB they have a zero weight,
 I think the minimum size you can get away with is 10GB in Ceph as the size
 is measured in TB and only has 2 decimal places.

 For a work around try running :-

 ceph osd crush reweight osd.X 1

 for each osd, this will reweight the OSD's. Assuming this is a test
 cluster and you won't be adding any larger OSD's in the future this
 shouldn't cause any problems.

 
  admin@cp-admin:~/safedrive$ ceph osd tree
  # idweighttype nameup/downreweight
  -10root default
  -20host osd-001
  00osd.0up1
  10osd.1up1
  -30host osd-002
  20osd.2up1
  30osd.3up1
  -40host osd-003
  40osd.4up1
  50osd.5up1






 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Production Ceph :: PG data lost : Cluster PG incomplete, inactive, unclean

2015-03-20 Thread Craig Lewis
 osdmap e261536: 239 osds: 239 up, 238 in

Why is that last OSD not IN?  The history you need is probably there.

Run  ceph pg pgid query on some of the stuck PGs.  Look for
the recovery_state section.  That should tell you what Ceph needs to
complete the recovery.


If you need more help, post the output of a couple pg queries.



On Fri, Mar 20, 2015 at 4:22 AM, Karan Singh karan.si...@csc.fi wrote:

 Hello Guys

 My CEPH cluster lost data and not its not recovering. This problem
 occurred when Ceph performed recovery when one of the node was down.
 Now all the nodes are up but Ceph is showing PG as incomplete , unclean ,
 recovering.


 I have tried several things to recover them like , *scrub , deep-scrub ,
 pg repair , try changing primary affinity and then scrubbing ,
 osd_pool_default_size etc. BUT NO LUCK*

 Could yo please advice , how to recover PG and achieve HEALTH_OK

 # ceph -s
 cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33
  health *HEALTH_WARN 19 pgs incomplete; 3 pgs recovering; 20 pgs
 stuck inactive; 23 pgs stuck unclean*; 2 requests are blocked  32 sec;
 recovery 531/980676 objects degraded (0.054%); 243/326892 unfound (0.074%)
  monmap e3: 3 mons at
 {xxx=:6789/0,xxx=:6789:6789/0,xxx=:6789:6789/0}, election epoch
 1474, quorum 0,1,2 xx,xx,xx
  osdmap e261536: 239 osds: 239 up, 238 in
   pgmap v415790: 18432 pgs, 13 pools, 2330 GB data, 319 kobjects
 20316 GB used, 844 TB / 864 TB avail
 531/980676 objects degraded (0.054%); 243/326892 unfound
 (0.074%)
1 creating
18409 active+clean
3 active+recovering
   19 incomplete




 # ceph pg dump_stuck unclean
 ok
 pg_stat objects mip degr unf bytes log disklog state state_stamp v
 reported up up_primary acting acting_primary last_scrub scrub_stamp
 last_deep_scrub deep_scrub_stamp
 10.70 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.534911 0'0 261536:1015
 [153,140,80] 153 [153,140,80] 153 0'0 2015-03-12 17:59:43.275049 0'0 
 2015-03-09
 17:55:58.745662
 3.dde 68 66 0 66 552861709 297 297 incomplete 2015-03-20 12:19:49.584839
 33547'297 261536:228352 [174,5,179] 174 [174,5,179] 174 33547'297 2015-03-12
 14:19:15.261595 28522'43 2015-03-11 14:19:13.894538
 5.a2 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.560756 0'0 261536:897
 [214,191,170] 214 [214,191,170] 214 0'0 2015-03-12 17:58:29.257085 0'0 
 2015-03-09
 17:55:07.684377
 13.1b6 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.846253 0'0 261536:1050
 [0,176,131] 0 [0,176,131] 0 0'0 2015-03-12 18:00:13.286920 0'0 2015-03-09
 17:56:18.715208
 7.25b 16 0 0 0 67108864 16 16 incomplete 2015-03-20 12:19:49.639102
 27666'16 261536:4777 [194,145,45] 194 [194,145,45] 194 27666'16 2015-03-12
 17:59:06.357864 2330'3 2015-03-09 17:55:30.754522
 5.19 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.742698 0'0 261536:25410
 [212,43,131] 212 [212,43,131] 212 0'0 2015-03-12 13:51:37.777026 0'0 
 2015-03-11
 13:51:35.406246
 3.a2f 0 0 0 0 0 0 0 creating 2015-03-20 12:42:15.586372 0'0 0:0 [] -1 []
 -1 0'0 0.00 0'0 0.00
 7.298 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.566966 0'0 261536:900
 [187,95,225] 187 [187,95,225] 187 27666'13 2015-03-12 17:59:10.308423
 2330'4 2015-03-09 17:55:35.750109
 3.a5a 77 87 261 87 623902741 325 325 active+recovering 2015-03-20
 10:54:57.443670 33569'325 261536:182464 [150,149,181] 150 [150,149,181]
 150 33569'325 2015-03-12 13:58:05.813966 28433'44 2015-03-11
 13:57:53.909795
 1.1e7 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.610547 0'0 261536:772
 [175,182] 175 [175,182] 175 0'0 2015-03-12 17:55:45.203232 0'0 2015-03-09
 17:53:49.694822
 3.774 79 0 0 0 645136397 339 339 incomplete 2015-03-20 12:19:49.821708
 33570'339 261536:166857 [162,39,161] 162 [162,39,161] 162 33570'339 2015-03-12
 14:49:03.869447 2226'2 2015-03-09 13:46:49.783950
 3.7d0 78 0 0 0 609222686 376 376 incomplete 2015-03-20 12:19:49.534004
 33538'376 261536:182810 [117,118,177] 117 [117,118,177] 117 33538'376 
 2015-03-12
 13:51:03.984454 28394'62 2015-03-11 13:50:58.196288
 3.d60 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.647196 0'0 261536:833
 [154,172,1] 154 [154,172,1] 154 33552'321 2015-03-12 13:44:43.502907
 28356'39 2015-03-11 13:44:41.663482
 4.1fc 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.610103 0'0 261536:1069
 [70,179,58] 70 [70,179,58] 70 0'0 2015-03-12 17:58:19.254170 0'0 2015-03-09
 17:54:55.720479
 3.e02 72 0 0 0 585105425 304 304 incomplete 2015-03-20 12:19:49.564768
 33568'304 261536:167428 [15,102,147] 15 [15,102,147] 15 33568'304 2015-03-16
 10:04:19.894789 2246'4 2015-03-09 11:43:44.176331
 8.1d4 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.614727 0'0 261536:19611
 [126,43,174] 126 [126,43,174] 126 0'0 2015-03-12 14:34:35.258338 0'0 
 2015-03-12
 14:34:35.258338
 4.2f4 0 0 0 0 0 0 0 incomplete 2015-03-20 12:19:49.595109 0'0
 261536:113791 [181,186,13] 181 [181,186,13] 181 0'0 2015-03-12
 14:59:03.529264 0'0 2015-03-09 13:46:40.601301
 3.52c 65 23 69 23 

Re: [ceph-users] Ceiling on number of PGs in a OSD

2015-03-20 Thread Craig Lewis
This isn't a hard limit on the number, but it's recommended that you keep
it around 100.  Smaller values cause data distribution evenness problems.
Larger values cause the OSD processes to use more CPU, RAM, and file
descriptors, particularly during recovery.  With that many OSDs, you're
going to want to increase your sysctl's, particularly open file
descriptors, open sockets, FDs per process, etc.


You don't need the same number of placement groups for every pool.  Pools
without much data don't need as many PGs.  For example, I have a bunch of
pools for RGW zones, and they have 32 PGs each.  I have a total of 2600
PGs, 2048 are in the .rgw.buckets pool.

Also keep in mind that your pg_num and pgp_num need to be multipled by the
number of replicas to get the PG per OSD count.  I have 2600 PGs and
replication 3, so I really have 7800 PGs spread over 72 OSDs.

Assuming you have one big pool, 750 OSDs, and replication 3, I'd go with
32k PGs on the big pool.  Same thing, but replication 2, I'd still go 32k,
but prepare to expand PGs with your next addition of OSDs.

If you're going to have several big pools (ie, you're using RGW and RDB
heavily), I'd go with 16k PGs for the big pools, and adjust those over time
depending on which is used more heavily.  If RDB is consuming 2x the space,
then increase it's pg_num and pgp_num during the next OSD expansion, but
don't increase RGWs pg_num and pgp_num.


The number of PGs per OSD should stay around 100 as you add OSDs.  If you
add 10x the OSDs, you'll multiple the pg_num and pgp_num by 10 too, which
gives you the same number of PGs per OSD.  My (pg_num / osd_num) fluctuates
between 75 and 200, depending on when I do the pg_num and pgp_num increase
relative to the OSD adds.

When you increase pg_num and pgp_num, don't do a large jump.  Ceph will
only allow you to double the value.  Even that is extreme.  It will cause
every OSD in the cluster to start splitting PGs.  When you want to double
your pg_num and pgp_num, it's recommended that you make several passes.  I
don't recall seeing any recommendations, but I'm planning to break my next
increase up into 10 passes.  I'm at 2048 now, so I'll probably add 204 PGs
until I get to 4096.




On Thu, Mar 19, 2015 at 6:12 AM, Sreenath BH bhsreen...@gmail.com wrote:

 Hi,

 Is there a celing on the number for number of placement groups in a
 OSD beyond which steady state and/or recovery performance will start
 to suffer?

 Example: I need to create a pool with 750 osds (25 OSD per server, 50
 servers).
 The PG calculator gives me 65536 placement groups with 300 PGs per OSD.
 Now as the cluster expands, the number of PGs in a OSD has to increase as
 well.

 If the cluster size inceases by a factor of 10, the number of PGs per
 OSD will also need to be increased.
 What would be the impact of large pg number in a OSD on peering and
 rebalancing.

 There is 3GB per OSD available.

 thanks,
 Sreenath
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOS Gateway Maturity

2015-03-20 Thread Craig Lewis
I have found a few incompatibilities, but so far they're all on the Ceph
side.  One example I remember was having to change the way we delete
objects.  The function we originally used fetches a list of object
versions, and deletes all versions.  Ceph is implementing objects versions
now (I believe that'll ship with Hammer), so we had to call a different
function to delete the object without iterating over the versions.

AFAIK, that code should work fine if we point it at Amazon.  I haven't
tried it though.


I've been using RGW (with replication) in production for 2 years now,
although I'm not large.  So far, all of my RGW issues have been Ceph
issues.  Most of my issues are caused by my under-powered hardware, or
shooting myself in the foot with aggressive optimizations.  Things are
better with my journals on SSD, but the best thing I did was slow down with
my changes.  For example, I have 7 OSD nodes and 72 OSDs.  When I add new
OSDs, I add a couple at a time instead of adding all the disks in a node at
once.  Guess how I learned that lesson. :-)



On Wed, Mar 18, 2015 at 10:03 AM, Jerry Lam jerry@oicr.on.ca wrote:

  Hi Chris,

  Thank you for your reply.
 We are also thinking about using the S3 API but we are concerned about how
 compatible it is with the real S3. For instance, we would like to design
 the system using pre-signed URL for storing some objects. I read the ceph
 documentation, it does not mention if it supports it or not.

  My question is do you guys find that the code using the RADOS S3 API can
 easily run in Amazon S3 without any change? If no, how much effort it is
 needed to make it compatible?

  Best Regards,

  Jerry
  From: Chris Jones cjo...@cloudm2.com
 Date: Tuesday, March 17, 2015 at 4:39 PM
 To: Jerry Lam jerry@oicr.on.ca
 Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] RADOS Gateway Maturity

   Hi Jerry,

  I currently work at Bloomberg and we currently have a very large Ceph
 installation in production and we use the S3 compatible API for rados
 gateway. We are also re-architecting our new RGW and evaluating a different
 Apache configuration for a little better performance. We only use replicas
 right now, no erasure coding yet. Actually, you can take a look at our
 current configuration at https://github.com/bloomberg/chef-bcpc.

  -Chris

 On Tue, Mar 17, 2015 at 10:40 AM, Jerry Lam jerry@oicr.on.ca wrote:

  Hi Ceph user,

  I’m new to Ceph but I need to use Ceph as the storage for the Cloud we
 are building in house.
 Did anyone use RADOS Gateway in production? How mature it is in terms of
 compatibility with S3 / Swift?
 Anyone can share their experience on it?

  Best Regards,

  Jerry

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




  --
   Best Regards,
 Chris Jones

  http://www.cloudm2.com

  cjo...@cloudm2.com
  (p) 770.655.0770

  This message is intended exclusively for the individual or entity to
 which it is addressed.  This communication may contain information that is
 proprietary, privileged or confidential or otherwise legally exempt from
 disclosure.  If you are not the named addressee, you are not authorized to
 read, print, retain, copy or disseminate this message or any part of it.
 If you have received this message in error, please notify the sender
 immediately by e-mail and delete all copies of the message.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven CPU usage on OSD nodes

2015-03-20 Thread Craig Lewis
I would say you're a little light on RAM.  With 4TB disks 70% full, I've
seen some ceph-osd processes using 3.5GB of RAM during recovery.  You'll be
fine during normal operation, but you might run into issues at the worst
possible time.

I have 8 OSDs per node, and 32G of RAM.  I've had ceph-osd processes start
swapping, and that's a great way to get them kicked out for being
unresponsive.


I'm not a dev, but I can make some wild and uninformed guesses :-) .  The
primary OSD uses more CPU than the replicas, and I suspect that you have
more primaries on the hot nodes.

Since you're testing, try repeating the test on 3 OSD nodes instead of 4.
If you don't want to run that test, you can generate a histogram from ceph
pg dump data, and see if there are more primary osds (the first one in the
acting array) on the hot nodes.



On Wed, Mar 18, 2015 at 7:18 AM, f...@univ-lr.fr f...@univ-lr.fr wrote:

 Hi to the ceph-users list !

 We're setting up a new Ceph infrastructure :
 - 1 MDS admin node
 - 4 OSD storage nodes (60 OSDs)
   each of them running a monitor
 - 1 client

 Each 32GB RAM/16 cores OSD node supports 15 x 4TB SAS OSDs (XFS) and 1 SSD
 with 5GB journal partitions, all in JBOD attachement.
 Every node has 2x10Gb LACP attachement.
 The OSD nodes are freshly installed with puppet then from the admin node
 Default OSD weight in the OSD tree
 1 test pool with 4096 PGs

 During setup phase, we're trying to qualify the performance
 characteristics of our setup.
 Rados benchmark are done from a client with these commandes :
 rados -p pool -b 4194304 bench 60 write -t 32 --no-cleanup
 rados -p pool -b 4194304 bench 60 seq -t 32 --no-cleanup

 Each time we observed a recurring phenomena : 2 of the 4 OSD nodes have
 twice the CPU load :
 http://www.4shared.com/photo/Ua0umPVbba/UnevenLoad.html
 (What to look at is the real-time %CPU and the cumulated CPU time per
 ceph-osd process)

 And after a fresh complete reinstall to be sure, this twice-as-high CPU
 load is observed but not on the same 2 nodes :
 http://www.4shared.com/photo/2AJfd1B_ba/UnevenLoad-v2.html

 Nothing obvious about the installation seems able to explain that.

 The crush distribution function doesn't have more than 4.5% inequality
 between the 4 OSD nodes for the primary OSDs of the objects, and less than
 3% between the hosts if we considere the whole acting sets for the objects
 used during the benchmark. And the differences are not accordingly
 comparable to the CPU loads. So the cause has to be elsewhere.

 I cannot be sure it has no impact on performance. Even if we have enough
 CPU cores headroom, logic would say it has to have some consequences on
 delays and also on performances .

 Would someone have any idea, or reproduce the test on its setup to see if
 this is a common comportment ?


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question Blackout

2015-03-20 Thread Craig Lewis
I'm not a CephFS user, but I have had a few cluster outages.

Each OSD has a journal, and Ceph ensures that a write is in all of the
journals (primary and replicas) before it acknowledges the write.  If an
OSD process crashes, it replays the journal on startup, and recovers the
write.

I've lost power at my data center, and had the whole cluster down.  Ceph
came back up when power was restored without me getting involved.


You might want the paid support package.  For extra piece of mind, you can
get a paid cluster review, and an engineer will go through your use case
with you.



On Tue, Mar 17, 2015 at 8:32 PM, Jesus Chavez (jeschave) jesch...@cisco.com
 wrote:

  Hi everyone, I am ready to launch ceph on production but there is one
 thing that keeps on my mind... If there was a Blackout where all the ceph
 nodes went off what would really  happen with the filesystem? It would get
 corrupt? Or ceph has any Kind of mechanism to survive to something like
 that?
 Thanks


 * Jesus Chavez*
 SYSTEMS ENGINEER-C.SALES

 jesch...@cisco.com
 Phone: *+52 55 5267 3146 +52%2055%205267%203146*
 Mobile: *+51 1 5538883255 +51%201%205538883255*

 CCIE - 44433

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping users to different rgw pools

2015-03-16 Thread Craig Lewis
Yes, the placement target feature is logically separate from multi-zone
setups.  Placement targets are configured in the region though, which
somewhat muddies the issue.

Placement targets are useful feature for multi-zone, so different zones in
a cluster don't share the same disks.  Federation setup is the only place
I've seen any discussion about the topic.  Even that is just a brief
mention.  I didn't see any documentation directly talking about setting up
placement targets, even in the federation guides.

It looks like you'll need to edit the default region to add the placement
targets, but you won't need to setup zones.  As far as I can tell, You'll
have to piece together what you need from the federation setup and some
experimentation.  I highly recommend a test VM that you can experiment on
before attempting anything in production.




On Sun, Mar 15, 2015 at 11:53 PM, Sreenath BH bhsreen...@gmail.com wrote:

 Thanks.

 Is this possible outside of multi-zone setup. (With only one Zone)?

 For example, I want to have pools with different replication
 factors(or erasure codings) and map users to these pools.

 -Sreenath


 On 3/13/15, Craig Lewis cle...@centraldesktop.com wrote:
  Yes, RadosGW has the concept of Placement Targets and Placement Pools.
 You
  can create a target, and point it a set of RADOS pools.  Those pools can
 be
  configured to use different storage strategies by creating different
  crushmap rules, and assigning those rules to the pool.
 
  RGW users can be assigned a default placement target.  When they create a
  bucket, they can either specify the target, or use their default one.
 All
  objects in a bucket are stored according to the bucket's placement
 target.
 
 
  I haven't seen a good guide for making use of these features.  The best
  guide I know of is the Federation guide (
  http://ceph.com/docs/giant/radosgw/federated-config/), but it only
 briefly
  mentions placement targets.
 
 
 
  On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com
 wrote:
 
  Hi all,
 
  Can one Radow gateway support more than one pool for storing objects?
 
  And as a follow-up question, is there a way to map different users to
  separate rgw pools so that their obejcts get stored in different
  pools?
 
  thanks,
  Sreenath
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW Direct Upload Limitation

2015-03-16 Thread Craig Lewis


 Maybe, but I'm not sure if Yehuda would want to take it upstream or
 not. This limit is present because it's part of the S3 spec. For
 larger objects you should use multi-part upload, which can get much
 bigger.
 -Greg


Note that the multi-part upload has a lower limit of 4MiB per part, and the
direct upload has an upper limit of 5GiB.

So you have to use both methods - direct upload for small files, and
multi-part upload for big files.

Your best bet is to use the Amazon S3 libraries.  They have functions that
take care of it for you.


I'd like to see this mentioned in the Ceph documentation someplace.  When I
first encountered the issue, I couldn't find a limit in the RadosGW
documentation anywhere.  I only found the 5GiB limit in the Amazon API
documentation, which lead me to test on RadosGW.  Now that I know it was
done to preserve Amazon compatibility, I don't want to override the value
anymore.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck unclean active+remapped after an osd marked out

2015-03-16 Thread Craig Lewis


 If I remember/guess correctly, if you mark an OSD out it won't
 necessarily change the weight of the bucket above it (ie, the host),
 whereas if you change the weight of the OSD then the host bucket's
 weight changes.
 -Greg



That sounds right.  Marking an OSD out is a ceph osd reweight, not a ceph
osd crush reweight.

Experimentally confirmed.  I have an OSD out right now, and the host's
crush weight is the same as the other hosts' crush weight.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] query about mapping of Swift/S3 APIs to Ceph cluster APIs

2015-03-16 Thread Craig Lewis
On Sat, Mar 14, 2015 at 3:04 AM, pragya jain prag_2...@yahoo.co.in wrote:

 Hello all!

 I am working on Ceph object storage architecture from last few months.

 I am unable to search  a document which can describe how Ceph object
 storage APIs (Swift/S3 APIs) are mappedd with Ceph storage cluster APIs
 (librados APIs) to store the data at Ceph storage cluster.

 As the documents say: Radosgw, a gateway interface for ceph object storage
 users, accept user request to store or retrieve data in the form of Swift
 APIs or S3 APIs and convert the user's request in RADOS request.

 Please help me in knowing
 1. how does Radosgw convert user request to RADOS request ?
 2. how are HTTP requests mapped with RADOS request?


The RadosGW daemon takes care of that.  It's an application that sits on
top of RADOS.

For HTTP, there are a couple ways.  The older way has Apache accepting the
HTTP request, then forwarding that to the RadosGW daemon using FastCGI.
Newer versions support RadosGW handling the HTTP directly.

For the full details, you'll want to check out the source code at
https://github.com/ceph/ceph

If you're not interested enough to read the source code (I wasn't :-) ),
setup a test cluster.  Create a user, bucket, and object, and look at the
contents of the rados pools.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shadow files

2015-03-16 Thread Craig Lewis
Out of curiousity, what's the frequency of the peaks and troughs?

RadosGW has configs on how long it should wait after deleting before
garbage collecting, how long between GC runs, and how many objects it can
GC in per run.

The defaults are 2 hours, 1 hour, and 32 respectively.  Search
http://docs.ceph.com/docs/master/radosgw/config-ref/ for rgw gc.

If your peaks and troughs have a frequency less than 1 hour, then GC is
going to delay and alias the disk usage w.r.t. the object count.

If you have millions of objects, you probably need to tweak those values.
If RGW is only GCing 32 objects an hour, it's never going to catch up.


Now that I think about it, I bet I'm having issues here too.  I delete more
than (32*24) objects per day...



On Sun, Mar 15, 2015 at 4:41 PM, Ben b@benjackson.email wrote:

 It is either a problem with CEPH, Civetweb or something else in our
 configuration.
 But deletes in user buckets is still leaving a high number of old shadow
 files. Since we have millions and millions of objects, it is hard to
 reconcile what should and shouldnt exist.

 Looking at our cluster usage, there are no troughs, it is just a rising
 peak.
 But when looking at users data usage, we can see peaks and troughs as you
 would expect as data is deleted and added.

 Our ceph version 0.80.9

 Please ideas?

 On 2015-03-13 02:25, Yehuda Sadeh-Weinraub wrote:

 - Original Message -

 From: Ben b@benjackson.email
 To: ceph-us...@ceph.com
 Sent: Wednesday, March 11, 2015 8:46:25 PM
 Subject: Re: [ceph-users] Shadow files

 Anyone got any info on this?

 Is it safe to delete shadow files?


 It depends. Shadow files are badly named objects that represent part
 of the objects data. They are only safe to remove if you know that the
 corresponding objects no longer exist.

 Yehuda


 On 2015-03-11 10:03, Ben wrote:
  We have a large number of shadow files in our cluster that aren't
  being deleted automatically as data is deleted.
 
  Is it safe to delete these files?
  Is there something we need to be aware of when deleting them?
  Is there a script that we can run that will delete these safely?
 
  Is there something wrong with our cluster that it isn't deleting these
  files when it should be?
 
  We are using civetweb with radosgw, with tengine ssl proxy infront of
  it
 
  Any advice please
  Thanks
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can not list objects in large bucket

2015-03-13 Thread Craig Lewis
By default, radosgw only returns the first 1000 objects.  Looks like
radosgw-admin has the same limit.

Looking at the man page, I don't see any way to page through the list.  I
must be missing something.


The S3 API does have the ability to page through the list.  I use the
command line tool s3cmd to get the full bucket list.  It does require user
credentials though, so that might be a pain if you have many users.


On Wed, Mar 11, 2015 at 6:47 PM, Sean Sullivan seapasu...@uchicago.edu
wrote:

  I have a single radosgw user with 2 s3 keys and 1 swift key. I have
 created a few buckets and I can list all of the contents of bucket A and C
 but not B with either S3 (boto) or python-swiftclient. I am able to list
 the first 1000 entries using radosgw-admin 'bucket list --bucket=bucketB'
 without any issues but this doesn't really help.

 The odd thing is I can still upload and download objects in the bucket. I
 just can't list them. I tried setting the bucket canned_acl to private and
 public but I still can't list the objects inside.

 I'm using ceph .87 (Giant) Here is some info about the cluster::
 http://pastebin.com/LvQYnXem -- ceph.conf
 http://pastebin.com/efBBPCwa -- ceph -s
 http://pastebin.com/tF62WMU9 -- radosgw-admin bucket list
 http://pastebin.com/CZ8TkyNG -- python list bucket objects script
 http://pastebin.com/TUCyxhMD -- radosgw-admin bucket stats --bucketB
 http://pastebin.com/uHbEtGHs -- rados -p .rgw.buckets ls | grep
 default.20283.2 (bucketB marker)
 http://pastebin.com/WYwfQndV -- Python Error when trying to list BucketB
 via boto

 I have no idea why this could be happening outside of the acl. Has anyone
 seen this before? Any idea on how I can get access to this bucket again via
 s3/swift? Also is there a way to list the full list of a bucket via
 radosgw-admin and not the first 9000 lines / 1000 entries, or a way to page
 through them?

 EDIT:: I just fixed it (I hope) but the fix doesn't make any sense:

 radosgw-admin bucket unlink --uid=user --bucket=bucketB
 radosgw-admin bucket link --uid=user --bucket=bucketB
 --bucket-id=default.20283.2

 Now with swift or s3 (boto) I am able to list the bucket contents without
 issue ^_^

 Can someone elaborate on why this works and how it broken in the first
 place when ceph was health_ok the entire time? With 3 replicas how did this
 happen? Could this be a bug?  sorry for the rambling. I am confused and
 tired ;p




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping users to different rgw pools

2015-03-13 Thread Craig Lewis
Yes, RadosGW has the concept of Placement Targets and Placement Pools.  You
can create a target, and point it a set of RADOS pools.  Those pools can be
configured to use different storage strategies by creating different
crushmap rules, and assigning those rules to the pool.

RGW users can be assigned a default placement target.  When they create a
bucket, they can either specify the target, or use their default one.  All
objects in a bucket are stored according to the bucket's placement target.


I haven't seen a good guide for making use of these features.  The best
guide I know of is the Federation guide (
http://ceph.com/docs/giant/radosgw/federated-config/), but it only briefly
mentions placement targets.



On Thu, Mar 12, 2015 at 11:48 PM, Sreenath BH bhsreen...@gmail.com wrote:

 Hi all,

 Can one Radow gateway support more than one pool for storing objects?

 And as a follow-up question, is there a way to map different users to
 separate rgw pools so that their obejcts get stored in different
 pools?

 thanks,
 Sreenath
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Expansion

2015-01-23 Thread Craig Lewis
It depends.  There are a lot of variables, like how many nodes and disks
you currently have.  Are you using journals on SSD.  How much data is
already in the cluster.  What the client load is on the cluster.

Since you only have 40 GB in the cluster, it shouldn't take long to
backfill.  You may find that it finishes backfilling faster than you can
format the new disks.


Since you only have a single OSD node, you must've changed the crushmap to
allow replication over OSDs instead of hosts.  After you get the new node
in would be the best time to switch back to host level replication.  The
more data you have, the more painful that change will become.






On Sun, Jan 18, 2015 at 10:09 AM, Georgios Dimitrakakis 
gior...@acmac.uoc.gr wrote:

 Hi Jiri,

 thanks for the feedback.

 My main concern is if it's better to add each OSD one-by-one and wait for
 the cluster to rebalance every time or do it all-together at once.

 Furthermore an estimate of the time to rebalance would be great!

 Regards,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Expansion

2015-01-23 Thread Craig Lewis
You've either modified the crushmap, or changed the pool size to 1.  The
defaults create 3 replicas on different hosts.

What does `ceph osd dump | grep ^pool` output?  If the size param is 1,
then you reduced the replica count.  If the size param is  1, you must've
adjusted the crushmap.

Either way, after you add the second node would be the ideal time to change
that back to the default.


Given that you only have 40GB of data in the cluster, you shouldn't have a
problem adding the 2nd node.


On Fri, Jan 23, 2015 at 3:58 PM, Georgios Dimitrakakis gior...@acmac.uoc.gr
 wrote:

 Hi Craig!


 For the moment I have only one node with 10 OSDs.
 I want to add a second one with 10 more OSDs.

 Each OSD in every node is a 4TB SATA drive. No SSD disks!

 The data ara approximately 40GB and I will do my best to have zero
 or at least very very low load during the expansion process.

 To be honest I haven't touched the crushmap. I wasn't aware that I
 should have changed it. Therefore, it still is with the default one.
 Is that OK? Where can I read about the host level replication in CRUSH map
 in order
 to make sure that it's applied or how can I find if this is already
 enabled?

 Any other things that I should be aware of?

 All the best,


 George


  It depends.  There are a lot of variables, like how many nodes and
 disks you currently have.  Are you using journals on SSD.  How much
 data is already in the cluster.  What the client load is on the
 cluster.

 Since you only have 40 GB in the cluster, it shouldnt take long to
 backfill.  You may find that it finishes backfilling faster than you
 can format the new disks.

 Since you only have a single OSD node, you mustve changed the crushmap
 to allow replication over OSDs instead of hosts.  After you get the
 new node in would be the best time to switch back to host level
 replication.  The more data you have, the more painful that change
 will become.

 On Sun, Jan 18, 2015 at 10:09 AM, Georgios Dimitrakakis  wrote:

  Hi Jiri,

 thanks for the feedback.

 My main concern is if its better to add each OSD one-by-one and
 wait for the cluster to rebalance every time or do it all-together
 at once.

 Furthermore an estimate of the time to rebalance would be great!

 Regards,



 Links:
 --
 [1] mailto:gior...@acmac.uoc.gr


 --

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow/Hung IOs

2015-01-09 Thread Craig Lewis
I doesn't seem like the problem here, but I've noticed that slow OSDs have
a large fan-out.  I have less than 100 OSDs, so every OSD talks to every
other OSD in my cluster.

I was getting slow notices from all of my OSDs.  Nothing jumped out, so I
started looking at disk write latency graphs.  I noticed that all the OSDs
in one node had 10x the write latency of the other nodes.  After that, I
graphed the number of slow notices per OSD, and noticed that a much higher
number of slow requests on that node.

Long story short, I lost a battery on my write cache.  But it wasn't at all
obvious from the slow request notices, not until I dug deeper.



On Mon, Jan 5, 2015 at 4:07 PM, Sanders, Bill bill.sand...@teradata.com
wrote:

  Thanks for the reply.

 14 and 18 happened to show up during that run, but its certainly not only
 those OSD's.  It seems to vary each run.  Just from the runs I've done
 today I've seen the following pairs of OSD's:

 ['0,13', '0,18', '0,24', '0,25', '0,32', '0,34', '0,36', '10,22', '11,30',
 '12,28', '13,30', '14,22', '14,24', '14,27', '14,30', '14,31', '14,33',
 '14,34', '14,35', '14,39', '16,20', '16,27', '18,38', '19,30', '19,31',
 '19,39', '20,38', '22,30', '26,37', '26,38', '27,33', '27,34', '27,36',
 '28,32', '28,34', '28,36', '28,37', '3,18', '3,27', '3,29', '3,37', '4,10',
 '4,29', '5,19', '5,37', '6,25', '9,28', '9,29', '9,37']

 Which is almost all of the OSD's in the system.

 Bill

  --
 *From:* Lincoln Bryant [linco...@uchicago.edu]
 *Sent:* Monday, January 05, 2015 3:40 PM
 *To:* Sanders, Bill
 *Cc:* ceph-users@lists.ceph.com
 *Subject:* Re: [ceph-users] Slow/Hung IOs

  Hi BIll,

  From your log excerpt, it looks like your slow requests are happening on
 OSDs 14 and 18. Is it always these two OSDs?

  If you don't have a long recovery time (e.g., the cluster is just full
 of test data), maybe you could try setting OSDs 14 and 18 out and
 re-benching?

  Alternatively I suppose you could just use bonnie++ or dd etc to write
 to those OSDs (careful to not clobber any Ceph dirs) and see how the
 performance looks.

  Cheers,
 Lincoln

   On Jan 5, 2015, at 4:36 PM, Sanders, Bill wrote:

   Hi Ceph Users,

 We've got a Ceph cluster we've built, and we're experiencing issues with
 slow or hung IO's, even running 'rados bench' on the OSD cluster.  Things
 start out great, ~600 MB/s, then rapidly drops off as the test waits for
 IO's. Nothing seems to be taxed... the system just seems to be waiting.
 Any help trying to figure out what could cause the slow IO's is appreciated.

 For example, 'rados -p rbd bench 60 write -t 32' takes over 900s to
 complete:

 A typical rados bench:
  Total time run: 957.458274
 Total writes made:  9251
 Write size: 4194304
 Bandwidth (MB/sec): 38.648

 Stddev Bandwidth:   157.323
 Max bandwidth (MB/sec): 964
 Min bandwidth (MB/sec): 0
 Average Latency:3.21126
 Stddev Latency: 51.9546
 Max latency:910.72
 Min latency:0.04516


 According to ceph.log, we're not experiencing any OSD flapping or monitor
 election cycles, just slow requests:

 # grep slow /var/log/ceph/ceph.log:
 2015-01-05 13:42:42.937678 osd.18 39.7.48.7:6803/11185 220 : [WRN] 3 slow
 requests, 1 included below; oldest blocked for  513.611379 secs
 2015-01-05 13:42:42.937685 osd.18 39.7.48.7:6803/11185 221 : [WRN] slow
 request 30.136429 seconds old, received at 2015-01-05 13:42:12.801205:
 osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write
 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
 from 3,37
 2015-01-05 13:42:49.938681 osd.18 39.7.48.7:6803/11185 222 : [WRN] 3 slow
 requests, 1 included below; oldest blocked for  520.612372 secs
 2015-01-05 13:42:49.938688 osd.18 39.7.48.7:6803/11185 223 : [WRN] slow
 request 480.636547 seconds old, received at 2015-01-05 13:34:49.302080:
 osd_op(client.92008.1:3100010 rb.0.140d.238e1f29.0c77 [write
 3622400~512] 3.d031a69f ondisk+write e994) v4 currently waiting for subops
 from 26,37
 2015-01-05 13:43:12.941838 osd.18 39.7.48.7:6803/11185 224 : [WRN] 3 slow
 requests, 1 included below; oldest blocked for  543.615545 secs
 2015-01-05 13:43:12.941844 osd.18 39.7.48.7:6803/11185 225 : [WRN] slow
 request 60.140595 seconds old, received at 2015-01-05 13:42:12.801205:
 osd_op(client.92008.1:3101508 rb.0.1437.238e1f29.000f [write
 114688~512] 3.841c0edf ondisk+write e994) v4 currently waiting for subops
 from 3,37
 2015-01-05 13:44:04.933440 osd.14 39.7.48.7:6818/11640 251 : [WRN] 4 slow
 requests, 1 included below; oldest blocked for  606.941954 secs
 2015-01-05 13:44:04.933469 osd.14 39.7.48.7:6818/11640 252 : [WRN] slow
 request 240.101138 seconds old, received at 2015-01-05 13:40:04.832272:
 osd_op(client.92008.1:3101102 rb.0.142b.238e1f29.0010 [write
 475136~512] 3.5e623815 ondisk+write e994) v4 currently waiting for subops
 from 27,33
 2015-01-05 13:44:12.950805 osd.18 

Re: [ceph-users] backfill_toofull, but OSDs not full

2015-01-09 Thread Craig Lewis
What was the osd_backfill_full_ratio?  That's the config that controls
backfill_toofull.  By default, it's 85%.  The mon_osd_*_ratio affect the
ceph status.

I've noticed that it takes a while for backfilling to restart after
changing osd_backfill_full_ratio.  Backfilling usually restarts for me in
10-15 minutes.  Some PGs will stay in that state until the cluster is
nearly done recoverying.

I've only seen backfill_toofull happen after the OSD exceeds the ratio (so
it's reactive, no proactive).  Mine usually happen when I'm rebalancing a
nearfull cluster, and an OSD backfills itself toofull.




On Mon, Jan 5, 2015 at 11:32 AM, c3 ceph-us...@lopkop.com wrote:

 Hi,

 I am wondering how a PG gets marked backfill_toofull.

 I reweighted several OSDs using ceph osd crush reweight. As expected, PG
 began moving around (backfilling).

 Some PGs got marked +backfilling (~10), some +wait_backfill (~100).

 But some are marked +backfill_toofull. My OSDs are between 25% and 72%
 full.

 Looking at ceph pg dump, I can find the backfill_toofull PGs and verified
 the OSDs involved are less than 72% full.

 Do backfill reservations include a size? Are these OSDs projected to be
 toofull, once the current backfilling complete? Some of the
 backfill_toofull and backfilling point to the same OSDs.

 I did adjust the full ratios, but that did not change the backfill_toofull
 status.
 ceph tell mon.\* injectargs '--mon_osd_full_ratio 0.95'
 ceph tell osd.\* injectargs '--osd_backfill_full_ratio 0.92'


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different disk usage on different OSDs

2015-01-08 Thread Craig Lewis
The short answer is that uniform distribution is a lower priority feature
of the CRUSH hashing algorithm.

CRUSH is designed to be consistent and stable in it's hashing.  For the
details, you can read Sage's paper (
http://ceph.com/papers/weil-rados-pdsw07.pdf).  The goal is that if you
make a change to your cluster, there will be some moderate data movement,
but not everything moves.  If you then undo the change, things will go back
to exactly how they were before.

Doing that and getting uniform distribution is hard, and it's work in
progress.  The tunables are progress on this front, but they are by no
means the last word.


The current work around is to use ceph osd reweight-by-utilization.  That
tool will look at data distributions, and reweight things to bring the OSDs
more inline with each other.  Unfortunately, it does a ceph osd reweight,
not a ceph osd crush reweight.  (The existence of two different weighs with
different behavior is unfortunate too).  ceph osd reweight is temporary, in
that the value will be lost if a OSD is marked out.  ceph osd crush
reweight updates the CRUSHMAP, and it's not temporary.  So I use ceph osd
crush reweight manually.

While it would be nice if Ceph would automatically rebalance itself, I'd
turn that off.  Moving data around in my small cluster involves a major
performance hit.  By manually adjusting the crush weights, I have some
control over when and how much data is moved around.


I recommend taking a look a ceph osd tree and df on all nodes, and start
adjusting the crush weight of heavily used disks down, and under utilized
disks up.  The crush weight is generally the size (base2) of the disk in
TiB.  I adjust my OSDs up or down by 0.05 each step, then decide if I need
to make another pass. I have one 4 TiB drives with a weight of 4.14, and
another with a weight of 3.04.  They're still not balanced, but it's better.


If data migration has a smaller impact on your cluster, larger steps should
be fine.  And if anything causes major problems, just revert the change.
CRUSH is stable and consistent :-)




On Mon, Jan 5, 2015 at 2:04 AM, ivan babrou ibob...@gmail.com wrote:

 Hi!

 I have a cluster with 106 osds and disk usage is varying from 166gb to
 316gb. Disk usage is highly correlated to number of pg per osd (no surprise
 here). Is there a reason for ceph to allocate more pg on some nodes?

 The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest are 87,
 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with very
 little data has only 8 pgs. PG size in biggest pool is ~6gb (5.1..6.3
 actually).

 Lack of balanced disk usage prevents me from using all the disk space.
 When the biggest osd is full, cluster does not accept writes anymore.

 Here's gist with info about my cluster:
 https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae

 --
 Regards, Ian Babrou
 http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-07 Thread Craig Lewis
On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva ol...@gnu.org wrote:

 However, I suspect that temporarily setting min size to a lower number
 could be enough for the PGs to recover.  If ceph osd pool pool set
 min_size 1 doesn't get the PGs going, I suppose restarting at least one
 of the OSDs involved in the recovery, so that they PG undergoes peering
 again, would get you going again.


It depends on how incomplete your incomplete PGs are.

min_size is defined as Sets the minimum number of replicas required for
I/O..  By default, size is 3 and min_size is 2 on recent versions of ceph.

If the number of replicas you have drops below min_size, then Ceph will
mark the PG as incomplete.  As long as you have one copy of the PG, you can
recover by lowering the min_size to the number of copies you do have, then
restoring the original value after recovery is complete.  I did this last
week when I deleted the wrong PGs as part of a toofull experiment.

If the number of replicas drops to 0, I think you can use ceph pg
force_create_pg, but I haven't tested it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of a cluster with full OSD(s)

2014-12-23 Thread Craig Lewis
On Tue, Dec 23, 2014 at 3:34 AM, Max Power 
mailli...@ferienwohnung-altenbeken.de wrote:

 I understand that the status osd full should never be reached. As I am
 new to
 ceph I want to be prepared for this case. I tried two different scenarios
 and
 here are my experiences:


For a real cluster, you should be monitoring your cluster, and taking
immediate action once you get an OSD in nearfull state.  Waiting until OSDs
are toofull is too late.

For a test cluster, it's a great learning experience. :-)



 The first one is to completely fill the storage (for me: writing files to a
 rados blockdevice). I discovered that the writing client (dd for example)
 gets
 completly stucked then. And this prevents me from stoping the process
 (SIGTERM,
 SIGKILL). At the moment I restart the whole computer to prevent writing to
 the
 cluster. Then I unmap the rbd device and set the full ratio a bit higher
 (0.95
 to 0.97). I do a mount on my adminnode and delete files till everything is
 okay
 again.
 Is this the best practice?


It is a design feature of Ceph that all cluster reads and writes stop until
the toofull situation is resolved.

The route you took is one of two ways to recover.  The other route you
found in your replica test.



 Is it possible to prevent the system from running in
 a osd full state? I could make the block devices smaller than the
 cluster can
 save. But it's hard to calculate this exactly.


If you continue to add data to the cluster after it's nearfull, then you're
going to hit toofull.
Once you hit nearfull, you need to delete existing data, or add more OSDs.

You've probably noticed that some OSDs are using more space than others.
You can try to even them out with `ceph osd reweight` or `ceph osd crush
reweight`, but that's a delaying tactic.  When I hit nearfull, I place an
order for new hardware, then use `ceph osd reweight` until it arrives.



 The next scenario is to change a pool size from say 2 to 3 replicas. While
 the
 cluster copies the objects it gets stuck as an osd reaches it limit.
 Normally
 the osd process quits then and I cannot restart it (even after setting the
 replicas back). The only possibility is to manually delete complete PG
 folders
 after exploring them with 'pg dump'. Is this the only way to get it back
 working
 again?


There are some other configs that might have come into play here.  You
might have run into osd_failsafe_nearfull_ratio
or osd_failsafe_full_ratio.  You could try bumping those up a bit, and see
if that lets the process stay up long enough to start reducing replicas.

Since osd_failsafe_full_ratio is already 0.97, I wouldn't take it any
higher than 0.98.  Ceph triggers on greater-than percentages, so 0.99
will let you fill a disk to 100% full.  If you get a disk to 100% full, the
only way to cleanup is to start deleting PG directories.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any Good Ceph Web Interfaces?

2014-12-23 Thread Craig Lewis
Are you asking because you want to manage a Ceph cluster point and click?
Or do you need some shiny to show the boss?


I'm using a combination of Chef and Zabbix.  I'm not running RHEL though,
but I would assume those are available in the repos.

It's not as slick as Calamari, and it really doesn't give me a whole
cluster view.  Ganglia did a better job of that, but I went with Zabbix
for the graphing and alerting in a single product.


If you're looking for some shiny for the boss, Zabbix's web interface
should work fine.

If you're looking for a point and click way to build a Ceph cluster, I
think Calamari is your only option.



On Mon, Dec 22, 2014 at 4:11 PM, Tony unix...@gmail.com wrote:

 Please don't mention calamari :-)

 The best web interface for ceph that actually works with RHEL6.6

 Preferable something in repo and controls and monitors all other ceph osd,
 mon, etc.


 Take everything and live for the moment.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy state of documentation [was: OSD JOURNAL not associated - ceph-disk list ?]

2014-12-22 Thread Craig Lewis
I get the impression that more people on the ML are using a config
management system.  ceph-deploy questions seem to come from new users
following the quick start guide.

I know both Puppet and Chef are fairly well represented here.  I've seen a
few posts about Salt and Ansible, but not much.  Calamari is built on top
of Salt, so I suppose that means Salt is well represented.  I really
haven't seen anything from the CFEngine or Bcfg2 camps.


I'm personally using Chef with a private fork of the Ceph cookbook.  The
Ceph cookbook doesn't use ceph-deploy, but it does use ceph-disk.  Whenever
I have problems with the ceph-disk command, I first go look at the cookbook
to see how it's doing things.



On Sun, Dec 21, 2014 at 10:37 AM, Nico Schottelius 
nico-ceph-us...@schottelius.org wrote:

 Hello list,

 I am a bit wondering about ceph-deploy and the development of ceph: I
 see that many people in the community are pushing towards the use of
 ceph-deploy, likely to ease use of ceph.

 However, I have run multiple times into issues using ceph-deploy, when
 it failed or incorrectly setup partitions or created a cluster of
 monitors that never reach a qurom.

 I have also recognised debugging and learning of ceph being much more
 difficult with ceph-deploy, compared to going the manual way, because as
 a user I miss a lot of information.

 Furthermore as the maintainer of a configuration management system [0],
 I am interested in knowing how things are working behind the scenes to
 be able to automate them.

 Thus I was wondering, if it is an option for the ceph community to
 focus on both (the manual  ceph-deploy) way instead of just pushing
 ceph-deploy?

 Cheers,

 Nico

 p.s.: Loic, just taking your mail as an example, but it is not personal
 - just want to show my point.

 Loic Dachary [Sun, Dec 21, 2014 at 06:08:27PM +0100]:
  [...]
 
  Is there a reason why you need to do this instead of letting ceph-disk
 prepare do it for you ?
 
  [...]

 --
 New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1256 OSD/21 server ceph cluster performance issues.

2014-12-22 Thread Craig Lewis
On Mon, Dec 22, 2014 at 2:57 PM, Sean Sullivan seapasu...@uchicago.edu
wrote:

  Thanks Craig!

 I think that this may very well be my issue with osds dropping out but I
 am still not certain as I had the cluster up for a small period while
 running rados bench for a few days without any status changes.


Mine were fine for a while too, through several benchmarks and a large
RadosGW import.  My problems were memory pressure plus an XFS bug, so it
took a while to manifest.  When it did, all of the ceph-osd processes on
that node would have periods of ~30 seconds with 100% CPU.  Some OSDs would
get kicked out.  Once that started, it was a downward spiral of recovery
causing increasing load causing more OSDs to get kicked out...

Once I found the memory problem, I cronned a buffer flush, and that usually
kept things from getting too bad.

I was able to see on the CPU graphs that CPU was increasing before the
problems started.  Once CPU got close to 100% usage on all cores, that's
when the OSDs started dropping out.  Hard to say if it was the CPU itself,
or if the CPU was just a symptom of the memory pressure plus XFS bug.




 The real big issue that I have is the radosgw one currently. After I
 figure out the root cause of the slow radosgw performance and correct that,
 it should hopefully buy me enough time to figure out the osd slow issue.

 It just doesn't make sense that I am getting 8mbps per client no matter 1
 or 60 clients while rbd and rados shoot well above 600MBs (above 1000 as
 well).


That is strange.  I was able to get 300 Mbps per client, on a 3 node
cluster with GigE.  I expected that each client would saturate the GigE on
their own, but 300 Mbps is more than enough for now.

I am using the Ceph apache and fastcgi module, but otherwise it's a pretty
standard apache setup.  My RadosGW processes are using a fair amount of
CPU, but as long as you have some idle CPU, that shouldn't be the
bottleneck.





 May I ask how you are monitoring your clusters logs? Are you just using
 rsyslog or do you have a logstash type system set up? Load wise I do not
 see a spike until I pull an osd out of the cluster or stop then start an
 osd without marking nodown.


I'm monitoring the cluster with Zabbix, and that gives me pretty much the
same info that I'd get in the logs.  I am planning to start pushing the
logs to Logstash soon, as soon as I get my logstash is able to handle the
extra load.



 I do think that CPU is probably the cause of the osd slow issue though as
 it makes the most logical sense. Did you end up dropping ceph and moving to
 zfs or did you stick with it and try to mitigate it via file flusher/ other
 tweaks?


I'm still on Ceph.  I worked around the memory pressure by reformatting my
XFS filesystems to use regular sized inodes.  It was a rough couple of
months, but everything has been stable for the last two months.

I do still want to use ZFS on my OSDs.  It's got all the features of BtrFS,
with the extra feature of being production ready.  It's just not production
ready in Ceph yet.  It's coming along nicely though, and I hope to reformat
one node to be all ZFS sometime next year.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have 2 different public networks

2014-12-19 Thread Craig Lewis
On Thu, Dec 18, 2014 at 10:47 PM, Francois Lafont flafdiv...@free.fr
wrote:

 Le 19/12/2014 02:18, Craig Lewis a écrit :
  The daemons bind to *,

 Yes but *only* for the OSD daemon. Am I wrong?

 Personally I must provide IP addresses for the monitors
 in the /etc/ceph/ceph.conf, like this:

 [global]
 mon host = 10.0.1.1, 10.0.1.2, 10.0.1.3

 Or like this:

 [mon.1]
 mon addr = 10.0.1.1
 [mon.2]
 mon addr = 10.0.1.2
 [mon.3]
 mon addr = 10.0.1.3


I'm not using mon addr lines, and my ceph-mon daemons are bound to 0.0.0.0:*.
I have no [mon.#] or [osd.#] sections at all.

I do have the global mon host line.  On the management nodes, try putting
the 10.0.2.0/24 addresses there instead of the 10.0.1.0/24 addresses.


  Do you really plan on having enough traffic creating and deleting RDB
  images that you need a dedicated network?  It seems like setting up link
  aggregation on 10.0.1.0/24 would be simpler and less error prone.

 This is not for traffic. I must have a node to manage rbd images and this
 node is in a different VLAN (this is an Openstack install... I try... ;).


If it's not a traffic volume problem, can you allow the 10.0.2.0/24 network
to route to the 10.0.1.0/24 network, and open the firewall enough? There
should be enough info in the network config to get the firewall working:
http://docs.ceph.com/docs/next/rados/configuration/network-config-ref/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-19 Thread Craig Lewis
I've done single nodes.  I have a couple VMs for RadosGW Federation
testing.  It has a single virtual network, with both clusters on the same
network.

Because I'm only using a single OSD on a single host, I had to update the
crushmap to handle that.  My Chef recipe runs:
ceph osd getcrushmap -o /tmp/compiled-crushmap.old

crushtool -d /tmp/compiled-crushmap.old -o /tmp/decompiled-crushmap.old

sed -e '/step chooseleaf firstn 0 type/s/host/osd/'
/tmp/decompiled-crushmap.old  /tmp/decompiled-crushmap.new

crushtool -c /tmp/decompiled-crushmap.new -o /tmp/compiled-crushmap.new

ceph osd setcrushmap -i /tmp/compiled-crushmap.new


Those are the only extra commands I run for a single node cluster.
Otherwise, it looks the same as my production nodes that run mon, osd, and
rgw.


Here's my single node's ceph.conf:
[global]
  fsid = a7798848-1d31-421b-8f3c-5a34d60f6579
  mon initial members = test0-ceph0
  mon host = 172.16.205.143:6789
  auth client required = none
  auth cluster required = none
  auth service required = none
  mon warn on legacy crush tunables = false
  osd crush chooseleaf type = 0
  osd pool default flag hashpspool = true
  osd pool default min size = 1
  osd pool default size = 1
  public network = 172.16.205.0/24

[osd]
  osd journal size = 1000
  osd mkfs options xfs = -s size=4096
  osd mkfs type = xfs
  osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64
  osd_scrub_sleep = 1.0
  osd_snap_trim_sleep = 1.0



[client.radosgw.test0-ceph0]
  host = test0-ceph0
  rgw socket path = /var/run/ceph/radosgw.test0-ceph0
  keyring = /etc/ceph/ceph.client.radosgw.test0-ceph0.keyring
  log file = /var/log/ceph/radosgw.log
  admin socket = /var/run/ceph/radosgw.asok
  rgw dns name = test0-ceph
  rgw region = us
  rgw region root pool = .us.rgw.root
  rgw zone = us-west
  rgw zone root pool = .us-west.rgw.root



On Thu, Dec 18, 2014 at 11:23 PM, Debashish Das deba@gmail.com wrote:

 Hi Team,

 Thank for the insight  the replies, as I understood from the mails -
 running Ceph cluster in a single node is possible but definitely not
 recommended.

 The challenge which i see is there is no clear documentation for single
 node installation.

 So I would request if anyone has installed Ceph in single node, please
 share the link or document which i can refer to install Ceph in my local
 server.

 Again thanks guys !!

 Kind Regards
 Debashish Das

 On Fri, Dec 19, 2014 at 6:08 AM, Robert LeBlanc rob...@leblancnet.us
 wrote:

 Thanks, I'll look into these.

 On Thu, Dec 18, 2014 at 5:12 PM, Craig Lewis cle...@centraldesktop.com
 wrote:

 I think this is it:
 https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939

 You can also check out a presentation on Cern's Ceph cluster:
 http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern


 At large scale, the biggest problem will likely be network I/O on the
 inter-switch links.



 On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us
 wrote:

 I'm interested to know if there is a reference to this reference
 architecture. It would help alleviate some of the fears we have about
 scaling this thing to a massive scale (10,000's OSDs).

 Thanks,
 Robert LeBlanc

 On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com
  wrote:



 On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
 wrote:


  2. What should be the minimum hardware requirement of the server
 (CPU,
  Memory, NIC etc)

 There is no real minimum to run Ceph, it's all about what your
 workload will look like and what kind of performance you need. We have
 seen Ceph run on Raspberry Pis.


 Technically, the smallest cluster is a single node with a 10 GiB
 disk.  Anything smaller won't work.

 That said, Ceph was envisioned to run on large clusters.  IIRC, the
 reference architecture has 7 rows, each row having 10 racks, all full.

 Those of us running small clusters (less than 10 nodes) are noticing
 that it doesn't work quite as well.  We have to significantly scale back
 the amount of backfilling and recovery that is allowed.  I try to keep all
 backfill/recovery operations touching less than 20% of my OSDs.  In the
 reference architecture, it could lose a whole row, and still keep under
 that limit.  My 5 nodes cluster is noticeably better better than the 3 
 node
 cluster.  It's faster, has lower latency, and latency doesn't increase as
 much during recovery operations.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering from PG in down+incomplete state

2014-12-19 Thread Craig Lewis
Why did you remove osd.7?

Something else appears to be wrong.  With all 11 OSDs up, you shouldn't
have any PGs stuck in stale or peering.


How badly are the clocks skewed between nodes?  If it's bad enough, it can
cause communication problems between nodes.  Ceph will complain if the
clocks are more than 50ms different. It's best if you run ntpd on all
nodes.

I'm thinking that cleaning up the clock skew will fix most of your issues.


If that does fix the issue, you can try bringing osd.7 back in.  Don't
reformat it, just deploy it as you normally would.  The CRUSHMAP will go
back to the way it was before you removed osd.7.  Ceph will start to
backfill+remap data onto the new osd, and see that most of it is already
there.  It should recovery relatively quickly... I think.


On Fri, Dec 19, 2014 at 10:28 AM, Robert LeBlanc rob...@leblancnet.us
wrote:

 I'm still pretty new at troubleshooting Ceph and since no one has
 responded yet I'll give a stab.

 What is the size of your pool?
 'ceph osd pool get pool name size'

 It seems like based on the number of incomplete PGs that it was '1'. I
 understand that if you are able to bring osd 7 back in, it would clear up.
 I'm just not seeing a secondary osd for that PG.

 Disclaimer: I could be totally wrong.

 Robert LeBlanc

 On Thu, Dec 18, 2014 at 11:41 PM, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 I had 12 OSD's in my cluster with 2 OSD nodes. One of the OSD was in down
 state, I have removed that PG from cluster, by removing crush rule for that
 OSD.

 Now cluster with 11 OSD's, started rebalancing. After sometime, cluster
 status was

 ems@rack6-client-5:~$ sudo ceph -s
 cluster eb5452f4-5ce9-4b97-9bfd-2a34716855f1
  health HEALTH_WARN 1 pgs down; 252 pgs incomplete; 10 pgs peering;
 73 pgs stale; 262 pgs stuck inactive; 73 pgs stuck stale; 262 pgs stuck
 unclean; clock skew detected on mon.rack6-client-5, mon.rack6-client-6
  monmap e1: 3 mons at {rack6-client-4=
 10.242.43.105:6789/0,rack6-client-5=10.242.43.106:6789/0,rack6-client-6=10.242.43.107:6789/0},
 election epoch 12, quorum 0,1,2 rack6-client-4,rack6-client-5,rack6-client-6
  osdmap e2648: 11 osds: 11 up, 11 in
   pgmap v554251: 846 pgs, 3 pools, 4383 GB data, 1095 kobjects
 11668 GB used, 26048 GB / 37717 GB avail
   63 stale+active+clean
1 down+incomplete
  521 active+clean
  251 incomplete
   10 stale+peering
 ems@rack6-client-5:~$


 To fix this, i cant run ceph osd lost osd.id to remove the PG which
 is in down state. As OSD is already removed from the cluster.

 ems@rack6-client-4:~$ sudo ceph pg dump all | grep down
 dumped all in format plain
 1.3815480   0   0   0   6492782592  3001
  3001down+incomplete 2014-12-18 15:58:29.681708  1118'508438
 2648:1073892[6,3,1] 6   [6,3,1] 6   76'437184
 2014-12-16 12:38:35.322835  76'437184   2014-12-16 12:38:35.322835
 ems@rack6-client-4:~$

 ems@rack6-client-4:~$ sudo ceph pg 1.38 query
 .
 recovery_state: [
 { name: Started\/Primary\/Peering,
   enter_time: 2014-12-18 15:58:29.681666,
   past_intervals: [
 { first: 1109,
   last: 1118,
   maybe_went_rw: 1,
 ...
 ...
 down_osds_we_would_probe: [
 7],
   peering_blocked_by: []},
 ...
 ...

 ems@rack6-client-4:~$ sudo ceph osd tree
 # idweight  type name   up/down reweight
 -1  36.85   root default
 -2  20.1host rack2-storage-1
 0   3.35osd.0   up  1
 1   3.35osd.1   up  1
 2   3.35osd.2   up  1
 3   3.35osd.3   up  1
 4   3.35osd.4   up  1
 5   3.35osd.5   up  1
 -3  16.75   host rack2-storage-5
 6   3.35osd.6   up  1
 8   3.35osd.8   up  1
 9   3.35osd.9   up  1
 10  3.35osd.10  up  1
 11  3.35osd.11  up  1
 ems@rack6-client-4:~$ sudo ceph osd lost 7 --yes-i-really-mean-it
 osd.7 is not down or doesn't exist
 ems@rack6-client-4:~$


 Can somebody suggest any other recovery step to come out of this?

 -Thanks  Regards,
 Mallikarjun Biradar


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Placement groups stuck inactive after down out of 1/9 OSDs

2014-12-19 Thread Craig Lewis
That seems odd.  So you have 3 nodes, with 3 OSDs each.  You should've been
able to mark osd.0 down and out, then stop the daemon without having those
issues.

It's generally best to mark an osd down, then out, and wait until the
cluster has recovered completely before stopping the daemon and removing it
from the cluster.  That guarantees that you always have 3+ copies of the
data.

Disks don't always fail gracefully though.  If you have a sudden and
complete failure, you can't do it the nice way.  At that point, just mark
the osd down and out.  If your cluster was healthy before this event, you
shouldn't have any data problems.  If the cluster wasn't HEALTH_OK before
the event, you will likely have some problems.

Is your cluster HEALTH_OK now?  If not, can you give me the following?

   - ceph -s
   - ceph osd tree
   - ceph osd dump | grep ^pool
   - ceph pg dump_stuck
   - ceph pg query pgid  # For just one of the stuck PGs


I'm a bit confused why your cluster has a bunch of PGs in the remapped
state, but none in the remapping state.  It's supposed to be recovering,
and something is blocking that.



As to the hung VMs, during any recovery or backfill, you'll probably have
IO problems.  The ceph.conf defaults are intended for large clusters,
probably with SSD journals.  In my 3 nodes, 24 OSD cluster with no SSD
journals, recovery was IO starving my clients.  I de-prioritized recovery
with:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

It was still painful, but those values kept my cluster usable.  Since I've
grown to 5 nodes, and added SSD journals, I've been able to increase the
backfills and recovery active to 3.  I found those values through trial and
error, watching my RadosGW latency, and playing with ceph tell osd.\*
injectargs ...

I've found that I have problems if more than 20% of my OSDs are involved in
a backfilling operation.  With your 9 OSDs, you're guaranteeing that any
single event will always hit at least 22% of your OSDS, and probably more.
If you're unable to add more disks, I would highly recommend adding SSD
journals.



On Fri, Dec 19, 2014 at 8:08 AM, Chris Murray chrismurra...@gmail.com
wrote:

 Hello,

 I'm a newbie to CEPH, gaining some familiarity by hosting some virtual
 machines on a test cluster. I'm using a virtualisation product called
 Proxmox Virtual Environment, which conveniently handles cluster setup,
 pool setup, OSD creation etc.

 During the attempted removal of an OSD, my pool appeared to cease
 serving IO to virtual machines, and I'm wondering if I did something
 wrong or if there's something more to the process of removing an OSD.

 The CEPH cluster is small; 9 OSDs in total across 3 nodes. There's a
 pool called 'vmpool', with size=3 and min_size=1. It's a bit slow, but I
 see plenty of information on how to troubleshoot that, and understand I
 should be separating cluster communication onto a separate network
 segment to improve performance. CEPH version is Firefly - 0.80.7

 So, the issue was: I marked osd.0 as down  out (or possibly out  down,
 if order matters), and virtual machines hung. Almost immediately, 78 pgs
 were 'stuck inactive', and after some activity overnight, they remained
 that way:


 cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
  health HEALTH_WARN 290 pgs degraded; 78 pgs stuck inactive; 496 pgs
 stuck unclean; 4 requests are blocked  32 sec; recovery 69696/685356
 objects degraded (10.169%)
  monmap e3: 3 mons at
 {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
 election epoch 50, quorum 0,1,2 0,1,2
  osdmap e669: 9 osds: 8 up, 8 in
   pgmap v100175: 1216 pgs, 4 pools, 888 GB data, 223 kobjects
 2408 GB used, 7327 GB / 9736 GB avail
 69696/685356 objects degraded (10.169%)
   78 inactive
  720 active+clean
  290 active+degraded
  128 active+remapped


 I started the OSD to bring it back 'up'. It was still 'out'.


 cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
  health HEALTH_WARN 59 pgs degraded; 496 pgs stuck unclean; recovery
 30513/688554 objects degraded (4.431%)
  monmap e3: 3 mons at
 {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
 election epoch 50, quorum 0,1,2 0,1,2
  osdmap e671: 9 osds: 9 up, 8 in
   pgmap v103181: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
 2408 GB used, 7327 GB / 9736 GB avail
 30513/688554 objects degraded (4.431%)
  720 active+clean
   59 active+degraded
  437 active+remapped
   client io 2303 kB/s rd, 153 kB/s wr, 85 op/s


 The inactive pgs had disappeared.
 I stopped the OSD again, making it 'down' and 'out', as it was previous.
 At this point, I started my virtual machines again, which functioned
 correctly.


 cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
  health HEALTH_WARN 368 pgs degraded; 496 

Re: [ceph-users] Placement groups stuck inactive after down out of 1/9 OSDs

2014-12-19 Thread Craig Lewis
With only one OSD down and size = 3, you shouldn't've had any PGs
inactive.  At worst, they should've been active+degraded.

The only thought I have is that some of your PGs aren't mapping to the
correct number of OSDs.  That's not supposed to be able to happen unless
you've messed up your crush rules.

You might go through ceph pg dump, and verify that all PGs have 3 OSDs in
the reporting and acting columns, and that there are no duplicate OSDs in
those lists.

With your 1216 PGs, it might be faster to write a script to parse the JSON
than to do it manually.  If you happen to remember some PGs that were
inactive or degraded, you could spot check those.



On Fri, Dec 19, 2014 at 11:45 AM, Chris Murray chrismurra...@gmail.com
wrote:

 Interesting indeed, those tuneables were suggested on the pve-user mailing
 list too, and they certainly sound like they’ll ease the pressure during
 the recovery operation. What I might not have explained very well though is
 that the VMs hung indefinitely and past the end of the recovery process,
 rather than being slow; almost as if the 78 stuck inactive placement groups
 contained data which was critical to VM operation. Looking at IO and
 performance in the cluster is certainly on the to-do list, with a scale-out
 of nodes and move of journals to SSD, but of course that needs some
 investment and I’d like to prove things first. It’s a bit catch-22 :-)

 To my knowledge, the cluster was HEALTH_OK before and it is HEALTH_OK now,
 BUT ... I've not followed my usual advice of stopping and thinking about
 things before trying something else, so I suppose the marking of the OSD
 'up' this morning (which turned those 78 into some other ACTIVE+* states)
 has spoiled the chance of troubleshooting. I’ve been messing around with
 osd.0 since too, and the health is now:

 cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
  health HEALTH_OK
  monmap e3: 3 mons at {0=
 192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0},
 election epoch 58, quorum 0,1,2 0,1,2
  osdmap e1205: 9 osds: 9 up, 9 in
   pgmap v120175: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
 2679 GB used, 9790 GB / 12525 GB avail
 1216 active+clean

 If it helps at all, the other details are as follows. Nothing from 'dump
 stuck' although I expect there would have been this morning.

 root@ceph25:~# ceph osd tree
 # idweight  type name   up/down reweight
 -1  12.22   root default
 -2  4.3 host ceph25
 3   0.9 osd.3   up  1
 6   0.68osd.6   up  1
 0   2.72osd.0   up  1
 -3  4.07host ceph26
 1   2.72osd.1   up  1
 4   0.9 osd.4   up  1
 7   0.45osd.7   up  1
 -4  3.85host ceph27
 2   2.72osd.2   up  1
 5   0.68osd.5   up  1
 8   0.45osd.8   up  1
 root@ceph25:~# ceph osd dump | grep ^pool
 pool 0 'data' replicated size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool
 crash_replay_interval 45 stripe_width 0
 pool 1 'metadata' replicated size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
 pool 2 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
 pool 3 'vmpool' replicated size 3 min_size 1 crush_ruleset 0 object_hash
 rjenkins pg_num 1024 pgp_num 1024 last_change 187 flags hashpspool
 stripe_width 0
 root@ceph25:~# ceph pg dump_stuck
 ok


 The more I think about this problem, the less I think there'll be an easy
 answer, and it's more likely that I'll have to reproduce the scenario and
 actually pause myself next time in order to troubleshoot it?

 From: Craig Lewis [mailto:cle...@centraldesktop.com]
 Sent: 19 December 2014 19:17
 To: Chris Murray
 Cc: ceph-users
 Subject: Re: [ceph-users] Placement groups stuck inactive after down  out
 of 1/9 OSDs

 That seems odd.  So you have 3 nodes, with 3 OSDs each.  You should've
 been able to mark osd.0 down and out, then stop the daemon without having
 those issues.

 It's generally best to mark an osd down, then out, and wait until the
 cluster has recovered completely before stopping the daemon and removing it
 from the cluster.  That guarantees that you always have 3+ copies of the
 data.

 Disks don't always fail gracefully though.  If you have a sudden and
 complete failure, you can't do it the nice way.  At that point, just mark
 the osd down and out.  If your cluster was healthy before this event, you
 shouldn't have any data problems.  If the cluster wasn't HEALTH_OK before
 the event, you will likely have some problems.

 Is your cluster HEALTH_OK now?  If not, can you give me the following

Re: [ceph-users] Have 2 different public networks

2014-12-19 Thread Craig Lewis
On Fri, Dec 19, 2014 at 4:03 PM, Francois Lafont flafdiv...@free.fr wrote:

 Le 19/12/2014 19:17, Craig Lewis a écrit :

  I'm not using mon addr lines, and my ceph-mon daemons are bound to
 0.0.0.0:*.

 And do you have several IP addresses on your server?
 Can you contact the *same* monitor process with different IP addresses?
 For instance:
 telnet -e ']' ip_addr1 6789
 telnet -e ']' ip_addr2 6789


Oh.  The second one fails, even though ceph-mon is bound to 0.0.0.0.  I
guess that's not going to work.

Looking again... I'm an idiot.  I was looking at the wrong column in
netstat.  The daemon is bound to a single IP.netstat | grep, with no
column headers bites me again.

I apologize for that wild goose chase.




 Please, could you post your ceph.conf here (or just lines about monitors)?


Probably doesn't help now, but:
[global]
  fsid = snip
  mon initial members = ceph0c, ceph1c
  mon host = 10.193.0.6:6789, 10.193.0.7:6789
  auth client required = none
  auth cluster required = none
  auth service required = none
  cluster network = 10.194.0.0/16
  mon warn on legacy crush tunables = false
  public network = 10.193.0.0/16


There is something that I don't understand. Personally I
 don't use ceph-deploy and I use manual deployment (because
 I want to make a Puppet deployment of my labs for Ubuntu Trusty
 with Ceph Firefly) and


I'm using Chef, which is also more like a manual deployment than
ceph-deploy.


 when I create my cluster with the
 first monitor, I have to generate a monitor map with this
 command:

 monmaptool --create --add {hostname} {ip-address} --fsid {uuid}
 /tmp/monmap
  

 And I have to provide an IP address, so it seems logical to me
 that a monitor is bound to only one IP address.


I don't see the Chef rule doing anything like that though.



 Can you post your monitors map?
 ceph mon getmap -o /tmp/monmap
 monmaptool --print /tmp/monmap


monmaptool: monmap file /tmp/monmap
epoch 2
fsid 1604ec7a-6ceb-42fc-8c68-0a7896c4e120
last_changed 2013-11-22 17:34:21.462685
created 0.00
0: 10.193.0.6:6789/0 mon.ceph0c
1: 10.193.0.7:6789/0 mon.ceph1c



  If it's not a traffic volume problem, can you allow the 10.0.2.0/24
 network
  to route to the 10.0.1.0/24 network, and open the firewall enough? There
  should be enough info in the network config to get the firewall working:
  http://docs.ceph.com/docs/next/rados/configuration/network-config-ref/

 Yes indeed, It could be enough. But I find it a shame to do this
 workaround because I'm not able to have monitors bound to several
 IP addresses. ;)


 Looks like you'll have to go this route.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have 2 different public networks

2014-12-19 Thread Craig Lewis
On Fri, Dec 19, 2014 at 6:19 PM, Francois Lafont flafdiv...@free.fr wrote:


 So, indeed, I have to use routing *or* maybe create 2 monitors
 by server like this:

 [mon.node1-public1]
 host = ceph-node1
 mon addr = 10.0.1.1

 [mon.node1-public2]
 host = ceph-node1
 mon addr = 10.0.2.1

 # etc...

 But, in this case, the working directories of mon.node1-public1
 and mon.node1-public2 will be in the same disk (I have no
 choice). Is it a problem? Are monitors big consumers of I/O disk?


Interesting idea.  While you will have an even number of monitors, you'll
still have an odd number of failure domains.  I'm not sure if it'll work
though... make sure you test having the leader on both networks.  It might
cause problems if the leader is on the 10.0.1.0/24 network?

Monitors can be big consumers of disk IO, if there is a lot of cluster
activity.  Monitors records all of the cluster changes in LevelDB, and send
copies to all of the daemons.  There have been posts to the ML about people
running out of Disk IOps on the monitors, and the problems it causes.  The
bigger the cluster, the more IOps.  As long as you monitor and alert on
your monitor disk IOps, I don't think it would be a problem.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Craig Lewis
On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
wrote:


  2. What should be the minimum hardware requirement of the server (CPU,
  Memory, NIC etc)

 There is no real minimum to run Ceph, it's all about what your
 workload will look like and what kind of performance you need. We have
 seen Ceph run on Raspberry Pis.


Technically, the smallest cluster is a single node with a 10 GiB disk.
Anything smaller won't work.

That said, Ceph was envisioned to run on large clusters.  IIRC, the
reference architecture has 7 rows, each row having 10 racks, all full.

Those of us running small clusters (less than 10 nodes) are noticing that
it doesn't work quite as well.  We have to significantly scale back the
amount of backfilling and recovery that is allowed.  I try to keep all
backfill/recovery operations touching less than 20% of my OSDs.  In the
reference architecture, it could lose a whole row, and still keep under
that limit.  My 5 nodes cluster is noticeably better better than the 3 node
cluster.  It's faster, has lower latency, and latency doesn't increase as
much during recovery operations.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help from Ceph experts

2014-12-18 Thread Craig Lewis
I think this is it:
https://engage.redhat.com/inktank-ceph-reference-architecture-s-201409080939

You can also check out a presentation on Cern's Ceph cluster:
http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern


At large scale, the biggest problem will likely be network I/O on the
inter-switch links.



On Thu, Dec 18, 2014 at 3:29 PM, Robert LeBlanc rob...@leblancnet.us
wrote:

 I'm interested to know if there is a reference to this reference
 architecture. It would help alleviate some of the fears we have about
 scaling this thing to a massive scale (10,000's OSDs).

 Thanks,
 Robert LeBlanc

 On Thu, Dec 18, 2014 at 3:43 PM, Craig Lewis cle...@centraldesktop.com
 wrote:



 On Thu, Dec 18, 2014 at 5:16 AM, Patrick McGarry patr...@inktank.com
 wrote:


  2. What should be the minimum hardware requirement of the server (CPU,
  Memory, NIC etc)

 There is no real minimum to run Ceph, it's all about what your
 workload will look like and what kind of performance you need. We have
 seen Ceph run on Raspberry Pis.


 Technically, the smallest cluster is a single node with a 10 GiB disk.
 Anything smaller won't work.

 That said, Ceph was envisioned to run on large clusters.  IIRC, the
 reference architecture has 7 rows, each row having 10 racks, all full.

 Those of us running small clusters (less than 10 nodes) are noticing that
 it doesn't work quite as well.  We have to significantly scale back the
 amount of backfilling and recovery that is allowed.  I try to keep all
 backfill/recovery operations touching less than 20% of my OSDs.  In the
 reference architecture, it could lose a whole row, and still keep under
 that limit.  My 5 nodes cluster is noticeably better better than the 3 node
 cluster.  It's faster, has lower latency, and latency doesn't increase as
 much during recovery operations.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have 2 different public networks

2014-12-18 Thread Craig Lewis
The daemons bind to *, so adding the 3rd interface to the machine will
allow you to talk to the daemons on that IP.

I'm not really sure how you'd setup the management network though.  I'd
start by setting the ceph.conf public network on the  management nodes to
have the public network 10.0.2.0/24, and an /etc/hosts file with the
monitor's names on the 10.0.2.0/24 network.

Make sure the management nodes can't route to the 10.0.1.0/24 network, and
see what happens.


Do you really plan on having enough traffic creating and deleting RDB
images that you need a dedicated network?  It seems like setting up link
aggregation on 10.0.1.0/24 would be simpler and less error prone.



On Thu, Dec 18, 2014 at 4:19 PM, Francois Lafont flafdiv...@free.fr wrote:

 Hi,

 Is it possible to have 2 different public networks in a Ceph cluster?
 I explain my question below.

 Currently, I have 3 identical nodes in my Ceph cluster. Each node has:

 - only 1 monitor;
 - n osds (we don't care about the value n here);
 - and 3 interfaces.

 One interface for the cluster network (10.0.0.0/24):
 - node1 - 10.0.0.1
 - node2 - 10.0.0.2
 - node3 - 10.0.0.3

 One interface for the public network (10.0.1.0/24):
 - node1 - [mon.1] mon addr = 10.0.1.1
 - node2 - [mon.2] mon addr = 10.0.1.2
 - node3 - [mon.3] mon addr = 10.0.1.3

 And one interface not used yet (see below).

 With this configuration, if I have a Ceph client in the
 public network, I can use rbd images etc. No problem,
 it works.

 But now I would like to use the third interface of the
 nodes for a *different* plublic network - 10.0.2.0/24.
 The Ceph clients in this network will not really use the
 storage but will create and delete rbd images in a pool.
 In fact it's just a network for *Ceph management*.

 So, I want to have 2 different public networks:
 - 10.0.1.0/24 (already exists)
 - *and* 10.0.2.0/24

 Am I wrong if I say that mon.1, mon.2 and mon.3
 must have one more IP address? Is it possible to
 have a monitor that listens on 2 addresses? Something
 like that:

 - node1 - [mon.1] mon addr = 10.0.1.1 *and* 10.0.2.1
 - node2 - [mon.2] mon addr = 10.0.1.2 *and* 10.0.2.2
 - node3 - [mon.3] mon addr = 10.0.1.3 *and* 10.0.2.3

 My environment is not a production environment, just a
 lab. So, if necessary I can reinstall everything, no
 problem.

 Thanks for your help.

 --
 François Lafont
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dual RADOSGW Network

2014-12-16 Thread Craig Lewis
You may need split horizon DNS.  The internal machines' DNS should resolve
to the internal IP, and the external machines' DNS should resolve to the
external IP.

There are various ways to do that.  The RadosGW config has an example of
setting up Dnsmasq:
http://ceph.com/docs/master/radosgw/config/#enabling-subdomain-s3-calls

On Tue, Dec 16, 2014 at 3:05 AM, Georgios Dimitrakakis gior...@acmac.uoc.gr
 wrote:

 Thanks Craig.

 I will try that!

 I thought it was more complicate than that because of the entries for the
 public_network and rgw dns name in the config file...

 I will give it a try.

 Best,


 George



  That shouldnt be a problem.  Just have Apache bind to all interfaces
 instead of the external IP.

 In my case, I only have Apache bound to the internal interface.  My
 load balancer has an external and internal IP, and Im able to talk to
 it on both interfaces.

 On Mon, Dec 15, 2014 at 2:00 PM, Georgios Dimitrakakis  wrote:

  Hi all!

 I have a single CEPH node which has two network interfaces.

 One is configured to be accessed directly by the internet (153.*)
 and the other one is configured on an internal LAN (192.*)

 For the moment radosgw is listening on the external (internet)
 interface.

 Can I configure radosgw to be accessed by both interfaces? What I
 would like to do is to save bandwidth and time for the machines on
 the internal network and use the internal net for all rados
 communications.

 Any ideas?

 Best regards,

 George
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com [1]
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2]



 Links:
 --
 [1] mailto:ceph-users@lists.ceph.com
 [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 [3] mailto:gior...@acmac.uoc.gr


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Test 6

2014-12-16 Thread Craig Lewis
I always wondered why my posts didn't show up until somebody replied to
them.  I thought it was my filters.

Thanks!

On Mon, Dec 15, 2014 at 10:57 PM, Leen de Braal l...@braha.nl wrote:

 If you are trying to see if your mails come through, don't check on the
 list. You have a gmail account, gmail removes mails that you have sent
 yourself.
 You can check the archives to see.

 And your mails did come on the list.


  --
  Lindsay
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


 --
 L. de Braal
 BraHa Systems
 NL - Terneuzen
 T +31 115 649333

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Crash makes whole cluster unusable ?

2014-12-16 Thread Craig Lewis
So the problem started once remapping+backfilling started, and lasted until
the cluster was healthy again?  Have you adjusted any of the recovery
tunables?  Are you using SSD journals?

I had a similar experience the first time my OSDs started backfilling.  The
average RadosGW operation latency went from 0.1 seconds to 10 seconds,
which is longer than the default HAProxy timeout.  Fun times.

Since then, I've increased HAProxy's timeouts, de-prioritized Ceph's
recovery, and I added SSD journals.

The relevant sections of ceph.conf are:

[global]
  mon osd down out interval = 900
  mon osd min down reporters = 9
  mon osd min down reports = 12
  mon warn on legacy crush tunables = false
  osd pool default flag hashpspool = true

[osd]
  osd max backfills = 3
  osd recovery max active = 3
  osd recovery op priority = 1
  osd scrub sleep = 1.0
  osd snap trim sleep = 1.0


Before the SSD journals, I had osd_max_backfills and
osd_recovery_max_active set to 1.  I watched my latency graphs, and used ceph
tell osd.\* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 to
tweak the values until the latency was acceptable.

On Tue, Dec 16, 2014 at 5:37 AM, Christoph Adomeit 
christoph.adom...@gatworks.de wrote:


 Hi there,

 today I had an osd crash with ceph 0.87/giant which made my hole cluster
 unusable for 45 Minutes.

 First it began with a disk error:

 sd 0:1:2:0: [sdc] CDB: Read(10)Read(10):: 28 28 00 00 0d 15 fe d0 fd 7b e8
 f8 00 00 00 00 b0 08 00 00
 XFS (sdc1): xfs_imap_to_bp: xfs_trans_read_buf() returned error 5.

 Then most other osds found out that my osd.3 is down:

 2014-12-16 08:45:15.873478 mon.0 10.67.1.11:6789/0 3361077 : cluster
 [INF] osd.3 10.67.1.11:6810/713621 failed (42 reports from 35 peers after
 23.642482 = grace 23.348982)

 5 minutes later the osd is marked as out:
 2014-12-16 08:50:21.095903 mon.0 10.67.1.11:6789/0 3361367 : cluster
 [INF] osd.3 out (down for 304.581079)

 However, since 8:45 until 9:20 I have 1000 slow requests and 107
 incomplete pgs. Many requests are not answered:

 2014-12-16 08:46:03.029094 mon.0 10.67.1.11:6789/0 3361126 : cluster
 [INF] pgmap v6930583: 4224 pgs: 4117 active+clean, 107 incomplete; 7647 GB
 data, 19090 GB used, 67952 GB / 87042 GB avail; 2307 kB/s rd, 2293 kB/s wr,
 407 op/s

 Also a recovery to another osd was not starting

 Seems the osd thinks it is still up and all other osds think this osd is
 down ?
 I found this in the log of osd3:
 ceph-osd.3.log:2014-12-16 08:45:19.319152 7faf81296700  0
 log_channel(default) log [WRN] : map e61177 wrongly marked me down
 ceph-osd.3.log:  -440 2014-12-16 08:45:19.319152 7faf81296700  0
 log_channel(default) log [WRN] : map e61177 wrongly marked me down

 Luckily I was able to restart osd3 and everything was working again but I
 do not understand what has happened. The cluster ways simply not usable for
 45 Minutes.

 Any ideas

 Thanks
   Christoph


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] my cluster has only rbd pool

2014-12-15 Thread Craig Lewis
If you're running Ceph 0.88 or newer, only the rdb pool is created by
default now.  Greg Farnum mentioned that the docs are out of date there.

On Sat, Dec 13, 2014 at 8:25 PM, wang lin linw...@hotmail.com wrote:

 Hi All
   I set up my first ceph cluster according to instructions in
 http://ceph.com/docs/master/start/quick-ceph-deploy/#storing-retrieving-object-data,
 http://ceph.com/docs/master/start/quick-ceph-deploy/#storing-retrieving-object-data
   but I got this error error opening pool data: (2) No such file or
 directory when using command rados put hello_obj hello --pool=data.
   I typed the command ceph osd lspools, the result only show 0
 rbd,, no other pools.
   Did I missing anything?
   Could anyone give me some advise?

 Thanks
 Lin, Wangf








 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dual RADOSGW Network

2014-12-15 Thread Craig Lewis
That shouldn't be a problem.  Just have Apache bind to all interfaces
instead of the external IP.

In my case, I only have Apache bound to the internal interface.  My load
balancer has an external and internal IP, and I'm able to talk to it on
both interfaces.

On Mon, Dec 15, 2014 at 2:00 PM, Georgios Dimitrakakis gior...@acmac.uoc.gr
 wrote:

 Hi all!

 I have a single CEPH node which has two network interfaces.

 One is configured to be accessed directly by the internet (153.*) and the
 other one is configured on an internal LAN (192.*)

 For the moment radosgw is listening on the external (internet) interface.

 Can I configure radosgw to be accessed by both interfaces? What I would
 like to do is to save bandwidth and time for the machines on the internal
 network and use the internal net for all rados communications.


 Any ideas?


 Best regards,


 George
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Number of SSD for OSD journal

2014-12-15 Thread Craig Lewis
I was going with a low perf scenario, and I still ended up adding SSDs.
Everything was fine in my 3 node cluster, until I wanted to add more nodes.


Admittedly, I was a bit aggressive with the expansion.  I added a whole
node at once, rather than one or two disks at a time.  Still, I wasn't
expecting the average RadosGW latency to go from 0.1 seconds to 10
seconds.  With the SSDs, I can do the same thing, and latency only goes up
to 1 seconds.

I'll be adding the Intel DC S3700's too all my nodes.


On Mon, Dec 15, 2014 at 12:45 PM, Sebastien Han sebastien@enovance.com
wrote:

 If you’re going with a low perf scenario I don’t think you should bother
 buying SSD, just remove them from the picture and do 12 SATA 7.2K 4TB.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple issues :( Ubuntu 14.04, latest Ceph

2014-12-15 Thread Craig Lewis
On Sun, Dec 14, 2014 at 6:31 PM, Benjamin zor...@gmail.com wrote:

 The machines each have Ubuntu 14.04 64-bit, with 1GB of RAM and 8GB of
 disk. They have between 10% and 30% disk utilization but common between all
 of them is that they *have free disk space* meaning I have no idea what
 the heck is causing Ceph to complain.


Each OSD is 8GB?  You need to make them at least 10 GB.

Ceph weights each disk as it's size in TiB, and it truncates to two decimal
places.  So your 8 GiB disks have a weight of 0.00.  Bump it up to 10 GiB,
and it'll get a weight of 0.01.

You should have 3 OSDs, one for each of ceph0,ceph1,ceph2.

If that doesn't fix the problem, go ahead and post the things Udo mentioned.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] active+degraded on an empty new cluster

2014-12-09 Thread Craig Lewis
When I first created a test cluster, I used 1 GiB disks.  That causes
problems.

Ceph has a CRUSH weight.  By default, the weight is the size of the disk in
TiB, truncated to 2 decimal places.  ie, any disk smaller than 10 GiB will
have a weight of 0.00.

I increased all of my virtual disks to 10 GiB.  After rebooting the nodes
(to see the changes), everything healed.


On Tue, Dec 9, 2014 at 9:45 AM, Gregory Farnum g...@gregs42.com wrote:

 It looks like your OSDs all have weight zero for some reason. I'd fix
 that. :)
 -Greg

 On Tue, Dec 9, 2014 at 6:24 AM Giuseppe Civitella 
 giuseppe.civite...@gmail.com wrote:

 Hi,

 thanks for the quick answer.
 I did try the force_create_pg on a pg but is stuck on creating:
 root@ceph-mon1:/home/ceph# ceph pg dump |grep creating
 dumped all in format plain
 2.2f0   0   0   0   0   0   0   creating
2014-12-09 13:11:37.384808  0'0 0:0 []  -1  []
-1  0'0 0.000'0  0.00

 root@ceph-mon1:/home/ceph# ceph pg 2.2f query
 { state: active+degraded,
   epoch: 105,
   up: [
 0],
   acting: [
 0],
   actingbackfill: [
 0],
   info: { pgid: 2.2f,
   last_update: 0'0,
   last_complete: 0'0,
   log_tail: 0'0,
   last_user_version: 0,
   last_backfill: MAX,
   purged_snaps: [],
   last_scrub: 0'0,
   last_scrub_stamp: 2014-12-06 14:15:11.499769,
   last_deep_scrub: 0'0,
   last_deep_scrub_stamp: 2014-12-06 14:15:11.499769,
   last_clean_scrub_stamp: 0.00,
   log_size: 0,
   ondisk_log_size: 0,
   stats_invalid: 0,
   stat_sum: { num_bytes: 0,
   num_objects: 0,
   num_object_clones: 0,
   num_object_copies: 0,
   num_objects_missing_on_primary: 0,
   num_objects_degraded: 0,
   num_objects_unfound: 0,
   num_objects_dirty: 0,
   num_whiteouts: 0,
   num_read: 0,
   num_read_kb: 0,
   num_write: 0,
   num_write_kb: 0,
   num_scrub_errors: 0,
   num_shallow_scrub_errors: 0,
   num_deep_scrub_errors: 0,
   num_objects_recovered: 0,
   num_bytes_recovered: 0,
   num_keys_recovered: 0,
   num_objects_omap: 0,
   num_objects_hit_set_archive: 0},
   stat_cat_sum: {},
   up: [
 0],
   acting: [
 0],
   up_primary: 0,
   acting_primary: 0},
   empty: 1,
   dne: 0,
   incomplete: 0,
   last_epoch_started: 104,
   hit_set_history: { current_last_update: 0'0,
   current_last_stamp: 0.00,
   current_info: { begin: 0.00,
   end: 0.00,
   version: 0'0},
   history: []}},
   peer_info: [],
   recovery_state: [
 { name: Started\/Primary\/Active,
   enter_time: 2014-12-09 12:12:52.760384,
   might_have_unfound: [],
   recovery_progress: { backfill_targets: [],
   waiting_on_backfill: [],
   last_backfill_started: 0\/\/0\/\/-1,
   backfill_info: { begin: 0\/\/0\/\/-1,
   end: 0\/\/0\/\/-1,
   objects: []},
   peer_backfill_info: [],
   backfills_in_flight: [],
   recovering: [],
   pg_backend: { pull_from_peer: [],
   pushing: []}},
   scrub: { scrubber.epoch_start: 0,
   scrubber.active: 0,
   scrubber.block_writes: 0,
   scrubber.finalizing: 0,
   scrubber.waiting_on: 0,
   scrubber.waiting_on_whom: []}},
 { name: Started,
   enter_time: 2014-12-09 12:12:51.845686}],
   agent_state: {}}root@ceph-mon1:/home/ceph#



 2014-12-09 13:01 GMT+01:00 Irek Fasikhov malm...@gmail.com:

 Hi.

 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

 ceph pg force_create_pg pgid


 2014-12-09 14:50 GMT+03:00 Giuseppe Civitella 
 giuseppe.civite...@gmail.com:

 Hi all,

 last week I installed a new ceph cluster on 3 vm running Ubuntu 14.04
 with default kernel.
 There is a ceph monitor a two osd hosts. Here are some datails:
 ceph -s
 cluster c46d5b02-dab1-40bf-8a3d-f8e4a77b79da
  health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
  monmap e1: 1 mons at {ceph-mon1=10.1.1.83:6789/0}, election epoch
 1, quorum 0 ceph-mon1
  osdmap e83: 6 osds: 6 up, 6 in
   pgmap v231: 192 pgs, 3 pools, 0 bytes data, 0 objects
 207 MB used, 30446 MB / 30653 MB avail
  192 active+degraded

 root@ceph-mon1:/home/ceph# ceph osd dump
 epoch 99
 fsid c46d5b02-dab1-40bf-8a3d-f8e4a77b79da
 created 2014-12-06 13:15:06.418843
 modified 2014-12-09 11:38:04.353279
 flags
 pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 

Re: [ceph-users] Scrub while cluster re-balancing

2014-12-03 Thread Craig Lewis
What's the output of ceph osd dump | grep ^pool ?

On Tue, Dec 2, 2014 at 10:44 PM, Mallikarjun Biradar 
mallikarjuna.bira...@gmail.com wrote:

 Hi Craig,


 but, my concern is why ceph status is not reporting for pool 2 (testPool2
 in this case). Whether its not performing scrub or its ceph status report
 issue?

 Though I have enough of objects in testPool2, scrub is not reporting
 active+clean+scrubbing in ceph -s.

 ems@rack6-ramp-4:~$ sudo ceph osd lspools
 0 rbd,1 testPool,2 testPool2,
 ems@rack6-ramp-4:~$

 ems@rack6-ramp-4:~$ sudo rados df
 pool name   category KB  objects   clones
 degraded  unfound   rdrd KB   wrwr KB
 rbd -  000
0   00000
 testPool- 5948025217  14521740
0   0141056332  22948324301141070117  22950524809
 testPool2   -   45039617109990
0   0 11238999 44955958 11259655 45038593
   total used 18004641796  1463173
   total avail32330689516
   total space50335331312
 ems@rack6-ramp-4:~$

 -Thanks  regards,
 Mallikarjun Biradar

 On Wed, Dec 3, 2014 at 1:20 AM, Craig Lewis cle...@centraldesktop.com
 wrote:

 ceph osd dump | grep ^pool will map pool names to numbers.  PGs are
 named after the pool; PG 2.xx belongs to pool 2.

 rados df will tell you have many items and data are in a pool.

 On Tue, Dec 2, 2014 at 10:53 AM, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi Craig,

 ceph -s is not showing any PG's in pool2.
 I have 3 pools.  rbd and two pools that i created testPool and testPool2.

 I have more than 10TB of data in testPool1 and good amount of data in
 testPool2 as well.
 Iam not using rbd pool.

 -Thanks  regards,
 Mallikarjun Biradar
 On 3 Dec 2014 00:15, Craig Lewis cle...@centraldesktop.com wrote:

 You mean `ceph -w` and `ceph -s` didn't show any PGs in
 the active+clean+scrubbing state while pool 2's PGs were being scrubbed?

 I see that happen with my really small pools.  I have a bunch of
 RadosGW pools that contain 5 objects, and ~1kB of data.  When I scrub the
 PGs in those pools, they complete so fast that they never show up in `ceph
 -w`.


 Since you have pools 0, 1, and 2, I assume those are the default
 'data', 'metadata', and 'rdb'.  If you're not using RDB, then the rdb pool
 will be very small.



 On Tue, Dec 2, 2014 at 5:32 AM, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 I was running scrub while cluster is in re-balancing state.

 From the osd logs..

 2014-12-02 18:50:26.934802 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.3 scrub ok
 2014-12-02 18:50:27.890785 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.24 scrub ok
 2014-12-02 18:50:31.902978 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.25 scrub ok
 2014-12-02 18:50:33.088060 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.33 scrub ok
 2014-12-02 18:50:50.828893 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.61 scrub ok
 2014-12-02 18:51:06.774648 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.68 scrub ok
 2014-12-02 18:51:20.463283 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.80 scrub ok
 2014-12-02 18:51:39.883295 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.89 scrub ok
 2014-12-02 18:52:00.568808 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.9f scrub ok
 2014-12-02 18:52:15.897191 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.a3 scrub ok
 2014-12-02 18:52:34.681874 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.aa scrub ok
 2014-12-02 18:52:47.833630 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.b1 scrub ok
 2014-12-02 18:53:09.312792 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.b3 scrub ok
 2014-12-02 18:53:25.324635 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.bd scrub ok
 2014-12-02 18:53:48.638475 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.c3 scrub ok
 2014-12-02 18:54:02.996972 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.d7 scrub ok
 2014-12-02 18:54:19.660038 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.d8 scrub ok
 2014-12-02 18:54:32.780646 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.fa scrub ok
 2014-12-02 18:54:36.772931 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.4 scrub ok
 2014-12-02 18:54:41.758487 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.9 scrub ok
 2014-12-02 18:54:46.910043 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.a scrub ok
 2014-12-02 18:54:51.908335 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.16 scrub ok
 2014-12-02 18:54:54.940807 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.19 scrub ok
 2014-12-02 18:55:00.956170 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.44 scrub ok
 2014-12-02 18:55:01.948455 7fcc6b614700  0 log_channel(default

Re: [ceph-users] Slow Requests when taking down OSD Node

2014-12-02 Thread Craig Lewis
I've found that it helps to shut down the osds before shutting down the
host.  Especially if the node is also a monitor.  It seems that some OSD
shutdown messages get lost while monitors are holding elections.

On Tue, Dec 2, 2014 at 10:10 AM, Christoph Adomeit 
christoph.adom...@gatworks.de wrote:

 Hi there,

 I have a giant cluster with 60 OSDs on 6 OSD Hosts.

 Now I want to do maintenance on one of the OSD Hosts.

 The documented Procedure is to ceph osd set noout and then shutdown
 the OSD Node for maintenance.

 However, as soon as I even shut down 1 OSD I get around 200 slow requests
 and the number of slow requests is growing for minutes.

 The test was done at night with low IOPS and I was expecting the cluster
 to handle this condition much better.

 Is there some way of a more graceful shutdown of OSDs so that I can prevent
 those slow requests ? I suppose it takes some time until monitor gets
 notified that an OSD was shutdown.

 Thanks
   Christoph



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub while cluster re-balancing

2014-12-02 Thread Craig Lewis
You mean `ceph -w` and `ceph -s` didn't show any PGs in
the active+clean+scrubbing state while pool 2's PGs were being scrubbed?

I see that happen with my really small pools.  I have a bunch of RadosGW
pools that contain 5 objects, and ~1kB of data.  When I scrub the PGs in
those pools, they complete so fast that they never show up in `ceph -w`.


Since you have pools 0, 1, and 2, I assume those are the default 'data',
'metadata', and 'rdb'.  If you're not using RDB, then the rdb pool will be
very small.



On Tue, Dec 2, 2014 at 5:32 AM, Mallikarjun Biradar 
mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 I was running scrub while cluster is in re-balancing state.

 From the osd logs..

 2014-12-02 18:50:26.934802 7fcc6b614700  0 log_channel(default) log [INF]
 : 0.3 scrub ok
 2014-12-02 18:50:27.890785 7fcc6b614700  0 log_channel(default) log [INF]
 : 0.24 scrub ok
 2014-12-02 18:50:31.902978 7fcc6b614700  0 log_channel(default) log [INF]
 : 0.25 scrub ok
 2014-12-02 18:50:33.088060 7fcc6b614700  0 log_channel(default) log [INF]
 : 0.33 scrub ok
 2014-12-02 18:50:50.828893 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.61 scrub ok
 2014-12-02 18:51:06.774648 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.68 scrub ok
 2014-12-02 18:51:20.463283 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.80 scrub ok
 2014-12-02 18:51:39.883295 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.89 scrub ok
 2014-12-02 18:52:00.568808 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.9f scrub ok
 2014-12-02 18:52:15.897191 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.a3 scrub ok
 2014-12-02 18:52:34.681874 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.aa scrub ok
 2014-12-02 18:52:47.833630 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.b1 scrub ok
 2014-12-02 18:53:09.312792 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.b3 scrub ok
 2014-12-02 18:53:25.324635 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.bd scrub ok
 2014-12-02 18:53:48.638475 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.c3 scrub ok
 2014-12-02 18:54:02.996972 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.d7 scrub ok
 2014-12-02 18:54:19.660038 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.d8 scrub ok
 2014-12-02 18:54:32.780646 7fcc6b614700  0 log_channel(default) log [INF]
 : 1.fa scrub ok
 2014-12-02 18:54:36.772931 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.4 scrub ok
 2014-12-02 18:54:41.758487 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.9 scrub ok
 2014-12-02 18:54:46.910043 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.a scrub ok
 2014-12-02 18:54:51.908335 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.16 scrub ok
 2014-12-02 18:54:54.940807 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.19 scrub ok
 2014-12-02 18:55:00.956170 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.44 scrub ok
 2014-12-02 18:55:01.948455 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.4f scrub ok
 2014-12-02 18:55:07.273587 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.76 scrub ok
 2014-12-02 18:55:10.641274 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.9e scrub ok
 2014-12-02 18:55:11.621669 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.ab scrub ok
 2014-12-02 18:55:18.261900 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.b0 scrub ok
 2014-12-02 18:55:19.560766 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.b1 scrub ok
 2014-12-02 18:55:20.501591 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.bb scrub ok
 2014-12-02 18:55:21.523936 7fcc6b614700  0 log_channel(default) log [INF]
 : 2.cd scrub ok

 Interestingly, for pg's 2.x (2.4, 2.9 etc)in logs here, cluster status was
 not reporting scrubbing, whereas for 0.x  1.x it was reporting as
 scrubbing in cluster status.

 In case of scrub operation on PG's (2.x) is really scrubbing performed OR
 cluster status is missing to report them?

  -Thanks  Regards,
 Mallikarjun Biradar

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-02 Thread Craig Lewis
On Mon, Dec 1, 2014 at 1:51 AM, Daniel Schneller 
daniel.schnel...@centerdevice.com wrote:


 I could not find any way to throttle the background deletion activity

 (the command returns almost immediately).


I'm only aware of osd snap trim sleep.  I haven't tried this since my
Firefly upgrade though.

I have tested out osd scrub sleep under a heavy deep-scrub load, and found
that I needed a value of 1.0, which is much higher than the recommended
starting point of 0.005.  I'll revisit this when #9487 gets backported
(Thanks Dan Van Der Ster!).

I used ceph tell osd.\* injectargs, and watched my IO graphs.  Start with
0.005, and multiple by 10 until you see a change.  It took 10-60 seconds to
see a change after injecting the args.

While this is a big issue in itself for us, we would at least try to

 estimate how long the process will take per snapshot / per pool. I

 assume the time needed is a function of the number of objects that were

 modified between two snapshots.


That matches my experiences as well.  Big snapshots are take longer, and
are much more likely to cause a cluster outage than small snapshots.



 1) Is there any way to control how much such an operation will

 tax the cluster (we would be happy to have it run longer, if that meant

 not utilizing all disks fully during that time)?


On Firefly,  osd snap trim sleep, and playing with the CFQ scheduler are
your only options.  They're not great options.  If you can upgrade to
Giant, the snap trim sleep should solve your problem.

There is some work being done in Hammer:
https://wiki.ceph.com/Planning/Blueprints/Hammer/osd%3A_Scrub%2F%2FSnapTrim_IO_prioritization

For the time being, I'm letting my snapshots accumulate.  I can't recover
anything without the database backups, and those are deleted on time, so I
can say with a straight face that their data is deleted.  I'll collect the
garbage later.


 3) Would SSD journals help here? Or any other hardware configuration

 change for that matter?


Probably, but it's not going to fix it.  I added SSD journals.  It's
better, but I still had downtime after trimming.  I'm glad I added them
though.  The cluster is overall are much healthier and more responsive.  In
particular, backfilling doesn't cause massive latency anymore.



 4) Any other recommendations? We definitely need to remove the data,

 not because of a lack of space (at least not at the moment), but because

 when customers delete stuff / cancel accounts, we are obliged to remove

 their data at least after a reasonable amount of time.


I know it's kind of snarky, but perhaps you can redefine reasonable until
you have a change to upgrade to Giant or Hammer?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow Requests when taking down OSD Node

2014-12-02 Thread Craig Lewis
If you watch `ceph -w` while stopping the OSD, do you see
2014-12-02 11:45:17.715629 mon.0 [INF] osd.X marked itself down

?

On Tue, Dec 2, 2014 at 11:06 AM, Christoph Adomeit 
christoph.adom...@gatworks.de wrote:

 Thanks Craig,

 but this is what I am doing.

 After setting ceph osd set noout I do a service ceph stop osd.51
 and as soon as I do this I get growing numbers (200) of slow requests,
 although there is not a big load on my cluster.

 Christoph

 On Tue, Dec 02, 2014 at 10:40:13AM -0800, Craig Lewis wrote:
  I've found that it helps to shut down the osds before shutting down the
  host.  Especially if the node is also a monitor.  It seems that some OSD
  shutdown messages get lost while monitors are holding elections.
 
  On Tue, Dec 2, 2014 at 10:10 AM, Christoph Adomeit 
  christoph.adom...@gatworks.de wrote:
 
   Hi there,
  
   I have a giant cluster with 60 OSDs on 6 OSD Hosts.
  
   Now I want to do maintenance on one of the OSD Hosts.
  
   The documented Procedure is to ceph osd set noout and then shutdown
   the OSD Node for maintenance.
  
   However, as soon as I even shut down 1 OSD I get around 200 slow
 requests
   and the number of slow requests is growing for minutes.
  
   The test was done at night with low IOPS and I was expecting the
 cluster
   to handle this condition much better.
  
   Is there some way of a more graceful shutdown of OSDs so that I can
 prevent
   those slow requests ? I suppose it takes some time until monitor gets
   notified that an OSD was shutdown.
  
   Thanks
 Christoph
  
  
  
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub while cluster re-balancing

2014-12-02 Thread Craig Lewis
ceph osd dump | grep ^pool will map pool names to numbers.  PGs are named
after the pool; PG 2.xx belongs to pool 2.

rados df will tell you have many items and data are in a pool.

On Tue, Dec 2, 2014 at 10:53 AM, Mallikarjun Biradar 
mallikarjuna.bira...@gmail.com wrote:

 Hi Craig,

 ceph -s is not showing any PG's in pool2.
 I have 3 pools.  rbd and two pools that i created testPool and testPool2.

 I have more than 10TB of data in testPool1 and good amount of data in
 testPool2 as well.
 Iam not using rbd pool.

 -Thanks  regards,
 Mallikarjun Biradar
 On 3 Dec 2014 00:15, Craig Lewis cle...@centraldesktop.com wrote:

 You mean `ceph -w` and `ceph -s` didn't show any PGs in
 the active+clean+scrubbing state while pool 2's PGs were being scrubbed?

 I see that happen with my really small pools.  I have a bunch of RadosGW
 pools that contain 5 objects, and ~1kB of data.  When I scrub the PGs in
 those pools, they complete so fast that they never show up in `ceph -w`.


 Since you have pools 0, 1, and 2, I assume those are the default 'data',
 'metadata', and 'rdb'.  If you're not using RDB, then the rdb pool will be
 very small.



 On Tue, Dec 2, 2014 at 5:32 AM, Mallikarjun Biradar 
 mallikarjuna.bira...@gmail.com wrote:

 Hi all,

 I was running scrub while cluster is in re-balancing state.

 From the osd logs..

 2014-12-02 18:50:26.934802 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.3 scrub ok
 2014-12-02 18:50:27.890785 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.24 scrub ok
 2014-12-02 18:50:31.902978 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.25 scrub ok
 2014-12-02 18:50:33.088060 7fcc6b614700  0 log_channel(default) log
 [INF] : 0.33 scrub ok
 2014-12-02 18:50:50.828893 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.61 scrub ok
 2014-12-02 18:51:06.774648 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.68 scrub ok
 2014-12-02 18:51:20.463283 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.80 scrub ok
 2014-12-02 18:51:39.883295 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.89 scrub ok
 2014-12-02 18:52:00.568808 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.9f scrub ok
 2014-12-02 18:52:15.897191 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.a3 scrub ok
 2014-12-02 18:52:34.681874 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.aa scrub ok
 2014-12-02 18:52:47.833630 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.b1 scrub ok
 2014-12-02 18:53:09.312792 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.b3 scrub ok
 2014-12-02 18:53:25.324635 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.bd scrub ok
 2014-12-02 18:53:48.638475 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.c3 scrub ok
 2014-12-02 18:54:02.996972 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.d7 scrub ok
 2014-12-02 18:54:19.660038 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.d8 scrub ok
 2014-12-02 18:54:32.780646 7fcc6b614700  0 log_channel(default) log
 [INF] : 1.fa scrub ok
 2014-12-02 18:54:36.772931 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.4 scrub ok
 2014-12-02 18:54:41.758487 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.9 scrub ok
 2014-12-02 18:54:46.910043 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.a scrub ok
 2014-12-02 18:54:51.908335 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.16 scrub ok
 2014-12-02 18:54:54.940807 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.19 scrub ok
 2014-12-02 18:55:00.956170 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.44 scrub ok
 2014-12-02 18:55:01.948455 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.4f scrub ok
 2014-12-02 18:55:07.273587 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.76 scrub ok
 2014-12-02 18:55:10.641274 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.9e scrub ok
 2014-12-02 18:55:11.621669 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.ab scrub ok
 2014-12-02 18:55:18.261900 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.b0 scrub ok
 2014-12-02 18:55:19.560766 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.b1 scrub ok
 2014-12-02 18:55:20.501591 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.bb scrub ok
 2014-12-02 18:55:21.523936 7fcc6b614700  0 log_channel(default) log
 [INF] : 2.cd scrub ok

 Interestingly, for pg's 2.x (2.4, 2.9 etc)in logs here, cluster status
 was not reporting scrubbing, whereas for 0.x  1.x it was reporting as
 scrubbing in cluster status.

 In case of scrub operation on PG's (2.x) is really scrubbing performed
 OR cluster status is missing to report them?

  -Thanks  Regards,
 Mallikarjun Biradar

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rebuild OSD's

2014-12-02 Thread Craig Lewis
You have a total of 2 OSDs, and 2 disks, right?

The safe method is to mark one OSD out, and wait for the cluster to heal.
Delete, reformat, add it back to the cluster, and wait for the cluster to
heal.  Repeat.  But that only works when you have enough OSDs that the
cluster can heal.

So you'll have to go the less safe route, and hope you don't suffer a
failure in the middle.  I went
this route, because it was taking too long to do the safe route:



First, setup your ceph.conf with the new osd options.  osd mkfs *, osd
journal *, whatever you want the OSDs to look like when you're done.  You
may want to set osd max backfills to 1 before you start.  The default value
of 10 is really only a good idea if you have a large cluster and SSD
journals.

Remove the disk, format, and put it back in:

   - ceph osd set norecover
   - ceph osd set nobackfill
   - ceph osd out $OSDID
   - sleep 30
   - stop ceph-osd id=$OSDID
   - ceph osd crush remove osd.$OSDID
   - ceph osd lost $OSDID --yes-i-really-mean-it
   - ceph auth del osd.$OSDID
   - ceph osd rm $OSDID
   - ceph-disk-prepare --zap $dev $journal# ceph-deploy would also work
   - ceph osd unset norecover
   - ceph osd unset nobackfill

Wait for the cluster to heal, then repeat.



It's more complicated if you have multiple devices in the zpool and you're
using more than a small percentage of the disk space.


On Sat, Nov 29, 2014 at 2:29 PM, Lindsay Mathieson 
lindsay.mathie...@gmail.com wrote:

 I have 2 OSD's on two nodes top of zfs that I'd like to rebuild in a more
 standard (xfs) setup.

 Would the following be a non destructive if somewhat tedious way of doing
 so?

 Following the instructions from here:


 http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual

 1. Remove osd.0
 2. Recreate osd.0
 3. Add. osd.0
 4. Wait for health to be restored
 i.e all data be copied from osd.1 to osd.0

 5. Remove osd.1
 6. Recreate osd.1
 7. Add. osd.1
 8. Wait for health to be restored
 i.e all data be copied from osd.0 to osd.1

 9. Profit!


 There's 1TB of data total. I can do this after hours while the system 
 network is not being used

 I do have complete backups in case it all goes pear shaped.

 thanks,
 --
 Lindsay
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimal or recommended threads values

2014-12-01 Thread Craig Lewis
I'm still using the default values, mostly because I haven't had time to
test.

On Thu, Nov 27, 2014 at 2:44 AM, Andrei Mikhailovsky and...@arhont.com
wrote:

 Hi Craig,

 Are you keeping the filestore, disk and op threads at their default
 values? or did you also change them?

 Cheers


 Tuning these values depends on a lot more than just the SSDs and HDDs.
 Which kernel and IO scheduler are you using?  Does your HBA do write
 caching?  It also depends on what your goals are.  Tuning for a RadosGW
 cluster is different that for a RDB cluster.  The short answer is that you
 are the only person that can can tell you what your optimal values are.  As
 always, the best benchmark is production load.


 In my small cluster (5 nodes, 44 osds), I'm optimizing to minimize latency
 during recovery.  When the cluster is healthy, bandwidth and latency are
 more than adequate for my needs.  Even with journals on SSDs, I've found
 that reducing the number of operations and threads has reduced my average
 latency.

 I use injectargs to try out new values while I monitor cluster latency.  I
 monitor latency while the cluster is healthy and recovering.  If a change
 is deemed better, only then will I persist the change to ceph.conf.  This
 gives me a fallback that any changes that causes massive problems can be
 undone with a restart or reboot.


 So far, the configs that I've written to ceph.conf are
 [global]
   mon osd down out interval = 900
   mon osd min down reporters = 9
   mon osd min down reports = 12
   osd pool default flag hashpspool = true

 [osd]
   osd max backfills = 1
   osd recovery max active = 1
   osd recovery op priority = 1


 I have it on my list to investigate filestore max sync interval.  And now
 that I've pasted that, I need to revisit the min down reports/reporters.  I
 have some nodes with 10 OSDs, and I don't want any one node able to mark
 the rest of the cluster as down (it happened once).




 On Sat, Nov 22, 2014 at 6:24 AM, Andrei Mikhailovsky and...@arhont.com
 wrote:

 Hello guys,

 Could some one comment on the optimal or recommended values of various
 threads values in ceph.conf?

 At the moment I have the following settings:

 filestore_op_threads = 8
 osd_disk_threads = 8
 osd_op_threads = 8
 filestore_merge_threshold = 40
 filestore_split_multiple = 8

 Are these reasonable for a small cluster made of 7.2K SAS disks with ssd
 journals with a ratio of 4:1?

 What are the settings that other people are using?

 Thanks

 Andrei



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Tip of the week: don't use Intel 530 SSD's for journals

2014-11-25 Thread Craig Lewis
I have suffered power losses in every data center I've been in.  I have
lost SSDs because of it (Intel 320 Series).  The worst time, I lost both
SSDs in a RAID1.  That was a bad day.

I'm using the Intel DC S3700 now, so I don't have a repeat.  My cluster is
small enough that losing a journal SSD would be a major headache.

I'm manually monitoring wear level.  So far all of my journals are still at
100% lifetime.  I do have some of the Intel 320 that are down to 45%
lifetime remaining.  (Those Intel 320s are in less critical roles).  One of
these days I'll get around to automating it.


Speed wise, my small cluster was fast enough without SSDs, until I started
to expand.  I'm only using RadosGW, and I only care about latency in the
human timeframe.  A second or two of latency is annoying, but not a big
deal.

I went from 3 nodes to 5, and the expansion was extremely painful.  I admit
that I inflicted a lot of pain on myself.  I expanded too fast (add all the
OSDs at the same time?  Sure, why not.), and I was using the default
configs.  Things got better after I lowered the backfill priority and
count, and learned to add one or two disks at a time.  Still, customers
noticed the increase in latency when I was adding osds.

Now that I have the journals on SSDs, customers don't notice the
maintenance anymore.  RadosGW latency goes from ~50ms to ~80ms, not ~50ms
to 2000ms.



On Tue, Nov 25, 2014 at 9:12 AM, Michael Kuriger mk7...@yp.com wrote:

 My cluster is actually very fast without SSD drives.  Thanks for the
 advice!

 Michael Kuriger
 mk7...@yp.com
 818-649-7235

 MikeKuriger (IM)




 On 11/25/14, 7:49 AM, Mark Nelson mark.nel...@inktank.com wrote:

 On 11/25/2014 09:41 AM, Erik Logtenberg wrote:
  If you are like me, you have the journals for your OSD's with rotating
  media stored separately on an SSD. If you are even more like me, you
  happen to use Intel 530 SSD's in some of your hosts. If so, please do
  check your S.M.A.R.T. statistics regularly, because these SSD's really
  can't cope with Ceph.
 
  Check out the media-wear graphs for the two Intel 530's in my cluster.
  As soon as those declining lines get down to 30% or so, they need to be
  replaced. That means less than half a year between purchase and
  end-of-life :(
 
  Tip of the week, keep an eye on those statistics, don't let a failing
  SSD surprise you.
 
 This is really good advice, and it's not just the Intel 530s.  Most
 consumer grade SSDs have pretty low write endurance.  If you mostly are
 doing reads from your cluster you may be OK, but if you have even
 moderately high write workloads and you care about avoiding OSD downtime
 (which in a production cluster is pretty important though not usually
 100% critical), get high write endurance SSDs.
 
 Mark
 
 
  Erik.
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
 
 
 https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listin
 fo.cgi_ceph-2Dusers-2Dceph.com
 d=AAICAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOS
 ncm6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=xAjtZHPapVvnusxPYRk6BsgVfaL1ZLDaT
 ojJmuDFDpQs=F0CBA8T3LuTIhofIV4LGk-6CgC8KsPAu-7JgJ4jRm3Ie=
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 
 https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinf
 o.cgi_ceph-2Dusers-2Dceph.com
 d=AAICAgc=lXkdEK1PC7UK9oKA-BBSI8p1AamzLOSnc
 m6Vfn0C_UQr=CSYA9OS6Qd7fQySI2LDvlQm=xAjtZHPapVvnusxPYRk6BsgVfaL1ZLDaTojJ
 muDFDpQs=F0CBA8T3LuTIhofIV4LGk-6CgC8KsPAu-7JgJ4jRm3Ie=

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] private network - VLAN vs separate switch

2014-11-25 Thread Craig Lewis
It's mostly about bandwidth.  With VLANs, the public and cluster networks
are going to be sharing the inter-switch links.

For a cluster that size, I don't see much advantage to the VLANs.  You'll
save a few ports by having the inter-switch links shared, at the expense of
contention on those links.

If you're trying to save ports, I'd go with a single network.  Adding a
cluster network later is relatively straight forward.  Just monitor the
bandwidth on the inter-switch links, and plan to expand when you saturate
those links.


That said, I am using VLANs, but my cluster is much smaller.  I only have 5
nodes and a single switch.  I'm planning to transition to a dedicated
cluster switch when I need the extra ports.  I don't anticipate the
transition being difficult.  I'll continue to use the same VLAN on the
dedicated switch, just to make the migration less complicated.


On Tue, Nov 25, 2014 at 3:11 AM, Sreenath BH bhsreen...@gmail.com wrote:

 Hi
 For a large network (say 100 servers and 2500 disks), are there any
 strong advantages to using separate switch and physical network
 instead of VLAN?

 Also, how difficult it would be to switch from a VLAN to using
 separate switches later?

 -Sreenath
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Create OSD on ZFS Mount (firefly)

2014-11-25 Thread Craig Lewis
There was a good thread on the mailing list a little while ago.  There were
several recommendations in that thread, maybe some of them will help.

Found it:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg14154.html


On Tue, Nov 25, 2014 at 4:16 AM, Lindsay Mathieson 
lindsay.mathie...@gmail.com wrote:

  Testing ceph on top of ZFS (zfsonlinux), kernel driver.



 - Have created ZFS mount:

 /var/lib/ceph/osd/ceph-0



 - followed the instructions at:

 http://ceph.com/docs/firefly/rados/operations/add-or-rm-osds/



 failing on the step 4. Initialize the OSD data directory.





 ceph-osd -i 0 --mkfs --mkkey

 2014-11-25 22:12:26.563666 7ff12b466780 -1
 filestore(/var/lib/ceph/osd/ceph-0) mkjournal error creating journal on
 /var/lib/ceph/osd/ceph-0/journal: (22) Invalid argument

 2014-11-25 22:12:26.563691 7ff12b466780 -1 OSD::mkfs: ObjectStore::mkfs
 failed with error -22

 2014-11-25 22:12:26.563765 7ff12b466780 -1 ** ERROR: error creating empty
 object store in /var/lib/ceph/osd/ceph-0: (22) Invalid argument





 Is this supported?



 thanks,



 --

 Lindsay

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2014-11-24 Thread Craig Lewis
/::initializing
 2014-11-22 14:19:21.918229 7f73b07c0700 10 host=us-west-1.lt.com
 rgw_dns_name=us-west-1.lt.com
 2014-11-22 14:19:21.918288 7f73b07c0700  2 req 1:0.53:swift-auth:GET
 /auth/::getting op
 2014-11-22 14:19:21.918300 7f73b07c0700  2 req 1:0.71:swift-auth:GET
 /auth/:swift_auth_get:authorizing
 2014-11-22 14:19:21.918307 7f73b07c0700  2 req 1:0.78:swift-auth:GET
 /auth/:swift_auth_get:reading permissions
 2014-11-22 14:19:21.918313 7f73b07c0700  2 req 1:0.84:swift-auth:GET
 /auth/:swift_auth_get:init op
 2014-11-22 14:19:21.918319 7f73b07c0700  2 req 1:0.90:swift-auth:GET
 /auth/:swift_auth_get:verifying op mask
 2014-11-22 14:19:21.918325 7f73b07c0700 20 required_mask= 0 user.op_mask=7
 2014-11-22 14:19:21.918330 7f73b07c0700  2 req 1:0.000100:swift-auth:GET
 /auth/:swift_auth_get:verifying op permissions
 2014-11-22 14:19:21.918336 7f73b07c0700  2 req 1:0.000107:swift-auth:GET
 /auth/:swift_auth_get:verifying op params
 2014-11-22 14:19:21.918341 7f73b07c0700  2 req 1:0.000112:swift-auth:GET
 /auth/:swift_auth_get:executing
 2014-11-22 14:19:21.918470 7f73b07c0700 20 get_obj_state:
 rctx=0x7f73dc002030 obj=.us-west.users.swift:east-user:swift
 state=0x7f73dc0066d8 s-prefetch_data=0
 2014-11-22 14:19:21.918494 7f73b07c0700 10 cache get:
 name=.us-west.users.swift+east-user:swift : miss
 2014-11-22 14:19:21.931892 7f73b07c0700 10 cache put:
 name=.us-west.users.swift+east-user:swift
 2014-11-22 14:19:21.931892 7f73b07c0700 10 adding
 .us-west.users.swift+east-user:swift to cache LRU end
 2014-11-22 14:19:21.931892 7f73b07c0700 20 get_obj_state: s-obj_tag was
 set empty
 2014-11-22 14:19:21.931892 7f73b07c0700 10 cache get:
 name=.us-west.users.swift+east-user:swift : type miss (requested=1,
 cached=6)
 2014-11-22 14:19:21.931893 7f73b07c0700 20 get_obj_state:
 rctx=0x7f73dc007300 obj=.us-west.users.swift:east-user:swift
 state=0x7f73dc006558 s-prefetch_data=0
 2014-11-22 14:19:21.931893 7f73b07c0700 10 cache get:
 name=.us-west.users.swift+east-user:swift : hit
 2014-11-22 14:19:21.931893 7f73b07c0700 20 get_obj_state: s-obj_tag was
 set empty
 2014-11-22 14:19:21.931893 7f73b07c0700 20 get_obj_state:
 rctx=0x7f73dc007300 obj=.us-west.users.swift:east-user:swift
 state=0x7f73dc006558 s-prefetch_data=0
 2014-11-22 14:19:21.931893 7f73b07c0700 20 state for
 obj=.us-west.users.swift:east-user:swift is not atomic, not appending
 atomic test
 2014-11-22 14:19:21.931893 7f73b07c0700 20 rados-read obj-ofs=0
 read_ofs=0 read_len=524288
 2014-11-22 14:19:21.932003 7f73b07c0700 20 rados-read r=0 bl.length=13
 2014-11-22 14:19:21.932021 7f73b07c0700 10 cache put:
 name=.us-west.users.swift+east-user:swift
 2014-11-22 14:19:21.932023 7f73b07c0700 10 moving
 .us-west.users.swift+east-user:swift to cache LRU end
 2014-11-22 14:19:21.932054 7f73b07c0700 20 get_obj_state:
 rctx=0x7f73dc006b30 obj=.us-west.users.uid:east-user state=0x7f73dc006498
 s-prefetch_data=0
 2014-11-22 14:19:21.932062 7f73b07c0700 10 cache get:
 name=.us-west.users.uid+east-user : miss
 2014-11-22 14:19:21.933559 7f73b07c0700 10 cache put:
 name=.us-west.users.uid+east-user
 2014-11-22 14:19:21.933567 7f73b07c0700 10 adding
 .us-west.users.uid+east-user to cache LRU end
 2014-11-22 14:19:21.933572 7f73b07c0700 20 get_obj_state: s-obj_tag was
 set empty
 2014-11-22 14:19:21.933580 7f73b07c0700 10 cache get:
 name=.us-west.users.uid+east-user : type miss (requested=1, cached=6)
 2014-11-22 14:19:21.933601 7f73b07c0700 20 get_obj_state:
 rctx=0x7f73dc006b30 obj=.us-west.users.uid:east-user state=0x7f73dc006498
 s-prefetch_data=0
 2014-11-22 14:19:21.933607 7f73b07c0700 10 cache get:
 name=.us-west.users.uid+east-user : hit
 2014-11-22 14:19:21.933611 7f73b07c0700 20 get_obj_state: s-obj_tag was
 set empty
 2014-11-22 14:19:21.933617 7f73b07c0700 20 get_obj_state:
 rctx=0x7f73dc006b30 obj=.us-west.users.uid:east-user state=0x7f73dc006498
 s-prefetch_data=0
 2014-11-22 14:19:21.933620 7f73b07c0700 20 state for
 obj=.us-west.users.uid:east-user is not atomic, not appending atomic test
 2014-11-22 14:19:21.933622 7f73b07c0700 20 rados-read obj-ofs=0
 read_ofs=0 read_len=524288
 2014-11-22 14:19:21.934709 7f73b07c0700 20 rados-read r=0 bl.length=310
 2014-11-22 14:19:21.934725 7f73b07c0700 10 cache put:
 name=.us-west.users.uid+east-user
 2014-11-22 14:19:21.934727 7f73b07c0700 10 moving
 .us-west.users.uid+east-user to cache LRU end
 2014-11-22 14:19:21.934790 7f73b07c0700  2 req 1:0.016560:swift-auth:GET
 /auth/:swift_auth_get:http status=403
 2014-11-22 14:19:21.934794 7f73b07c0700  1 == req done
 req=0x7f73e000d010 http_status=403 ==
 2014-11-22 14:19:21.934800 7f73b07c0700 20 process_request() returned -1

 Why am I not able to authenticate?

 On Fri, Nov 21, 2014 at 1:04 AM, Craig Lewis cle...@centraldesktop.com
 wrote:

 You need to create two system users, in both zones.  They should have the
 same name, access key, and secret in both zones.  By convention, these
 system users are named the same as the zones.

 You

Re: [ceph-users] Optimal or recommended threads values

2014-11-24 Thread Craig Lewis
Tuning these values depends on a lot more than just the SSDs and HDDs.
Which kernel and IO scheduler are you using?  Does your HBA do write
caching?  It also depends on what your goals are.  Tuning for a RadosGW
cluster is different that for a RDB cluster.  The short answer is that you
are the only person that can can tell you what your optimal values are.  As
always, the best benchmark is production load.


In my small cluster (5 nodes, 44 osds), I'm optimizing to minimize latency
during recovery.  When the cluster is healthy, bandwidth and latency are
more than adequate for my needs.  Even with journals on SSDs, I've found
that reducing the number of operations and threads has reduced my average
latency.

I use injectargs to try out new values while I monitor cluster latency.  I
monitor latency while the cluster is healthy and recovering.  If a change
is deemed better, only then will I persist the change to ceph.conf.  This
gives me a fallback that any changes that causes massive problems can be
undone with a restart or reboot.


So far, the configs that I've written to ceph.conf are
[global]
  mon osd down out interval = 900
  mon osd min down reporters = 9
  mon osd min down reports = 12
  osd pool default flag hashpspool = true

[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1


I have it on my list to investigate filestore max sync interval.  And now
that I've pasted that, I need to revisit the min down reports/reporters.  I
have some nodes with 10 OSDs, and I don't want any one node able to mark
the rest of the cluster as down (it happened once).




On Sat, Nov 22, 2014 at 6:24 AM, Andrei Mikhailovsky and...@arhont.com
wrote:

 Hello guys,

 Could some one comment on the optimal or recommended values of various
 threads values in ceph.conf?

 At the moment I have the following settings:

 filestore_op_threads = 8
 osd_disk_threads = 8
 osd_op_threads = 8
 filestore_merge_threshold = 40
 filestore_split_multiple = 8

 Are these reasonable for a small cluster made of 7.2K SAS disks with ssd
 journals with a ratio of 4:1?

 What are the settings that other people are using?

 Thanks

 Andrei



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2014-11-20 Thread Craig Lewis
You need to create two system users, in both zones.  They should have the
same name, access key, and secret in both zones.  By convention, these
system users are named the same as the zones.

You shouldn't use those system users for anything other than replication.
You should create a non-system user to interact with the cluster.  Just
like you don't run as root all the time, you don't want to be a radosgw
system user all the time.  You only need to create this user in the primary
zone.

Once replication is working, it should copy the non-system user to the
secondary cluster, as well as any buckets and objects this user creates.


On Wed, Nov 19, 2014 at 1:16 AM, Vinod H I vinvi...@gmail.com wrote:

 Hi,
 I am using firefly version 0.80.7.
 I am testing disaster recovery mechanism for rados gateways.
 I have followed the federated gateway setup as mentioned in the docs.
 There is one region with two zones on the same cluster.
 After sync(using radosgw-agent, with --sync-scope=full), container
 created by the swift user(with --system flag) on the master zone gateway
 is not visible for the swift user(with --system flag) on the slave zone.
 There are no error during the syncing process.
 I tried by creating a new slave zone user with same uid and access and
 secret keys as that of master. It did not work!
 Any idea on how to be able to read the synced containers from the slave
 zone?
 Is there any requirement that the two zones must be on separate clusters?
 --
 Vinod H I


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg's degraded

2014-11-20 Thread Craig Lewis
Just to be clear, this is from a cluster that was healthy, had a disk
replaced, and hasn't returned to healthy?  It's not a new cluster that has
never been healthy, right?

Assuming it's an existing cluster, how many OSDs did you replace?  It
almost looks like you replaced multiple OSDs at the same time, and lost
data because of it.

Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?


On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah jshah2...@me.com wrote:

 After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded
 mode. Sone are in the unclean and others are in the stale state. Somehow
 the MDS is also degraded. How do I recover the OSD’s and the MDS back to
 healthy ? Read through the documentation and on the web but no luck so far.

 pg 2.33 is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.30 is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.31 is stuck unclean since forever, current state
 stale+active+degraded, last acting [2]
 pg 2.32 is stuck unclean for 597129.903922, current state
 stale+active+degraded, last acting [2]
 pg 0.2f is stuck unclean for 597129.903951, current state
 stale+active+degraded, last acting [2]
 pg 1.2e is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2d is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [2]
 pg 0.2e is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2f is stuck unclean for 597129.904015, current state
 stale+active+degraded, last acting [2]
 pg 2.2c is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2d is stuck stale for 422844.566858, current state
 stale+active+degraded, last acting [2]
 pg 1.2c is stuck stale for 422598.539483, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2f is stuck stale for 422598.539488, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2c is stuck stale for 422598.539487, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2d is stuck stale for 422598.539492, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2e is stuck stale for 422598.539496, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2b is stuck stale for 422598.539491, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2a is stuck stale for 422598.539496, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.29 is stuck stale for 422598.539504, current state
 stale+active+degraded+remapped, last acting [3]
 .
 .
 .
 6 ops are blocked  2097.15 sec
 3 ops are blocked  2097.15 sec on osd.0
 2 ops are blocked  2097.15 sec on osd.2
 1 ops are blocked  2097.15 sec on osd.4
 3 osds have slow requests
 recovery 40/60 objects degraded (66.667%)
 mds cluster is degraded
 mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal

 —Jiten


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg's degraded

2014-11-20 Thread Craig Lewis
So you have your crushmap set to choose osd instead of choose host?

Did you wait for the cluster to recover between each OSD rebuild?  If you
rebuilt all 3 OSDs at the same time (or without waiting for a complete
recovery between them), that would cause this problem.



On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah jshah2...@me.com wrote:

 Yes, it was a healthy cluster and I had to rebuild because the OSD’s got
 accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of
 them.


 [jshah@Lab-cephmon001 ~]$ ceph osd tree
 # id weight type name up/down reweight
 -1 0.5 root default
 -2 0.0 host Lab-cephosd005
 4 0.0 osd.4 up 1
 -3 0.0 host Lab-cephosd001
 0 0.0 osd.0 up 1
 -4 0.0 host Lab-cephosd002
 1 0.0 osd.1 up 1
 -5 0.0 host Lab-cephosd003
 2 0.0 osd.2 up 1
 -6 0.0 host Lab-cephosd004
 3 0.0 osd.3 up 1


 [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
 Error ENOENT: i don't have paid 2.33

 —Jiten


 On Nov 20, 2014, at 11:18 AM, Craig Lewis cle...@centraldesktop.com
 wrote:

 Just to be clear, this is from a cluster that was healthy, had a disk
 replaced, and hasn't returned to healthy?  It's not a new cluster that has
 never been healthy, right?

 Assuming it's an existing cluster, how many OSDs did you replace?  It
 almost looks like you replaced multiple OSDs at the same time, and lost
 data because of it.

 Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?


 On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah jshah2...@me.com wrote:

 After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded
 mode. Sone are in the unclean and others are in the stale state. Somehow
 the MDS is also degraded. How do I recover the OSD’s and the MDS back to
 healthy ? Read through the documentation and on the web but no luck so far.

 pg 2.33 is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.30 is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.31 is stuck unclean since forever, current state
 stale+active+degraded, last acting [2]
 pg 2.32 is stuck unclean for 597129.903922, current state
 stale+active+degraded, last acting [2]
 pg 0.2f is stuck unclean for 597129.903951, current state
 stale+active+degraded, last acting [2]
 pg 1.2e is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2d is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [2]
 pg 0.2e is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2f is stuck unclean for 597129.904015, current state
 stale+active+degraded, last acting [2]
 pg 2.2c is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2d is stuck stale for 422844.566858, current state
 stale+active+degraded, last acting [2]
 pg 1.2c is stuck stale for 422598.539483, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2f is stuck stale for 422598.539488, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2c is stuck stale for 422598.539487, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2d is stuck stale for 422598.539492, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2e is stuck stale for 422598.539496, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2b is stuck stale for 422598.539491, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2a is stuck stale for 422598.539496, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.29 is stuck stale for 422598.539504, current state
 stale+active+degraded+remapped, last acting [3]
 .
 .
 .
 6 ops are blocked  2097.15 sec
 3 ops are blocked  2097.15 sec on osd.0
 2 ops are blocked  2097.15 sec on osd.2
 1 ops are blocked  2097.15 sec on osd.4
 3 osds have slow requests
 recovery 40/60 objects degraded (66.667%)
 mds cluster is degraded
 mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal

 —Jiten


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg's degraded

2014-11-20 Thread Craig Lewis
If there's no data to lose, tell Ceph to re-create all the missing PGs.

ceph pg force_create_pg 2.33

Repeat for each of the missing PGs.  If that doesn't do anything, you might
need to tell Ceph that you lost the OSDs.  For each OSD you moved, run ceph
osd lost OSDID, then try the force_create_pg command again.

If that doesn't work, you can keep fighting with it, but it'll be faster to
rebuild the cluster.



On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah jshah2...@me.com wrote:

 Thanks for your help.

 I was using puppet to install the OSD’s where it chooses a path over a
 device name. Hence it created the OSD in the path within the root volume
 since the path specified was incorrect.

 And all 3 of the OSD’s were rebuilt at the same time because it was unused
 and we had not put any data in there.

 Any way to recover from this or should i rebuild the cluster altogether.

 —Jiten

 On Nov 20, 2014, at 1:40 PM, Craig Lewis cle...@centraldesktop.com
 wrote:

 So you have your crushmap set to choose osd instead of choose host?

 Did you wait for the cluster to recover between each OSD rebuild?  If you
 rebuilt all 3 OSDs at the same time (or without waiting for a complete
 recovery between them), that would cause this problem.



 On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah jshah2...@me.com wrote:

 Yes, it was a healthy cluster and I had to rebuild because the OSD’s got
 accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of
 them.


 [jshah@Lab-cephmon001 ~]$ ceph osd tree
 # id weight type name up/down reweight
 -1 0.5 root default
 -2 0.0 host Lab-cephosd005
 4 0.0 osd.4 up 1
 -3 0.0 host Lab-cephosd001
 0 0.0 osd.0 up 1
 -4 0.0 host Lab-cephosd002
 1 0.0 osd.1 up 1
 -5 0.0 host Lab-cephosd003
 2 0.0 osd.2 up 1
 -6 0.0 host Lab-cephosd004
 3 0.0 osd.3 up 1


 [jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query
 Error ENOENT: i don't have paid 2.33

 —Jiten


 On Nov 20, 2014, at 11:18 AM, Craig Lewis cle...@centraldesktop.com
 wrote:

 Just to be clear, this is from a cluster that was healthy, had a disk
 replaced, and hasn't returned to healthy?  It's not a new cluster that has
 never been healthy, right?

 Assuming it's an existing cluster, how many OSDs did you replace?  It
 almost looks like you replaced multiple OSDs at the same time, and lost
 data because of it.

 Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?


 On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah jshah2...@me.com wrote:

 After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded
 mode. Sone are in the unclean and others are in the stale state. Somehow
 the MDS is also degraded. How do I recover the OSD’s and the MDS back to
 healthy ? Read through the documentation and on the web but no luck so far.

 pg 2.33 is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.30 is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.31 is stuck unclean since forever, current state
 stale+active+degraded, last acting [2]
 pg 2.32 is stuck unclean for 597129.903922, current state
 stale+active+degraded, last acting [2]
 pg 0.2f is stuck unclean for 597129.903951, current state
 stale+active+degraded, last acting [2]
 pg 1.2e is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2d is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [2]
 pg 0.2e is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2f is stuck unclean for 597129.904015, current state
 stale+active+degraded, last acting [2]
 pg 2.2c is stuck unclean since forever, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2d is stuck stale for 422844.566858, current state
 stale+active+degraded, last acting [2]
 pg 1.2c is stuck stale for 422598.539483, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2f is stuck stale for 422598.539488, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2c is stuck stale for 422598.539487, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2d is stuck stale for 422598.539492, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.2e is stuck stale for 422598.539496, current state
 stale+active+degraded+remapped, last acting [3]
 pg 0.2b is stuck stale for 422598.539491, current state
 stale+active+degraded+remapped, last acting [3]
 pg 1.2a is stuck stale for 422598.539496, current state
 stale+active+degraded+remapped, last acting [3]
 pg 2.29 is stuck stale for 422598.539504, current state
 stale+active+degraded+remapped, last acting [3]
 .
 .
 .
 6 ops are blocked  2097.15 sec
 3 ops are blocked  2097.15 sec on osd.0
 2 ops are blocked  2097.15 sec on osd.2
 1 ops are blocked  2097.15 sec on osd.4
 3 osds have slow requests
 recovery 40/60 objects

Re: [ceph-users] OSD commits suicide

2014-11-18 Thread Craig Lewis
That would probably have helped.  The XFS deadlocks would only occur when
there was relatively little free memory.  Kernel 3.18 is supposed to have a
fix for that, but I haven't tried it yet.

Looking at my actual usage, I don't even need 64k inodes.  64k inodes
should make things a bit faster when you have a large number of files in a
directory.  Ceph will automatically split directories with too many files
into multiple sub-directories, so it's kinda pointless.

I may try the experiment again, but probably not.  It took several weeks to
reformat all of the OSDS.  Even on a single node, it takes 4-5 days to
drain, format, and backfill.  That was months ago, and I'm still dealing
with the side effects.  I'm not eager to try again.


On Mon, Nov 17, 2014 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:

 On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis cle...@centraldesktop.com
 wrote:
  I did have a problem in my secondary cluster that sounds similar to
 yours.
  I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
  options xfs = -i size=64k).   This showed up with a lot of XFS: possible
  memory allocation deadlock in kmem_alloc in the kernel logs.  I was
 able to
  keep things limping along by flushing the cache frequently, but I
 eventually
  re-formatted every OSD to get rid of the 64k inodes.
 
  After I finished the reformat, I had problems because of deep-scrubbing.
  While reformatting, I disabled deep-scrubbing.  Once I re-enabled it,
 Ceph
  wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs
 would
  be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to spread
  out the schedule a bit.  Once this finishes in a few day, I should be
 able
  to re-enable deep-scrubbing and keep my HEALTH_OK.
 
 

 Would you mind to check suggestions by following mine hints or hints
 from mentioned URLs from there
 http://marc.info/?l=linux-mmm=141607712831090w=2 with 64k again? As
 for me, I am not observing lock loop after setting min_free_kbytes for
 a half of gigabyte per OSD. Even if your locks has a different nature,
 it may be worthy to try anyway.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crashed while there was no space

2014-11-18 Thread Craig Lewis
You shouldn't let the cluster get so full that losing a few OSDs will make
you go toofull.  Letting the cluster get to 100% full is such a bad idea
that you should make sure it doesn't happen.


Ceph is supposed to stop moving data to an OSD once that OSD hits
osd_backfill_full_ratio, which defaults to 0.85.  Any disk at 86% full will
stop backfilling.

I have verified this works when the disks fill up while the cluster is
healthy, but I haven't failed a disk once I'm in the toofull state.  Even
so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default
0.97) should stop all IO until a human gets involved.

The only gotcha I can find is that the values are percentages, and the test
is a greater than done with two significant digits.  ie, if the
osd_backfill_full_ratio is 0.85, it will continue backfilling until the
disk is 86% full.  So values are 0.99 and 1.00 will cause problems.


On Mon, Nov 17, 2014 at 6:50 PM, han vincent hang...@gmail.com wrote:

 hi, craig:

 Your solution did work very well. But if the data is very
 important, when remove directory of PG from OSDs, a small mistake will
 result in loss of data. And if cluster is very large, do not you think
 delete the data on the disk from 100% to 95% is a tedious and
 error-prone thing, for so many OSDs, large disks, and so on.

  so my key question is: if there is no space in the cluster while
 some OSDs crashed,  why the cluster should choose to migrate? And in
 the migrating, other
 OSDs will crashed one by one until the cluster could not work.

 2014-11-18 5:28 GMT+08:00 Craig Lewis cle...@centraldesktop.com:
  At this point, it's probably best to delete the pool.  I'm assuming the
 pool
  only contains benchmark data, and nothing important.
 
  Assuming you can delete the pool:
  First, figure out the ID of the data pool.  You can get that from ceph
 osd
  dump | grep '^pool'
 
  Once you have the number, delete the data pool: rados rmpool data data
  --yes-i-really-really-mean-it
 
  That will only free up space on OSDs that are up.  You'll need to
 manually
  some PGs on the OSDs that are 100% full.  Go to
  /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories that
  start with your data pool ID.  You don't need to delete all of them.
 Once
  the disk is below 95% full, you should be able to start that OSD.  Once
 it's
  up, it will finish deleting the pool.
 
  If you can't delete the pool, it is possible, but it's more work, and you
  still run the risk of losing data if you make a mistake.  You need to
  disable backfilling, then delete some PGs on each OSD that's full. Try to
  only delete one copy of each PG.  If you delete every copy of a PG on all
  OSDs, then you lost the data that was in that PG.  As before, once you
  delete enough that the disk is less than 95% full, you can start the OSD.
  Once you start it, start deleting your benchmark data out of the data
 pool.
  Once that's done, you can re-enable backfilling.  You may need to scrub
 or
  deep-scrub the OSDs you deleted data from to get everything back to
 normal.
 
 
  So how did you get the disks 100% full anyway?  Ceph normally won't let
 you
  do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio,
 or
  osd_failsafe_full_ratio?
 
 
  On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote:
 
  hello, every one:
 
  These days a problem of ceph has troubled me for a long time.
 
  I build a cluster with 3 hosts and each host has three osds in it.
  And after that
  I used the command rados bench 360 -p data -b 4194304 -t 300 write
  --no-cleanup
  to test the write performance of the cluster.
 
  When the cluster is near full, there couldn't write any data to
  it. Unfortunately,
  there was a host hung up, then a lots of PG was going to migrate to
 other
  OSDs.
  After a while, a lots of OSD was marked down and out, my cluster
 couldn't
  work
  any more.
 
  The following is the output of ceph -s:
 
  cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
  health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
  incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
  pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
  recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
  down, quorum 0,2 2,1
   monmap e1: 3 mons at
  {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
  epoch 40, quorum 0,2 2,1
   osdmap e173: 9 osds: 2 up, 2 in
  flags full
pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
  37541 MB used, 3398 MB / 40940 MB avail
  945/29649 objects degraded (3.187%)
34 stale+active+degraded+remapped
   176 stale+incomplete
   320 stale+down+peering
53 active+degraded+remapped
   408 incomplete
 1 active+recovering+degraded
   673 down

Re: [ceph-users] OSDs down

2014-11-17 Thread Craig Lewis
Firstly, any chance of getting node4 and node5 back up?  You can move the
disks (monitor and osd) to a new chasis, and bring it back up.  As long as
it has the same IP as the original node4 and node5, the monitor should join.

How much is the clock skewed on node2?  I haven't had problems with small
skew (~100 ms), but I've seen posts to the mailing list about large skews
(minutes) causing quorum and authentication problems.

When you say Nevertheless on node3 every ceph * commands stay freezed, do
you by chance mean node2 instead of node3?  If so, that supports the clock
skew being a problem, preventing the commands and the OSDs
from authenticating with the monitors.

If you really did mean node3, then something strange else going on.



On Mon, Nov 17, 2014 at 7:07 AM, NEVEU Stephane 
stephane.ne...@thalesgroup.com wrote:

 Hi all J ,



 I need some help, I’m in a sad situation : i’ve lost 2 ceph server nodes
 physically (5 nodes initialy/ 5 monitors). So 3 nodes left : node1, node2,
 node3

 On my first node leaving, I’ve updated the crush map to remove every osds
 running on those 2 lost servers :

 Ceph osd crush remove osds  ceph auth del osds  ceph osd rm osds 
 ceph osd remove my2Lostnodes

 So the crush map seems to be ok now on node1.

 Ceph osd tree on node 1 returns that every osds running on node2 are “down
 1” and “up 1” on node 3 and “up 1” on node1. Nevertheless on node3 every
 ceph * commands stay freezed, so I’m not sure the crush map has been
 updated on node2 and node3. I don’t know how to set ods on node 2 up again.

 My node2 says it cannot connect to the cluster !



 Ceph –s on node 1 gives me (so still 5 monitors):



 cluster 45d9195b-365e-491a-8853-34b46553db94
  health HEALTH_WARN 10016 pgs degraded; 10016 pgs stuck unclean;
 recovery 181055/544038 objects degraded (33.280%); 11/33 in osds are down;
 noout flag(s) set; 2 mons down, quorum 0,1,2 node1,node2,node3; clock skew
 detected on mon.node2
  monmap e1: 5 mons at {node1=
 172.23.6.11:6789/0,node2=172.23.6.12:6789/0,node3=172.23.6.13:6789/0,node4=172.23.6.14:6789/0,node5=172.23.6.15:6789/0
 http://172.23.6.14:6789/0,omcinfcph02d=172.23.6.15:6789/0,omcinfcph61d=172.23.6.11:6789/0,omcinfcph62d=172.23.6.12:6789/0,omcinfcph63d=172.23.6.13:6789/0},
 election epoch 488, quorum 0,1,2 node1,node2,node3
  mdsmap e48: 1/1/1 up {0=node3=up:active}
  osdmap e3852: 33 osds: 22 up, 33 in
 flags noout
   pgmap v8189785: 10016 pgs, 9 pools, 705 GB data, 177 kobjects
 2122 GB used, 90051 GB / 92174 GB avail
 181055/544038 objects degraded (33.280%)
10016 active+degraded
   client io 0 B/s rd, 233 kB/s wr, 22 op/s





 Thx for your help !!



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crashed while there was no space

2014-11-17 Thread Craig Lewis
At this point, it's probably best to delete the pool.  I'm assuming the
pool only contains benchmark data, and nothing important.

Assuming you can delete the pool:
First, figure out the ID of the data pool.  You can get that from ceph osd
dump | grep '^pool'

Once you have the number, delete the data pool: rados rmpool data
data --yes-i-really-really-mean-it

That will only free up space on OSDs that are up.  You'll need to manually
some PGs on the OSDs that are 100% full.  Go
to /var/lib/ceph/osd/ceph-OSDID/current, and delete a few directories
that start with your data pool ID.  You don't need to delete all of them.
Once the disk is below 95% full, you should be able to start that OSD.
Once it's up, it will finish deleting the pool.

If you can't delete the pool, it is possible, but it's more work, and you
still run the risk of losing data if you make a mistake.  You need to
disable backfilling, then delete some PGs on each OSD that's full. Try to
only delete one copy of each PG.  If you delete every copy of a PG on all
OSDs, then you lost the data that was in that PG.  As before, once you
delete enough that the disk is less than 95% full, you can start the OSD.
Once you start it, start deleting your benchmark data out of the data
pool.  Once that's done, you can re-enable backfilling.  You may need to
scrub or deep-scrub the OSDs you deleted data from to get everything back
to normal.


So how did you get the disks 100% full anyway?  Ceph normally won't let you
do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio, or
 osd_failsafe_full_ratio?


On Mon, Nov 17, 2014 at 7:00 AM, han vincent hang...@gmail.com wrote:

 hello, every one:

 These days a problem of ceph has troubled me for a long time.

 I build a cluster with 3 hosts and each host has three osds in it.
 And after that
 I used the command rados bench 360 -p data -b 4194304 -t 300 write
 --no-cleanup
 to test the write performance of the cluster.

 When the cluster is near full, there couldn't write any data to
 it. Unfortunately,
 there was a host hung up, then a lots of PG was going to migrate to other
 OSDs.
 After a while, a lots of OSD was marked down and out, my cluster couldn't
 work
 any more.

 The following is the output of ceph -s:

 cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
 health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
 incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
 pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
 recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
 down, quorum 0,2 2,1
  monmap e1: 3 mons at
 {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
 epoch 40, quorum 0,2 2,1
  osdmap e173: 9 osds: 2 up, 2 in
 flags full
   pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
 37541 MB used, 3398 MB / 40940 MB avail
 945/29649 objects degraded (3.187%)
   34 stale+active+degraded+remapped
  176 stale+incomplete
  320 stale+down+peering
   53 active+degraded+remapped
  408 incomplete
1 active+recovering+degraded
  673 down+peering
1 stale+active+degraded
   15 remapped+peering
3 stale+active+recovering+degraded+remapped
3 active+degraded
   33 remapped+incomplete
8 active+recovering+degraded+remapped

 The following is the output of ceph osd tree:
 # idweight  type name   up/down reweight
 -1  9   root default
 -3  9   rack unknownrack
 -2  3   host 10.0.0.97
  0   1   osd.0   down0
  1   1   osd.1   down0
  2   1   osd.2   down0
  -4  3   host 10.0.0.98
  3   1   osd.3   down0
  4   1   osd.4   down0
  5   1   osd.5   down0
  -5  3   host 10.0.0.70
  6   1   osd.6   up  1
  7   1   osd.7   up  1
  8   1   osd.8   down0

 The following is part of output os osd.0.log

 -3 2014-11-14 17:33:02.166022 7fd9dd1ab700  0
 filestore(/data/osd/osd.0)  error (28) No space left on device not
 handled on operation 10 (15804.0.13, or op 13, counting from 0)
 -2 2014-11-14 17:33:02.216768 7fd9dd1ab700  0
 filestore(/data/osd/osd.0) ENOSPC handling not implemented
 -1 2014-11-14 17:33:02.216783 7fd9dd1ab700  0
 filestore(/data/osd/osd.0)  transaction dump:
 ...
 ...
 0 2014-11-14 17:33:02.541008 7fd9dd1ab700 -1 

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-17 Thread Craig Lewis
I use `dd` to force activity to the disk I want to replace, and watch the
activity lights.  That only works if your disks aren't 100% busy.  If they
are, stop the ceph-osd daemon, and see which drive stops having activity.
Repeat until you're 100% confident that you're pulling the right drive.

On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.fr
wrote:

  Hi,



 I’m used to RAID software giving me the failing disks  slots, and most
 often blinking the disks on the disk bays.

 I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI
 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T)
 .



 Since this is an LSI, I thought I’d use MegaCli to identify the disks
 slot, but MegaCli does not see the HBA card.

 Then I found the LSI “sas2ircu” utility, but again, this one fails at
 giving me the disk slots (it finds the disks, serials and others, but slot
 is always 0)

 Because of this, I’m going to head over to the disk bay and unplug the
 disk which I think corresponds to the alphabetical order in linux, and see
 if it’s the correct one…. But even if this is correct this time, it might
 not be next time.



 But this makes me wonder : how do you guys, Ceph users, manage your disks
 if you really have JBOD servers ?

 I can’t imagine having to guess slots that each time, and I can’t imagine
 neither creating serial number stickers for every single disk I could have
 to manage …

 Is there any specific advice reguarding JBOD cards people should (not) use
 in their systems ?

 Any magical way to “blink” a drive in linux ?



 Thanks  regards

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD commits suicide

2014-11-17 Thread Craig Lewis
I did have a problem in my secondary cluster that sounds similar to yours.
I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
options xfs = -i size=64k).   This showed up with a lot of XFS: possible
memory allocation deadlock in kmem_alloc in the kernel logs.  I was able
to keep things limping along by flushing the cache frequently, but I
eventually re-formatted every OSD to get rid of the 64k inodes.

After I finished the reformat, I had problems because of deep-scrubbing.
While reformatting, I disabled deep-scrubbing.  Once I re-enabled it, Ceph
wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs would
be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to spread
out the schedule a bit.  Once this finishes in a few day, I should be able
to re-enable deep-scrubbing and keep my HEALTH_OK.


My primary cluster has always been well behaved.  It completed the
re-format without having any problems.  The clusters are nearly identical,
the biggest difference being that the secondary had a higher sustained load
due to a replication backlog.




On Sat, Nov 15, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu
wrote:

 Hi,

 Thanks for the tip, I applied these configuration settings and it does
 lower the load during rebuilding a bit. Are there settings like these
 that also tune Ceph down a bit during regular operations? The slow
 requests, timeouts and OSD suicides are killing me.

 If I allow the cluster to regain consciousness and stay idle a bit, it
 all seems to settle down nicely, but as soon as I apply some load it
 immediately starts to overstress and complain like crazy.

 I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844
 This was reported by Dmitry Smirnov 26 days ago, but the report has no
 response yet. Any ideas?

 In my experience, OSD's are quite unstable in Giant and very easily
 stressed, causing chain effects, further worsening the issues. It would
 be nice to know if this is also noticed by other users?

 Thanks,

 Erik.


 On 11/10/2014 08:40 PM, Craig Lewis wrote:
  Have you tuned any of the recovery or backfill parameters?  My ceph.conf
  has:
  [osd]
osd max backfills = 1
osd recovery max active = 1
osd recovery op priority = 1
 
  Still, if it's running for a few hours, then failing, it sounds like
  there might be something else at play.  OSDs use a lot of RAM during
  recovery.  How much RAM and how many OSDs do you have in these nodes?
  What does memory usage look like after a fresh restart, and what does it
  look like when the problems start?  Even better if you know what it
  looks like 5 minutes before the problems start.
 
  Is there anything interesting in the kernel logs?  OOM killers, or
  memory deadlocks?
 
 
 
  On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu
  mailto:e...@logtenberg.eu wrote:
 
  Hi,
 
  I have some OSD's that keep committing suicide. My cluster has ~1.3M
  misplaced objects, and it can't really recover, because OSD's keep
  failing before recovering finishes. The load on the hosts is quite
 high,
  but the cluster currently has no other tasks than just the
  backfilling/recovering.
 
  I attached the logfile from a failed OSD. It shows the suicide, the
  recent events and also me starting the OSD again after some time.
 
  It'll keep running for a couple of hours and then fail again, for the
  same reason.
 
  I noticed a lot of timeouts. Apparently ceph stresses the hosts to
 the
  limit with the recovery tasks, so much that they timeout and can't
  finish that task. I don't understand why. Can I somehow throttle
 ceph a
  bit so that it doesn't keep overrunning itself? I kinda feel like it
  should chill out a bit and simply recover one step at a time instead
 of
  full force and then fail.
 
  Thanks,
 
  Erik.
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep scrub parameter tuning

2014-11-17 Thread Craig Lewis
The minimum value for osd_deep_scrub_interval  is osd_scrub_min_interval,
and it wouldn't be advisable to go that low.

I can't find the documentation, but basically Ceph will attempt a scrub
sometime between osd_scrub_min_interval and osd_scrub_max_interval.  If the
PG hasn't been deep-scrubbed in the last osd_deep_scrub_interval seconds,
it does a deep-scrub instead.

So if you set  osd_deep_scrub_interval to osd_scrub_min_interval, you'll
never scrub your PGs, you'll only deep-scrub.

Obviously, you can lower the two scrub intervals too.  As Loïc says, test
it well.  I find when I'm playing with these values, I use injectargs to
find a good value, then persist that value in the ceph.conf.


On Fri, Nov 14, 2014 at 3:16 AM, Loic Dachary l...@dachary.org wrote:

 Hi,

 On 14/11/2014 12:11, Mallikarjun Biradar wrote:
  Hi,
 
  Default deep scrub interval is once per week, which we can set using
 osd_deep_scrub_interval parameter.
 
  Whether can we reduce it to less than a week or minimum interval is one
 week?

 You can reduce it to a shorter period. It is worth testing the impact on
 disk IO before going to production with shorter intervals though.

 Cheers

 
  -Thanks  regards,
  Mallikarjun Biradar
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

 --
 Loïc Dachary, Artisan Logiciel Libre


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Negative number of objects degraded for extended period of time

2014-11-17 Thread Craig Lewis
Well, after 4 days, this is probably moot.  Hopefully it's finished
backfilling, and your problem is gone.

If not, I believe that if you fix those backfill_toofull, the negative
numbers will start approaching zero.  I seem to recall that negative
degraded is a special case of degraded, but I don't remember exactly, and
can't find any references.  I have seen it before, and it went away when my
cluster became healthy.

As long as you still have OSDs completing their backfilling, I'd let it
run.

If you get to the point that all of the backfills are done, and you're left
with only wait_backfill+backfill_toofull, then you can bump
osd_backfill_full_ratio, mon_osd_nearfull_ratio, and maybe
osd_failsafe_nearfull_ratio.
 If you do, be careful, and only bump them just enough to let them start
backfilling.  If you set them to 0.99, bad things will happen.




On Thu, Nov 13, 2014 at 7:57 AM, Fred Yang frederic.y...@gmail.com wrote:

 Hi,

 The Ceph cluster we are running have few OSDs approaching to 95% 1+ weeks
 ago so I ran a reweight to balance it out, in the meantime, instructing
 application to purge data not required. But after large amount of data
 purge issued from application side(all OSDs' usage dropped below 20%), the
 cluster fall into this weird state for days, the objects degraded remain
 negative for more than 7 days, I'm seeing some IOs going on on OSDs
 consistently, but the number(negative) objects degraded does not change
 much:

 2014-11-13 10:43:07.237292 mon.0 [INF] pgmap v5935301: 44816 pgs: 44713
 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 30172 kB/s wr, 58 op/s;
 -13582/1468299 objects degraded (-0.925%)
 2014-11-13 10:43:08.248232 mon.0 [INF] pgmap v5935302: 44816 pgs: 44713
 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 26459 kB/s wr, 51 op/s;
 -13582/1468303 objects degraded (-0.925%)

 Any idea what might be happening here? It
 seems active+remapped+wait_backfill+backfill_toofull stuck?

  osdmap e43029: 36 osds: 36 up, 36 in
   pgmap v5935658: 44816 pgs, 32 pools, 1488 GB data, 714 kobjects
 3017 GB used, 17092 GB / 20109 GB avail
 -13438/1475773 objects degraded (-0.911%)
44713 active+clean
1 active+backfilling
   20 active+remapped+wait_backfill
   27 active+remapped+wait_backfill+backfill_toofull
   11 active+recovery_wait
   33 active+remapped+backfilling
   11 active+wait_backfill+backfill_toofull
   client io 478 B/s rd, 40170 kB/s wr, 80 op/s

 The cluster is running on v0.72.2, we are planning to upgrade cluster to
 firefly, but I would like to get the cluster state clean first before the
 upgrade.

 Thanks,
 Fred

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Federated gateways

2014-11-14 Thread Craig Lewis
I have identical regionmaps in both clusters.

I only created the zone's pools in that cluster.  I didn't delete the
default .rgw.* pools, so those exist in both zones.

Both users need to be system on both ends, and have identical access and
secrets.  If they're not, this is likely your problem.


On Fri, Nov 14, 2014 at 11:38 AM, Aaron Bassett aa...@five3genomics.com
wrote:

 Well I upgraded both clusters to giant this morning just to see if that
 would help, and it didn’t. I have a couple questions though. I have the
 same regionmap on both clusters, with both zones in it, but then i only
 have the buckets and zone info for one zone in each cluster, is this right?
 Or do I need all the buckets and zones in both clusters? Reading the docs
 it doesn’t seem like i do because I’m expecting data to sync from one zone
 in one cluster to the other zone on the other cluster, but I don’t know
 what to think anymore.

 Also do both users need to be system users on both ends?

 Aaron



 On Nov 12, 2014, at 4:00 PM, Craig Lewis cle...@centraldesktop.com
 wrote:

 http://tracker.ceph.com/issues/9206

 My post to the ML: http://www.spinics.net/lists/ceph-users/msg12665.html


 IIRC, the system uses didn't see the other user's bucket in a bucket
 listing, but they could read and write the objects fine.



 On Wed, Nov 12, 2014 at 11:16 AM, Aaron Bassett aa...@five3genomics.com
 wrote:

 In playing around with this a bit more, I noticed that the two users on
 the secondary node cant see each others buckets. Is this a problem?


 IIRC, the system user couldn't see each other's buckets, but they could
 read and write the objects.

 On Nov 11, 2014, at 6:56 PM, Craig Lewis cle...@centraldesktop.com
 wrote:

 I see you're running 0.80.5.  Are you using Apache 2.4?  There is a known
 issue with Apache 2.4 on the primary and replication.  It's fixed, just
 waiting for the next firefly release.  Although, that causes 40x errors
 with Apache 2.4, not 500 errors.

 It is apache 2.4, but I’m actually running 0.80.7 so I probably have
 that bug fix?


 No, the unreleased 0.80.8 has the fix.




 Have you verified that both system users can read and write to both
 clusters?  (Just make sure you clean up the writes to the slave cluster).

 Yes I can write everywhere and radosgw-agent isn’t getting any 403s like
 it was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index
 pool is syncing properly, as are the users. It seems like really the only
 thing that isn’t syncing is the .zone.rgw.buckets pool.


 That's pretty much the same behavior I was seeing with Apache 2.4.

 Try downgrading the primary cluster to Apache 2.2.  In my testing, the
 secondary cluster could run 2.2 or 2.4.

 Do you have a link to that bug#? I want to see if it gives me any clues.

 Aaron




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Federated gateways

2014-11-12 Thread Craig Lewis
http://tracker.ceph.com/issues/9206

My post to the ML: http://www.spinics.net/lists/ceph-users/msg12665.html


IIRC, the system uses didn't see the other user's bucket in a bucket
listing, but they could read and write the objects fine.



On Wed, Nov 12, 2014 at 11:16 AM, Aaron Bassett aa...@five3genomics.com
wrote:

 In playing around with this a bit more, I noticed that the two users on
 the secondary node cant see each others buckets. Is this a problem?


IIRC, the system user couldn't see each other's buckets, but they could
read and write the objects.

 On Nov 11, 2014, at 6:56 PM, Craig Lewis cle...@centraldesktop.com
 wrote:

 I see you're running 0.80.5.  Are you using Apache 2.4?  There is a known
 issue with Apache 2.4 on the primary and replication.  It's fixed, just
 waiting for the next firefly release.  Although, that causes 40x errors
 with Apache 2.4, not 500 errors.

 It is apache 2.4, but I’m actually running 0.80.7 so I probably have that
 bug fix?


 No, the unreleased 0.80.8 has the fix.




 Have you verified that both system users can read and write to both
 clusters?  (Just make sure you clean up the writes to the slave cluster).

 Yes I can write everywhere and radosgw-agent isn’t getting any 403s like
 it was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index
 pool is syncing properly, as are the users. It seems like really the only
 thing that isn’t syncing is the .zone.rgw.buckets pool.


 That's pretty much the same behavior I was seeing with Apache 2.4.

 Try downgrading the primary cluster to Apache 2.2.  In my testing, the
 secondary cluster could run 2.2 or 2.4.

 Do you have a link to that bug#? I want to see if it gives me any clues.

 Aaron


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Federated gateways

2014-11-11 Thread Craig Lewis
=0x7f53f00053f0).reader got front 190
 2014-11-11 14:37:06.701449 7f51ff0f0700 10 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).aborted = 0
 2014-11-11 14:37:06.701458 7f51ff0f0700 20 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).reader got 190 + 0 + 0 byte message
 2014-11-11 14:37:06.701569 7f51ff0f0700 10 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).reader got message 49 0x7f51b4001460
 osd_op_reply(1784 statelog.obj_opstate.97 [call] v47531'14 uv14 ondisk = 0)
 v6
 2014-11-11 14:37:06.701597 7f51ff0f0700 20 -- 172.16.10.103:0/1007381
 queue 0x7f51b4001460 prio 127
 2014-11-11 14:37:06.701627 7f51ff0f0700 20 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).reader reading tag...
 2014-11-11 14:37:06.701636 7f51ff1f1700 10 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).writer: state = open policy.server=0
 2014-11-11 14:37:06.701678 7f51ff1f1700 10 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).write_ack 49
 2014-11-11 14:37:06.701684 7f54ebfff700  1 -- 172.16.10.103:0/1007381 ==
 osd.25 172.16.10.103:6934/14875 49  osd_op_reply(1784
 statelog.obj_opstate.97 [call] v47531'14 uv14 ondisk = 0) v6  190+0+0
 (1714651716 0 0) 0x7f51b4001460 con 0x7f53f00053f0
 2014-11-11 14:37:06.701710 7f51ff1f1700 10 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).writer: state = open policy.server=0
 2014-11-11 14:37:06.701728 7f51ff1f1700 20 -- 172.16.10.103:0/1007381 
 172.16.10.103:6934/14875 pipe(0x7f53f0005160 sd=61 :33168 s=2 pgs=2524
 cs=1 l=1 c=0x7f53f00053f0).writer sleeping
 2014-11-11 14:37:06.701751 7f54ebfff700 10 -- 172.16.10.103:0/1007381
 dispatch_throttle_release 190 to dispatch throttler 190/104857600
 2014-11-11 14:37:06.701762 7f54ebfff700 20 -- 172.16.10.103:0/1007381
 done calling dispatch on 0x7f51b4001460
 2014-11-11 14:37:06.701815 7f54447f0700  0 WARNING: set_req_state_err
 err_no=5 resorting to 500
 2014-11-11 14:37:06.701894 7f54447f0700  1 == req done
 req=0x7f546800f3b0 http_status=500 ==


 Any information you could give me would be wonderful as I’ve been banging
 my head against this for a few days.

 Thanks, Aaron

 On Nov 5, 2014, at 3:02 PM, Aaron Bassett aa...@five3genomics.com wrote:

 Ah so I need both users in both clusters? I think I missed that bit, let
 me see if that does the trick.

 Aaron

 On Nov 5, 2014, at 2:59 PM, Craig Lewis cle...@centraldesktop.com wrote:

 One region two zones is the standard setup, so that should be fine.

 Is metadata (users and buckets) being replicated, but not data (objects)?


 Let's go through a quick checklist:

- Verify that you enabled log_meta and log_data in the region.json for
the master zone
- Verify that RadosGW is using your region map with radosgw-admin
regionmap get --name client.radosgw.name
- Verifu
- Verify that RadosGW is using your zone map with radosgw-admin zone
get --name client.radosgw.name
- Verify that all the pools in your zone exist (RadosGW only
auto-creates the basic ones).
- Verify that your system users exist in both zones with the same
access and secret.

 Hopefully that gives you an idea what's not working correctly.

 If it doesn't, crank up the logging on the radosgw daemon on both sides,
 and check the logs.  Add debug rgw = 20 to both ceph.conf (in the
 client.radosgw.name section), and restart.  Hopefully those logs will
 tell you what's wrong.


 On Wed, Nov 5, 2014 at 11:39 AM, Aaron Bassett aa...@five3genomics.com
 wrote:

 Hello everyone,
 I am attempted to setup a two cluster situation for object storage
 disaster recovery. I have two physically separate sites so using 1 big
 cluster isn’t an option. I’m attempting to follow the guide at:
 http://ceph.com/docs/v0.80.5/radosgw/federated-config/ . After a couple
 days of flailing, I’ve settled on using 1 region with two zones, where each
 cluster is a zone. I’m now attempting to set up an agent as per the
 “Multi-Site Data Replication section. The agent kicks off ok and starts
 making all sorts of connections, but no objects were being copied to the
 non-master zone. I re-ran the agent with the -v flag and saw a lot of:

 DEBUG:urllib3.connectionpool:GET
 /admin/opstate?client-id=radosgw-agentobject=test%2F_shadow_.JjVixjWmebQTrRed36FL6D0vy2gDVZ__39op-id=phx-r1-head1%3A2451615%3A1
 HTTP/1.1 200 None

 DEBUG:radosgw_agent.worker:op state is []

 DEBUG:radosgw_agent.worker:error geting op state: list index out of range


 So it appears something is still

Re: [ceph-users] pg's stuck for 4-5 days after reaching backfill_toofull

2014-11-11 Thread Craig Lewis
How many OSDs are nearfull?

I've seen Ceph want two toofull OSDs to swap PGs.  In that case, I
dynamically raised mon_osd_nearfull_ratio and osd_backfill_full_ratio a
bit, then put it back to normal once the scheduling deadlock finished.

Keep in mind that ceph osd reweight is temporary.  If you mark an osd OUT
then IN, the weight will be set to 1.0.  If you need something that's
persistent, you can use ceph osd crush reweight osd.NUM crust_weight.
Look at ceph osd tree to get the current weight.

I also recommend stepping towards your goal.  Changing either weight can
cause a lot of unrelated migrations, and the crush weight seems to cause
more than the osd weight.  I step osd weight by 0.125, and crush weight by
0.05.


On Tue, Nov 11, 2014 at 12:47 PM, Chad Seys cws...@physics.wisc.edu wrote:

 Find out which OSD it is:

 ceph health detail

 Squeeze blocks off the affected OSD:

 ceph osd reweight OSDNUM 0.8

 Repeat with any OSD which becomes toofull.

 Your cluster is only about 50% used, so I think this will be enough.

 Then when it finishes, allow data back on OSD:

 ceph osd reweight OSDNUM 1

 Hopefully ceph will someday be taught to move PGs in a better order!
 Chad.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Federated gateways

2014-11-11 Thread Craig Lewis

 I see you're running 0.80.5.  Are you using Apache 2.4?  There is a known
 issue with Apache 2.4 on the primary and replication.  It's fixed, just
 waiting for the next firefly release.  Although, that causes 40x errors
 with Apache 2.4, not 500 errors.

 It is apache 2.4, but I’m actually running 0.80.7 so I probably have that
 bug fix?


No, the unreleased 0.80.8 has the fix.




 Have you verified that both system users can read and write to both
 clusters?  (Just make sure you clean up the writes to the slave cluster).

 Yes I can write everywhere and radosgw-agent isn’t getting any 403s like
 it was earlier when I had mismatched keys. The .us-nh.rgw.buckets.index
 pool is syncing properly, as are the users. It seems like really the only
 thing that isn’t syncing is the .zone.rgw.buckets pool.


That's pretty much the same behavior I was seeing with Apache 2.4.

Try downgrading the primary cluster to Apache 2.2.  In my testing, the
secondary cluster could run 2.2 or 2.4.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-10 Thread Craig Lewis
If all of your PGs now have an empty down_osds_we_would_probe, I'd run
through this discussion again.  The commands to tell Ceph to give up on
lost data should have an effect now.

That's my experience anyway.  Nothing progressed until I took care of
down_osds_we_would_probe.
After that was empty, I was able to repair.  It wasn't immediate though.
It still took ~24 hours, and a few OSD restarts, for the cluster to get
itself healthy.  You might try sequentially restarting OSDs.  It shouldn't
be necessary, but it shouldn't make anything worse.



On Mon, Nov 10, 2014 at 7:17 AM, Chad Seys cws...@physics.wisc.edu wrote:

 Hi Craig and list,

If you create a real osd.20, you might want to leave it OUT until you
get things healthy again.

 I created a real osd.20 (and it turns out I needed an osd.21 also).

 ceph pg x.xx query no longer lists down osds for probing:
 down_osds_we_would_probe: [],

 But I cannot find the magic command line which will remove these incomplete
 PGs.

 Anyone know how to remove incomplete PGs ?

 Thanks!
 Chad.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistency

2014-11-10 Thread Craig Lewis
For #1, it depends what you mean by fast.  I wouldn't worry about it taking
15 minutes.

If you mark the old OSD out, ceph will start remapping data immediately,
including a bunch of PGs on unrelated OSDs.  Once you replace the disk, and
put the same OSDID back in the same host, the CRUSH map will be back to
what it was before you started.  All of those remaps on unrelated OSDs will
reverse.  They'll complete fairly quickly, because they only have to
backfill the data that was written during the remap.


I prefer #1.  ceph pg repair will just overwrite the replicas with whatever
the primary OSD has, which may copy bad data from your bad OSD over good
replicas.  So #2 has the potential to corrupt the data.  #1 will delete the
data you know is bad, leaving only good data behind to replicate.  Once
ceph pg repair gets more intelligent, I'll revisit this.

I also prefer the simplicity.  If it's dead or corrupt, they're treated the
same.




On Sun, Nov 9, 2014 at 7:25 PM, GuangYang yguan...@outlook.com wrote:


 In terms of disk replacement, to avoid migrating data back and forth, are
 the below two approaches reasonable?
  1. Keep the OSD in and do an ad-hoc disk replacement and provision a new
 OSD (so that keep the OSD id as the same), and then trigger data migration.
 In this way the data migration only happens once, however, it does require
 operators to replace the disk very fast.
  2. Move the data on the broken disk to a new disk completely and use Ceph
 to repair bad objects.

 Thanks,
 Guang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD commits suicide

2014-11-10 Thread Craig Lewis
Have you tuned any of the recovery or backfill parameters?  My ceph.conf
has:
[osd]
  osd max backfills = 1
  osd recovery max active = 1
  osd recovery op priority = 1

Still, if it's running for a few hours, then failing, it sounds like there
might be something else at play.  OSDs use a lot of RAM during recovery.
How much RAM and how many OSDs do you have in these nodes?  What does
memory usage look like after a fresh restart, and what does it look like
when the problems start?  Even better if you know what it looks like 5
minutes before the problems start.

Is there anything interesting in the kernel logs?  OOM killers, or memory
deadlocks?



On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu wrote:

 Hi,

 I have some OSD's that keep committing suicide. My cluster has ~1.3M
 misplaced objects, and it can't really recover, because OSD's keep
 failing before recovering finishes. The load on the hosts is quite high,
 but the cluster currently has no other tasks than just the
 backfilling/recovering.

 I attached the logfile from a failed OSD. It shows the suicide, the
 recent events and also me starting the OSD again after some time.

 It'll keep running for a couple of hours and then fail again, for the
 same reason.

 I noticed a lot of timeouts. Apparently ceph stresses the hosts to the
 limit with the recovery tasks, so much that they timeout and can't
 finish that task. I don't understand why. Can I somehow throttle ceph a
 bit so that it doesn't keep overrunning itself? I kinda feel like it
 should chill out a bit and simply recover one step at a time instead of
 full force and then fail.

 Thanks,

 Erik.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] An OSD always crash few minutes after start

2014-11-10 Thread Craig Lewis
You're running 0.87-6.  There were various fixes for this problem in
Firefly.  Were any of these snapshots created on early version of Firefly?

So far, every fix for this issue has gotten developers involved.  I'd see
if you can talk to some devs on IRC, or post to the ceph-devel mailing list.


My own experience is that I had to delete the affected PGs, and force
create them.  Hopefully there's a better answer now.



On Fri, Nov 7, 2014 at 8:10 PM, Chu Duc Minh chu.ducm...@gmail.com wrote:

 One of my OSDs have problems and can NOT be start. I tried to start many
 times but it always crash few minutes after start.
 I think about two reasons to make it crash:
 1. A read/write request to this OSD, but due to the corrupted
 volume/snapshot/parent-image/..., it crash.
 2. The recovering process can NOT work properly due to the corrupted
 volumes/snapshot/parent-image/...

 After many retry and check log, i guess the reason (2) is the main cause.
 Because  if (1) is the main cause, other OSDs (contain buggy
 volume/snapshot) will crash too.

 State of my ceph cluster (just few seconds before crash time):

   111/57706299 objects degraded (0.001%)
 14918 active+clean
1 active+clean+scrubbing+deep
   52 active+recovery_wait+degraded
2 active+recovering+degraded


 PS: i attach crash-dump log of that OSD in this email for your information.

 Thank you!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck in stale state

2014-11-10 Thread Craig Lewis
nothing to send, going to standby isn't necessarily bad, I see it from
time to time.  It shouldn't stay like that for long though.  If it's been 5
minutes, and the cluster still isn't doing anything, I'd restart that osd.

On Fri, Nov 7, 2014 at 1:55 PM, Jan Pekař jan.pe...@imatic.cz wrote:

 Hi,

 I was testing ceph cluster map changes and I got to stuck state which
 seems to be indefinite.
 First my description what I have done.

 I'm testing special case with only one copy of pg's (pool size = 1).

 All pg's was on one osd.0. I created second osd.1 and modified cluster map
 to transfer one pool (metadata) to the newly created osd.1
 PG's started to remap and objects degraded number was dropping - so
 everything looked normal.

 During that recovery process I restarted both osd daemons.
 After that I noticed, that pg's, that should be remapped had stale state
 - stale+active+remapped+backfilling and other object with stale state .
 I tried to run ceph pg force_create_pg on one pg, that should be remapped,
 but nothing changed (that is 1 stuck / creating PG below in ceph health)

 Command rados -p metadata ls hangs so data are unavailable, but it should
 be there.

 What should I do in this state to get it working?

 ceph -s below:

 cluster 93418692-8e2e-4689-a237-ed5b47f39f72
  health HEALTH_WARN 52 pgs backfill; 1 pgs backfilling; 63 pgs stale;
 1 pgs stuck inactive; 63 pgs stuck stale; 54 pgs stuck unclean; recovery
 107232/1881806 objects degraded (5.698%); mon.imatic-mce low disk space
  monmap e1: 1 mons at {imatic-mce=192.168.11.165:6789/0}, election
 epoch 1, quorum 0 imatic-mce
  mdsmap e450: 1/1/1 up {0=imatic-mce=up:active}
  osdmap e275: 2 osds: 2 up, 2 in
   pgmap v51624: 448 pgs, 4 pools, 790 GB data, 1732 kobjects
 804 GB used, 2915 GB / 3720 GB avail
 107232/1881806 objects degraded (5.698%)
   52 stale+active+remapped+wait_backfill
1 creating
1 stale+active+remapped+backfilling
   10 stale+active+clean
  384 active+clean

 Last message in OSD log's:

 2014-11-07 22:17:45.402791 deb4db70  0 -- 192.168.11.165:6804/29564 
 192.168.11.165:6807/29939 pipe(0x9d52f00 sd=213 :53216 s=2 pgs=1 cs=1 l=0
 c=0x2c7f58c0).fault with nothing to send, going to standby

 Thank you for help
 With regards
 Jan Pekar, ceph fan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd down

2014-11-10 Thread Craig Lewis
Yes, removing an OSD before re-creating it will give you the same OSD ID.
That's my preferred method, because it keeps the crushmap the same.  Only
PGs that existed on the replaced disk need to be backfilled.

I don't know if adding the replacement to the same host then removing the
old OSD gives you the same CRUSH map as the reverse.  I suspect not,
because the OSDs are re-ordered on that host.


On Mon, Nov 10, 2014 at 1:29 PM, Shain Miley smi...@npr.org wrote:

   Craig,

  Thanks for the info.

  I ended up doing a zap and then a create via ceph-deploy.

  One question that I still have is surrounding adding the failed osd back
 into the pool.

  In this example...osd.70 was badwhen I added it back in via
 ceph-deploy...the disk was brought up as osd.108.

  Only after osd.108 was up and running did I think to remove osd.70 from
 the crush map etc.

  My question is this...had I removed it from the crush map prior to my
 ceph-deploy create...should/would Ceph have reused the osd number 70?

  I would prefer to replace a failed disk with a new one and keep the old
 osd assignment...if possible that is why I am asking.

  Anyway...thanks again for all the help.

  Shain

 Sent from my iPhone

 On Nov 7, 2014, at 2:09 PM, Craig Lewis cle...@centraldesktop.com wrote:

   I'd stop that osd daemon, and run xfs_check / xfs_repair on that
 partition.

  If you repair anything, you should probably force a deep-scrub on all
 the PGs on that disk.  I think ceph osd deep-scrub osdid will do that,
 but you might have to manually grep ceph pg dump .


  Or you could just treat it like a failed disk, but re-use the disk. 
 ceph-disk-prepare
 --zap-disk should take care of you.


 On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley smi...@npr.org wrote:

 I tried restarting all the osd's on that node, osd.70 was the only ceph
 process that did not come back online.

 There is nothing in the ceph-osd log for osd.70.

 However I do see over 13,000 of these messages in the kern.log:

 Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1):
 xfs_log_force: error 5 returned.

 Does anyone have any suggestions on how I might be able to get this HD
 back in the cluster (or whether or not it is worth even trying).

 Thanks,

 Shain

 Shain Miley | Manager of Systems and Infrastructure, Digital Media |
 smi...@npr.org | 202.513.3649

 
 From: Shain Miley [smi...@npr.org]
 Sent: Tuesday, November 04, 2014 3:55 PM
 To: ceph-users@lists.ceph.com
 Subject: osd down

 Hello,

 We are running ceph version 0.80.5 with 108 osd's.

 Today I noticed that one of the osd's is down:

 root@hqceph1:/var/log/ceph# ceph -s
  cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
   health HEALTH_WARN crush map has legacy tunables
   monmap e1: 3 mons at
 {hqceph1=
 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0
 },
 election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
   osdmap e7119: 108 osds: 107 up, 107 in
pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
  216 TB used, 171 TB / 388 TB avail
  3204 active+clean
 4 active+clean+scrubbing
client io 4079 kB/s wr, 8 op/s


 Using osd dump I determined that it is osd number 70:

 osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
 last_clean_interval [488,2665) 10.35.1.217:6814/22440
 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440
 autoout,exists http://10.35.1.217:6830/22440autoout,exists
 5dbd4a14-5045-490e-859b-15533cd67568


 Looking at that node, the drive is still mounted and I did not see any
 errors in any of the system logs, and the raid level status shows the
 drive as up and healthy, etc.


 root@hqosd6:~# df -h |grep 70
 /dev/sdl1   3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70


 I was hoping that someone might be able to advise me on the next course
 of action (can I add the osd back in?, should I replace the drive
 altogether, etc)

 I have attached the osd log to this email.

 Any suggestions would be great.

 Thanks,

 Shain















 --
 Shain Miley | Manager of Systems and Infrastructure, Digital Media |
 smi...@npr.org | 202.513.3649
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-10 Thread Craig Lewis
I had the same experience with force_create_pg too.

I ran it, and the PGs sat there in creating state.  I left the cluster
overnight, and sometime in the middle of the night, they created.  The
actual transition from creating to active+clean happened during the
recovery after a single OSD was kicked out.  I don't recall if that single
OSD was responsible for the creating PGs.  I really can't say what
un-jammed my creating.


On Mon, Nov 10, 2014 at 12:33 PM, Chad Seys cws...@physics.wisc.edu wrote:

 Hi Craig,

  If all of your PGs now have an empty down_osds_we_would_probe, I'd run
  through this discussion again.

 Yep, looks to be true.

 So I ran:

 # ceph pg force_create_pg 2.5

 and it has been creating for about 3 hours now. :/


 # ceph health detail | grep creating
 pg 2.5 is stuck inactive since forever, current state creating, last
 acting []
 pg 2.5 is stuck unclean since forever, current state creating, last acting
 []

 Then I restart all OSDs.  The creating label disapears and I'm back with
 same number of incomplete PGs.  :(

 is the 'force_create_pg' the right command?  The 'mark_unfound_lost'
 complains
 that 'pg has no unfound objects' .

 I shall start the 'force_create_pg' again and wait longer.  Unless there
 is a
 different command to use. ?

 Thanks!
 Chad.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] buckets and users

2014-11-07 Thread Craig Lewis
You need separate pools for the different zones, otherwise both zones will
have the same data.

You could use the defaults for the first zone, but the second zone will
need it's own.  You might as well follow the convention of creating
non-default pools for the zone.


This is all semantics, but regions are generally seen as more distinct than
zones.  It's up to you if you want separate regions or the same region with
separate zones.  The end result is the same either way.


On Fri, Nov 7, 2014 at 2:06 AM, Marco Garcês ma...@garces.cc wrote:

 So I really need to create the region also? I thought it was using the
 default region, so I didn't have to create extra regions.
 Let me try to figure this out, the docs are a little bit confusing.

 Marco Garcês



 On Thu, Nov 6, 2014 at 6:39 PM, Craig Lewis cle...@centraldesktop.com
 wrote:
  You need to tell each radosgw daemon which zone to use.  In ceph.conf, I
  have:
  [client.radosgw.ceph3c]
host = ceph3c
rgw socket path = /var/run/ceph/radosgw.ceph3c
keyring = /etc/ceph/ceph.client.radosgw.ceph3c.keyring
log file = /var/log/ceph/radosgw.log
admin socket = /var/run/ceph/radosgw.asok
rgw dns name = us-central-1.ceph.cdlocal
rgw region = us
rgw region root pool = .us.rgw.root
rgw zone = us-central-1
rgw zone root pool = .us-central-1.rgw.root
 
 
 
 
  On Thu, Nov 6, 2014 at 6:35 AM, Marco Garcês ma...@garces.cc wrote:
 
  Update:
 
  I was able to fix the authentication error, and I have 2 radosgw
  running on the same host.
  The problem now, is, I believe I have created the zone wrong, or, I am
  doing something wrong, because I can login with the user I had before,
  and I can access his buckets. I need to have everything separated.
 
  Here are my zone info:
 
  default zone:
  { domain_root: .rgw,
control_pool: .rgw.control,
gc_pool: .rgw.gc,
log_pool: .log,
intent_log_pool: .intent-log,
usage_log_pool: .usage,
user_keys_pool: .users,
user_email_pool: .users.email,
user_swift_pool: .users.swift,
user_uid_pool: .users.uid,
system_key: { access_key: ,
secret_key: },
placement_pools: [
  { key: default-placement,
val: { index_pool: .rgw.buckets.index,
data_pool: .rgw.buckets,
data_extra_pool: .rgw.buckets.extra}}]}
 
  env2 zone:
  { domain_root: .rgw,
control_pool: .rgw.control,
gc_pool: .rgw.gc,
log_pool: .log,
intent_log_pool: .intent-log,
usage_log_pool: .usage,
user_keys_pool: .users,
user_email_pool: .users.email,
user_swift_pool: .users.swift,
user_uid_pool: .users.uid,
system_key: { access_key: ,
secret_key: },
placement_pools: [
  { key: default-placement,
val: { index_pool: .rgw.buckets.index,
data_pool: .rgw.buckets,
data_extra_pool: .rgw.buckets.extra}}]}
 
  Could you guys help me?
 
 
 
  Marco Garcês
 
 
  On Thu, Nov 6, 2014 at 3:56 PM, Marco Garcês ma...@garces.cc wrote:
   By the way,
   Is it possible to run 2 radosgw on the same host?
  
   I think I have created the zone, not sure if it was correct, because
   it used the default pool names, even though I had changed them in the
   json file I had provided.
  
   Now I am trying to run ceph-radosgw with two different entries in the
   ceph.conf file, but without sucess. Example:
  
   [client.radosgw.gw]
   host = GATEWAY
   keyring = /etc/ceph/keyring.radosgw.gw
   rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
   log file = /var/log/ceph/client.radosgw.gateway.log
   rgw print continue = false
   rgw dns name = gateway.local
   rgw enable ops log = false
   rgw enable usage log = true
   rgw usage log tick interval = 30
   rgw usage log flush threshold = 1024
   rgw usage max shards = 32
   rgw usage max user shards = 1
   rgw cache lru size = 15000
   rgw thread pool size = 2048
  
   #[client.radosgw.gw.env2]
   #host = GATEWAY
   #keyring = /etc/ceph/keyring.radosgw.gw
   #rgw socket path =
 /var/run/ceph/ceph.env2.radosgw.gateway.fastcgi.sock
   #log file = /var/log/ceph/client.env2.radosgw.gateway.log
   #rgw print continue = false
   #rgw dns name = cephppr.local
   #rgw enable ops log = false
   #rgw enable usage log = true
   #rgw usage log tick interval = 30
   #rgw usage log flush threshold = 1024
   #rgw usage max shards = 32
   #rgw usage max user shards = 1
   #rgw cache lru size = 15000
   #rgw thread pool size = 2048
   #rgw zone = ppr
  
   It fails to create the socket:
   2014-11-06 15:39:08.862364 7f80cc670880  0 ceph version 0.80.5
   (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process radosgw, pid 7930
   2014-11-06 15:39:08.870429 7f80cc670880  0 librados:
   client.radosgw.gw.env2 authentication error (1) Operation not
   permitted
   2014-11-06 15:39:08.870889 7f80cc670880 -1 Couldn't init storage
   provider (RADOS)
  
  
   What am I doing wrong?
  
   Marco Garcês
   #sysadmin
   Maputo

Re: [ceph-users] Is it normal that osd's memory exceed 1GB under stresstest?

2014-11-07 Thread Craig Lewis
It depends on which version of ceph, but it's pretty normal under newer
versions.

There are a bunch of variables.  How many PGs per OSD, how much data is in
the PGs, etc.  I'm a bit light on the PGs (~60 PGs per OSD), and heavy on
the data (~3 TiB of data on each OSD).  In the production cluster, under
peak user traffic, my OSDs are using around 1GiB of memory.

If there is some scrubbing, deep-scrubbing, or a recovery, I've seen
individual OSDs go as high as 4 GiB.  Which causes some problems...



On Thu, Nov 6, 2014 at 11:00 PM, 谢锐 xie...@szsandstone.com wrote:

 and make one osd down.then do stress test by fio.

 -- Original --


 From:  谢锐xie...@szsandstone.com;

 Date:  Fri, Nov 7, 2014 02:50 PM

 To:  ceph-usersceph-us...@ceph.com;


 Subject:  [ceph-users] Is it normal that osd's memory exceed 1GB under
 stresstest?


 I set mon_osd_down_out_interval to two days,and do stress test. the memory
 of osd exceed 1GB.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd down

2014-11-07 Thread Craig Lewis
I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition.

If you repair anything, you should probably force a deep-scrub on all the
PGs on that disk.  I think ceph osd deep-scrub osdid will do that, but
you might have to manually grep ceph pg dump .


Or you could just treat it like a failed disk, but re-use the disk.
ceph-disk-prepare
--zap-disk should take care of you.


On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley smi...@npr.org wrote:

 I tried restarting all the osd's on that node, osd.70 was the only ceph
 process that did not come back online.

 There is nothing in the ceph-osd log for osd.70.

 However I do see over 13,000 of these messages in the kern.log:

 Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1):
 xfs_log_force: error 5 returned.

 Does anyone have any suggestions on how I might be able to get this HD
 back in the cluster (or whether or not it is worth even trying).

 Thanks,

 Shain

 Shain Miley | Manager of Systems and Infrastructure, Digital Media |
 smi...@npr.org | 202.513.3649

 
 From: Shain Miley [smi...@npr.org]
 Sent: Tuesday, November 04, 2014 3:55 PM
 To: ceph-users@lists.ceph.com
 Subject: osd down

 Hello,

 We are running ceph version 0.80.5 with 108 osd's.

 Today I noticed that one of the osd's is down:

 root@hqceph1:/var/log/ceph# ceph -s
  cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
   health HEALTH_WARN crush map has legacy tunables
   monmap e1: 3 mons at
 {hqceph1=
 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0},
 election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
   osdmap e7119: 108 osds: 107 up, 107 in
pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
  216 TB used, 171 TB / 388 TB avail
  3204 active+clean
 4 active+clean+scrubbing
client io 4079 kB/s wr, 8 op/s


 Using osd dump I determined that it is osd number 70:

 osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
 last_clean_interval [488,2665) 10.35.1.217:6814/22440
 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440
 autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568


 Looking at that node, the drive is still mounted and I did not see any
 errors in any of the system logs, and the raid level status shows the
 drive as up and healthy, etc.


 root@hqosd6:~# df -h |grep 70
 /dev/sdl1   3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70


 I was hoping that someone might be able to advise me on the next course
 of action (can I add the osd back in?, should I replace the drive
 altogether, etc)

 I have attached the osd log to this email.

 Any suggestions would be great.

 Thanks,

 Shain















 --
 Shain Miley | Manager of Systems and Infrastructure, Digital Media |
 smi...@npr.org | 202.513.3649
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-07 Thread Craig Lewis
ceph-disk-prepare will give you the next unused number.  So this will work
only if the osd you remove is greater than 20.

On Thu, Nov 6, 2014 at 12:12 PM, Chad Seys cws...@physics.wisc.edu wrote:

 Hi Craig,

  You'll have trouble until osd.20 exists again.
 
  Ceph really does not want to lose data.  Even if you tell it the osd is
  gone, ceph won't believe you.  Once ceph can probe any osd that claims to
  be 20, it might let you proceed with your recovery.  Then you'll probably
  need to use ceph pg pgid mark_unfound_lost.
 
  If you don't have a free bay to create a real osd.20, it's possible to
 fake
  it with some small loop-back filesystems.  Bring it up and mark it OUT.
 It
  will probably cause some remapping.  I would keep it around until you get
  things healthy.
 
  If you create a real osd.20, you might want to leave it OUT until you get
  things healthy again.

 Thanks for the recovery tip!

 I would guess I could safely remove an OSD (mark OUT, wait for migration to
 stop, then crush osd rm) and then add back in as osd.20 would work?

 New switch:
 --yes-i-really-REALLY-mean-it

 ;)
 Chad.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] buckets and users

2014-11-06 Thread Craig Lewis
You need to tell each radosgw daemon which zone to use.  In ceph.conf, I
have:
[client.radosgw.ceph3c]
  host = ceph3c
  rgw socket path = /var/run/ceph/radosgw.ceph3c
  keyring = /etc/ceph/ceph.client.radosgw.ceph3c.keyring
  log file = /var/log/ceph/radosgw.log
  admin socket = /var/run/ceph/radosgw.asok
  rgw dns name = us-central-1.ceph.cdlocal
  rgw region = us
  rgw region root pool = .us.rgw.root
  rgw zone = us-central-1
  rgw zone root pool = .us-central-1.rgw.root




On Thu, Nov 6, 2014 at 6:35 AM, Marco Garcês ma...@garces.cc wrote:

 Update:

 I was able to fix the authentication error, and I have 2 radosgw
 running on the same host.
 The problem now, is, I believe I have created the zone wrong, or, I am
 doing something wrong, because I can login with the user I had before,
 and I can access his buckets. I need to have everything separated.

 Here are my zone info:

 default zone:
 { domain_root: .rgw,
   control_pool: .rgw.control,
   gc_pool: .rgw.gc,
   log_pool: .log,
   intent_log_pool: .intent-log,
   usage_log_pool: .usage,
   user_keys_pool: .users,
   user_email_pool: .users.email,
   user_swift_pool: .users.swift,
   user_uid_pool: .users.uid,
   system_key: { access_key: ,
   secret_key: },
   placement_pools: [
 { key: default-placement,
   val: { index_pool: .rgw.buckets.index,
   data_pool: .rgw.buckets,
   data_extra_pool: .rgw.buckets.extra}}]}

 env2 zone:
 { domain_root: .rgw,
   control_pool: .rgw.control,
   gc_pool: .rgw.gc,
   log_pool: .log,
   intent_log_pool: .intent-log,
   usage_log_pool: .usage,
   user_keys_pool: .users,
   user_email_pool: .users.email,
   user_swift_pool: .users.swift,
   user_uid_pool: .users.uid,
   system_key: { access_key: ,
   secret_key: },
   placement_pools: [
 { key: default-placement,
   val: { index_pool: .rgw.buckets.index,
   data_pool: .rgw.buckets,
   data_extra_pool: .rgw.buckets.extra}}]}

 Could you guys help me?



 Marco Garcês


 On Thu, Nov 6, 2014 at 3:56 PM, Marco Garcês ma...@garces.cc wrote:
  By the way,
  Is it possible to run 2 radosgw on the same host?
 
  I think I have created the zone, not sure if it was correct, because
  it used the default pool names, even though I had changed them in the
  json file I had provided.
 
  Now I am trying to run ceph-radosgw with two different entries in the
  ceph.conf file, but without sucess. Example:
 
  [client.radosgw.gw]
  host = GATEWAY
  keyring = /etc/ceph/keyring.radosgw.gw
  rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
  log file = /var/log/ceph/client.radosgw.gateway.log
  rgw print continue = false
  rgw dns name = gateway.local
  rgw enable ops log = false
  rgw enable usage log = true
  rgw usage log tick interval = 30
  rgw usage log flush threshold = 1024
  rgw usage max shards = 32
  rgw usage max user shards = 1
  rgw cache lru size = 15000
  rgw thread pool size = 2048
 
  #[client.radosgw.gw.env2]
  #host = GATEWAY
  #keyring = /etc/ceph/keyring.radosgw.gw
  #rgw socket path = /var/run/ceph/ceph.env2.radosgw.gateway.fastcgi.sock
  #log file = /var/log/ceph/client.env2.radosgw.gateway.log
  #rgw print continue = false
  #rgw dns name = cephppr.local
  #rgw enable ops log = false
  #rgw enable usage log = true
  #rgw usage log tick interval = 30
  #rgw usage log flush threshold = 1024
  #rgw usage max shards = 32
  #rgw usage max user shards = 1
  #rgw cache lru size = 15000
  #rgw thread pool size = 2048
  #rgw zone = ppr
 
  It fails to create the socket:
  2014-11-06 15:39:08.862364 7f80cc670880  0 ceph version 0.80.5
  (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process radosgw, pid 7930
  2014-11-06 15:39:08.870429 7f80cc670880  0 librados:
  client.radosgw.gw.env2 authentication error (1) Operation not
  permitted
  2014-11-06 15:39:08.870889 7f80cc670880 -1 Couldn't init storage
  provider (RADOS)
 
 
  What am I doing wrong?
 
  Marco Garcês
  #sysadmin
  Maputo - Mozambique
  [Skype] marcogarces
 
 
  On Thu, Nov 6, 2014 at 10:11 AM, Marco Garcês ma...@garces.cc wrote:
  Your solution of pre-pending the environment name to the bucket, was
  my first choice, but at the moment I can't ask the devs to change the
  code to do that. For now I have to stick with the zones solution.
  Should I follow the federated zones docs
  (http://ceph.com/docs/master/radosgw/federated-config/) but skip the
  sync step?
 
  Thank you,
 
  Marco Garcês
 
  On Wed, Nov 5, 2014 at 8:13 PM, Craig Lewis cle...@centraldesktop.com
 wrote:
  You could setup dedicated zones for each environment, and not
  replicate between them.
 
  Each zone would have it's own URL, but you would be able to re-use
  usernames and bucket names.  If different URLs are a problem, you
  might be able to get around that in the load balancer or the web
  servers.  I wouldn't really recommend that, but it's possible.
 
 
  I have a similar requirement.  I was able to pre-pending

Re: [ceph-users] Basic Ceph Questions

2014-11-06 Thread Craig Lewis
On Wed, Nov 5, 2014 at 11:57 PM, Wido den Hollander w...@42on.com wrote:

 On 11/05/2014 11:03 PM, Lindsay Mathieson wrote:

 
  - Geo Replication - thats done via federated gateways? looks complicated
 :(
* The remote slave, it would be read only?
 

 That is only for the RADOS Gateway. Ceph itself (RADOS) does not support
 Geo Replication.



That is only for the RADOS Gateway. Ceph itself (RADOS) does not support
Geo Replication.

The 3 services built on top of RADOS support backups, but RADOS itself does
not.  For RDB, you can use snapshot diffs, and ship them offsite (see
various threads on the ML).  For RadosGW, there is Federation.  For CephFS,
you can use traditional POSIX filesystem backup tools.




  - Disaster strikes, apart from DR backups how easy is it to recover your
 data
  off ceph OSD's? one of the things I liked about gluster was that if I
 totally
  screwed up the gluster masters, I could always just copy the data off the
  filesystem. Not so much with ceph.
 

 It's a bit harder with Ceph. Eventually it is doable, but that is
 something that would take a lot of time.


In practice, not really.  Out of curiosity, I attempted this for some
RadosGW objects.  It was easy when there was a single object less than
4MB.  It very quickyl became complicated with a few larger objects.  You'd
have to have a very deep understanding of the service to track all of the
information down with the cluster offline.

It's definitely possible, just not practical.




 
  - Am I abusing ceph? :) I just have a small 3 node VM server cluster
 with 20
  windows VM;s, some servers, some VDI. The shared store is a QNAP nas
 which is
  struggling. I'm using ceph for
  - Shared Storage
  - Replication/Redundancy
  - Improved performance
 

 I think that 3 nodes is not sufficient, Ceph really starts performing
 when you go 10 nodes (excluding monitors).


If it meets your needs, then it's working.  :-)

You're going to spend a lot more time managing the 3 node Ceph cluster than
you spent on the QNAP.  If it doesn't make sense for you to spent a lot of
time dealing with storage, then a single shared store with more IOPS would
be a better fit.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] emperor - firefly 0.80.7 upgrade problem

2014-11-06 Thread Craig Lewis
On Thu, Nov 6, 2014 at 11:27 AM, Chad Seys


  Also, are you certain that osd 20 is not up?
  -Sam

 Yep.

 # ceph osd metadata 20
 Error ENOENT: osd.20 does not exist

 So part of ceph thinks osd.20 doesn't exist, but another part (the
 down_osds_we_would_probe) thinks the osd exists and is down?


You'll have trouble until osd.20 exists again.

Ceph really does not want to lose data.  Even if you tell it the osd is
gone, ceph won't believe you.  Once ceph can probe any osd that claims to
be 20, it might let you proceed with your recovery.  Then you'll probably
need to use ceph pg pgid mark_unfound_lost.

If you don't have a free bay to create a real osd.20, it's possible to fake
it with some small loop-back filesystems.  Bring it up and mark it OUT.  It
will probably cause some remapping.  I would keep it around until you get
things healthy.

If you create a real osd.20, you might want to leave it OUT until you get
things healthy again.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   >