Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Christian Balzer

Hello,

On Wed, 20 Aug 2014 15:39:11 +0100 Hugo Mills wrote:

We have a ceph system here, and we're seeing performance regularly
 descend into unusability for periods of minutes at a time (or longer).
 This appears to be triggered by writing large numbers of small files.
 
Specifications:
 
 ceph 0.80.5
 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2
 threads) 
 2 machines running primary and standby MDS
 3 monitors on the same machines as the OSDs
 Infiniband to about 8 CephFS clients (headless, in the machine room)
 Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
machines, in the analysis lab)
 
Please let us know the CPU and memory specs of the OSD nodes as well.
And the replication factor, I presume 3 if you value that data.
Also the PG and PGP values for the pool(s) you're using.

The cluster stores home directories of the users and a larger area
 of scientific data (approx 15 TB) which is being processed and
 analysed by the users of the cluster.
 
We have a relatively small number of concurrent users (typically
 4-6 at most), who use GUI tools to examine their data, and then
 complex sets of MATLAB scripts to process it, with processing often
 being distributed across all the machines using Condor.
 
It's not unusual to see the analysis scripts write out large
 numbers (thousands, possibly tens or hundreds of thousands) of small
 files, often from many client machines at once in parallel. When this
 happens, the ceph cluster becomes almost completely unresponsive for
 tens of seconds (or even for minutes) at a time, until the writes are
 flushed through the system. Given the nature of modern GUI desktop
 environments (often reading and writing small state files in the
 user's home directory), this means that desktop interactiveness and
 responsiveness for all the other users of the cluster suffer.
 
1-minute load on the servers typically peaks at about 8 during
 these events (on 4-core machines). Load on the clients also peaks
 high, because of the number of processes waiting for a response from
 the FS. The MDS shows little sign of stress -- it seems to be entirely
 down to the OSDs. ceph -w shows requests blocked for more than 10
 seconds, and in bad cases, ceph -s shows up to many hundreds of
 requests blocked for more than 32s.
 
We've had to turn off scrubbing and deep scrubbing completely --
 except between 01.00 and 04.00 every night -- because it triggers the
 exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
 up to 7 PGs being scrubbed, as it did on Monday, it's completely
 unusable.
 
Note that I know nothing about CephFS and while there are probably
tunables the slow requests you're seeing and the hardware up there
definitely suggests slow OSDs.

Now with a replication factor of 3, your total cluster performance
(sustained) is that of just 6 disks and 4TB ones are never any speed
wonders. Minus the latency overheads from the network, which should be
minimal in your case though.

Your old NFS (cluster?) had twice the spindles you wrote, so if that means
36 disks it was quite a bit faster.

A cluster I'm just building with 3 nodes, 4 journal SSDs and 8 OSD HDDs
per node can do about 7000 write IOPS (4KB), so I would expect yours to be
worse off.

Having the journals on dedicated partitions instead of files on the rootfs
would not only be faster (though probably not significantly so), but also
prevent any potential failures based on FS corruption.

The SSD journals will compensate for some spikes of high IOPS, but 25
files is clearly beyond that.

Putting lots of RAM (relatively cheap these days) into the OSD nodes has
the big benefit that reads of hot objects will not have to go to disk and
thus compete with write IOPS.

Is this problem something that's often seen? If so, what are the
 best options for mitigation or elimination of the problem? I've found
 a few references to issue #6278 [1], but that seems to be referencing
 scrub specifically, not ordinary (if possibly pathological) writes.
 
You need to match your cluster to your workload.
Aside from tuning things (which tends to have limited effects), you can
either scale out by adding more servers or scale up by using faster
storage and/or a cache pool.

What are the sorts of things I should be looking at to work out
 where the bottleneck(s) are? I'm a bit lost about how to drill down
 into the ceph system for identifying performance issues. Is there a
 useful guide to tools somewhere?
 
Reading/scouring this ML can be quite helpful. 

Watch your OSD nodes (all of them!) with iostat or preferably atop (which
will also show you how your CPUs and network is doing) while running the
below stuff. 

To get a baseline do:
rados -p pool-in-question bench 60 write -t 64
This will test your throughput most of all and due to the 4MB block size
spread the load very equally amongst the OSDs.
During that test you should see all OSDs more or 

Re: [ceph-users] RadosGW problems

2014-08-21 Thread Marco Garcês
I have noticed that when I make the request to HTTPS, the responde comes in
http form with port 443... Where is this happening, do you have any idea?

On Wed, Aug 20, 2014 at 1:30 PM, Marco Garcês ma...@garces.cc wrote:

 swift --insecure -V 1 -A https://gateway.bcitestes.local/auth -U
 testuser:swift -K MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat
 Account HEAD failed: http://gateway.bcitestes.local:443/swift/v1 400 Bad
 Request

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Dan Van Der Ster
Hi Hugo,

On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote:

 What are you using for OSD journals?
 
   On each machine, the three OSD journals live on the same ext4
 filesystem on an SSD, which is also the root filesystem of the
 machine.
 
 Also check the CPU usage for the mons and osds...
 
   The mons are doing pretty much nothing in terms of CPU, as far as I
 can see. I will double-check during an incident.
 
 Does your hardware provide enough IOPS for what your users need?
 (e.g. what is the op/s from ceph -w)
 
   Not really an answer to your question, but: Before the ceph cluster
 went in, we were running the system on two 5-year-old NFS servers for
 a while. We have about half the total number of spindles that we used
 to, but more modern drives.

NFS exported async or sync? If async, it can’t be compared to CephFS. Also, if 
those NFS servers had RAID cards with a wb-cache, it can’t really be compared.

 
   I'll look at how the op/s values change when we have the problem.
 At the moment (with what I assume to be normal desktop usage from the
 3-4 users in the lab), they're flapping wildly somewhere around a
 median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
 read and write.


Another tunable to look at is the filestore max sync interval — in my 
experience the colocated journal/OSD setup suffers with the default (5s, IIRC), 
especially when an OSD is getting a constant stream of writes. When this 
happens, the disk heads are constantly seeking back and forth between 
synchronously writing to the journal and flushing the outstanding writes. If we 
would have a dedicated (spinning) disk for the journal, then the synchronous 
writes (to the journal) could be done sequentially (thus, quickly) and the 
flushes would also be quick(er). SSD journals can obviously also help with this.

For a short test I would try increasing filestore max sync interval to 30s or 
maybe even 60s to see if it helps. (I know that at least one of the Inktank 
experts advise against changing the filestore max sync interval — but in my 
experience 5s is much too short for the colocated journal setup.) You need to 
make sure your journals are large enough to store 30/60s of writes, but when 
you have predominantly small writes even a few GB of journal ought to be 
enough. 

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting

2014-08-21 Thread Dan Van Der Ster
Hi,
You only have one OSD? I’ve seen similar strange things in test pools having 
only one OSD — and I kinda explained it by assuming that OSDs need peers (other 
OSDs sharing the same PG) to behave correctly. Install a second OSD and see how 
it goes...
Cheers, Dan


On 21 Aug 2014, at 02:59, Bruce McFarland 
bruce.mcfarl...@taec.toshiba.commailto:bruce.mcfarl...@taec.toshiba.com 
wrote:

I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple 
OSD’s running on it. When I start the OSD using /etc/init.d/ceph start osd.0
I see the expected interaction between the OSD and the monitor authenticating 
keys etc and finally the OSD starts.

Running watching the cluster with ‘ceph –w’ running on the monitor I never see 
the INFO messages I expect. There isn’t a msg from osd.0 for the boot event and 
the expected INFO messages from osdmap and pgmap  for the osd and it’s pages 
being added to those maps.  I only see the last time the monitor was booted and 
it wins the monitor election and reports monmap, pgmap, and mdsmap info.

The firewalls are disabled with selinux==disabled and iptables turned off. All 
hosts can ssh w/o passwords into each other and I’ve verified traffic between 
hosts using tcpdump captures. Any ideas on what I’d need to add to ceph.conf or 
have overlooked would be greatly appreciated.
Thanks,
Bruce

[root@ceph0 ceph]# /etc/init.d/ceph restart osd.0
=== osd.0 ===
=== osd.0 ===
Stopping Ceph osd.0 on ceph0...kill 15676...done
=== osd.0 ===
2014-08-20 17:43:46.456592 7fa51a034700  1 -- :/0 messenger.start
2014-08-20 17:43:46.457363 7fa51a034700  1 -- :/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 
0x7fa51402f9e0 con 0x7fa51402f570
2014-08-20 17:43:46.458229 7fa5189f0700  1 -- 209.243.160.83:0/1025971 learned 
my addr 209.243.160.83:0/1025971
2014-08-20 17:43:46.459664 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 1  mon_map v1  200+0+0 (3445960796 0 0) 
0x7fa508000ab0 con 0x7fa51402f570
2014-08-20 17:43:46.459849 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570
2014-08-20 17:43:46.460180 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 
0x7fa4fc0012d0 con 0x7fa51402f570
2014-08-20 17:43:46.461341 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  
206+0+0 (409581826 0 0) 0x7fa508000f60 con 0x7fa51402f570
2014-08-20 17:43:46.461514 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 
0x7fa4fc001cf0 con 0x7fa51402f570
2014-08-20 17:43:46.462824 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  
393+0+0 (2134012784 0 0) 0x7fa5080011d0 con 0x7fa51402f570
2014-08-20 17:43:46.463011 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7fa51402bbc0 
con 0x7fa51402f570
2014-08-20 17:43:46.463073 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fa4fc0025d0 
con 0x7fa51402f570
2014-08-20 17:43:46.463329 7fa51a034700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7fa514030490 con 0x7fa51402f570
2014-08-20 17:43:46.463363 7fa51a034700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7fa5140309b0 con 0x7fa51402f570
2014-08-20 17:43:46.463564 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 5  mon_map v1  200+0+0 (3445960796 0 0) 
0x7fa508001100 con 0x7fa51402f570
2014-08-20 17:43:46.463639 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 6  mon_subscribe_ack(300s) v1  20+0+0 
(540052875 0 0) 0x7fa5080013e0 con 0x7fa51402f570
2014-08-20 17:43:46.463707 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 7  auth_reply(proto 2 0 (0) Success) v1  
194+0+0 (1040860857 0 0) 0x7fa5080015d0 con 0x7fa51402f570
2014-08-20 17:43:46.468877 7fa51a034700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_command({prefix: get_command_descriptions} v 
0) v1 -- ?+0 0x7fa514030e20 con 0x7fa51402f570
2014-08-20 17:43:46.469862 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 8  osd_map(554..554 src has 1..554) v3  
59499+0+0 (2180258623 0 0) 0x7fa50800f980 con 0x7fa51402f570
2014-08-20 17:43:46.470428 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 9  mon_subscribe_ack(300s) v1  20+0+0 
(540052875 0 0) 0x7fa50800fc40 con 0x7fa51402f570
2014-08-20 17:43:46.475021 7fa5135fe700  1 -- 

Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
   Just to fill in some of the gaps from yesterday's mail:

On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote:
Some questions below I can't answer immediately, but I'll spend
 tomorrow morning irritating people by triggering these events (I think
 I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
 files in it) and giving you more details. 

   Yes, the tarball with the 25 small files in it is definitely a
reproducer.

[snip]
  What about iostat on the OSDs — are your OSD disks busy reading or
  writing during these incidents?
 
Not sure. I don't think so, but I'll try to trigger an incident and
 report back on this one.

   Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes,
and 200-300 kB/s reads on all three, but it fluctuates a lot (with
5-second intervals). Sample data at the end of the email.

  What are you using for OSD journals?
 
On each machine, the three OSD journals live on the same ext4
 filesystem on an SSD, which is also the root filesystem of the
 machine.
 
  Also check the CPU usage for the mons and osds...
 
The mons are doing pretty much nothing in terms of CPU, as far as I
 can see. I will double-check during an incident.

   The mons are just ticking over with a 1% CPU usage.

  Does your hardware provide enough IOPS for what your users need?
  (e.g. what is the op/s from ceph -w)
 
Not really an answer to your question, but: Before the ceph cluster
 went in, we were running the system on two 5-year-old NFS servers for
 a while. We have about half the total number of spindles that we used
 to, but more modern drives.
 
I'll look at how the op/s values change when we have the problem.
 At the moment (with what I assume to be normal desktop usage from the
 3-4 users in the lab), they're flapping wildly somewhere around a
 median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
 read and write.

   With minimal users and one machine running the tar unpacking
process, I'm getting somewhere around 100-200 op/s on the ceph
cluster, but interactivity on the desktop machine I'm logged in on is
horrible -- I'm frequently getting tens of seconds of latency. Compare
that to the (relatively) comfortable 350-400 op/s we had yesterday
with what is probably workloads with larger files.

  If disabling deep scrub helps, then it might be that something else
  is reading the disks heavily. One thing to check is updatedb — we
  had to disable it from indexing /var/lib/ceph on our OSDs.
 
I haven't seen that running at all during the day, but I'll look
 into it.

   No, it's not anything like that -- iotop reports pretty much the
only things doing IO are ceph-osd and the occasional xfsaild.

   Hugo.

Hugo.
 
  Best Regards,
  Dan
  
  -- Dan van der Ster || Data  Storage Services || CERN IT Department --
  
  
  On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote:
  
 We have a ceph system here, and we're seeing performance regularly
   descend into unusability for periods of minutes at a time (or longer).
   This appears to be triggered by writing large numbers of small files.
   
 Specifications:
   
   ceph 0.80.5
   6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
   2 machines running primary and standby MDS
   3 monitors on the same machines as the OSDs
   Infiniband to about 8 CephFS clients (headless, in the machine room)
   Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
 machines, in the analysis lab)
   
 The cluster stores home directories of the users and a larger area
   of scientific data (approx 15 TB) which is being processed and
   analysed by the users of the cluster.
   
 We have a relatively small number of concurrent users (typically
   4-6 at most), who use GUI tools to examine their data, and then
   complex sets of MATLAB scripts to process it, with processing often
   being distributed across all the machines using Condor.
   
 It's not unusual to see the analysis scripts write out large
   numbers (thousands, possibly tens or hundreds of thousands) of small
   files, often from many client machines at once in parallel. When this
   happens, the ceph cluster becomes almost completely unresponsive for
   tens of seconds (or even for minutes) at a time, until the writes are
   flushed through the system. Given the nature of modern GUI desktop
   environments (often reading and writing small state files in the
   user's home directory), this means that desktop interactiveness and
   responsiveness for all the other users of the cluster suffer.
   
 1-minute load on the servers typically peaks at about 8 during
   these events (on 4-core machines). Load on the clients also peaks
   high, because of the number of processes waiting for a response from
   the FS. The MDS shows little sign of stress -- it seems to be entirely
   down to the OSDs. ceph -w shows requests blocked for more than 10
   seconds, 

Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
On Thu, Aug 21, 2014 at 07:40:45AM +, Dan Van Der Ster wrote:
 On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote:
  Does your hardware provide enough IOPS for what your users need?
  (e.g. what is the op/s from ceph -w)
  
Not really an answer to your question, but: Before the ceph cluster
  went in, we were running the system on two 5-year-old NFS servers for
  a while. We have about half the total number of spindles that we used
  to, but more modern drives.
 
 NFS exported async or sync? If async, it can’t be compared to
 CephFS. Also, if those NFS servers had RAID cards with a wb-cache,
 it can’t really be compared.

   Hmm. Yes, async. Probably wouldn't have been my choice... (I only
started working with this system recently -- about the same time that
the ceph cluster was deployed to replace the older machines. I haven't
had much of say in what's implemented here, but I have to try to
support it.)

   I'm tempted to put the users' home directories back on an NFS
server, and keep ceph for the research data. That at least should give
us more in the way of interactivity (which is the main thing I'm
getting complaints about).

I'll look at how the op/s values change when we have the problem.
  At the moment (with what I assume to be normal desktop usage from the
  3-4 users in the lab), they're flapping wildly somewhere around a
  median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
  read and write.

 Another tunable to look at is the filestore max sync interval — in
 my experience the colocated journal/OSD setup suffers with the
 default (5s, IIRC), especially when an OSD is getting a constant
 stream of writes. When this happens, the disk heads are constantly
 seeking back and forth between synchronously writing to the journal
 and flushing the outstanding writes. If we would have a dedicated
 (spinning) disk for the journal, then the synchronous writes (to the
 journal) could be done sequentially (thus, quickly) and the flushes
 would also be quick(er). SSD journals can obviously also help with
 this.

   Not sure what you mean about colocated journal/OSD. The journals
aren't on the same device as the OSDs. However, all three journals on
each machine are on the same SSD.

 For a short test I would try increasing filestore max sync interval
 to 30s or maybe even 60s to see if it helps. (I know that at least
 one of the Inktank experts advise against changing the filestore max
 sync interval — but in my experience 5s is much too short for the
 colocated journal setup.) You need to make sure your journals are
 large enough to store 30/60s of writes, but when you have
 predominantly small writes even a few GB of journal ought to be
 enough.

   I'll have a play with that.

   Thanks for all the help so far -- it's been useful. I'm learning
what the right kind of questions are.

   Hugo.

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Dan Van Der Ster
Hi Hugo,

On 21 Aug 2014, at 14:17, Hugo Mills h.r.mi...@reading.ac.uk wrote:

 
   Not sure what you mean about colocated journal/OSD. The journals
 aren't on the same device as the OSDs. However, all three journals on
 each machine are on the same SSD.

embarrassed I obviously didn’t drink enough coffee this morning. I read your 
reply as something like … On each machine, the three OSD journals live on the 
same ext4 filesystem on an OSD”.

Anyway… what kind of SSD do you have? With iostat -xm 1, do you see high % 
utilisation on that SSD during these incidents? It could be that you’re 
exceeding even the iops capacity of the SSD.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-users@lists.ceph.com

2014-08-21 Thread Paweł Sadowski
Hi,

I'm trying to start Qemu on top of RBD. In documentation[1] there is a
big warning:

Important

If you set rbd_cache=true, you must set cache=writeback or risk data
loss. Without cache=writeback, QEMU will not send flush requests to
librbd. If QEMU exits uncleanly in this configuration, filesystems
on top of rbd can be corrupted.

But in last part of that page there is written that Qemu command line
override ceph.conf settings and setting *cache=writethrough* will force
*rbd_cache**=**true* and *rbd_cache_max_dirty=0*. In that configuration
rbd will write directly to Ceph and there is no risk of data loss
(except for things cached in VM OS). Am I right or am I missing something?

1: http://ceph.com/docs/master/rbd/qemu-rbd/

Thanks,
PS
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Qemu cache=writethrough

2014-08-21 Thread Paweł Sadowski
Sorry for missing subject.

On 08/21/2014 03:09 PM, Paweł Sadowski wrote:
 Hi,

 I'm trying to start Qemu on top of RBD. In documentation[1] there is a
 big warning:

 Important

 If you set rbd_cache=true, you must set cache=writeback or risk data
 loss. Without cache=writeback, QEMU will not send flush requests to
 librbd. If QEMU exits uncleanly in this configuration, filesystems
 on top of rbd can be corrupted.

 But in last part of that page there is written that Qemu command line
 override ceph.conf settings and setting *cache=writethrough* will force
 *rbd_cache**=**true* and *rbd_cache_max_dirty=0*. In that configuration
 rbd will write directly to Ceph and there is no risk of data loss
 (except for things cached in VM OS). Am I right or am I missing something?

 1: http://ceph.com/docs/master/rbd/qemu-rbd/

 Thanks,
 PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question on OSD node failure recovery

2014-08-21 Thread LaBarre, James (CTR) A6IT
I understand the concept with Ceph being able to recover from the failure of an 
OSD (presumably with a single OSD being on a single disk), but I'm wondering 
what the scenario is if an OSD server node containing  multiple disks should 
fail.  Presuming you have a server containing 8-10 disks, your duplicated 
placement groups could end up on the same system.  From diagrams I've seen they 
show duplicates going to separate nodes, but is this in fact how it handles it?
--
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown. 
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity to
whom it is intended even if addressed incorrectly.  Please delete it from
your files if you are not the intended recipient.  Thank you for your
compliance.  Copyright (c) 2014 Cigna
==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on OSD node failure recovery

2014-08-21 Thread Sean Noonan
Ceph uses CRUSH (http://ceph.com/docs/master/rados/operations/crush-map/) to 
determine object placement.  The default generated crush maps are sane, in that 
they will put replicas in placement groups into separate failure domains.  You 
do not need to worry about this simple failure case, but you should consider 
the network and disk i/o consequences of re-replicating large amounts of data.

Sean

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of LaBarre, 
James  (CTR)  A6IT [james.laba...@cigna.com]
Sent: Thursday, August 21, 2014 9:17 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Question on OSD node failure recovery

I understand the concept with Ceph being able to recover from the failure of an 
OSD (presumably with a single OSD being on a single disk), but I’m wondering 
what the scenario is if an OSD server node containing  multiple disks should 
fail.  Presuming you have a server containing 8-10 disks, your duplicated 
placement groups could end up on the same system.  From diagrams I’ve seen they 
show duplicates going to separate nodes, but is this in fact how it handles it?

--
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown.
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity to
whom it is intended even if addressed incorrectly.  Please delete it from
your files if you are not the intended recipient.  Thank you for your
compliance.  Copyright (c) 2014 Cigna
==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hanging ceph client

2014-08-21 Thread Damien Churchill
Hi,

On a freshly created 4 node cluster I'm struggling to get the 4th node
to create correctly. ceph-deploy is unable to create the OSDs on it
and when logging in to the node and attempting to run `ceph -s`
manually (after copying the client.admin keyring) with debug
parameters it ends up hanging and looping over mon_command({prefix:
get_command_descriptions} v 0).

I'm not sure what else to try to find out why this is happening. It
seems like it's able to talk to the monitors okay as it looks like it
is authenticating, and the same command runs fine on the first 3 nodes
which are running monitors, but just hangs on the node that isn't.

Thanks in advance for any help!
root@ceph4:~# ceph -s --debug-ms=5 --debug-client=5 --debug-mon=10
2014-08-21 14:45:32.689379 7ff622841700  1 -- :/0 messenger.start
2014-08-21 14:45:32.691284 7ff622841700  1 -- :/1007607 -- 
192.168.78.13:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- ?+0 0x7ff61c024980 
con 0x7ff61c024530
2014-08-21 14:45:32.692075 7ff61a7fc700  1 -- 192.168.78.14:0/1007607 learned 
my addr 192.168.78.14:0/1007607
2014-08-21 14:45:32.693174 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 1  mon_map v1  485+0+0 (2066881705 0 0) 
0x7ff61bd0 con 0x7ff61c024530
2014-08-21 14:45:32.693383 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  33+0+0 
(3596119886 0 0) 0x7ff610001080 con 0x7ff61c024530
2014-08-21 14:45:32.693691 7ff620885700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.13:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7ff604001680 
con 0x7ff61c024530
2014-08-21 14:45:32.694549 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  206+0+0 
(1790499909 0 0) 0x7ff610001080 con 0x7ff61c024530
2014-08-21 14:45:32.694750 7ff620885700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.13:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 
0x7ff604003810 con 0x7ff61c024530
2014-08-21 14:45:32.695641 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  393+0+0 
(350251809 0 0) 0x7ff618c0 con 0x7ff61c024530
2014-08-21 14:45:32.695780 7ff620885700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.13:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7ff61c020c20 con 
0x7ff61c024530
2014-08-21 14:45:32.696051 7ff622841700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.13:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7ff61c025200 con 0x7ff61c024530
2014-08-21 14:45:32.696079 7ff622841700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.13:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7ff61c0257a0 con 0x7ff61c024530
2014-08-21 14:45:32.696324 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 5  mon_map v1  485+0+0 (2066881705 0 0) 
0x7ff6100012f0 con 0x7ff61c024530
2014-08-21 14:45:32.696422 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 6  mon_subscribe_ack(300s) v1  20+0+0 (1427523647 
0 0) 0x7ff610001590 con 0x7ff61c024530
2014-08-21 14:45:32.696834 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 7  osd_map(46..46 src has 1..46) v3  7172+0+0 
(2083907578 0 0) 0x7ff618c0 con 0x7ff61c024530
2014-08-21 14:45:32.697095 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 8  mon_subscribe_ack(300s) v1  20+0+0 (1427523647 
0 0) 0x7ff610002fd0 con 0x7ff61c024530
2014-08-21 14:45:32.704621 7ff622841700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.13:6789/0 -- mon_command({prefix: get_command_descriptions} v 0) 
v1 -- ?+0 0x7ff61c025c10 con 0x7ff61c024530
2014-08-21 14:45:32.900195 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 9  osd_map(46..46 src has 1..46) v3  7172+0+0 
(2083907578 0 0) 0x7ff618c0 con 0x7ff61c024530
2014-08-21 14:45:32.900265 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.2 
192.168.78.13:6789/0 10  mon_subscribe_ack(300s) v1  20+0+0 (1427523647 
0 0) 0x7ff610002fd0 con 0x7ff61c024530
2014-08-21 14:46:05.691726 7ff61b7fe700  1 -- 192.168.78.14:0/1007607 mark_down 
0x7ff61c024530 -- 0x7ff61c0242c0
2014-08-21 14:46:05.691818 7ff61a6fb700  2 -- 192.168.78.14:0/1007607  
192.168.78.13:6789/0 pipe(0x7ff61c0242c0 sd=3 :60918 s=4 pgs=174 cs=1 l=1 
c=0x7ff61c024530).fault (0) Success
2014-08-21 14:46:05.691913 7ff61b7fe700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.12:6789/0 -- auth(proto 0 30 bytes epoch 1) v1 -- ?+0 0x7ff608001ba0 
con 0x7ff608001760
2014-08-21 14:46:05.693707 7ff620885700  1 -- 192.168.78.14:0/1007607 == mon.1 
192.168.78.12:6789/0 1  auth_reply(proto 2 0 (0) Success) v1  33+0+0 
(2330663482 0 0) 0x7ff610001220 con 0x7ff608001760
2014-08-21 14:46:05.693982 7ff620885700  1 -- 192.168.78.14:0/1007607 -- 
192.168.78.12:6789/0 -- auth(proto 2 128 bytes epoch 0) v1 -- ?+0 
0x7ff604007520 

[ceph-users] fail to upload file from RadosGW by Python+S3

2014-08-21 Thread debian Only
i can upload file to RadosGW by s3cmd , and software Dragondisk.

the script can list all bucket and all file in the bucket.  but can not
from python s3.
###
#coding=utf-8
__author__ = 'Administrator'

#!/usr/bin/env python
import fnmatch
import os, sys
import boto
import boto.s3.connection

access_key = 'VC8R6C193WDVKNTDCRKA'
secret_key = 'ASUWdUTx6PwVXEf/oJRRmDnvKEWp509o3rl1Xt+h'

pidfile = copytoceph.pid


def check_pid(pid):
try:
os.kill(pid, 0)
except OSError:
return False
else:
return True


if os.path.isfile(pidfile):
pid = long(open(pidfile, 'r').read())
if check_pid(pid):
print %s already exists, doing natting % pidfile
sys.exit()

pid = str(os.getpid())
file(pidfile, 'w').write(pid)

conn = boto.connect_s3(
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
host='ceph-radosgw.lab.com',
port=80,
is_secure=False,
calling_format=boto.s3.connection.OrdinaryCallingFormat(),
)

print conn
mybucket = conn.get_bucket('foo')
print mybucket
mylist = mybucket.list()
print mylist
buckets = conn.get_all_buckets()
for bucket in buckets:
print {name}\t{created}.format(
name=bucket.name,
created=bucket.creation_date,
)

for key in bucket.list():
print {name}\t{size}\t{modified}.format(
name=(key.name).encode('utf8'),
size=key.size,
modified=key.last_modified,
)


key = mybucket.new_key('hello.txt')
print key
key.set_contents_from_string('Hello World!')

###

root@ceph-radosgw:~# python rgwupload.py
S3Connection:ceph-radosgw.lab.com
Bucket: foo
boto.s3.bucketlistresultset.BucketListResultSet object at 0x1d6ae10
backup  2014-08-21T10:23:08.000Z
add volume for vms.png  23890   2014-08-21T10:53:43.000Z
foo 2014-08-20T16:11:19.000Z
file0001.txt29  2014-08-21T04:22:25.000Z
galley/DSC_0005.JPG 2142126 2014-08-21T04:24:29.000Z
galley/DSC_0006.JPG 2005662 2014-08-21T04:24:29.000Z
galley/DSC_0009.JPG 1922686 2014-08-21T04:24:29.000Z
galley/DSC_0010.JPG 2067713 2014-08-21T04:24:29.000Z
galley/DSC_0011.JPG 2027689 2014-08-21T04:24:30.000Z
galley/DSC_0012.JPG 2853358 2014-08-21T04:24:30.000Z
galley/DSC_0013.JPG 2844746 2014-08-21T04:24:30.000Z
iso 2014-08-21T04:43:16.000Z
pdf 2014-08-21T09:36:15.000Z
Key: foo,hello.txt

it hanged at here.

Same error when i run this script on radosgw host.

Traceback (most recent call last):
  File D:/Workspace/S3-Ceph/test.py, line 65, in module
key.set_contents_from_string('Hello World!')
  File c:\Python27\lib\site-packages\boto\s3\key.py, line 1419, in
set_contents_from_string
encrypt_key=encrypt_key)
  File c:\Python27\lib\site-packages\boto\s3\key.py, line 1286, in
set_contents_from_file
chunked_transfer=chunked_transfer, size=size)
  File c:\Python27\lib\site-packages\boto\s3\key.py, line 746, in
send_file
chunked_transfer=chunked_transfer, size=size)
  File c:\Python27\lib\site-packages\boto\s3\key.py, line 944, in
_send_file_internal
query_args=query_args
  File c:\Python27\lib\site-packages\boto\s3\connection.py, line 664, in
make_request
retry_handler=retry_handler
  File c:\Python27\lib\site-packages\boto\connection.py, line 1053, in
make_request
retry_handler=retry_handler)
  File c:\Python27\lib\site-packages\boto\connection.py, line 1009, in
_mexe
raise BotoServerError(response.status, response.reason, body)
boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error
None
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] active+remapped after remove osd via ceph osd out

2014-08-21 Thread Dominik Mostowiec
Hi,
I have 2 PG in active+remapped state.

ceph health detail
HEALTH_WARN 2 pgs stuck unclean; recovery 24/348041229 degraded (0.000%)
pg 3.1a07 is stuck unclean for 29239.046024, current state
active+remapped, last acting [167,80,145]
pg 3.154a is stuck unclean for 29239.039777, current state
active+remapped, last acting [377,224,292]
recovery 24/348041229 degraded (0.000%)

This happend when i call ceph osd reweight-by-utilization 102

What can be wrong ?

ceph -v - ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f)

Tunables:
ceph osd crush dump | tail -n 4
  tunables: { choose_local_tries: 0,
  choose_local_fallback_tries: 0,
  choose_total_tries: 60,
  chooseleaf_descend_once: 1}}

Cluster:
6 racks X 3 hosts X 22 OSDs. (396 osds: 396 up, 396 in)

 crushtool -i ../crush2  --min-x 0 --num-rep 3  --max-x 10624 --test 
 --show-bad-mappings
is clean.

When 'ceph osd reweight' for all osd is 1.0 is ok, but i have nearfull OSD's.

There is no missing OSD's in crushmap
 grep device /tmp/crush.txt | grep -v osd
# devices

ceph osd dump | grep -i pool
pool 0 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 28459 owner 0
crash_replay_interval 45
pool 1 'metadata' rep size 3 min_size 1 crush_ruleset 1 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 28460 owner 0
pool 2 'rbd' rep size 3 min_size 1 crush_ruleset 2 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 28461 owner 0
pool 3 '.rgw.buckets' rep size 3 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 73711 owner
0
pool 4 '.log' rep size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 2048 pgp_num 2048 last_change 90517 owner 0
pool 5 '.rgw' rep size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 72467 owner 0
pool 6 '.users.uid' rep size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 28465 owner 0
pool 7 '.users' rep size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 28466 owner 0
pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 28467 owner
18446744073709551615
pool 9 '.intent-log' rep size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 28468 owner
18446744073709551615
pool 10 '.rgw.control' rep size 3 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 33485 owner
18446744073709551615
pool 11 '.rgw.gc' rep size 3 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 33487 owner
18446744073709551615
pool 12 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 8 pgp_num 8 last_change 44540 owner 0
pool 13 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
pg_num 8 pgp_num 8 last_change 46912 owner 0

 ceph pg 3.1a07 query
{ state: active+remapped,
  epoch: 181721,
  up: [
167,
80],
  acting: [
167,
80,
145],
  info: { pgid: 3.1a07,
  last_update: 181719'94809,
  last_complete: 181719'94809,
  log_tail: 159997'91808,
  last_backfill: MAX,
  purged_snaps: [],
  history: { epoch_created: 4,
  last_epoch_started: 179611,
  last_epoch_clean: 179611,
  last_epoch_split: 11522,
  same_up_since: 179610,
  same_interval_since: 179610,
  same_primary_since: 179610,
  last_scrub: 160655'94695,
  last_scrub_stamp: 2014-08-19 04:16:20.308318,
  last_deep_scrub: 158290'91157,
  last_deep_scrub_stamp: 2014-08-12 05:15:25.557591,
  last_clean_scrub_stamp: 2014-08-19 04:16:20.308318},
  stats: { version: 181719'94809,
  reported_seq: 995830,
  reported_epoch: 181721,
  state: active+remapped,
  last_fresh: 2014-08-21 14:53:14.050284,
  last_change: 2014-08-21 09:42:07.473356,
  last_active: 2014-08-21 14:53:14.050284,
  last_clean: 2014-08-21 07:38:51.366084,
  last_became_active: 2013-10-25 13:59:36.125019,
  last_unstale: 2014-08-21 14:53:14.050284,
  mapping_epoch: 179606,
  log_start: 159997'91808,
  ondisk_log_start: 159997'91808,
  created: 4,
  last_epoch_clean: 179611,
  parent: 0.0,
  parent_split_bits: 0,
  last_scrub: 160655'94695,
  last_scrub_stamp: 2014-08-19 04:16:20.308318,
  last_deep_scrub: 158290'91157,
  last_deep_scrub_stamp: 2014-08-12 05:15:25.557591,
  last_clean_scrub_stamp: 2014-08-19 04:16:20.308318,
  log_size: 3001,
  ondisk_log_size: 3001,
  stats_invalid: 0,
  stat_sum: { num_bytes: 2880784014,
  num_objects: 12108,
  num_object_clones: 0,
  num_object_copies: 0,
  num_objects_missing_on_primary: 0,
  num_objects_degraded: 0,
   

Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting

2014-08-21 Thread Bruce McFarland
I have 3 storage servers each with 30 osds. Each osd has a journal that is a 
partition on a virtual drive that is a raid0 of 6 ssds. I brought up a 3 osd (1 
per storage server) cluster to bring up Ceph and figure out configuration etc.

From: Dan Van Der Ster [mailto:daniel.vanders...@cern.ch]
Sent: Thursday, August 21, 2014 1:17 AM
To: Bruce McFarland
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting

Hi,
You only have one OSD? I've seen similar strange things in test pools having 
only one OSD - and I kinda explained it by assuming that OSDs need peers (other 
OSDs sharing the same PG) to behave correctly. Install a second OSD and see how 
it goes...
Cheers, Dan


On 21 Aug 2014, at 02:59, Bruce McFarland 
bruce.mcfarl...@taec.toshiba.commailto:bruce.mcfarl...@taec.toshiba.com 
wrote:


I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple 
OSD's running on it. When I start the OSD using /etc/init.d/ceph start osd.0
I see the expected interaction between the OSD and the monitor authenticating 
keys etc and finally the OSD starts.

Running watching the cluster with 'ceph -w' running on the monitor I never see 
the INFO messages I expect. There isn't a msg from osd.0 for the boot event and 
the expected INFO messages from osdmap and pgmap  for the osd and it's pages 
being added to those maps.  I only see the last time the monitor was booted and 
it wins the monitor election and reports monmap, pgmap, and mdsmap info.

The firewalls are disabled with selinux==disabled and iptables turned off. All 
hosts can ssh w/o passwords into each other and I've verified traffic between 
hosts using tcpdump captures. Any ideas on what I'd need to add to ceph.conf or 
have overlooked would be greatly appreciated.
Thanks,
Bruce

[root@ceph0 ceph]# /etc/init.d/ceph restart osd.0
=== osd.0 ===
=== osd.0 ===
Stopping Ceph osd.0 on ceph0...kill 15676...done
=== osd.0 ===
2014-08-20 17:43:46.456592 7fa51a034700  1 -- :/0 messenger.start
2014-08-20 17:43:46.457363 7fa51a034700  1 -- :/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 
0x7fa51402f9e0 con 0x7fa51402f570
2014-08-20 17:43:46.458229 7fa5189f0700  1 -- 209.243.160.83:0/1025971 learned 
my addr 209.243.160.83:0/1025971
2014-08-20 17:43:46.459664 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 1  mon_map v1  200+0+0 (3445960796 0 0) 
0x7fa508000ab0 con 0x7fa51402f570
2014-08-20 17:43:46.459849 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570
2014-08-20 17:43:46.460180 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 
0x7fa4fc0012d0 con 0x7fa51402f570
2014-08-20 17:43:46.461341 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  
206+0+0 (409581826 0 0) 0x7fa508000f60 con 0x7fa51402f570
2014-08-20 17:43:46.461514 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 
0x7fa4fc001cf0 con 0x7fa51402f570
2014-08-20 17:43:46.462824 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  
393+0+0 (2134012784 0 0) 0x7fa5080011d0 con 0x7fa51402f570
2014-08-20 17:43:46.463011 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7fa51402bbc0 
con 0x7fa51402f570
2014-08-20 17:43:46.463073 7fa5135fe700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7fa4fc0025d0 
con 0x7fa51402f570
2014-08-20 17:43:46.463329 7fa51a034700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7fa514030490 con 0x7fa51402f570
2014-08-20 17:43:46.463363 7fa51a034700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
0x7fa5140309b0 con 0x7fa51402f570
2014-08-20 17:43:46.463564 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 5  mon_map v1  200+0+0 (3445960796 0 0) 
0x7fa508001100 con 0x7fa51402f570
2014-08-20 17:43:46.463639 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 6  mon_subscribe_ack(300s) v1  20+0+0 
(540052875 0 0) 0x7fa5080013e0 con 0x7fa51402f570
2014-08-20 17:43:46.463707 7fa5135fe700  1 -- 209.243.160.83:0/1025971 == 
mon.0 209.243.160.84:6789/0 7  auth_reply(proto 2 0 (0) Success) v1  
194+0+0 (1040860857 0 0) 0x7fa5080015d0 con 0x7fa51402f570
2014-08-20 17:43:46.468877 7fa51a034700  1 -- 209.243.160.83:0/1025971 -- 
209.243.160.84:6789/0 -- mon_command({prefix: get_command_descriptions} v 
0) v1 -- ?+0 0x7fa514030e20 con 0x7fa51402f570
2014-08-20 

[ceph-users] Ceph Cinder Capabilities reports wrong free size

2014-08-21 Thread Jens-Christian Fischer
I am working with Cinder Multi Backends on an Icehouse installation and have 
added another backend (Quobyte) to a previously running Cinder/Ceph 
installation.

I can now create QuoByte volumes, but no longer any ceph volumes. The 
cinder-scheduler log get’s an incorrect number for the free size of the volumes 
pool and disregards the RBD backend as a viable storage system:

2014-08-21 16:42:49.847 1469 DEBUG 
cinder.openstack.common.scheduler.filters.capabilities_filter [r...] extra_spec 
requirement 'rbd' does not match 'quobyte' _satisfies_extra_specs 
/usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:55
2014-08-21 16:42:49.848 1469 DEBUG 
cinder.openstack.common.scheduler.filters.capabilities_filter [r...] host 
'controller@quobyte': free_capacity_gb: 156395.931061 fails resource_type 
extra_specs requirements host_passes 
/usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:68
2014-08-21 16:42:49.848 1469 WARNING cinder.scheduler.filters.capacity_filter 
[r...-] Insufficient free space for volume creation (requested / avail): 20/8.0
2014-08-21 16:42:49.849 1469 ERROR cinder.scheduler.flows.create_volume [r.] 
Failed to schedule_create_volume: No valid host was found.

here’s our /etc/cinder/cinder.conf

— cut —
[DEFAULT]
rootwrap_config = /etc/cinder/rootwrap.conf
api_paste_confg = /etc/cinder/api-paste.ini
# iscsi_helper = tgtadm
volume_name_template = volume-%s
# volume_group = cinder-volumes
verbose = True
auth_strategy = keystone
state_path = /var/lib/cinder
lock_path = /var/lock/cinder
volumes_dir = /var/lib/cinder/volumes
rabbit_host=10.2.0.10
use_syslog=False
api_paste_config=/etc/cinder/api-paste.ini
glance_num_retries=0
debug=True
storage_availability_zone=nova
glance_api_ssl_compression=False
glance_api_insecure=False
rabbit_userid=openstack
rabbit_use_ssl=False
log_dir=/var/log/cinder
osapi_volume_listen=0.0.0.0
glance_api_servers=1.2.3.4:9292
rabbit_virtual_host=/
scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler
default_availability_zone=nova
rabbit_hosts=10.2.0.10:5672
control_exchange=openstack
rabbit_ha_queues=False
glance_api_version=2
amqp_durable_queues=False
rabbit_password=secret
rabbit_port=5672
rpc_backend=cinder.openstack.common.rpc.impl_kombu
enabled_backends=quobyte,rbd
default_volume_type=rbd

[database]
idle_timeout=3600
connection=mysql://cinder:secret@10.2.0.10/cinder

[quobyte]
quobyte_volume_url=quobyte://hostname.cloud.example.com/openstack-volumes
volume_driver=cinder.volume.drivers.quobyte.QuobyteDriver

[rbd-volumes]
volume_backend_name=rbd-volumes
rbd_pool=volumes
rbd_flatten_volume_from_snapshot=False
rbd_user=cinder
rbd_ceph_conf=/etc/ceph/ceph.conf
rbd_secret_uuid=1234-5678-ABCD-…-DEF
rbd_max_clone_depth=5
volume_driver=cinder.volume.drivers.rbd.RBDDriver

— cut ---

any ideas?

cheers
Jens-Christian

-- 
SWITCH
Jens-Christian Fischer, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 15 71
jens-christian.fisc...@switch.ch
http://www.switch.ch

http://www.switch.ch/stories

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hanging ceph client

2014-08-21 Thread Gregory Farnum
Yeah, that's fairly bizarre. Have you turned up the monitor logs and
seen what they're doing? Have you checked that the nodes otherwise
have the same configuration (firewall rules, client key permissions,
installed version of Ceph...)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Aug 21, 2014 at 6:50 AM, Damien Churchill dam...@gmail.com wrote:
 Hi,

 On a freshly created 4 node cluster I'm struggling to get the 4th node
 to create correctly. ceph-deploy is unable to create the OSDs on it
 and when logging in to the node and attempting to run `ceph -s`
 manually (after copying the client.admin keyring) with debug
 parameters it ends up hanging and looping over mon_command({prefix:
 get_command_descriptions} v 0).

 I'm not sure what else to try to find out why this is happening. It
 seems like it's able to talk to the monitors okay as it looks like it
 is authenticating, and the same command runs fine on the first 3 nodes
 which are running monitors, but just hangs on the node that isn't.

 Thanks in advance for any help!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fail to upload file from RadosGW by Python+S3

2014-08-21 Thread debian Only
when i use Dragondisk , i unselect Expect 100-continue  header , upload
file sucessfully.  when select this option, upload file will hang.

maybe the python script can not upload file due to the 100-continue ??  my
radosgw Apache2 not use 100-continue.

if my guess is ture,  how to disable this in python s3-connection and make
python script working for upload file?



2014-08-21 20:57 GMT+07:00 debian Only onlydeb...@gmail.com:

 i can upload file to RadosGW by s3cmd , and software Dragondisk.

 the script can list all bucket and all file in the bucket.  but can not
 from python s3.
 ###
 #coding=utf-8
 __author__ = 'Administrator'

 #!/usr/bin/env python
 import fnmatch
 import os, sys
 import boto
 import boto.s3.connection

 access_key = 'VC8R6C193WDVKNTDCRKA'
 secret_key = 'ASUWdUTx6PwVXEf/oJRRmDnvKEWp509o3rl1Xt+h'

 pidfile = copytoceph.pid


 def check_pid(pid):
 try:
 os.kill(pid, 0)
 except OSError:
 return False
 else:
 return True


 if os.path.isfile(pidfile):
 pid = long(open(pidfile, 'r').read())
 if check_pid(pid):
 print %s already exists, doing natting % pidfile
 sys.exit()

 pid = str(os.getpid())
 file(pidfile, 'w').write(pid)

 conn = boto.connect_s3(
 aws_access_key_id=access_key,
 aws_secret_access_key=secret_key,
 host='ceph-radosgw.lab.com',
 port=80,
 is_secure=False,
 calling_format=boto.s3.connection.OrdinaryCallingFormat(),
 )

 print conn
 mybucket = conn.get_bucket('foo')
 print mybucket
 mylist = mybucket.list()
 print mylist
 buckets = conn.get_all_buckets()
 for bucket in buckets:
 print {name}\t{created}.format(
 name=bucket.name,
 created=bucket.creation_date,
 )

 for key in bucket.list():
 print {name}\t{size}\t{modified}.format(
 name=(key.name).encode('utf8'),
 size=key.size,
 modified=key.last_modified,
 )


 key = mybucket.new_key('hello.txt')
 print key
 key.set_contents_from_string('Hello World!')

 ###

 root@ceph-radosgw:~# python rgwupload.py
 S3Connection:ceph-radosgw.lab.com
 Bucket: foo
 boto.s3.bucketlistresultset.BucketListResultSet object at 0x1d6ae10
 backup  2014-08-21T10:23:08.000Z
 add volume for vms.png  23890   2014-08-21T10:53:43.000Z
 foo 2014-08-20T16:11:19.000Z
 file0001.txt29  2014-08-21T04:22:25.000Z
 galley/DSC_0005.JPG 2142126 2014-08-21T04:24:29.000Z
 galley/DSC_0006.JPG 2005662 2014-08-21T04:24:29.000Z
 galley/DSC_0009.JPG 1922686 2014-08-21T04:24:29.000Z
 galley/DSC_0010.JPG 2067713 2014-08-21T04:24:29.000Z
 galley/DSC_0011.JPG 2027689 2014-08-21T04:24:30.000Z
 galley/DSC_0012.JPG 2853358 2014-08-21T04:24:30.000Z
 galley/DSC_0013.JPG 2844746 2014-08-21T04:24:30.000Z
 iso 2014-08-21T04:43:16.000Z
 pdf 2014-08-21T09:36:15.000Z
 Key: foo,hello.txt

 it hanged at here.

 Same error when i run this script on radosgw host.

 Traceback (most recent call last):
   File D:/Workspace/S3-Ceph/test.py, line 65, in module
 key.set_contents_from_string('Hello World!')
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 1419, in
 set_contents_from_string
 encrypt_key=encrypt_key)
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 1286, in
 set_contents_from_file
 chunked_transfer=chunked_transfer, size=size)
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 746, in
 send_file
 chunked_transfer=chunked_transfer, size=size)
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 944, in
 _send_file_internal
 query_args=query_args
   File c:\Python27\lib\site-packages\boto\s3\connection.py, line 664, in
 make_request
 retry_handler=retry_handler
   File c:\Python27\lib\site-packages\boto\connection.py, line 1053, in
 make_request
 retry_handler=retry_handler)
   File c:\Python27\lib\site-packages\boto\connection.py, line 1009, in
 _mexe
 raise BotoServerError(response.status, response.reason, body)
 boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error
 None

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fail to upload file from RadosGW by Python+S3

2014-08-21 Thread debian Only
my radosgw disbaled 100-continue

[global]
fsid = 075f1aae-48de-412e-b024-b0f014dbc8cf
mon_initial_members = ceph01-vm, ceph02-vm, ceph04-vm
mon_host = 192.168.123.251,192.168.123.252,192.168.123.250
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

*rgw print continue = false*
rgw dns name = ceph-radosgw
osd pool default pg num = 128
osd pool default pgp num = 128

#debug rgw = 20
[client.radosgw.gateway]
host = ceph-radosgw
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
log file = /var/log/ceph/client.radosgw.gateway.log



2014-08-21 22:42 GMT+07:00 debian Only onlydeb...@gmail.com:

 when i use Dragondisk , i unselect Expect 100-continue  header , upload
 file sucessfully.  when select this option, upload file will hang.

 maybe the python script can not upload file due to the 100-continue ??  my
 radosgw Apache2 not use 100-continue.

 if my guess is ture,  how to disable this in python s3-connection and make
 python script working for upload file?



 2014-08-21 20:57 GMT+07:00 debian Only onlydeb...@gmail.com:

 i can upload file to RadosGW by s3cmd , and software Dragondisk.

 the script can list all bucket and all file in the bucket.  but can not
 from python s3.
 ###
 #coding=utf-8
 __author__ = 'Administrator'

 #!/usr/bin/env python
 import fnmatch
 import os, sys
 import boto
 import boto.s3.connection

 access_key = 'VC8R6C193WDVKNTDCRKA'
 secret_key = 'ASUWdUTx6PwVXEf/oJRRmDnvKEWp509o3rl1Xt+h'

 pidfile = copytoceph.pid


 def check_pid(pid):
 try:
 os.kill(pid, 0)
 except OSError:
 return False
 else:
 return True


 if os.path.isfile(pidfile):
 pid = long(open(pidfile, 'r').read())
 if check_pid(pid):
 print %s already exists, doing natting % pidfile
  sys.exit()

 pid = str(os.getpid())
 file(pidfile, 'w').write(pid)

 conn = boto.connect_s3(
 aws_access_key_id=access_key,
 aws_secret_access_key=secret_key,
 host='ceph-radosgw.lab.com',
 port=80,
 is_secure=False,
 calling_format=boto.s3.connection.OrdinaryCallingFormat(),
  )

 print conn
 mybucket = conn.get_bucket('foo')
 print mybucket
 mylist = mybucket.list()
 print mylist
 buckets = conn.get_all_buckets()
 for bucket in buckets:
 print {name}\t{created}.format(
 name=bucket.name,
 created=bucket.creation_date,
 )

 for key in bucket.list():
 print {name}\t{size}\t{modified}.format(
 name=(key.name).encode('utf8'),
  size=key.size,
 modified=key.last_modified,
 )


 key = mybucket.new_key('hello.txt')
 print key
 key.set_contents_from_string('Hello World!')

 ###

 root@ceph-radosgw:~# python rgwupload.py
 S3Connection:ceph-radosgw.lab.com
 Bucket: foo
 boto.s3.bucketlistresultset.BucketListResultSet object at 0x1d6ae10
 backup  2014-08-21T10:23:08.000Z
 add volume for vms.png  23890   2014-08-21T10:53:43.000Z
 foo 2014-08-20T16:11:19.000Z
 file0001.txt29  2014-08-21T04:22:25.000Z
 galley/DSC_0005.JPG 2142126 2014-08-21T04:24:29.000Z
 galley/DSC_0006.JPG 2005662 2014-08-21T04:24:29.000Z
 galley/DSC_0009.JPG 1922686 2014-08-21T04:24:29.000Z
 galley/DSC_0010.JPG 2067713 2014-08-21T04:24:29.000Z
 galley/DSC_0011.JPG 2027689 2014-08-21T04:24:30.000Z
 galley/DSC_0012.JPG 2853358 2014-08-21T04:24:30.000Z
 galley/DSC_0013.JPG 2844746 2014-08-21T04:24:30.000Z
 iso 2014-08-21T04:43:16.000Z
 pdf 2014-08-21T09:36:15.000Z
 Key: foo,hello.txt

 it hanged at here.

 Same error when i run this script on radosgw host.

 Traceback (most recent call last):
   File D:/Workspace/S3-Ceph/test.py, line 65, in module
 key.set_contents_from_string('Hello World!')
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 1419, in
 set_contents_from_string
 encrypt_key=encrypt_key)
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 1286, in
 set_contents_from_file
 chunked_transfer=chunked_transfer, size=size)
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 746, in
 send_file
 chunked_transfer=chunked_transfer, size=size)
   File c:\Python27\lib\site-packages\boto\s3\key.py, line 944, in
 _send_file_internal
 query_args=query_args
   File c:\Python27\lib\site-packages\boto\s3\connection.py, line 664,
 in make_request
 retry_handler=retry_handler
   File c:\Python27\lib\site-packages\boto\connection.py, line 1053, in
 make_request
 retry_handler=retry_handler)
   File c:\Python27\lib\site-packages\boto\connection.py, line 1009, in
 _mexe
 raise BotoServerError(response.status, response.reason, body)
 boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error
 None



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting

2014-08-21 Thread Gregory Farnum
Are the OSD processes still alive? What's the osdmap output of ceph
-w (which was not in the output you pasted)?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Aug 21, 2014 at 7:11 AM, Bruce McFarland
bruce.mcfarl...@taec.toshiba.com wrote:
 I have 3 storage servers each with 30 osds. Each osd has a journal that is a
 partition on a virtual drive that is a raid0 of 6 ssds. I brought up a 3 osd
 (1 per storage server) cluster to bring up Ceph and figure out configuration
 etc.



 From: Dan Van Der Ster [mailto:daniel.vanders...@cern.ch]
 Sent: Thursday, August 21, 2014 1:17 AM
 To: Bruce McFarland
 Cc: ceph-us...@ceph.com
 Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting



 Hi,

 You only have one OSD? I’ve seen similar strange things in test pools having
 only one OSD — and I kinda explained it by assuming that OSDs need peers
 (other OSDs sharing the same PG) to behave correctly. Install a second OSD
 and see how it goes...

 Cheers, Dan





 On 21 Aug 2014, at 02:59, Bruce McFarland bruce.mcfarl...@taec.toshiba.com
 wrote:



 I have a cluster with 1 monitor and 3 OSD Servers. Each server has multiple
 OSD’s running on it. When I start the OSD using /etc/init.d/ceph start osd.0

 I see the expected interaction between the OSD and the monitor
 authenticating keys etc and finally the OSD starts.



 Running watching the cluster with ‘ceph –w’ running on the monitor I never
 see the INFO messages I expect. There isn’t a msg from osd.0 for the boot
 event and the expected INFO messages from osdmap and pgmap  for the osd and
 it’s pages being added to those maps.  I only see the last time the monitor
 was booted and it wins the monitor election and reports monmap, pgmap, and
 mdsmap info.



 The firewalls are disabled with selinux==disabled and iptables turned off.
 All hosts can ssh w/o passwords into each other and I’ve verified traffic
 between hosts using tcpdump captures. Any ideas on what I’d need to add to
 ceph.conf or have overlooked would be greatly appreciated.

 Thanks,

 Bruce



 [root@ceph0 ceph]# /etc/init.d/ceph restart osd.0

 === osd.0 ===

 === osd.0 ===

 Stopping Ceph osd.0 on ceph0...kill 15676...done

 === osd.0 ===

 2014-08-20 17:43:46.456592 7fa51a034700  1 -- :/0 messenger.start

 2014-08-20 17:43:46.457363 7fa51a034700  1 -- :/1025971 --
 209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0
 0x7fa51402f9e0 con 0x7fa51402f570

 2014-08-20 17:43:46.458229 7fa5189f0700  1 -- 209.243.160.83:0/1025971
 learned my addr 209.243.160.83:0/1025971

 2014-08-20 17:43:46.459664 7fa5135fe700  1 -- 209.243.160.83:0/1025971 ==
 mon.0 209.243.160.84:6789/0 1  mon_map v1  200+0+0 (3445960796 0 0)
 0x7fa508000ab0 con 0x7fa51402f570

 2014-08-20 17:43:46.459849 7fa5135fe700  1 -- 209.243.160.83:0/1025971 ==
 mon.0 209.243.160.84:6789/0 2  auth_reply(proto 2 0 (0) Success) v1 
 33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570

 2014-08-20 17:43:46.460180 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --
 209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0
 0x7fa4fc0012d0 con 0x7fa51402f570

 2014-08-20 17:43:46.461341 7fa5135fe700  1 -- 209.243.160.83:0/1025971 ==
 mon.0 209.243.160.84:6789/0 3  auth_reply(proto 2 0 (0) Success) v1 
 206+0+0 (409581826 0 0) 0x7fa508000f60 con 0x7fa51402f570

 2014-08-20 17:43:46.461514 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --
 209.243.160.84:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0
 0x7fa4fc001cf0 con 0x7fa51402f570

 2014-08-20 17:43:46.462824 7fa5135fe700  1 -- 209.243.160.83:0/1025971 ==
 mon.0 209.243.160.84:6789/0 4  auth_reply(proto 2 0 (0) Success) v1 
 393+0+0 (2134012784 0 0) 0x7fa5080011d0 con 0x7fa51402f570

 2014-08-20 17:43:46.463011 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --
 209.243.160.84:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7fa51402bbc0
 con 0x7fa51402f570

 2014-08-20 17:43:46.463073 7fa5135fe700  1 -- 209.243.160.83:0/1025971 --
 209.243.160.84:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0
 0x7fa4fc0025d0 con 0x7fa51402f570

 2014-08-20 17:43:46.463329 7fa51a034700  1 -- 209.243.160.83:0/1025971 --
 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0
 0x7fa514030490 con 0x7fa51402f570

 2014-08-20 17:43:46.463363 7fa51a034700  1 -- 209.243.160.83:0/1025971 --
 209.243.160.84:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0
 0x7fa5140309b0 con 0x7fa51402f570

 2014-08-20 17:43:46.463564 7fa5135fe700  1 -- 209.243.160.83:0/1025971 ==
 mon.0 209.243.160.84:6789/0 5  mon_map v1  200+0+0 (3445960796 0 0)
 0x7fa508001100 con 0x7fa51402f570

 2014-08-20 17:43:46.463639 7fa5135fe700  1 -- 209.243.160.83:0/1025971 ==
 mon.0 209.243.160.84:6789/0 6  mon_subscribe_ack(300s) v1  20+0+0
 (540052875 0 0) 0x7fa5080013e0 con 0x7fa51402f570

 2014-08-20 17:43:46.463707 7fa5135fe700  1 -- 209.243.160.83:0/1025971 ==
 mon.0 209.243.160.84:6789/0 7  

Re: [ceph-users] Ceph Cinder Capabilities reports wrong free size

2014-08-21 Thread Gregory Farnum
On Thu, Aug 21, 2014 at 8:29 AM, Jens-Christian Fischer
jens-christian.fisc...@switch.ch wrote:
 I am working with Cinder Multi Backends on an Icehouse installation and have 
 added another backend (Quobyte) to a previously running Cinder/Ceph 
 installation.

 I can now create QuoByte volumes, but no longer any ceph volumes. The 
 cinder-scheduler log get’s an incorrect number for the free size of the 
 volumes pool and disregards the RBD backend as a viable storage system:

I don't know much about Cinder, but given this output:

 2014-08-21 16:42:49.847 1469 DEBUG 
 cinder.openstack.common.scheduler.filters.capabilities_filter [r...] 
 extra_spec requirement 'rbd' does not match 'quobyte' _satisfies_extra_specs 
 /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:55
 2014-08-21 16:42:49.848 1469 DEBUG 
 cinder.openstack.common.scheduler.filters.capabilities_filter [r...] host 
 'controller@quobyte': free_capacity_gb: 156395.931061 fails resource_type 
 extra_specs requirements host_passes 
 /usr/lib/python2.7/dist-packages/cinder/openstack/common/scheduler/filters/capabilities_filter.py:68
 2014-08-21 16:42:49.848 1469 WARNING cinder.scheduler.filters.capacity_filter 
 [r...-] Insufficient free space for volume creation (requested / avail): 
 20/8.0
 2014-08-21 16:42:49.849 1469 ERROR cinder.scheduler.flows.create_volume [r.] 
 Failed to schedule_create_volume: No valid host was found.

I suspect you'll have better luck on the Openstack mailing list. :)

Although for a random quick guess, I think maybe you need to match the
rbd and rbd-volumes (from your conf file) strings?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



 here’s our /etc/cinder/cinder.conf

 — cut —
 [DEFAULT]
 rootwrap_config = /etc/cinder/rootwrap.conf
 api_paste_confg = /etc/cinder/api-paste.ini
 # iscsi_helper = tgtadm
 volume_name_template = volume-%s
 # volume_group = cinder-volumes
 verbose = True
 auth_strategy = keystone
 state_path = /var/lib/cinder
 lock_path = /var/lock/cinder
 volumes_dir = /var/lib/cinder/volumes
 rabbit_host=10.2.0.10
 use_syslog=False
 api_paste_config=/etc/cinder/api-paste.ini
 glance_num_retries=0
 debug=True
 storage_availability_zone=nova
 glance_api_ssl_compression=False
 glance_api_insecure=False
 rabbit_userid=openstack
 rabbit_use_ssl=False
 log_dir=/var/log/cinder
 osapi_volume_listen=0.0.0.0
 glance_api_servers=1.2.3.4:9292
 rabbit_virtual_host=/
 scheduler_driver=cinder.scheduler.filter_scheduler.FilterScheduler
 default_availability_zone=nova
 rabbit_hosts=10.2.0.10:5672
 control_exchange=openstack
 rabbit_ha_queues=False
 glance_api_version=2
 amqp_durable_queues=False
 rabbit_password=secret
 rabbit_port=5672
 rpc_backend=cinder.openstack.common.rpc.impl_kombu
 enabled_backends=quobyte,rbd
 default_volume_type=rbd

 [database]
 idle_timeout=3600
 connection=mysql://cinder:secret@10.2.0.10/cinder

 [quobyte]
 quobyte_volume_url=quobyte://hostname.cloud.example.com/openstack-volumes
 volume_driver=cinder.volume.drivers.quobyte.QuobyteDriver

 [rbd-volumes]
 volume_backend_name=rbd-volumes
 rbd_pool=volumes
 rbd_flatten_volume_from_snapshot=False
 rbd_user=cinder
 rbd_ceph_conf=/etc/ceph/ceph.conf
 rbd_secret_uuid=1234-5678-ABCD-…-DEF
 rbd_max_clone_depth=5
 volume_driver=cinder.volume.drivers.rbd.RBDDriver

 — cut ---

 any ideas?

 cheers
 Jens-Christian

 --
 SWITCH
 Jens-Christian Fischer, Peta Solutions
 Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
 phone +41 44 268 15 15, direct +41 44 268 15 71
 jens-christian.fisc...@switch.ch
 http://www.switch.ch

 http://www.switch.ch/stories

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting

2014-08-21 Thread Bruce McFarland
Yes all of the ceph-osd processes are up and running. I perform a ceph-mon 
restart to see if that might trigger the osdmap update, but there is no INFO 
msg from the osdmap or the pgmap that I expect to when the osd's are started. 
All of the osd's and their hosts appear in the CRUSH map and in ceph.conf. 

Since I went through a bunch of issues getting the multiple osds/host setup and 
working I'm assuming that the monitor's tables might be hosed and am going to 
purgedata and reinstall the monitor and see if it builds the proper mappings. 
I've stopped all of the osd's and verified that there aren't any active 
ceph-osd processes. Then I'll follow the procedure for bringing online a new 
monitor to an existing cluster so that I use the proper fsid.

2014-08-20 17:20:24.648538 7f326ebfd700  0 monclient: hunting for new mon
2014-08-20 17:20:24.648857 7f327455f700  0 -- 209.243.160.84:0/1005462  
209.243.160.84:6789/0 pipe(0x7f3264020300 sd=3 :0 s=1 pgs=0 cs=0 l=1 
c=0x7f3264020570).fault
2014-08-20 17:20:26.077687 mon.0 [INF] mon.ceph-mon01@0 won leader election 
with quorum 0
2014-08-20 17:20:26.077810 mon.0 [INF] monmap e1: 1 mons at 
{ceph-mon01=209.243.160.84:6789/0}
2014-08-20 17:20:26.077931 mon.0 [INF] pgmap v555: 192 pgs: 192 creating; 0 
bytes data, 0 kB used, 0 kB / 0 kB avail
2014-08-20 17:20:26.078032 mon.0 [INF] mdsmap e1: 0/0/1 up


-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com] 
Sent: Thursday, August 21, 2014 8:44 AM
To: Bruce McFarland
Cc: Dan Van Der Ster; ceph-us...@ceph.com
Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's booting

Are the OSD processes still alive? What's the osdmap output of ceph -w (which 
was not in the output you pasted)?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Thu, Aug 21, 2014 at 7:11 AM, Bruce McFarland 
bruce.mcfarl...@taec.toshiba.com wrote:
 I have 3 storage servers each with 30 osds. Each osd has a journal 
 that is a partition on a virtual drive that is a raid0 of 6 ssds. I 
 brought up a 3 osd
 (1 per storage server) cluster to bring up Ceph and figure out 
 configuration etc.



 From: Dan Van Der Ster [mailto:daniel.vanders...@cern.ch]
 Sent: Thursday, August 21, 2014 1:17 AM
 To: Bruce McFarland
 Cc: ceph-us...@ceph.com
 Subject: Re: [ceph-users] MON running 'ceph -w' doesn't see OSD's 
 booting



 Hi,

 You only have one OSD? I’ve seen similar strange things in test pools 
 having only one OSD — and I kinda explained it by assuming that OSDs 
 need peers (other OSDs sharing the same PG) to behave correctly. 
 Install a second OSD and see how it goes...

 Cheers, Dan





 On 21 Aug 2014, at 02:59, Bruce McFarland 
 bruce.mcfarl...@taec.toshiba.com
 wrote:



 I have a cluster with 1 monitor and 3 OSD Servers. Each server has 
 multiple OSD’s running on it. When I start the OSD using 
 /etc/init.d/ceph start osd.0

 I see the expected interaction between the OSD and the monitor 
 authenticating keys etc and finally the OSD starts.



 Running watching the cluster with ‘ceph –w’ running on the monitor I 
 never see the INFO messages I expect. There isn’t a msg from osd.0 for 
 the boot event and the expected INFO messages from osdmap and pgmap  
 for the osd and it’s pages being added to those maps.  I only see the 
 last time the monitor was booted and it wins the monitor election and 
 reports monmap, pgmap, and mdsmap info.



 The firewalls are disabled with selinux==disabled and iptables turned off.
 All hosts can ssh w/o passwords into each other and I’ve verified 
 traffic between hosts using tcpdump captures. Any ideas on what I’d 
 need to add to ceph.conf or have overlooked would be greatly appreciated.

 Thanks,

 Bruce



 [root@ceph0 ceph]# /etc/init.d/ceph restart osd.0

 === osd.0 ===

 === osd.0 ===

 Stopping Ceph osd.0 on ceph0...kill 15676...done

 === osd.0 ===

 2014-08-20 17:43:46.456592 7fa51a034700  1 -- :/0 messenger.start

 2014-08-20 17:43:46.457363 7fa51a034700  1 -- :/1025971 --
 209.243.160.84:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0
 0x7fa51402f9e0 con 0x7fa51402f570

 2014-08-20 17:43:46.458229 7fa5189f0700  1 -- 209.243.160.83:0/1025971 
 learned my addr 209.243.160.83:0/1025971

 2014-08-20 17:43:46.459664 7fa5135fe700  1 -- 209.243.160.83:0/1025971 
 ==
 mon.0 209.243.160.84:6789/0 1  mon_map v1  200+0+0 (3445960796 
 0 0)
 0x7fa508000ab0 con 0x7fa51402f570

 2014-08-20 17:43:46.459849 7fa5135fe700  1 -- 209.243.160.83:0/1025971 
 ==
 mon.0 209.243.160.84:6789/0 2  auth_reply(proto 2 0 (0) Success) 
 v1 
 33+0+0 (536914167 0 0) 0x7fa508000f60 con 0x7fa51402f570

 2014-08-20 17:43:46.460180 7fa5135fe700  1 -- 209.243.160.83:0/1025971 
 --
 209.243.160.84:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0
 0x7fa4fc0012d0 con 0x7fa51402f570

 2014-08-20 17:43:46.461341 7fa5135fe700  1 -- 209.243.160.83:0/1025971 
 ==
 mon.0 209.243.160.84:6789/0 3  auth_reply(proto 2 0 (0) Success) 
 v1 
 206+0+0 (409581826 0 0) 

Re: [ceph-users] Problem setting tunables for ceph firefly

2014-08-21 Thread Craig Lewis
There was a good discussion of this a month ago:
https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg11483.html

That'll give you some things you can try, and information on how to undo it
if it does cause problems.


You can disable the warning by adding this to the [mon] section of
ceph.conf:
  mon warn on legacy crush tunables = false





On Thu, Aug 21, 2014 at 7:17 AM, Gerd Jakobovitsch g...@mandic.net.br
wrote:

 Dear all,

 I have a ceph cluster running in 3 nodes, 240 TB space with 60% usage,
 used by rbd and radosgw clients. Recently I upgraded from emperor to
 firefly, and I got the message about legacy tunables described in
 http://ceph.com/docs/master/rados/operations/crush-map/#tunables. After
 some data rearrangement to minimize risks, I tried to apply the optimal
 settings. This resulted in 28% of object degradation, much more than I
 expected, and worse, I lost communication for the rbd clients, running in
 kernels 3.10 or 3.11.

 Searching for a solution, I got to this proposed solution:
 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11199.html.
 Applying it (before the data was all moved), I got additional 2% of object
 degradation, but the rbd clients came back into working. But then I got a
 large number of degraded or staled PGs, that are not backfilling. Looking
 for the definition of chooseleaf_vary_r, I reached the definition in
 http://ceph.com/docs/master/rados/operations/crush-map/:
 chooseleaf_vary_r: Whether a recursive chooseleaf attempt will start with
 a non-zero value of r, based on how many attempts the parent has already
 made. Legacy default is 0, but with this value CRUSH is sometimes unable to
 find a mapping. The optimal value (in terms of computational cost and
 correctness) is 1. However, for legacy clusters that have lots of existing
 data, changing from 0 to 1 will cause a lot of data to move; a value of 4
 or 5 will allow CRUSH to find a valid mapping but will make less data move.

 Is there any suggestion to handle it? Have I to set chooseleaf_vary_r to
 some other value? Will I lose communication with my rbd clients? Or should
 I return to legacy tunables?

 Regards,

 Gerd Jakobovitsch


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question on OSD node failure recovery

2014-08-21 Thread Craig Lewis
The default rules are sane for small clusters with few failure domains.
 Anything larger than a single rack should customize their rules.

It's a good idea to figure this out early.  Changes to your CRUSH rules can
result in a large percentage of data moving around, which will make your
cluster unusable until the migration completes.

It is possible to make changes after the cluster has a lot of data.  From
what I've been able to figure out, it involves a lot of work to manually
migrate data to new pools using the new rules.




On Thu, Aug 21, 2014 at 6:23 AM, Sean Noonan sean.noo...@twosigma.com
wrote:

 Ceph uses CRUSH (http://ceph.com/docs/master/rados/operations/crush-map/)
 to determine object placement.  The default generated crush maps are sane,
 in that they will put replicas in placement groups into separate failure
 domains.  You do not need to worry about this simple failure case, but you
 should consider the network and disk i/o consequences of re-replicating
 large amounts of data.

 Sean
 
 From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
 LaBarre, James  (CTR)  A6IT [james.laba...@cigna.com]
 Sent: Thursday, August 21, 2014 9:17 AM
 To: ceph-us...@ceph.com
 Subject: [ceph-users] Question on OSD node failure recovery

 I understand the concept with Ceph being able to recover from the failure
 of an OSD (presumably with a single OSD being on a single disk), but I’m
 wondering what the scenario is if an OSD server node containing  multiple
 disks should fail.  Presuming you have a server containing 8-10 disks, your
 duplicated placement groups could end up on the same system.  From diagrams
 I’ve seen they show duplicates going to separate nodes, but is this in fact
 how it handles it?


 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to
 whom it is intended even if addressed incorrectly.  Please delete it from
 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2014 Cigna

 ==
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com