Re: [ceph-users] pg incomplete second osd in acting set still available

2016-03-25 Thread John-Paul Robinson
So I think I know what might have gone wrong.

When I took might osd's out of the cluster and shut them down, the first
set of osds likely came back up and in the cluster before 300 seconds
expired.  This would have prevented cluster triggering recovery of the
pg from the replica osd.

So the question is, can I force this to happen?  Can I take the supposed
primary osd down for 300+ seconds to allow the cluster to start
recovering the pgs (this will of course affect all other pgs on the
osds).  Or is there a better way?

Note that all my secondary osds in these pgs have the expected amount of
data for the pg, remained up during the primary's downtime and should
have the state to become the primary for the acting set.

Thanks for listening.

~jpr


On 03/25/2016 11:57 AM, John-Paul Robinson wrote:
> Hi Folks,
>
> One last dip into my old bobtail cluster.  (new hardware is on order)
>
> I have three pg in an incomplete state.  The cluster was previously
> stable but with a health warn state due to a few near full osds.  I
> started resizing drives on one host to expand space after taking the
> osds that served them out and down.  My failure domain is two levels
> osds and hosts and have two copies per placement group.
>
> I have three of my pgs flagging incomplete.
>
> root@d90-b1-1c-3a-c4-8f:~# date; sudo ceph --id nova health detail |
> grep incomplete
> Fri Mar 25 11:28:47 CDT 2016
> HEALTH_WARN 168 pgs backfill; 107 pgs backfilling; 241 pgs degraded; 3
> pgs incomplete; 3 pgs stuck inactive; 287 pgs stuck unclean; recovery
> 4913393/39589336 degraded (12.411%);  recovering 120 o/s, 481MB/s; 4
> near full osd(s)
> pg 3.5 is stuck inactive since forever, current state incomplete, last
> acting [53,22]
> pg 3.150 is stuck inactive since forever, current state incomplete, last
> acting [50,74]
> pg 3.38c is stuck inactive since forever, current state incomplete, last
> acting [14,70]
> pg 3.5 is stuck unclean since forever, current state incomplete, last
> acting [53,22]
> pg 3.150 is stuck unclean since forever, current state incomplete, last
> acting [50,74]
> pg 3.38c is stuck unclean since forever, current state incomplete, last
> acting [14,70]
> pg 3.38c is incomplete, acting [14,70]
> pg 3.150 is incomplete, acting [50,74]
> pg 3.5 is incomplete, acting [53,22]
>
> Given that incomplete means:
>
> "Ceph detects that a placement group is missing information about writes
> that may have occurred, or does not have any healthy copies. If you see
> this state, try to start any failed OSDs that may contain the needed
> information or temporarily adjust min_size to allow recovery."
>
> I have restarted all osds in these acting sets and they log normally,
> opening their respective journals and such. However, the incomplete
> state remains.
>
> All three of the primary osds 53,50,14 have were reformatted to expand
> size, so I know there's no "spare" journal if its referring to what was
> there before.  Btw, I did take all osds to out and down before resizing
> their drives, so I'm not sure how these pg would actually be expecting
> old journal.
>
> I suspect I need to forgo the journal and let the secondaries become
> primary for these pg.
>
> I sure hope that's possible.
>
> As always, thanks for any pointers.
>
> ~jpr
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg incomplete second osd in acting set still available

2016-03-25 Thread John-Paul Robinson
Hi Folks,

One last dip into my old bobtail cluster.  (new hardware is on order)

I have three pg in an incomplete state.  The cluster was previously
stable but with a health warn state due to a few near full osds.  I
started resizing drives on one host to expand space after taking the
osds that served them out and down.  My failure domain is two levels
osds and hosts and have two copies per placement group.

I have three of my pgs flagging incomplete.

root@d90-b1-1c-3a-c4-8f:~# date; sudo ceph --id nova health detail |
grep incomplete
Fri Mar 25 11:28:47 CDT 2016
HEALTH_WARN 168 pgs backfill; 107 pgs backfilling; 241 pgs degraded; 3
pgs incomplete; 3 pgs stuck inactive; 287 pgs stuck unclean; recovery
4913393/39589336 degraded (12.411%);  recovering 120 o/s, 481MB/s; 4
near full osd(s)
pg 3.5 is stuck inactive since forever, current state incomplete, last
acting [53,22]
pg 3.150 is stuck inactive since forever, current state incomplete, last
acting [50,74]
pg 3.38c is stuck inactive since forever, current state incomplete, last
acting [14,70]
pg 3.5 is stuck unclean since forever, current state incomplete, last
acting [53,22]
pg 3.150 is stuck unclean since forever, current state incomplete, last
acting [50,74]
pg 3.38c is stuck unclean since forever, current state incomplete, last
acting [14,70]
pg 3.38c is incomplete, acting [14,70]
pg 3.150 is incomplete, acting [50,74]
pg 3.5 is incomplete, acting [53,22]

Given that incomplete means:

"Ceph detects that a placement group is missing information about writes
that may have occurred, or does not have any healthy copies. If you see
this state, try to start any failed OSDs that may contain the needed
information or temporarily adjust min_size to allow recovery."

I have restarted all osds in these acting sets and they log normally,
opening their respective journals and such. However, the incomplete
state remains.

All three of the primary osds 53,50,14 have were reformatted to expand
size, so I know there's no "spare" journal if its referring to what was
there before.  Btw, I did take all osds to out and down before resizing
their drives, so I'm not sure how these pg would actually be expecting
old journal.

I suspect I need to forgo the journal and let the secondaries become
primary for these pg.

I sure hope that's possible.

As always, thanks for any pointers.

~jpr

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] upgrading to major releases

2015-10-23 Thread John-Paul Robinson
Hi,

When upgrading to the next release, is it necessary to first upgrade to
the most recent point release of the prior release or can one upgrade
from the initial release of the named version?  The release notes don't
appear to indicate it is necessary
(http://docs.ceph.com/docs/master/release-notes), just wondering if
there are benefits or assumptions.

Thanks,

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread John-Paul Robinson
Hi,

Has anyone else experienced a problem with RBD-to-NFS gateways blocking
nfsd server requests when their ceph cluster has a placement group that
is not servicing I/O for some reason, eg. too few replicas or an osd
with slow request warnings?

We have an RBD-NFS gateway that stops responding to NFS clients
(interaction with RBD-backed NFS shares hang on the NFS client),
whenever our ceph cluster has some part of it in an I/O block
condition.   This issue only affects the ability of the nfsd processes
to serve requests to the client.  I can look at and access underlying
mounted RBD containers without issue, although they appear hung from the
NFS client side.   The gateway node load numbers spike to a number that
reflects the number of nfsd processes, but the system is otherwise
untaxed (unlike the case in a normal high os load, ie. i can type and
run commands with normal responsiveness.)

The behavior comes accross like there is some nfsd global lock that an
nfsd sets before requesting I/O from a backend device.  In the case
above, the I/O request hangs on one RBD image affected by the I/O block
caused by the problematic pg or OSD.   The nfsd request blocks on the
ceph I/O and because it has set a global lock, all other nfsd processes
are prevented from servicing requests to their clients.  The nfsd
processes are now all in the wait queue causing the load number on the
gateway system to spike. Once the Ceph I/O issues is resolved, the nfsd
I/O request completes and all service returns to normal.  The load on
the gateway drops to normal immediately and all NFS clients can again
interact with the nfsd processes.  Thoughout this time unaffected ceph
objects remain available to other clients, eg. OpenStack volumes.

Our RBD-NFS gateway is running Ubuntu 12.04.5 with kernel
3.11.0-15-generic.  The ceph version installed on this client is 0.72.2,
though I assume only the kernel resident RBD module matters.

Any thoughts or pointers appreciated.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg incomplete state

2015-10-22 Thread John-Paul Robinson
Greg,

Thanks for providing this background on the incomplete state.

With that context, and a little more digging online and in our
environment, I was able to resolve the issue. My cluster is back in
health ok.

The key to fixing the incomplete state was the information provided by
pg query.  I did not have to change the min_size setting.  In addition
to your comments, these two references were helpful.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-August/042102.html
http://tracker.ceph.com/issues/5226


The tail of `ceph pg 3.ea query` showed there were an number of osds
involved in servicing the backfill.

 "probing_osds": [
10,
11,
30,
37,
39,
54],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": []},
{ "name": "Started",
  "enter_time": "2015-10-21 14:39:13.824613"}]}

After checking all the OSDs, I confirmed that only osd.11 had the pg
data and all the rest had an empty dir for pg 3.ea.  Because osd 10 was
listed first and had an empty copy of the pg, my assumption was it was
blocking the backfill.  I stopped osd.10 briefly and the state of pg
3.ea immediately entered "active+degraded+remapped+backfilling".  After
the backfill started started osd.10.  In particular, osd 11 became the
primary (as desired) and began backfilling osd 30.

 { "state": "active+degraded+remapped+backfilling",
  "up": [
30,
11],
  "acting": [
11,
30],

osd.10 was no longer holding up the start of backfill operation:

"recovery_state": [
{ "name": "Started\/Primary\/Active",
  "enter_time": "2015-10-22 12:46:50.907955",
  "might_have_unfound": [
{ "osd": 10,
  "status": "not queried"}],
  "recovery_progress": { "backfill_target": 30,
  "waiting_on_backfill": 0,

Based on the steps that triggered the original incomplete state, my
guess is that when I took osd.30 down and out to reformat, a number of
alternates (including osd.10)  were mapped as backfill targets for the
pg.  These operations didn't have a chance to start up before osd 30's
reformat completed and was back in the cluster.  At that point, pg 3.ea
was remapped again, leaving osd 10 at the top of the list.  Not having
any data, it blocked the backfill from osd 11 from starting.

Not sure if that was the exact cause, but it makes some sense.

Thanks again for the pointing me in a useful direction.

~jpr

On 10/21/2015 03:01 PM, Gregory Farnum wrote:
> I don't remember the exact timeline, but min_size is designed to
> prevent data loss from under-replicated objects (ie, if you only have
> 1 copy out of 3 and you lose that copy, you're in trouble, so maybe
> you don't want it to go active). Unfortunately it could also prevent
> the OSDs from replicating/backfilling the data to new OSDs in the case
> where you only had one copy left — that's fixed now, but wasn't
> initially. And in that case it reported the PG as incomplete (in later
> versions, PGs in this state get reported as undersized).
>
> So if you drop the min_size to 1, it will allow new writes to the PG
> (which might not be great), but it will also let the OSD go into the
> backfilling state. (At least, assuming the number of replicas is the
> only problem.). Based on your description of the problem I think this
> is the state you're in, and decreasing min_size is the solution.
> *shrug*
> You could also try and do something like extracting the PG from osd.11
> and copying it to osd.30, but that's quite tricky without the modern
> objectstore tool stuff, and I don't know if any of that works on
> dumpling (which it sounds like you're on — incidentally, you probably
> want to upgrade from that).
> -Greg
>
> On Wed, Oct 21, 2015 at 12:55 PM, John-Paul Robinson <j...@uab.edu> wrote:
>> Greg,
>>
>> Thanks for the insight.  I suspect things are somewhat sane given that I
>> did erase the primary (osd.30) and the secondary (osd.11) still contains
>> pg data.
>>
>> If I may, could you clarify the process of backfill a little?
>>
>> I understand the min_size allows I/O on the object to resume while there
>> are only that many replicas (ie. 1 once changed) and this would let
>> things move forward.
>>
>> I would expect, however, that some backfill would already be on-going
>> for pg 3.ea on osd.30.  As far as I can tell, there isn't anything
>> happening.  The pg 3.ea directory is just as empty today as it was
>> yesterday.
>>
>> Will changing t

Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread John-Paul Robinson


On 10/22/2015 04:03 PM, Wido den Hollander wrote:
> On 10/22/2015 10:57 PM, John-Paul Robinson wrote:
>> Hi,
>>
>> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
>> nfsd server requests when their ceph cluster has a placement group that
>> is not servicing I/O for some reason, eg. too few replicas or an osd
>> with slow request warnings?
>>
>> We have an RBD-NFS gateway that stops responding to NFS clients
>> (interaction with RBD-backed NFS shares hang on the NFS client),
>> whenever our ceph cluster has some part of it in an I/O block
>> condition.   This issue only affects the ability of the nfsd processes
>> to serve requests to the client.  I can look at and access underlying
>> mounted RBD containers without issue, although they appear hung from the
>> NFS client side.   The gateway node load numbers spike to a number that
>> reflects the number of nfsd processes, but the system is otherwise
>> untaxed (unlike the case in a normal high os load, ie. i can type and
>> run commands with normal responsiveness.)
>>
> Well, that is normal I think. Certain objects become unresponsive if a
> PG is not serving I/O.
>
> With a simple 'ls' or 'df -h' you might not be touching those objects,
> so for you it seems like everything is functioning.
>
> The nfsd process however might be hung due to a blocking I/O call. That
> is completely normal and to be excpected.

I agree that an nfsd process blocking on a blocked backend I/O request
is expected an normal.

> That it hangs the complete NFS server might be just a side-effect on how
> nfsd was written.

Hanging all nfsd processes is the part I find unexpected.  I'm just
wondering is someone has experience with this or if this is a known nfsd
issue.

> It might be that Ganesha works better for you:
> http://blog.widodh.nl/2014/12/nfs-ganesha-with-libcephfs-on-ubuntu-14-04/

Thanks genesha looks very interesting!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging nfsd requests on an RBD to NFS gateway

2015-10-22 Thread John-Paul Robinson
A few clarifications on our experience:

* We have 200+ rbd images mounted on our RBD-NFS gateway.  (There's
nothing easier for a user to understand than "your disk is full".)

* I'd expect more contention potential with a single shared RBD back
end, but with many distinct and presumably isolated backend RBD images,
I've always been surprised that *all* the nfsd task hang.  This leads me
to think  it's an nfsd issue rather than and rbd issue.  (I realize this
is an rbd list, looking for shared experience. ;) )
 
* I haven't seen any difference between reads and writes.  Any access to
any backing RBD store from the NFS client hangs.

~jpr

On 10/22/2015 06:42 PM, Ryan Tokarek wrote:
>> On Oct 22, 2015, at 3:57 PM, John-Paul Robinson <j...@uab.edu> wrote:
>>
>> Hi,
>>
>> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
>> nfsd server requests when their ceph cluster has a placement group that
>> is not servicing I/O for some reason, eg. too few replicas or an osd
>> with slow request warnings?
> We have experienced exactly that kind of problem except that it sometimes 
> happens even when ceph health reports "HEALTH_OK". This has been incredibly 
> vexing for us. 
>
>
> If the cluster is unhealthy for some reason, then I'd expect your/our 
> symptoms as writes can't be completed. 
>
> I'm guessing that you have file systems with barriers turned on. Whichever 
> file system that has a barrier write stuck on the problem pg, will cause any 
> other process trying to write anywhere in that FS also to block. This likely 
> means a cascade of nfsd processes will block as they each try to service 
> various client writes to that FS. Even though, theoretically, the rest of the 
> "disk" (rbd) and other file systems might still be writable, the NFS 
> processes will still be in uninterruptible sleep just because of that stuck 
> write request (or such is my understanding). 
>
> Disabling barriers on the gateway machine might postpone the problem (never 
> tried it and don't want to) until you hit your vm.dirty_bytes or 
> vm.dirty_ratio thresholds, but it is dangerous as you could much more easily 
> lose data. You'd be better off solving the underlying issues when they happen 
> (too few replicas available or overloaded osds). 
>
>
> For us, even when the cluster reports itself as healthy, we sometimes have 
> this problem. All nfsd processes block. sync blocks. echo 3 > 
> /proc/sys/vm/drop_caches blocks. There is a persistent 4-8MB "Dirty" in 
> /proc/meminfo. None of the osds log slow requests. Everything seems fine on 
> the osds and mons. Neither CPU nor I/O load is extraordinary on the ceph 
> nodes, but at least one file system on the gateway machine will stop 
> accepting writes. 
>
> If we just wait, the situation resolves itself in 10 to 30 minutes. A forced 
> reboot of the NFS gateway "solves" the performance problem, but is annoying 
> and dangerous (we unmount all of the file systems that are still unmountable, 
> but the stuck ones lead us to a sysrq-b). 
>
> This is on Scientific Linux 6.7 systems with elrepo 4.1.10 Kernels running 
> Ceph Firefly (0.8.10) and XFS file systems exported over NFS and samba. 
>
> Ryan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg incomplete state

2015-10-21 Thread John-Paul Robinson
Greg,

Thanks for the insight.  I suspect things are somewhat sane given that I
did erase the primary (osd.30) and the secondary (osd.11) still contains
pg data.

If I may, could you clarify the process of backfill a little?

I understand the min_size allows I/O on the object to resume while there
are only that many replicas (ie. 1 once changed) and this would let
things move forward.

I would expect, however, that some backfill would already be on-going
for pg 3.ea on osd.30.  As far as I can tell, there isn't anything
happening.  The pg 3.ea directory is just as empty today as it was
yesterday.

Will changing the min_size actually trigger backfill to begin for an
object if has stalled or never got started?

An alternative idea I had was to take osd.30 back out of the cluster so
that pg 3.ae [30,11] would get mapped to some other osd to maintain
replication.  This seems a bit heavy handed though, given that only this
one pg is affected.

Thanks for any follow up.

~jpr 


On 10/21/2015 01:21 PM, Gregory Farnum wrote:
> On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson <j...@uab.edu> wrote:
>> Hi folks
>>
>> I've been rebuilding drives in my cluster to add space.  This has gone
>> well so far.
>>
>> After the last batch of rebuilds, I'm left with one placement group in
>> an incomplete state.
>>
>> [sudo] password for jpr:
>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
>> pg 3.ea is stuck inactive since forever, current state incomplete, last
>> acting [30,11]
>> pg 3.ea is stuck unclean since forever, current state incomplete, last
>> acting [30,11]
>> pg 3.ea is incomplete, acting [30,11]
>>
>> I've restarted both OSD a few times but it hasn't cleared the error.
>>
>> On the primary I see errors in the log related to slow requests:
>>
>> 2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
>> 3 included below; oldest blocked for > 31.922487 secs
>> 2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
>> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
>> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read
>> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
>> 2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
>> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
>> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read
>> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
>> 2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
>> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
>> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
>> 3.e4bd50ea) v4 currently reached pg
>>
>> Note's online suggest this is an issue with the journal and that it may
>> be possible to export and rebuild thepg.  I don't have firefly.
>>
>> https://ceph.com/community/incomplete-pgs-oh-my/
>>
>> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
>> but missing entirely on osd.30 (the primary).
>>
>> on osd.33 (primary):
>>
>> crowbar@da0-36-9f-0e-2b-88:~$ du -sk
>> /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>> 0   /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>
>> on osd.11 (secondary):
>>
>> crowbar@da0-36-9f-0e-2b-40:~$ du -sh
>> /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>> 63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>
>> This makes some sense since, my disk drive rebuilding activity
>> reformatted the primary osd.30.  It also gives me some hope that my data
>> is not lost.
>>
>> I understand incomplete means problem with journal, but is there a way
>> to dig deeper into this or possible to get the secondary's data to take
>> over?
> If you're running an older version of Ceph (Firefly or earlier,
> maybe?), "incomplete" can also mean "not enough replicas". It looks
> like that's what you're hitting here, if osd.11 is not reporting any
> issues. If so, simply setting the min_size on this pool to 1 until the
> backfilling is done should let you get going.
> -Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg incomplete state

2015-10-21 Thread John-Paul Robinson
Yes.  That's the intention.  I was fixing the osd size to ensure the
cluster was in health ok for the upgrades (instead of multiple osds in
near full).

Thanks again for all the insight.  Very helpful.

~jpr

On 10/21/2015 03:01 PM, Gregory Farnum wrote:
>  (which it sounds like you're on — incidentally, you probably
> want to upgrade from that).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg incomplete state

2015-10-20 Thread John-Paul Robinson
Hi folks

I've been rebuilding drives in my cluster to add space.  This has gone
well so far.

After the last batch of rebuilds, I'm left with one placement group in
an incomplete state.

[sudo] password for jpr:
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 3.ea is stuck inactive since forever, current state incomplete, last
acting [30,11]
pg 3.ea is stuck unclean since forever, current state incomplete, last
acting [30,11]
pg 3.ea is incomplete, acting [30,11]

I've restarted both OSD a few times but it hasn't cleared the error.

On the primary I see errors in the log related to slow requests:

2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
3 included below; oldest blocked for > 31.922487 secs
2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.a044 [read
1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.a044 [read
2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
3.e4bd50ea) v4 currently reached pg

Note's online suggest this is an issue with the journal and that it may
be possible to export and rebuild thepg.  I don't have firefly.

https://ceph.com/community/incomplete-pgs-oh-my/

Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
but missing entirely on osd.30 (the primary). 

on osd.33 (primary):

crowbar@da0-36-9f-0e-2b-88:~$ du -sk
/var/lib/ceph/osd/ceph-30/current/3.ea_head/
0   /var/lib/ceph/osd/ceph-30/current/3.ea_head/

on osd.11 (secondary):

crowbar@da0-36-9f-0e-2b-40:~$ du -sh
/var/lib/ceph/osd/ceph-11/current/3.ea_head/

63G /var/lib/ceph/osd/ceph-11/current/3.ea_head/

This makes some sense since, my disk drive rebuilding activity
reformatted the primary osd.30.  It also gives me some hope that my data
is not lost.

I understand incomplete means problem with journal, but is there a way
to dig deeper into this or possible to get the secondary's data to take
over?

Thanks,

~jpr



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson
The move  journal, partition resize, grow file system approach would
work nicely if the spare capacity were at the end of the disk.

Unfortunately, the gdisk (0.8.1) end of disk location bug caused the
journal placement to be at the 800GB mark, leaving the largest remaining
partition at the end of the disk.   I'm assuming the gdisk bug was
caused by overflowing a 32bit int during the -1000M offset from end of
disk calculation.  When it computed the end of disk for the journal
placement on disks >2TB it dropped the 2TB part of the size and was left
only with the 800GB value, putting the journal there.  After gdisk
created the journal at the 800GB mark (splitting the disk),
ceph-disk-prepare told gdisk to take the largest remaining partition for
data, using the 2TB partition at the end.

Here's an example of the buggy partitioning:

crowbar@da0-36-9f-0e-28-2c:~$ sudo gdisk -l /dev/sdd
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdd: 5859442688 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 6F76BD12-05D6-4FA2-A132-CAC3E1C26C81
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5859442654
Partitions will be aligned on 2048-sector boundaries
Total free space is 1562425343 sectors (745.0 GiB)

Number  Start (sector)End (sector)  Size   Code  Name
   1  1564475392  5859442654   2.0 TiB   ceph data
   2  1562425344  1564475358   1001.0 MiB    ceph journal



I assume I could still follow a disk-level relocation of data using dd
and shift all my content forward in the disk and then grow the file
system to the end, but this would take a significant amount of time,
more than a quick restart of the OSD. 

This leaves me the option of setting noout and hoping for the best (no
other failures) during my somewhat lengthy dd data movement or taking my
osd down and letting the cluster begin repairing the redundancy.

If I follow the second option of normal osd loss repair, my disk
repartition step would be fast and I could bring the OSD back up rather
quickly.  Does taking an OSD out of service, erasing it and bringing the
same OSD back into service present any undue stress to the cluster?  

I'd prefer to use the second option if I can because I'm likely to
repeat this in the near future in order to add encryption to these disks.

~jpr

On 09/15/2015 06:44 PM, Lionel Bouton wrote:
> Le 16/09/2015 01:21, John-Paul Robinson a écrit :
>> Hi,
>>
>> I'm working to correct a partitioning error from when our cluster was
>> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
>> partitions for our OSDs, instead of the 2.8TB actually available on
>> disk, a 29% space hit.  (The error was due to a gdisk bug that
>> mis-computed the end of the disk during the ceph-disk-prepare and placed
>> the journal at the 2TB mark instead of the true end of the disk at
>> 2.8TB. I've updated gdisk to a newer release that works correctly.)
>>
>> I'd like to fix this problem by taking my existing 2TB OSDs offline one
>> at a time, repartitioning them and then bringing them back into the
>> cluster.  Unfortunately I can't just grow the partitions, so the
>> repartition will be destructive.
> Hum, why should it be? If the journal is at the 2TB mark, you should be
> able to:
> - stop the OSD,
> - flush the journal, (ceph-osd -i  --flush-journal),
> - unmount the data filesystem (might be superfluous but the kernel seems
> to cache the partition layout when a partition is active),
> - remove the journal partition,
> - extend the data partition,
> - place the journal partition at the end of the drive (in fact you
> probably want to write a precomputed partition layout in one go).
> - mount the data filesystem, resize it online,
> - ceph-osd -i  --mkjournal (assuming your setup can find the
> partition again automatically without reconfiguration)
> - start the OSD
>
> If you script this you should not have to use noout: the OSD should come
> back in a matter of seconds and the impact on the storage network minimal.
>
> Note that the start of the disk is where you get the best sequential
> reads/writes. Given that most data accesses are random and all journal
> accesses are sequential I put the journal at the start of the disk when
> data and journal are sharing the same platters.
>
> Best regards,
>
> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson (Campus)
So I just realized I had described the partition error incorrectly in my 
initial post. The journal was placed at the 800GB mark leaving the 2TB data 
partition at the end of the disk. (See my follow-up to Lionel for details.) 

I'm working to correct that so I have a single large partition the size of the 
disk, save for the journal.

Sorry for any confusion. 

~jpr



> On Sep 15, 2015, at 6:21 PM, John-Paul Robinson <j...@uab.edu> wrote:
> 
> I'm working to correct a partitioning error from when our cluster was
> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> partitions for our OSDs, instead of the 2.8TB actually available on
> disk, a 29% space hit.  (The error was due to a gdisk bug that
> mis-computed the end of the disk during the ceph-disk-prepare and placed
> the journal at the 2TB mark instead of the true end of the disk at
> 2.8TB. I've updated gdisk to a newer release that works correctly.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on reusing OSD

2015-09-16 Thread John-Paul Robinson
Christian,

Thanks for the feedback.

I guess I'm wondering about step 4 "clobber partition, leaving data in
tact and grow partition and the file system as needed".

My understanding of xfs_growfs is that the free space must be at the end
of the existing file system.  In this case the existing partition starts
around the 800GB mark on the disk and and extends to the end of the
disk.  My goal is to add the first 800GB on the disk to that partition
so it can become a single data partition.

Note that my volumes are not LVM based so I can't extend the volume by
incorporating the free space at the start of the disk.

Am I misunderstanding something about file system grow commands?

Regarding your comments, on impact to the cluster of a downed OSD.  I
have lost OSDs and the impact is minimal (acceptable).

My concern is around taking an OSD down, having the cluster initiate
recovery and then bringing that same OSD back into the cluster in an
empty state.  Are the placement groups that originally had data on this
OSD already remapped by this point (even if they aren't fully recovered)
so that bring the empty, replacement OSD on-line simply causes a
different set of placement groups to be mapped onto it to achieve the
rebalance?

Thanks,

~jpr

On 09/16/2015 08:37 AM, Christian Balzer wrote:
> Hello,
>
> On Wed, 16 Sep 2015 07:21:26 -0500 John-Paul Robinson wrote:
>
>> > The move  journal, partition resize, grow file system approach would
>> > work nicely if the spare capacity were at the end of the disk.
>> >
> That shouldn't matter, you can "safely" loose your journal in controlled
> circumstances.
>
> This would also be an ideal time to put your journals on SSDs. ^o^
>
> Roughly (you do have a test cluster, do you? Or at least try this with
> just one OSD):
>
> 1. set noout just to be sure.
> 2. stop the OSD
> 3. "ceph-osd -i osdnum --flush-journal" for warm fuzzies (see man page or
> --help)
> 4. clobber your partitions in a way that leaves you with an intact data
> partition, grow that and the FS in it as desired.
> 5. re-init the journal with "ceph-osd -i osdnum --mkjournal"
> 6. start the OSD and rejoice. 
>  
> More below.
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question on reusing OSD

2015-09-15 Thread John-Paul Robinson
Hi,

I'm working to correct a partitioning error from when our cluster was
first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
partitions for our OSDs, instead of the 2.8TB actually available on
disk, a 29% space hit.  (The error was due to a gdisk bug that
mis-computed the end of the disk during the ceph-disk-prepare and placed
the journal at the 2TB mark instead of the true end of the disk at
2.8TB. I've updated gdisk to a newer release that works correctly.)

I'd like to fix this problem by taking my existing 2TB OSDs offline one
at a time, repartitioning them and then bringing them back into the
cluster.  Unfortunately I can't just grow the partitions, so the
repartition will be destructive.

I would like for the reformatted OSD to come back into the cluster
looking just like the original OSD, except that it now has 2.8TB for
it's data.  That is, I'd like the OSD number to stay the same and for
the cluster to think of it like the original disk (save for not having
any data on it).

Ordinarily, I would add an OSD by bringing a system into the cluster
triggering these events:

ceph-disk-prepare /dev/sdb /dev/sdb  # partitions disk, note older
cluster with journal on same disk
ceph-disk-activate /dev/sdb # registers osd with cluster

The ceph-disk-prepare is focused on partitioning and doesn't interact
with the cluster.  The ceph-disk-activate takes care of making the OSD
look like an OSD and adding it into the cluster.

Inside of the ceph-disk-activate the code looks for some special files
at the top of the /dev/sdb1 file system, including magic, ceph_fsid, and
whoami (which is where the osd number is stored).

My first question is, can I preserve these special files and put them
back on the repartitioned/formatted drive causing ceph-disk-activate to
just bring the OSD back into the cluster using it's original identity or
is there a better way to do what I want?

My second question is, if I take an OSD out of the cluster, should I
wait for the subsequent rebalance to complete before bringing the
reformatted OSD back in the cluster?  That is, will it cause problems to
drop an OSD out of the cluster and then bring the same OSD back into the
cluster except without any of the data.   I'm assuming this is similar
to what would happen in a standard disk replacement scenario.

I reviewed the thread from Sept 2014
(https://www.mail-archive.com/ceph-users@lists.ceph.com/msg13394.html)
discussing a similar scenario.  This was more focused on re-using a
journal slot on an SSD.  In my case the journal is on the same disk as
the data.  Also, I don't have a recent release of the ceph so likely
won't benefit from the associated fix.

Thanks for any suggestions.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS interaction with RBD

2015-05-29 Thread John-Paul Robinson
In the end this came down to one slow OSD.  There were no hardware
issues so have to just assume something gummed up during rebalancing and
peering.

I restarted the osd process after setting the cluster to noout.  After
the osd was restarted the rebalance completed and the cluster returned
to health ok.

As soon as the osd restarted all previously hanging operations returned
to normal.

I'm surprised by a single slow OSD impacting access to the entire
cluster.   I understand now that only the primary osd is used for reads
and writes must go to the primary then secondary, but I would have
expected  the impact to be more contained.

We currently build XFS file systems directly on RBD images.  I'm
wondering if there would be any value in using an LVM abstraction on top
to spread access to other osds  for read and failure scenarios.

Any thoughts on the above appreciated.

~jpr


On 05/28/2015 03:18 PM, John-Paul Robinson wrote:
 To follow up on the original post,

 Further digging indicates this is a problem with RBD image access and
 is not related to NFS-RBD interaction as initially suspected.  The
 nfsd is simply hanging as a result of a hung request to the XFS file
 system mounted on our RBD-NFS gateway.This hung XFS call is caused
 by a problem with the RBD module interacting with our Ceph pool.

 I've found a reliable way to trigger a hang directly on an rbd image
 mapped into our RBD-NFS gateway box.  The image contains an XFS file
 system.  When I try to list the contents of a particular directory,
 the request hangs indefinitely.

 Two weeks ago our ceph status was:

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
health HEALTH_WARN 1 near full osd(s)
monmap e1: 3 mons at
 
 {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
 election epoch 350, quorum 0,1,2
 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
osdmap e5978: 66 osds: 66 up, 66 in
 pgmap v26434260: 3072 pgs: 3062 active+clean, 6
 active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
 data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
mdsmap e1: 0/0/1 up


 The near full osd was number 53 and we updated our crush map to
 rewieght the osd.  All of the OSDs had a weight of 1 based on the
 assumption that all osds were 2.0TB.  Apparently one of our severs had
 the OSDs Sized to 2.8TB and this caused the OSD imbalance eventhough
 we are only at 50% utilization.  We reweighted the near full osd to .8
 and that initiated a rebalance that has since relieved the 95% full
 condition on that OSD.

 However, since that time the repeering has not completed and we
 suspect this is causing problems with our access of RBD images.   Our
 current ceph status is:

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
 stuck unclean; recovery 9/23842120 degraded (0.000%)
monmap e1: 3 mons at
 
 {da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
 election epoch 350, quorum 0,1,2
 da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
osdmap e6036: 66 osds: 66 up, 66 in
 pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
 active+clean+scrubbing, 1 remapped+peering, 3
 active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297
 GB / 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
mdsmap e1: 0/0/1 up


 Here are further details on our stuck pgs:

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
 dump_stuck inactive
 ok
 pg_stat objects mip degrunf bytes   log disklog
 state   state_stamp v   reportedup  acting 
 last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
 3.3af   11600   0   0   0   47941791744 153812 
 153812  remapped+peering2015-05-15 12:47:17.223786 
 5979'293066  6000'1248735 [48,62] [53,48,62] 
 5979'293056 2015-05-15 07:40:36.275563  5979'293056
 2015-05-15 07:40:36.275563

 jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
 dump_stuck unclean
 ok
 pg_stat objects mip degrunf bytes   log disklog
 state   state_stamp v   reportedup  acting 
 last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
 3.106   11870   0   9   0   49010106368 163991 
 163991  active  2015-05-15 12:47:19.761469  6035'356332
 5968'1358516 [62,53]  [62,53] 5979'356242 2015-05-14
 22:22:12.966150  5979'351351 2015-05-12 18:04:41.838686
 5.104   0   0   0   0   0   0   0  
 active  2015-05-15 12:47:19.800676  0'0 5968'1615

Re: [ceph-users] NFS interaction with RBD

2015-05-28 Thread John-Paul Robinson
To follow up on the original post,

Further digging indicates this is a problem with RBD image access and is
not related to NFS-RBD interaction as initially suspected.  The nfsd is
simply hanging as a result of a hung request to the XFS file system
mounted on our RBD-NFS gateway.This hung XFS call is caused by a
problem with the RBD module interacting with our Ceph pool.

I've found a reliable way to trigger a hang directly on an rbd image
mapped into our RBD-NFS gateway box.  The image contains an XFS file
system.  When I try to list the contents of a particular directory, the
request hangs indefinitely.

Two weeks ago our ceph status was:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 near full osd(s)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e5978: 66 osds: 66 up, 66 in
pgmap v26434260: 3072 pgs: 3062 active+clean, 6
active+clean+scrubbing, 4 active+clean+scrubbing+deep; 45712 GB
data, 91590 GB used, 51713 GB / 139 TB avail; 12234B/s wr, 1op/s
   mdsmap e1: 0/0/1 up


The near full osd was number 53 and we updated our crush map to rewieght
the osd.  All of the OSDs had a weight of 1 based on the assumption that
all osds were 2.0TB.  Apparently one of our severs had the OSDs Sized to
2.8TB and this caused the OSD imbalance eventhough we are only at 50%
utilization.  We reweighted the near full osd to .8 and that initiated a
rebalance that has since relieved the 95% full condition on that OSD.

However, since that time the repeering has not completed and we suspect
this is causing problems with our access of RBD images.   Our current
ceph status is:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova status
   health HEALTH_WARN 1 pgs peering; 1 pgs stuck inactive; 4 pgs
stuck unclean; recovery 9/23842120 degraded (0.000%)
   monmap e1: 3 mons at

{da0-36-9f-0e-28-2c=172.16.171.6:6789/0,da0-36-9f-0e-2b-88=172.16.171.5:6789/0,da0-36-9f-0e-2b-a0=172.16.171.4:6789/0},
election epoch 350, quorum 0,1,2
da0-36-9f-0e-28-2c,da0-36-9f-0e-2b-88,da0-36-9f-0e-2b-a0
   osdmap e6036: 66 osds: 66 up, 66 in
pgmap v27104371: 3072 pgs: 3 active, 3056 active+clean, 9
active+clean+scrubbing, 1 remapped+peering, 3
active+clean+scrubbing+deep; 45868 GB data, 92006 GB used, 51297 GB
/ 139 TB avail; 3125B/s wr, 0op/s; 9/23842120 degraded (0.000%)
   mdsmap e1: 0/0/1 up


Here are further details on our stuck pgs:

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck inactive
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563

jpr@rcs-02:~/projects/rstore-utils$ sudo ceph --id nova pg
dump_stuck unclean
ok
pg_stat objects mip degrunf bytes   log disklog
state   state_stamp v   reportedup  acting 
last_scrub   scrub_stamp  last_deep_scrub deep_scrub_stamp
3.106   11870   0   9   0   49010106368 163991 
163991  active  2015-05-15 12:47:19.761469  6035'356332
5968'1358516 [62,53]  [62,53] 5979'356242 2015-05-14
22:22:12.966150  5979'351351 2015-05-12 18:04:41.838686
5.104   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.800676  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:22.425105 
0'0 2015-05-08 10:19:54.938934
4.105   0   0   0   0   0   0   0  
active  2015-05-15 12:47:19.801028  0'0 5968'1615  
[62,53] [62,53]   0'0 2015-05-14 18:43:04.434826 
0'0 2015-05-14 18:43:04.434826
3.3af   11600   0   0   0   47941791744 153812 
153812  remapped+peering2015-05-15 12:47:17.223786 
5979'293066  6000'1248735 [48,62] [53,48,62] 
5979'293056 2015-05-15 07:40:36.275563  5979'293056
2015-05-15 07:40:36.275563


The servers in the pool are not overloaded.  On the ceph server that
originally had the nearly full osd, (osd 53), I'm seeing entries like
this in the osd log:

2015-05-28 06:25:02.900129 7f2ea8a4f700  0 log [WRN] : 6 slow
requests, 6 included below; oldest blocked for  1096430.805069 secs
2015-05-28 06:25:02.900145 7f2ea8a4f700  0 log [WRN] : slow request

[ceph-users] NFS interaction with RBD

2015-05-23 Thread John-Paul Robinson (Campus)
We've had a an NFS gateway serving up RBD images successfully for over a year. 
Ubuntu 12.04 and ceph .73 iirc. 

In the past couple of weeks we have developed a problem where the nfs clients 
hang while accessing exported rbd containers. 

We see errors on the server about nfsd hanging for 120sec etc. 

The nfs server is still able to successfully interact with the images it is 
serving. We can export non rbd shares from the local file system and nfs 
clients can use them just fine. 

There seems to be something weird going on with rbd and nfs kernel modules. 

Our ceph pool is in a warn state due to an osd rebalance that is continuing 
slowly. But the fact that we continue to have good rbd image access directly on 
the server makes me think this is not related.  Also the nfs server is only a 
client of the pool, it doesnt participate in it. 

Has anyone experienced similar issues?  

We do have a lot of images attached to the server but he issue is there even 
when we map only a few. 

Thanks for any pointers. 

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommended way to use Ceph as storage for file server

2014-06-09 Thread John-Paul Robinson
We have an NFS to RBD gateway with a large number of smaller RBDs.  In
our use case we are allowing users to request their own RBD containers
that are then served up via NFS into a mixed cluster of clients.Our
gateway is quite beefy, probably more than it needs to be, 2x8 core
cpus  and 96GB ram.  It has been pressed into this service, drawn from a
pool homogeneous servers rather then being spec'd out for this role
explicitly (it could likely be less beefy).  It has performed well.  Our
RBD nodes connected via  2x10GB nics in a transmit-load-balance config.

The server has performed well in this role.  It could just be the
specs.  An individual RBD in this NFS gateway won't see the parallel
performance advantages that CephFS promises, however, one potential
advantage is that a multi-RBD backend will be able to simultaneously
manage NFS client requests isolated to different RBD.   One RBD may
still get a heavy load but it at least the server as a whole has the
potential to spread requests across different devices. 

I haven't done load comparisons so this is just a point of interest. 
It's probably moot if the kernel doesn't do a good job of spreading NFS
load across threads or there is some other kernel/RBD constriction point.

~jpr

On 06/02/2014 12:35 PM, Dimitri Maziuk wrote:
 A more or less obvious alternative for CephFS would be to simply create
  a huge RBD and have a separate file server (running NFS / Samba /
  whatever) use that block device as backend. Just put a regular FS on top
  of the RBD and use it that way.
  Clients wouldn't really have any of the real performance and resilience
  benefits that Ceph could offer though, because the (single machine?)
  file server is now the bottleneck.
 Performance: assuming all your nodes are fast storage on a quad-10Gb
 pipe. Resilience: your gateway can be an active-passive HA pair, that
 shouldn't be any different from NFS+DRBD setups.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on harvesting freed space

2014-04-17 Thread John-Paul Robinson
So in the mean time, are there any common work-arounds?

I'm assuming monitoring imageused/imagesize ratio and if its greater
than some tolerance create a new image and move file system content over
is an effective, if crude approach.  I'm not clear on how to measure the
amount of storage an image uses at the RBD level.  Probably because I
don't understand info output:

$ sudo rbd --id nova info somecontainer
rbd image 'somecontainer':
size 1024 GB in 262144 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.176f3.238e1f29
format: 1

Are there others?

I assume snapshotting images doesn't help here since RBD still wouldn't
be able to distinguish what's in use and what's not.

Thoughts?

~jpr

On 04/17/2014 01:38 AM, Wido den Hollander wrote:
 On 04/17/2014 02:39 AM, Somnath Roy wrote:
 It seems Discard support for kernel rbd is targeted for v80..

 http://tracker.ceph.com/issues/190

 
 True, but it will obviously take time before this hits the upstream
 kernels and goes into distributions.
 
 For RHEL 7 it might be that the krbd module from the Ceph extra repo
 might work. For Ubuntu it's waiting for newer kernels to be backported
 to the LTS releases.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on harvesting freed space

2014-04-16 Thread John-Paul Robinson
So having learned some about fstrim, I ran it on an SSD backed file
system and it reported space freed. I ran it on an RBD backed file
system and was told it's not implemented. 

This is consistent with the test for FITRIM. 

$ cat /sys/block/rbd3/queue/discard_max_bytes
0

On my SSD backed device I get:

$ cat /sys/block/sda/queue/discard_max_bytes
2147450880

Is this just not needed by RBD or is cleanup handled in a different way?

I'm wondering what will happen to a thin provisioned RBD image overtime
on a file system with lots of file create delete activity.  Will the
storage in the ceph pool stay allocated to this application (the file
system) in that case?

Thanks for any additional insights.

~jpr

On 04/15/2014 04:16 PM, John-Paul Robinson wrote:
 Thanks for the insight.

 Based on that I found the fstrim command for xfs file systems. 

 http://xfs.org/index.php/FITRIM/discard

 Anyone had experiences using the this command with RBD image backends?

 ~jpr

 On 04/15/2014 02:00 PM, Kyle Bader wrote:
 I'm assuming Ceph/RBD doesn't have any direct awareness of this since
 the file system doesn't traditionally have a give back blocks
 operation to the block device.  Is there anything special RBD does in
 this case that communicates the release of the Ceph storage back to the
 pool?
 VMs running a 3.2+ kernel (iirc) can give back blocks by issuing TRIM.

 http://wiki.qemu.org/Features/QED/Trim

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question on harvesting freed space

2014-04-15 Thread John-Paul Robinson
Hi,

If I have an 1GB RBD image and format it with say xfs of ext4, then I
basically have thin provisioned disk.  It takes up only as much space
from the Ceph pool as is needed to hold the data structure of the empty
file system.

If I add files to my file systems and then remove them, how does Ceph
deal with these freed blocks?

At the file system level the pointers to the blocks get removed from the
dir tree and the blocks get added to the free list for potential use by
other files.

I'm assuming Ceph/RBD doesn't have any direct awareness of this since
the file system doesn't traditionally have a give back blocks
operation to the block device.  Is there anything special RBD does in
this case that communicates the release of the Ceph storage back to the
pool?

Sorry for any gross oversimplifications.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Live database files on Ceph

2014-04-04 Thread John-Paul Robinson
I've seen this fast everything except sequential reads asymmetry in my
own simple dd tests on RBD images but haven't really understood the cause.

Could you clarify what's going on that would cause that kind of
asymmetry. I've been assuming that once I get around to turning
on/tuning read caching on my underlying OSD nodes the situation will
improve but haven't dug into that yet.

~jpr

On 04/04/2014 04:46 AM, Mark Kirkwood wrote:
 However you may see some asymmetry in this performance - fast random and
 sequential writes, fast random reads but considerably slower sequential
 reads. The RBD cache may help here, but I need to investigate this
 further (and also some of the more fiddly settings to do with vertio
 disk config).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rebooting nodes in a ceph cluster

2013-12-19 Thread John-Paul Robinson
What impact does rebooting nodes in a ceph cluster have on the health of
the ceph cluster?  Can it trigger rebalancing activities that then have
to be undone once the node comes back up?

I have a 4 node ceph cluster each node has 11 osds.  There is a single
pool with redundant storage.

If it takes 15 minutes for one of my servers to reboot is there a risk
that some sort of needless automatic processing will begin?

I'm assuming that the ceph cluster can go into a not ok state but that
in this particular configuration all the data is protected against the
single node failure and there is no place for the data to migrate too so
nothing bad will happen.

Thanks for any feedback.

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebooting nodes in a ceph cluster

2013-12-19 Thread John-Paul Robinson (Campus)
So is it recommended to adjust  the rebalance timeout to align with the time to 
reboot individual nodes?  

I didn't see this in my pass through the ops manual but maybe I'm not looking 
in the right place. 

Thanks,

~jpr

 On Dec 19, 2013, at 6:51 PM, Sage Weil s...@inktank.com wrote:
 
 On Thu, 19 Dec 2013, John-Paul Robinson wrote:
 What impact does rebooting nodes in a ceph cluster have on the health of
 the ceph cluster?  Can it trigger rebalancing activities that then have
 to be undone once the node comes back up?
 
 I have a 4 node ceph cluster each node has 11 osds.  There is a single
 pool with redundant storage.
 
 If it takes 15 minutes for one of my servers to reboot is there a risk
 that some sort of needless automatic processing will begin?
 
 By default, we start rebalancing data after 5 minutes.  You can adjust 
 this (to, say, 15 minutes) with
 
 mon osd down out interval = 900
 
 in ceph.conf.
 
 sage
 
 
 I'm assuming that the ceph cluster can go into a not ok state but that
 in this particular configuration all the data is protected against the
 single node failure and there is no place for the data to migrate too so
 nothing bad will happen.
 
 Thanks for any feedback.
 
 ~jpr
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is Ceph a provider of block device too ?

2013-11-21 Thread John-Paul Robinson
Is this statement accurate?

As I understand DRBD, you can replicate online block devices reliably,
but with Ceph the replication for RBD images requires that the file
system be offline.

Thanks for the clarification,

~jpr


On 11/08/2013 03:46 PM, Gregory Farnum wrote:
 Does Ceph provides the distributed filesystem and block device?
 Ceph's RBD is a distributed block device and works very well; you
 could use it to replace DRBD. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and RAID

2013-10-03 Thread John-Paul Robinson
What is this take on such a configuration?

Is it worth the effort of tracking rebalancing at two layers, RAID
mirror and possibly Ceph if the pool has a redundancy policy.  Or is it
better to just let ceph rebalance itself when you lose a non-mirrored disk?

If following the raid mirror approach, would you then skip redundency
at the ceph layer to keep your total overhead the same?  It seems that
would be risky in the even you loose your storage server with the
raid-1'd drives.  No Ceph level redunancy would then be fatal.  But if
you do raid-1 plus ceph redundancy, doesn't that mean it takes 4TB for
each 1 real TB?

~jpr

On 10/02/2013 10:03 AM, Dimitri Maziuk wrote:
 I would consider (mdadm) raid-1, dep. on the hardware  budget,
 because this way a single disk failure will not trigger a cluster-wide
 rebalance.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using RBD with LVM

2013-09-25 Thread John-Paul Robinson
Thanks.  This fixed the problem.

BTW, after adding this line I still got the same error on my pvcreate
but then I ran pvcreate -vvv and found that it was ignorning my
/dev/rbd1 device because it had detected a partition signature (which I
had added in an earlier attempt to work around this ignored issue).

I deleted the partion and the pvcreate worked on all my RBD devices.

A basic recipe for creating an LVM volume is:

for i in 1 2 3
do
  rbd create user1-home-lvm-p0$i --size 102400
  rbd map user1-home-lvm-p0$i
  pvcreate user1-home-lvm-p0$i
done
vgcreate user1-home-vg \
   /dev/rbd/rbd/user1-home-lvm-p01 \
   /dev/rbd/rbd/user1-home-lvm-p02 \
   /dev/rbd/rbd/user1-home-lvm-p03
lvcreate -nuser1-home-lv -l%100FREE user1-home-vg
mkfs.ext4 /dev/user1-home-vg/user1-home-lv
mount /dev/user1-home-vg/user1-home-lv /somewhere

~jpr

On 09/24/2013 07:58 PM, Mandell Degerness wrote:
 You need to add a line to /etc/lvm/lvm.conf:
 
 types = [ rbd, 1024 ]
 
 It should be in the devices section of the file.
 
 On Tue, Sep 24, 2013 at 5:00 PM, John-Paul Robinson j...@uab.edu wrote:
 Hi,

 I'm exploring a configuration with multiple Ceph block devices used with
 LVM.  The goal is to provide a way to grow and shrink my file systems
 while they are on line.

 I've created three block devices:

 $ sudo ./ceph-ls  | grep home
 jpr-home-lvm-p01: 102400 MB
 jpr-home-lvm-p02: 102400 MB
 jpr-home-lvm-p03: 102400 MB

 And have them mapped into my kernel (3.2.0-23-generic #36-Ubuntu SMP):

 $ sudo rbd showmapped
 id pool imagesnap device
 0  rbd  jpr-test-vol01   -/dev/rbd0
 1  rbd  jpr-home-lvm-p01 -/dev/rbd1
 2  rbd  jpr-home-lvm-p02 -/dev/rbd2
 3  rbd  jpr-home-lvm-p03 -/dev/rbd3

 In order to use them with LVM, I need to define them as physical
 volumes.  But when I run this command I get an unexpected error:

 $ sudo pvcreate /dev/rbd1
   Device /dev/rbd1 not found (or ignored by filtering).

 I am able to use other RBD on this same machine to create file systems
 directly and mount them:

 $ df -h /mnt-test
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/rbd050G  885M   47G   2% /mnt-test

 Is there a reason that the /dev/rbd[1-2] devices can't be initialized as
 physical volumes in LVM?

 Thanks,

 ~jpr
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using RBD with LVM

2013-09-25 Thread John-Paul Robinson
Thanks.

After fixing the issue with the types entry in lvm.conf, I discovered
the -vvv option which helped me detect a the second cause for the
ignored error: pvcreate saw a partition signature and skipped the device.

The -vvv is s good flag. :)

~jpr

On 09/25/2013 01:52 AM, Wido den Hollander wrote:
 Try this:
 
 $ sudo pvcreate -vvv /dev/rbd1
 
 It has something to do with LVM filtering RBD devices away, you might
 need to add them manually in /etc/lvm/lvm.conf
 
 I've seen this before and fixed it, but I forgot what the root cause was.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD on CentOS 6

2013-09-25 Thread John-Paul Robinson
Hi,

We've been working with Ceph 0.56 on Ubuntu 12.04 and are able to
create, map, and mount ceph block devices via the RBD kernel module. We
have a CentOS6.4 box one which we would like to do the same.

http://ceph.com/docs/next/install/os-recommendations/

OS recommedations state that we should be at kernel v3.4.20 or better.

Does anyone have any recommendations for or against using a CentOS6.4
platform to work with RBD in the kernel?

We're assuming we will have to upgrade the kernel in 3.4.20 or better
(if possible).

Thanks,

~jpr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com