Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-07 Thread Christian Balzer
On Wed, 7 Jan 2015 17:07:46 -0800 Craig Lewis wrote:

> On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva  wrote:
> 
> > However, I suspect that temporarily setting min size to a lower number
> > could be enough for the PGs to recover.  If "ceph osd pool  set
> > min_size 1" doesn't get the PGs going, I suppose restarting at least
> > one of the OSDs involved in the recovery, so that they PG undergoes
> > peering again, would get you going again.
> >
> 
> It depends on how incomplete your incomplete PGs are.
> 
> min_size is defined as "Sets the minimum number of replicas required for
> I/O.".  By default, size is 3 and min_size is 2 on recent versions of
> ceph.
> 
> If the number of replicas you have drops below min_size, then Ceph will
> mark the PG as incomplete.  As long as you have one copy of the PG, you
> can recover by lowering the min_size to the number of copies you do
> have, then restoring the original value after recovery is complete.  I
> did this last week when I deleted the wrong PGs as part of a toofull
> experiment.
> 
Which of course begs the question of why not having min_size at 1
permanently, so that in the (hopefully rare) case of loosing 2 OSDs at the
same time your cluster still keeps working (as it should with a size of 3).

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Minimum Cluster Install (ARM)

2015-01-07 Thread Garg, Pankaj
Hi,
I am trying to get a very minimal Ceph cluster up and running (on ARM) and I'm 
wondering what is the smallest unit that I can run rados-bench on ?
Documentation at (http://ceph.com/docs/next/start/quick-ceph-deploy/) seems to 
refer to 4 different nodes. Admin Node, Monitor Node and 2 OSD only nodes.

Can the Admin node be an x86 machine even if the deployment is ARM based?

Or can the Admin Node and Monitor node co-exist.

Finally, I'm assuming I can get by with only 1 independent OSD node.

If that's possible, I can get by with 2 ARM systems only. Can someone please 
shed some light on whether this will work?

Thanks
Pankaj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Sanders, Bill
Excellent, thanks for the detailed breakdown.

Take care,
Bill

From: Michael J. Kidd [michael.k...@inktank.com]
Sent: Wednesday, January 07, 2015 4:50 PM
To: Sanders, Bill
Cc: Loic Dachary; ceph-us...@ceph.com
Subject: Re: [ceph-users] PG num calculator live on Ceph.com

Hello Bill,
  Either 2048 or 4096 should be acceptable.  4096 gives about a 300 PG per OSD 
ratio, which would leave room for tripling the OSD count without needing to 
increase the PG number.  While 2048 gives about 150 PGs per OSD, not leaving 
room but for about a 50% OSD count expansion.

The high PG count per OSD issue really doesn't manifest aggressively until you 
get around 1000 PGs per OSD and beyond.  At those levels, steady state 
operation continues without issue.. but recovery within the cluster will see 
the memory utilization of the OSDs climb and could push into out of memory 
conditions on the OSD host (or at a minimum, heavy swap usage if enabled).  It 
still depends of course on the # of OSDs per node, and the amount of memory on 
the node as to if you'll actually experience issues or not.

As an example though, I worked on a cluster which was about 5500 PGs per OSD.  
The cluster experienced a network config issue in the switchgear which isolated 
2/3's of the OSD nodes from each other and the other 1/3 of the cluster.  When 
the network issue was cleared, the OSDs started dropping like flies... They'd 
start up, spool up the memory they needed for map update parsing, and get 
killed before making any real headway.  We were finally able to get the cluster 
online by limiting what the OSDs were doing to a small slice of the normal 
start-up, waiting for the OSDs to calm down, then opening up a bit more for 
them to do (noup, noin, norecover, nobackfill, pause, noscrub, nodeep-scrub 
were all set, and then unset one at a time until all OSDs were up/in and able 
to handle the recovery).

6 weeks later, that same cluster lost about 40% of the OSDs during a power 
outage due to corruption from an HBA bug.. (it didn't flush the write cache to 
disk).  This pushed the PG per OSD count over 9000!!  It simply couldn't 
recover with the available memory at that PG count.  Each OSD, started by 
itself, would consume > 60gb of RAM and get killed (the nodes only had 64gb 
total).

While this is an extreme example... we see cases generated with > 1000 PGs per 
OSD on a regular basis.  This is the type of thing we're trying to head off.

It should be noted that you can increase the PG num of a pool.. but cannot 
decrease!   The only way to reduce your cluster PG count is to create new 
smaller PG num pools, migrate the data and then delete the old, high PG count 
pools.  You could also simply add more OSDs to reduce the PG per OSD ratio.

The issue with too few PGs is poor data distribution.  So it's all about having 
enough PGs to get good data distribution without going too high and having 
resource exhaustion during recovery.

Hope this helps put things into perspective.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill 
mailto:bill.sand...@teradata.com>> wrote:
This is interesting.  Kudos to you guys for getting the calculator up, I think 
this'll help some folks.

I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on: 
http://ceph.com/docs/master/rados/operations/placement-groups/

'''
Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 4096
'''

But the calculator gives a different result of 2048.  Out of curiosity, what 
sorts of issues might one encounter by having too many placement groups?  I 
understand there's some resource overhead.  I don't suppose it would manifest 
itself in a recognizable way?

Bill


From: ceph-users 
[ceph-users-boun...@lists.ceph.com] 
on behalf of Michael J. Kidd 
[michael.k...@inktank.com]
Sent: Wednesday, January 07, 2015 3:51 PM
To: Loic Dachary
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] PG num calculator live on Ceph.com

> Where is the source ?
On the page.. :)  It does link out to jquery and jquery-ui, but all the custom 
bits are embedded in the HTML.

Glad it's helpful :)

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary 
mailto:l...@dachary.org>> wrote:


On 07/01/2015 23:08, Michael J. Kidd wrote:
> Hello all,
>   Just a quick heads up that we now have a PG calculator to help determine 
> the proper PG per pool numbers to achieve a target PG per OSD ratio.
>
> http://ceph.com/pgcalc
>
> Please check it out!  Happy to answer any questions, and always welcome any 
> feedback on the tool / verbiage, etc...

Great work ! That will be immensely useful :-)

Where is the source ?

Cheers

>
> A

Re: [ceph-users] Slow/Hung IOs

2015-01-07 Thread Sanders, Bill
Thanks for your reply, Christian.  Sorry for my delay in responding.

The kernel logs are silent.  Forgot to mention before that ntpd is running and 
the nodes are sync'd.

I'm working on some folks for an updated kernel, but I'm not holding my breath. 
 That said, If I'm seeing this problem by running rados bench on the storage 
cluster itself, is it fair to say that the kernel code isn't the issue?

vm/min_free_kbytes is now set to 512M, though that didn't solve the issue.  I 
also set "filestore_max_sync_interval = 30" (and commented out the journal 
line) as you suggested, but that didn't seem to change anything, either.  Not 
sure what you mean about the monitors and SSD's... they currently *are* hosted 
on SSD's, which don't appear to be 

When rados bench starts, atop (holy crap that's a lot of info) shows that the 
HDD's go crazy for a little while (busy >85%).  The SSD's never get that busy 
(certainly <50%).  I attached a few 'snapshots' of atop taken just after the 
test starts (~12s), while it was still running (~30s), and after the test was 
supposed to have ended (~70s), but was essentially waiting for slow-requests.  
The only thing red-lining at all were the HDD's

I wonder how I could test our network.  Are you thinking its possible we're 
losing packets?  I'll ping (har!) our network guy... 

I have to admit that the OSD logs don't mean a whole lot to me.  Are OSD log 
entries like this normal?  This is not from during the test, but just before 
when the system was essentially idle.

2015-01-07 15:38:40.340883 7fa264ff7700  0 -- 39.71.48.8:6800/46686 >> 
39.71.48.6:6806/47930 pipe(0x7fa268c14480 sd=111 :40639 s=2 pgs=559 cs=13 l=0 
c=0x7fa283060080).fault with nothing to send, going to standby
2015-01-07 15:38:53.573890 7fa2b99f6700  0 -- 39.71.48.8:6800/46686 >> 
39.71.48.9:6805/23130 pipe(0x7fa268c55800 sd=127 :6800 s=2 pgs=152 cs=13 l=0 
c=0x7fa268c17e00).fault with nothing to send, going to standby
2015-01-07 15:38:55.881934 7fa281bfd700  0 -- 39.71.48.8:6800/46686 >> 
39.71.48.9:6809/44433 pipe(0x7fa268c12180 sd=65 :41550 s=2 pgs=599 cs=19 l=0 
c=0x7fa28305fc00).fault with nothing to send, going to standby
2015-01-07 15:38:56.360866 7fa29e1f6700  0 -- 39.71.48.8:6800/46686 >> 
39.71.48.6:6820/48681 pipe(0x7fa268c14980 sd=145 :6800 s=2 pgs=500 cs=21 l=0 
c=0x7fa28305fa80).fault with nothing to send, going to standby
2015-01-07 15:38:58.767181 7fa2a85f6700  0 -- 39.71.48.8:6800/46686 >> 
39.71.48.6:6820/48681 pipe(0x7fa268c55d00 sd=52 :6800 s=0 pgs=0 cs=0 l=0 
c=0x7fa268c18b80).accept connect_seq 22 vs existing 21 state standby
2015-01-07 15:38:58.943514 7fa253cf0700  0 -- 39.71.48.8:6800/46686 >> 
39.71.48.9:6805/23130 pipe(0x7fa268c55f80 sd=49 :6800 s=0 pgs=0 cs=0 l=0 
c=0x7fa268c18d00).accept connect_seq 14 vs existing 13 state standby


For the OSD complaining about slow requests its logs show something like during 
the test:

2015-01-07 15:47:28.463470 7fc0714f0700  0 -- 39.7.48.7:6812/16907 >> 
39.7.48.4:0/3544514455 pipe(0x7fc08f827a80 sd=153 :6812 s=0 pgs=0 cs=0 l=0 
c=0x7fc08f882580).accept peer addr is really 39.7.48.4:0/3544514455 (socket is 
39.7.48.4:464
35/0)
2015-01-07 15:48:04.426399 7fc0e9bfd700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 30.738429 secs
2015-01-07 15:48:04.426416 7fc0e9bfd700  0 log [WRN] : slow request 30.738429 
seconds old, received at 2015-01-07 15:47:33.687935: osd_op(client.92886.0:4711 
benchmark_data_tvsaq1_29431_object4710 [write 0~4194304] 3.1639422f ack+ondisk+
write e1464) v4 currently waiting for subops from 22,36
2015-01-07 15:48:34.429979 7fc0e9bfd700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 60.742016 secs
2015-01-07 15:48:34.429997 7fc0e9bfd700  0 log [WRN] : slow request 60.742016 
seconds old, received at 2015-01-07 15:47:33.687935: osd_op(client.92886.0:4711 
benchmark_data_tvsaq1_29431_object4710 [write 0~4194304] 3.1639422f ack+ondisk+
write e1464) v4 currently waiting for subops from 22,36


From: Christian Balzer [ch...@gol.com]
Sent: Tuesday, January 06, 2015 12:25 AM
To: ceph-users@lists.ceph.com
Cc: Sanders, Bill
Subject: Re: [ceph-users] Slow/Hung IOs

On Mon, 5 Jan 2015 22:36:29 + Sanders, Bill wrote:

> Hi Ceph Users,
>
> We've got a Ceph cluster we've built, and we're experiencing issues with
> slow or hung IO's, even running 'rados bench' on the OSD cluster.
> Things start out great, ~600 MB/s, then rapidly drops off as the test
> waits for IO's. Nothing seems to be taxed... the system just seems to be
> waiting.  Any help trying to figure out what could cause the slow IO's
> is appreciated.
>
I assume nothing in the logs of the respective OSDs either?
Kernel or other logs equally silent?

Watching things with atop (while running the test) not showing anything
particular?

Looking at the myriad of throttles and other data in
http://ceph.com/docs/next/dev/perf_counters/
might be helpful for the affected OSDs.

Having this

Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Lindsay Mathieson
With cephfs we have the two pools - data & metadata. Does that effect the
pg calculations? metadata pool will have substantially less data than the
data pool.


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
Hello Bill,
  Either 2048 or 4096 should be acceptable.  4096 gives about a 300 PG per
OSD ratio, which would leave room for tripling the OSD count without
needing to increase the PG number.  While 2048 gives about 150 PGs per OSD,
not leaving room but for about a 50% OSD count expansion.

The high PG count per OSD issue really doesn't manifest aggressively until
you get around 1000 PGs per OSD and beyond.  At those levels, steady state
operation continues without issue.. but recovery within the cluster will
see the memory utilization of the OSDs climb and could push into out of
memory conditions on the OSD host (or at a minimum, heavy swap usage if
enabled).  It still depends of course on the # of OSDs per node, and the
amount of memory on the node as to if you'll actually experience issues or
not.

As an example though, I worked on a cluster which was about 5500 PGs per
OSD.  The cluster experienced a network config issue in the switchgear
which isolated 2/3's of the OSD nodes from each other and the other 1/3 of
the cluster.  When the network issue was cleared, the OSDs started dropping
like flies... They'd start up, spool up the memory they needed for map
update parsing, and get killed before making any real headway.  We were
finally able to get the cluster online by limiting what the OSDs were doing
to a small slice of the normal start-up, waiting for the OSDs to calm down,
then opening up a bit more for them to do (noup, noin, norecover,
nobackfill, pause, noscrub, nodeep-scrub were all set, and then unset one
at a time until all OSDs were up/in and able to handle the recovery).

6 weeks later, that same cluster lost about 40% of the OSDs during a power
outage due to corruption from an HBA bug.. (it didn't flush the write cache
to disk).  This pushed the PG per OSD count over 9000!!  It simply couldn't
recover with the available memory at that PG count.  Each OSD, started by
itself, would consume > 60gb of RAM and get killed (the nodes only had 64gb
total).

While this is an extreme example... we see cases generated with > 1000 PGs
per OSD on a regular basis.  This is the type of thing we're trying to head
off.

It should be noted that you can increase the PG num of a pool.. but cannot
decrease!   The only way to reduce your cluster PG count is to create new
smaller PG num pools, migrate the data and then delete the old, high PG
count pools.  You could also simply add more OSDs to reduce the PG per OSD
ratio.

The issue with too few PGs is poor data distribution.  So it's all about
having enough PGs to get good data distribution without going too high and
having resource exhaustion during recovery.

Hope this helps put things into perspective.

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill 
wrote:

>  This is interesting.  Kudos to you guys for getting the calculator up, I
> think this'll help some folks.
>
> I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on:
> http://ceph.com/docs/master/rados/operations/placement-groups/
>
> '''
> Less than 5 OSDs set pg_num to 128
> Between 5 and 10 OSDs set pg_num to 512
> Between 10 and 50 OSDs set pg_num to 4096
> '''
>
> But the calculator gives a different result of 2048.  Out of curiosity,
> what sorts of issues might one encounter by having too many placement
> groups?  I understand there's some resource overhead.  I don't suppose it
> would manifest itself in a recognizable way?
>
> Bill
>
>  --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Michael J. Kidd [michael.k...@inktank.com]
> *Sent:* Wednesday, January 07, 2015 3:51 PM
> *To:* Loic Dachary
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] PG num calculator live on Ceph.com
>
>> Where is the source ?
>  On the page.. :)  It does link out to jquery and jquery-ui, but all the
> custom bits are embedded in the HTML.
>
>  Glad it's helpful :)
>
>   Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>   - by Red Hat
>
> On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary  wrote:
>
>>
>>
>> On 07/01/2015 23:08, Michael J. Kidd wrote:
>> > Hello all,
>> >   Just a quick heads up that we now have a PG calculator to help
>> determine the proper PG per pool numbers to achieve a target PG per OSD
>> ratio.
>> >
>> > http://ceph.com/pgcalc
>> >
>> > Please check it out!  Happy to answer any questions, and always welcome
>> any feedback on the tool / verbiage, etc...
>>
>> Great work ! That will be immensely useful :-)
>>
>> Where is the source ?
>>
>> Cheers
>>
>> >
>> > As an aside, we're also working to update the documentation to reflect
>> the best practices.  See Ceph.com tracker for this at:
>> > http://tracker.ceph.com/issues/9867
>> >
>> > Thanks!
>> > Michael J. Kidd
>> > Sr. Storage Consultant
>> > Inktank Professional Services
>> >  - by Red Hat
>> >
>> >
>>  > ___
>> > ceph-use

Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Sanders, Bill
This is interesting.  Kudos to you guys for getting the calculator up, I think 
this'll help some folks.

I have 1 pool, 40 OSDs, and replica of 3.  I based my PG count on: 
http://ceph.com/docs/master/rados/operations/placement-groups/

'''
Less than 5 OSDs set pg_num to 128
Between 5 and 10 OSDs set pg_num to 512
Between 10 and 50 OSDs set pg_num to 4096
'''

But the calculator gives a different result of 2048.  Out of curiosity, what 
sorts of issues might one encounter by having too many placement groups?  I 
understand there's some resource overhead.  I don't suppose it would manifest 
itself in a recognizable way?

Bill


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Michael J. 
Kidd [michael.k...@inktank.com]
Sent: Wednesday, January 07, 2015 3:51 PM
To: Loic Dachary
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] PG num calculator live on Ceph.com

> Where is the source ?
On the page.. :)  It does link out to jquery and jquery-ui, but all the custom 
bits are embedded in the HTML.

Glad it's helpful :)

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary 
mailto:l...@dachary.org>> wrote:


On 07/01/2015 23:08, Michael J. Kidd wrote:
> Hello all,
>   Just a quick heads up that we now have a PG calculator to help determine 
> the proper PG per pool numbers to achieve a target PG per OSD ratio.
>
> http://ceph.com/pgcalc
>
> Please check it out!  Happy to answer any questions, and always welcome any 
> feedback on the tool / verbiage, etc...

Great work ! That will be immensely useful :-)

Where is the source ?

Cheers

>
> As an aside, we're also working to update the documentation to reflect the 
> best practices.  See Ceph.com tracker for this at:
> http://tracker.ceph.com/issues/9867
>
> Thanks!
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

--
Loïc Dachary, Artisan Logiciel Libre


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow/Hung IOs

2015-01-07 Thread Christian Balzer

Hello,

On Thu, 8 Jan 2015 00:17:11 + Sanders, Bill wrote:

> Thanks for your reply, Christian.  Sorry for my delay in responding.
> 
> The kernel logs are silent.  Forgot to mention before that ntpd is
> running and the nodes are sync'd.
> 
> I'm working on some folks for an updated kernel, but I'm not holding my
> breath.  That said, If I'm seeing this problem by running rados bench on
> the storage cluster itself, is it fair to say that the kernel code isn't
> the issue?
> 
Well, aside from such nuggets as:
http://tracker.ceph.com/issues/6301 
(which you're obviously not facing, but still)
most people tend to run Ceph with the latest stable-ish kernels for a
variety of reasons. 
If nothing else, you're going to hopefully get some other improvements and
are able to compare notes with a broader group of Ceph users. 

> vm/min_free_kbytes is now set to 512M, though that didn't solve the
> issue.  
I wasn't expecting it too, but if you look at threads as recent as this
one:
http://comments.gmane.org/gmane.comp.file-systems.ceph.user/15167

Setting this with IB HCAs makes a lot of sense.

> I also set "filestore_max_sync_interval = 30" (and commented out
> the journal line) as you suggested, but that didn't seem to change
> anything, either.  

That setting could/should improve journal utilization, it has nothing to
do per se with your problem. Of course you will need to restart all OSDs
(and make sure the change took effect by looking at the active
configuration via the admin socket). 

> Not sure what you mean about the monitors and
> SSD's... they currently *are* hosted on SSD's, which don't appear to be 
> 
Cut off in the middle of the sentence?
Anyways, from your description "2x1TB spinners configured in RAID for the
OS" I have to assume that /var/lib/ceph/ is part of that RAID and that's
where the monitors keep their very active leveldb. 
It really likes to be on SSDs, I could make monitors go wonky on a similar
setup when running bonnie++ on those OS disks.

> When rados bench starts, atop (holy crap that's a lot of info) shows
> that the HDD's go crazy for a little while (busy >85%).  The SSD's never
> get that busy (certainly <50%).  I attached a few 'snapshots' of atop
> taken just after the test starts (~12s), while it was still running
> (~30s), and after the test was supposed to have ended (~70s), but was
> essentially waiting for slow-requests.  The only thing red-lining at all
> were the HDD's
> 
Yeah, atop is quite informative in a big window and if you think that's
TMI, look at the performance counters on each OSD as I mentioned earlier.
"ceph --admin-daemon /var/run/ceph/ceph-osd.16.asok perf dump"

HDDs are supposed to get 100% busy and nothing stands out in particular.
Was one of those disk (this node) part of a slow request?

I find irqbalance clumsy and often plain wrong, but while your top IRQ
load is nothing to worry about you might want to investigate separating
your network and disk controller IRQs onto separate (real) cores (but
within the same CPU/numa region).

> I wonder how I could test our network.  Are you thinking its possible
> we're losing packets?  I'll ping (har!) our network guy... 
> 
Network people tend to run away screaming when mentioning IB, that's why
I'm the IB guy here and not the 4 (in our team alone) network guys. 

What exactly are you using (hardware, IB stack, IPoIB mode) and are those
single ports or are they bonded?

> I have to admit that the OSD logs don't mean a whole lot to me.  Are OSD
> log entries like this normal?  This is not from during the test, but
> just before when the system was essentially idle.
> 
> 2015-01-07 15:38:40.340883 7fa264ff7700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.6:6806/47930 pipe(0x7fa268c14480 sd=111 :40639 s=2 pgs=559
> cs=13 l=0 c=0x7fa283060080).fault with nothing to send, going to standby
> 2015-01-07 15:38:53.573890 7fa2b99f6700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.9:6805/23130 pipe(0x7fa268c55800 sd=127 :6800 s=2 pgs=152 cs=13
> l=0 c=0x7fa268c17e00).fault with nothing to send, going to standby
> 2015-01-07 15:38:55.881934 7fa281bfd700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.9:6809/44433 pipe(0x7fa268c12180 sd=65 :41550 s=2 pgs=599 cs=19
> l=0 c=0x7fa28305fc00).fault with nothing to send, going to standby
> 2015-01-07 15:38:56.360866 7fa29e1f6700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.6:6820/48681 pipe(0x7fa268c14980 sd=145 :6800 s=2 pgs=500 cs=21
> l=0 c=0x7fa28305fa80).fault with nothing to send, going to standby
> 2015-01-07 15:38:58.767181 7fa2a85f6700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.6:6820/48681 pipe(0x7fa268c55d00 sd=52 :6800 s=0 pgs=0 cs=0 l=0
> c=0x7fa268c18b80).accept connect_seq 22 vs existing 21 state standby
> 2015-01-07 15:38:58.943514 7fa253cf0700  0 -- 39.71.48.8:6800/46686 >>
> 39.71.48.9:6805/23130 pipe(0x7fa268c55f80 sd=49 :6800 s=0 pgs=0 cs=0 l=0
> c=0x7fa268c18d00).accept connect_seq 14 vs existing 13 state standby
> 
Totally normal.

> 
> For the OSD complaining about 

Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Mark Nelson

Hi Michael,

Good job!  It would be really useful to add in calculations to show the 
expected distribution and max deviation from the mean.


I'm dredging this up from an old email I sent out a year ago, but if we 
treat this as a "balls into bins" problem ala Raab & Steger:


http://www14.in.tum.de/personen/raab/publ/balls.pdf

I believe we can get a tight bound on the maximally loaded bin where:

- m balls in n bins
- m > n

with the formula:

m/n + sqrt(2m*ln(n)/n)

IE, if we say have 9000 balls spread across 90 bins:

9000/90 + sqrt(2*9000*ln(90)/90) =~ 130

vs 9000/90 = 100 on average

That would allow folks to get a feel for how much deviation they *could* 
see given different PG/OSD counts.  There are techniques that we can use 
to cheat around this like applying new random seeds during pool creation 
to throw away particularly bad pool topologies.  Unfortunately once the 
topology changes you are bound by random variation again.  Changing OSD 
weight might help, but with multiple pools the skew may be right for one 
pool but wrong for another.


On 01/07/2015 04:08 PM, Michael J. Kidd wrote:

Hello all,
   Just a quick heads up that we now have a PG calculator to help
determine the proper PG per pool numbers to achieve a target PG per OSD
ratio.

http://ceph.com/pgcalc

Please check it out!  Happy to answer any questions, and always welcome
any feedback on the tool / verbiage, etc...

As an aside, we're also working to update the documentation to reflect
the best practices.  See Ceph.com tracker for this at:
http://tracker.ceph.com/issues/9867

Thanks!
Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
  - by Red Hat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2015-01-07 Thread Craig Lewis
On Mon, Dec 29, 2014 at 4:49 PM, Alexandre Oliva  wrote:

> However, I suspect that temporarily setting min size to a lower number
> could be enough for the PGs to recover.  If "ceph osd pool  set
> min_size 1" doesn't get the PGs going, I suppose restarting at least one
> of the OSDs involved in the recovery, so that they PG undergoes peering
> again, would get you going again.
>

It depends on how incomplete your incomplete PGs are.

min_size is defined as "Sets the minimum number of replicas required for
I/O.".  By default, size is 3 and min_size is 2 on recent versions of ceph.

If the number of replicas you have drops below min_size, then Ceph will
mark the PG as incomplete.  As long as you have one copy of the PG, you can
recover by lowering the min_size to the number of copies you do have, then
restoring the original value after recovery is complete.  I did this last
week when I deleted the wrong PGs as part of a toofull experiment.

If the number of replicas drops to 0, I think you can use ceph pg
force_create_pg, but I haven't tested it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
> Where is the source ?
On the page.. :)  It does link out to jquery and jquery-ui, but all the
custom bits are embedded in the HTML.

Glad it's helpful :)

Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary  wrote:

>
>
> On 07/01/2015 23:08, Michael J. Kidd wrote:
> > Hello all,
> >   Just a quick heads up that we now have a PG calculator to help
> determine the proper PG per pool numbers to achieve a target PG per OSD
> ratio.
> >
> > http://ceph.com/pgcalc
> >
> > Please check it out!  Happy to answer any questions, and always welcome
> any feedback on the tool / verbiage, etc...
>
> Great work ! That will be immensely useful :-)
>
> Where is the source ?
>
> Cheers
>
> >
> > As an aside, we're also working to update the documentation to reflect
> the best practices.  See Ceph.com tracker for this at:
> > http://tracker.ceph.com/issues/9867
> >
> > Thanks!
> > Michael J. Kidd
> > Sr. Storage Consultant
> > Inktank Professional Services
> >  - by Red Hat
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Loic Dachary


On 07/01/2015 23:08, Michael J. Kidd wrote:
> Hello all,
>   Just a quick heads up that we now have a PG calculator to help determine 
> the proper PG per pool numbers to achieve a target PG per OSD ratio. 
> 
> http://ceph.com/pgcalc
> 
> Please check it out!  Happy to answer any questions, and always welcome any 
> feedback on the tool / verbiage, etc...

Great work ! That will be immensely useful :-)

Where is the source ?

Cheers

> 
> As an aside, we're also working to update the documentation to reflect the 
> best practices.  See Ceph.com tracker for this at:
> http://tracker.ceph.com/issues/9867
> 
> Thanks!
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Christopher O'Connell
Ah, so I've been doing it wrong all this time (I thought we had to take the
size multiple into account ourselves).

Thanks!

On Wed, Jan 7, 2015 at 4:25 PM, Michael J. Kidd 
wrote:

> Hello Christopher,
>   Keep in mind that the PGs per OSD (and per pool) calculations take into
> account the replica count ( pool size= parameter ).  So, for example.. if
> you're using a default of 3 replicas.. 16 * 3 = 48 PGs which allows for at
> least one PG per OSD on that pool.  Even with a size=2, 32 PGs total still
> gives very close to 1 PG per OSD.  Being that it's such a low utilization
> pool, this is still sufficient.
>
> Thanks,
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
> On Wed, Jan 7, 2015 at 3:17 PM, Christopher O'Connell 
> wrote:
>
>> Hi,
>>
>> I"m playing with this with a modest sized ceph cluster (36x6TB disks).
>> Based on this it says that small pools (such as .users) would have just 16
>> PGs. Is this correct? I've historically always made even these small pools
>> have at least as many PGs as the next power of 2 over my number of OSDs (64
>> in this case).
>>
>> All the best,
>>
>> ~ Christopher
>>
>> On Wed, Jan 7, 2015 at 3:08 PM, Michael J. Kidd > > wrote:
>>
>>> Hello all,
>>>   Just a quick heads up that we now have a PG calculator to help
>>> determine the proper PG per pool numbers to achieve a target PG per OSD
>>> ratio.
>>>
>>> http://ceph.com/pgcalc
>>>
>>> Please check it out!  Happy to answer any questions, and always welcome
>>> any feedback on the tool / verbiage, etc...
>>>
>>> As an aside, we're also working to update the documentation to reflect
>>> the best practices.  See Ceph.com tracker for this at:
>>> http://tracker.ceph.com/issues/9867
>>>
>>> Thanks!
>>> Michael J. Kidd
>>> Sr. Storage Consultant
>>> Inktank Professional Services
>>>  - by Red Hat
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
Hello Christopher,
  Keep in mind that the PGs per OSD (and per pool) calculations take into
account the replica count ( pool size= parameter ).  So, for example.. if
you're using a default of 3 replicas.. 16 * 3 = 48 PGs which allows for at
least one PG per OSD on that pool.  Even with a size=2, 32 PGs total still
gives very close to 1 PG per OSD.  Being that it's such a low utilization
pool, this is still sufficient.

Thanks,
Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat

On Wed, Jan 7, 2015 at 3:17 PM, Christopher O'Connell 
wrote:

> Hi,
>
> I"m playing with this with a modest sized ceph cluster (36x6TB disks).
> Based on this it says that small pools (such as .users) would have just 16
> PGs. Is this correct? I've historically always made even these small pools
> have at least as many PGs as the next power of 2 over my number of OSDs (64
> in this case).
>
> All the best,
>
> ~ Christopher
>
> On Wed, Jan 7, 2015 at 3:08 PM, Michael J. Kidd 
> wrote:
>
>> Hello all,
>>   Just a quick heads up that we now have a PG calculator to help
>> determine the proper PG per pool numbers to achieve a target PG per OSD
>> ratio.
>>
>> http://ceph.com/pgcalc
>>
>> Please check it out!  Happy to answer any questions, and always welcome
>> any feedback on the tool / verbiage, etc...
>>
>> As an aside, we're also working to update the documentation to reflect
>> the best practices.  See Ceph.com tracker for this at:
>> http://tracker.ceph.com/issues/9867
>>
>> Thanks!
>> Michael J. Kidd
>> Sr. Storage Consultant
>> Inktank Professional Services
>>  - by Red Hat
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Christopher O'Connell
Hi,

I"m playing with this with a modest sized ceph cluster (36x6TB disks).
Based on this it says that small pools (such as .users) would have just 16
PGs. Is this correct? I've historically always made even these small pools
have at least as many PGs as the next power of 2 over my number of OSDs (64
in this case).

All the best,

~ Christopher

On Wed, Jan 7, 2015 at 3:08 PM, Michael J. Kidd 
wrote:

> Hello all,
>   Just a quick heads up that we now have a PG calculator to help determine
> the proper PG per pool numbers to achieve a target PG per OSD ratio.
>
> http://ceph.com/pgcalc
>
> Please check it out!  Happy to answer any questions, and always welcome
> any feedback on the tool / verbiage, etc...
>
> As an aside, we're also working to update the documentation to reflect the
> best practices.  See Ceph.com tracker for this at:
> http://tracker.ceph.com/issues/9867
>
> Thanks!
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG num calculator live on Ceph.com

2015-01-07 Thread Michael J. Kidd
Hello all,
  Just a quick heads up that we now have a PG calculator to help determine
the proper PG per pool numbers to achieve a target PG per OSD ratio.

http://ceph.com/pgcalc

Please check it out!  Happy to answer any questions, and always welcome any
feedback on the tool / verbiage, etc...

As an aside, we're also working to update the documentation to reflect the
best practices.  See Ceph.com tracker for this at:
http://tracker.ceph.com/issues/9867

Thanks!
Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
 - by Red Hat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy dependency errors on fc20 with firefly

2015-01-07 Thread Travis Rhoden
Hi Noah,

I'll try to recreate this on a fresh FC20 install as well.  Looks to
me like there might be a repo priority issue.  It's mixing packages
from Fedora downstream repos and the ceph.com upstream repos.  That's
not supposed to happen.

 - Travis

On Wed, Jan 7, 2015 at 2:15 PM, Noah Watkins  wrote:
> I'm trying to install Firefly on an up-to-date FC20 box. I'm getting
> the following errors:
>
> [nwatkins@kyoto cluster]$ ../ceph-deploy/ceph-deploy install --release
> firefly kyoto
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/nwatkins/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.21): ../ceph-deploy/ceph-deploy
> install --release firefly kyoto
> [ceph_deploy.install][DEBUG ] Installing stable version firefly on
> cluster ceph hosts kyoto
> [ceph_deploy.install][DEBUG ] Detecting platform for host kyoto ...
> [kyoto][DEBUG ] connection detected need for sudo
> [kyoto][DEBUG ] connected to host: kyoto
> [kyoto][DEBUG ] detect platform information from remote host
> [kyoto][DEBUG ] detect machine type
> [ceph_deploy.install][INFO  ] Distro info: Fedora 20 Heisenbug
> [kyoto][INFO  ] installing ceph on kyoto
> [kyoto][INFO  ] Running command: sudo yum -y install yum-plugin-priorities
> [kyoto][DEBUG ] Loaded plugins: langpacks, priorities, refresh-packagekit
> [kyoto][DEBUG ] Package yum-plugin-priorities-1.1.31-27.fc20.noarch
> already installed and latest version
> [kyoto][DEBUG ] Nothing to do
> [kyoto][INFO  ] Running command: sudo rpm --import
> https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
> [kyoto][INFO  ] Running command: sudo rpm -Uvh --replacepkgs --force
> --quiet 
> http://ceph.com/rpm-firefly/fc20/noarch/ceph-release-1-0.fc20.noarch.rpm
> [kyoto][DEBUG ] 
> [kyoto][DEBUG ] Updating / installing...
> [kyoto][DEBUG ] 
> [kyoto][WARNIN] ensuring that /etc/yum.repos.d/ceph.repo contains a
> high priority
> [kyoto][WARNIN] altered ceph.repo priorities to contain: priority=1
> [kyoto][INFO  ] Running command: sudo yum -y -q install ceph
> [kyoto][WARNIN] Error: Package: 1:python-cephfs-0.80.7-1.fc20.x86_64 (updates)
> [kyoto][WARNIN]Requires: libcephfs1 = 1:0.80.7-1.fc20
> [kyoto][WARNIN]Available: libcephfs1-0.80.1-0.fc20.x86_64 (Ceph)
> [kyoto][DEBUG ]  You could try using --skip-broken to work around the problem
> [kyoto][WARNIN]libcephfs1 = 0.80.1-0.fc20
> [kyoto][WARNIN]Available: libcephfs1-0.80.3-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]libcephfs1 = 0.80.3-0.fc20
> [kyoto][WARNIN]Available: libcephfs1-0.80.4-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]libcephfs1 = 0.80.4-0.fc20
> [kyoto][WARNIN]Available: libcephfs1-0.80.5-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]libcephfs1 = 0.80.5-0.fc20
> [kyoto][WARNIN]Available: libcephfs1-0.80.6-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]libcephfs1 = 0.80.6-0.fc20
> [kyoto][WARNIN]Installing: libcephfs1-0.80.7-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]libcephfs1 = 0.80.7-0.fc20
> [kyoto][WARNIN] Error: Package: 1:python-rbd-0.80.7-1.fc20.x86_64 (updates)
> [kyoto][WARNIN]Requires: librbd1 = 1:0.80.7-1.fc20
> [kyoto][WARNIN]Available: librbd1-0.80.1-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librbd1 = 0.80.1-0.fc20
> [kyoto][WARNIN]Available: librbd1-0.80.3-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librbd1 = 0.80.3-0.fc20
> [kyoto][WARNIN]Available: librbd1-0.80.4-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librbd1 = 0.80.4-0.fc20
> [kyoto][WARNIN]Available: librbd1-0.80.5-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librbd1 = 0.80.5-0.fc20
> [kyoto][WARNIN]Available: librbd1-0.80.6-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librbd1 = 0.80.6-0.fc20
> [kyoto][WARNIN]Installing: librbd1-0.80.7-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librbd1 = 0.80.7-0.fc20
> [kyoto][WARNIN] Error: Package: 1:python-rados-0.80.7-1.fc20.x86_64 (updates)
> [kyoto][WARNIN]Requires: librados2 = 1:0.80.7-1.fc20
> [kyoto][WARNIN]Available: librados2-0.80.1-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librados2 = 0.80.1-0.fc20
> [kyoto][WARNIN]Available: librados2-0.80.3-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librados2 = 0.80.3-0.fc20
> [kyoto][WARNIN]Available: librados2-0.80.4-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librados2 = 0.80.4-0.fc20
> [kyoto][WARNIN]Available: librados2-0.80.5-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librados2 = 0.80.5-0.fc20
> [kyoto][WARNIN]Available: librados2-0.80.6-0.fc20.x86_64 (Ceph)
> [kyoto][WARNIN]librados2 = 0.80.6-0.fc20
> [kyoto][WARNIN]Installing: librados

Re: [ceph-users] Erasure code pool overhead

2015-01-07 Thread Italo Santos
Thanks Nick.  

At.

Italo Santos
http://italosantos.com.br/


On Wednesday, January 7, 2015 at 18:44, Nick Fisk wrote:

> Hi Italo,
>   
> =k/(k+m)
>   
> Where k is data chunks and m is coding chunks.
>   
> For example k=8 m=2 would give you
>   
> =8/(8+2)
>   
> .8 or 80% usable storage and 20% used for coding. Please keep in mind however 
> that you can’t fill up the storage completely.
>   
> Nick
>   
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Italo Santos
> Sent: 06 January 2015 22:14
> To: ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> Subject: [ceph-users] Erasure code pool overhead
>   
> Hello,  
>  
>   
>  
> I’d like to know how can I calculate the overhead of a erasure pool?
>  
>   
>  
> Regards.
>  
>   
>  
> Italo Santos
>  
> http://italosantos.com.br/ 
> (http://xo4t.mjt.lu/link/xo4t/gsjxzg0/1/g83J0My8PWkwSfX5StB9tA/aHR0cDovL2l0YWxvc2FudG9zLmNvbS5ici8)
>  
>   
>  
>  
>  
>  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-07 Thread Nico Schottelius
Hello Dan,

it is good to know that there are actually people using ceph + qemu in
production!

Regarding replicas: I thought about using size = 2, but I see that
this resembles raid5 and size = 3 is more or less equal in terms of loss
to raid6.

Regarding the kernel panics: I am still researching / trying to find out
why they happen. They can easily be reproduced by triggering high amount
of i/o in a VM. 

We are mostly running Debian (stable, testing, stable+backports) that
shows the kernel panics.
Ubuntu has not shown this behaviour so far, afair.

So if anyone has experienced kernel panics in Qemu-VMs running on RBD
(and fixed it), please let me know!

Cheers,

Nico

p.s.: We are *not* using rbdmap / kernel mounts - it's just qemu running with
qemu-system-x86_64 -enable-kvm -name one-204 -S -machine 
pc-i440fx-trusty,accel=kvm,usb=off -m 512 -realtime mlock=off -smp 
2,sockets=2,cores=1,threads=1 -uuid d7c3374e-349e-4db6-8f54-f3c607f93101
-no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-204.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-boot strict=on
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
lsi,id=scsi0,bus=pci.0,addr=0x4 -drive
file=rbd:one/one-53-204-0:id=libvirt:key=...:auth_supported=cephx\;none:mon_host=kaffee.private.ungleich.ch\;wein.private.ungleich.ch\;tee.private.ungleich.ch,if=none,id=drive-scsi0-0-0,format=raw,cache=none
-device 
scsi-hd,bus=scsi0.0,scsi-id=0,drive=drive-scsi0-0-0,id=scsi0-0-0,bootindex=1 
-drive 
file=/var/lib/one//datastores/0/204/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw
 -device
ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev 
tap,fd=24,id=hostnet0 -device 
rtl8139,netdev=hostnet0,id=net0,mac=02:00:4d:6d:96:ae,bus=pci.0,addr=0x3 -vnc 
0.0.0.0:204 -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


Dan Van Der Ster [Wed, Jan 07, 2015 at 08:12:29PM +]:
> Hi Nico,
> Yes Ceph is production ready. Yes people are using it in production for qemu. 
> Last time I heard, Ceph was surveyed as the most popular backend for 
> OpenStack Cinder in production.
> 
> When using RBD in production, it really is critically important to (a) use 3 
> replicas and (b) pay attention to pg distribution early on so that you don't 
> end up with unbalanced OSDs.
> 
> Replication is especially important for RBD because you 
> _must_not_ever_lose_an_entire_pg_. Parts of every single rbd device are 
> stored on every single PG... So losing a PG means you lost random parts of 
> every single block device. If this happens, the only safe course of action is 
> to restore from backups. But the whole point of Ceph is that it enables you 
> to configure adequate replication across failure domains, which makes this 
> scenario very very very unlikely to occur.
> 
> I don't know why you were getting kernel panics. It's probably advisable to 
> stick to the most recent mainline kernel when using kRBD.
> 
> Cheers, Dan
> 
> On 7 Jan 2015 20:45, Nico Schottelius  wrote:
> Good evening,
> 
> we also tried to rescue data *from* our old / broken pool by map'ing the
> rbd devices, mounting them on a host and rsync'ing away as much as
> possible.
> 
> However, after some time rsync got completly stuck and eventually the
> host which mounted the rbd mapped devices decided to kernel panic at
> which time we decided to drop the pool and go with a backup.
> 
> This story and the one of Christian makes me wonder:
> 
> Is anyone using ceph as a backend for qemu VM images in production?
> 
> And:
> 
> Has anyone on the list been able to recover from a pg incomplete /
> stuck situation like ours?
> 
> Reading about the issues on the list here gives me the impression that
> ceph as a software is stuck/incomplete and has not yet become ready
> "clean" for production (sorry for the word joke).
> 
> Cheers,
> 
> Nico
> 
> Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
> > Hi Nico and all others who answered,
> >
> > After some more trying to somehow get the pgs in a working state (I've
> > tried force_create_pg, which was putting then in creating state. But
> > that was obviously not true, since after rebooting one of the containing
> > osd's it went back to incomplete), I decided to save what can be saved.
> >
> > I've created a new pool, created a new image there, mapped the old image
> > from the old pool and the new image from the new pool to a machine, to
> > copy data on posix level.
> >
> > Unfortunately, formatting the image from the new pool hangs after some
> > time. So it seems that the new pool is suffering from the same problem
> > as the old pool. Which is totaly not understandable for me.
> >
> > Right now, it seems like Ceph is giving me no options to either save
> > some of the still intact rbd volumes, or to create a new pool along the
> > old one to at least enable our clients

Re: [ceph-users] Block and NAS Services for Non Linux OS

2015-01-07 Thread Nick Fisk
Hi Steven,

 

Until the RBD/FS drivers are developed for those particular OS’s you are forced 
to use a Linux server to “proxy” the storage into another format which those 
OS’s can understand.

 

However if you take a look on the Dev mailing list, somebody has just posted a 
link to a Windows CephFS driver, with the potential for there to be a Windows 
RBD driver sometime in the future.

 

I believe the ESXi Driver API’s are available, so who knows, somebody may 
develop a native ESXi driver in the future too.

 

But in the meantime if you are worried about scale, you can always make use of 
multiple proxy nodes to spread the load across more hardware.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steven 
Sim
Sent: 30 December 2014 12:26
To: Eneko Lacunza
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Block and NAS Services for Non Linux OS

 

Hello Eneko;

 

Firstly, thanks for your comments!

 

You mentioned that machines see a QEMU IDE/SCSI disk, they don't know whether 
its on ceph, NFS, local, LVM, ... so it works OK for any VM guest SO.

 

But what if I want to CEPH cluster to serve a whole range of clients in the 
data center, ranging from ESXi, Microsoft Hypervisors, Solaris (unvirtualized), 
AIX (unvirtualized) etc ...

 

In particular, I'm being asked to create a NAS and iSCSI Block storage farm 
with an ability to serve not just Linux but a range of operating system(s), 
some virtualized, some not . ...

 

I love the distributive nature of CEPH but using Proxy nodes (or heads) sort of 
goes against the distributive concept...



Warmest Regards
Steven Sim
Mobile : 96963117
Principal Systems
77 High Street
#10-07 High Street Plaza
Singapore 179433
Company Registration Number : 201002783M

 

On 30 December 2014 at 18:55, Eneko Lacunza mailto:elacu...@binovo.es> > wrote:

Hi Steven,

Welcome to the list.

On 30/12/14 11:47, Steven Sim wrote:

This is my first posting and I apologize if the content or query is not 
appropriate.

My understanding for CEPH is the block and NAS services are through specialized 
(albeit opensource) kernel modules for Linux.

What about the other OS e.g. Solaris, AIX, Windows, ESX ...

If the solution is to use a proxy, would using the MON servers (as iSCSI and 
NAS proxies) be okay?

Virtual machines see a QEMU IDE/SCSI disk, they don't know whether its on ceph, 
NFS, local, LVM, ... so it works OK for any VM guest SO.

Currently on Proxmox, it's qemu-kvm the ceph (RBD) client, not the linux kernel.


What about performance?


It depends a lot on the setup. Do you have something on your mind? :)

Cheers
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
  943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es  

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on Centos 7

2015-01-07 Thread Travis Rhoden
Hello,

Can you give the link the exact instructions you followed?

For CentOS7 (EL7) ceph-extras should not be necessary.  The instructions at
[1] do not have you enabled the ceph-extras repo.  You will find that there
are EL7 packages at [2].  I recently found a README that was incorrectly
referencing ceph-extras when it came to ceph-deploy.  I'm wondering if
there may be other incorrect instructions floating around. I'm guessing the
confusion may be coming from [3].  I think a note should be added there
that ceph-extras is not needed for EL7.  Right now it just says this is
needed for "some Ceph deployments", but as you have found, if you enable it
on EL7, it won't work.

Can you try removing the ceph-extra repo definition and see if that fixes
things?

 - Travis


[1]
http://ceph.com/docs/master/start/quick-start-preflight/#red-hat-package-manager-rpm
[2] http://ceph.com/rpm-giant/
[3] http://ceph.com/docs/master/install/get-packages/#add-ceph-extras

On Tue, Jan 6, 2015 at 2:40 AM, Nur Aqilah 
wrote:

> Hi all,
>
> I was wondering if anyone can give me some guidelines in installing ceph
> on Centos 7. I followed the guidelines on ceph.com on how to do the Quick
> Installation. But there was always this one particular error. When i typed
> in this command "sudo yum update && sudo yum install ceph-deploy" a long
> error pops up. I later checked and found out that el7/CentOS 7 is not
> listed in here http://ceph.com/packages/ceph-extras/rpm/
>
> Together attached is a screenshot of the error that i was talking about. I
> would really appreciate it if someone would kindly help me out
>
> Thank you and regards,
>
> *Nur Aqilah Abdul Rahman*
>
> Systems Engineer
>
> *impact* *business solutions Sdn Bhd*
>
> E303, Level 3 East Wing Metropolitan Square,
> Jalan PJU 8/1, Damansara Perdana,
> 47820 Petaling Jaya, Selangor Darul Ehsan
>
> P: 03 7728 6826
> F: 03 7728 5826
>
> Thanks & Regards,
>
> [image: Email-Signature_Updated240713]
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hanging VMs with Qemu + RBD

2015-01-07 Thread Nico Schottelius
Hello Achim,

good to hear someone else running this setup. We have changed the number
of backfills using

ceph tell osd.\* injectargs '--osd-max-backfills 1'

and it seems to work mostly in regards of issues when rebalancing.

One unsolved problem we have is machines kernel panic'ing, when i/o is
slow. We usually see a kernel panic in the sym53c8xx driver, especially for
those VMs with high i/o rates. We tried to upgrade the kernel in the VM
(Debian stable 3.2.0 -> Debian backports 3.16.0), but just have
different kernel panic in the same driver now.

Have you had the some problem and if so, how did you get it fixed?

Cheers,

Nico

Achim Ledermüller [Wed, Jan 07, 2015 at 05:42:38PM +0100]:
> Hi,
> 
> We have the same setup including OpenNebula 4.10.1. We had some
> backfilling due to node failures and node expansion. If we throttle
> osd_max_backfills there is not a problem at all. If the value for
> backfilling jobs is too high, we can see delayed reactions within the
> shell, eg. `ls -lh` needs 2 seconds.
> 
> Kind regards,
> Achim
> 
> -- 
> Achim Ledermüller, M. Sc.
> Systems Engineer
> 
> NETWAYS Managed Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
> Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> GF: Julian Hein, Bernd Erk | AG Nuernberg HRB25207
> http://www.netways.de | achim.ledermuel...@netways.de
> 
> ** OSDC 2015 - April - osdc.de **
> ** Puppet Camp Berlin 2015 - April - netways.de/puppetcamp **
> ** OSBConf 2015 - September – osbconf.org **
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd directory listing performance issues

2015-01-07 Thread Shain Miley
Just to follow up on this thread, the main reason that the rbd directory 
listing latency was an issue for us,  was that we were seeing a large amount of 
IO delay in a PHP app that reads from that rbd image.

It occurred to me (based on Roberts cache_dir suggestion below) that maybe 
doing a recursive find or a recursive directory listing inside the one folder 
in question might speed things up.

After doing the recursive find...the directory listing seems much faster and 
the responsiveness of the PHP app has increased as well.

Hopefully nothing else will need to be done here, however it seems that worst 
case...a daily or weekly cronjob that traverses the directory tree in that 
folder might be all we need.

Thanks again for all the help.

Shain 



Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Shain Miley 
[smi...@npr.org]
Sent: Tuesday, January 06, 2015 8:16 PM
To: Christian Balzer; ceph-us...@ceph.com
Subject: Re: [ceph-users] rbd directory listing performance issues

Christian,

Each of the OSD's server nodes are running on Dell R-720xd's with 64 GB or RAM.

We have 107 OSD's so I have not checked all of them..however the ones I have 
checked with xfs_db, have shown anywhere from 1% to 4% fragmentation.

I'll try to upgrade the client server to 32 or 64 GB of ram at some point 
soon...however at this point all the tuning that I have done has not yielded 
all that much in terms of results.

It maybe a simple fact that I need to look into adding some SSD's, and the 
overall bottleneck here are the 4TB 7200 rpm disks we are using.

In general, when looking at the graphs in Calamari, we see around 20ms latency 
(await) for our OSD's however there are lots of times where we see (via the 
graphs) spikes of 250ms to 400ms as well.

Thanks again,

Shain


Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649


From: Christian Balzer [ch...@gol.com]
Sent: Tuesday, January 06, 2015 7:34 PM
To: ceph-us...@ceph.com
Cc: Shain Miley
Subject: Re: [ceph-users] rbd directory listing performance issues

Hello,

On Tue, 6 Jan 2015 15:29:50 + Shain Miley wrote:

> Hello,
>
> We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up of
> 107 x 4TB drives formatted with xfs. The cluster is running ceph version
> 0.80.7:
>
I assume journals on the same HDD then.

How much memory per node?

[snip]
>
> A while back I created an 80 TB rbd image to be used as an archive
> repository for some of our audio and video files. We are still seeing
> good rados and rbd read and write throughput performance, however we
> seem to be having quite a long delay in response times when we try to
> list out the files in directories with a large number of folders, files,
> etc.
>
> Subsequent directory listing times seem to run a lot faster (but I am
> not sure for long that is the case before we see another instance of
> slowness), however the initial directory listings can take 20 to 45
> seconds.
>

Basically the same thing(s) that Robert said.
How big is "large"?
How much memory on the machine you're mounting this image?
Ah, never mind, just saw your follow-up.

Definitely add memory to this machine if you can.

The initial listing is always going to be slow-ish of sorts depending on
a number of things in the cluster.

As in, how busy is it (IOPS)? With journals on disk your HDDs are going to
be sluggish individually and your directory information might reside
mostly in one object (on one OSD), thus limiting you to the speed of that
particular disk.

And this is also where the memory of your storage nodes comes in, if it is
large enough your "hot" objects will get cached there as well.
To see if that's the case (at least temporarily), drop the caches on all
of your storage nodes (echo 3 > /proc/sys/vm/drop_caches), mount your
image, do the "ls -l" until it's "fast", umount it, mount it again and do
the listing again.
In theory, unless your cluster is extremely busy or your storage node have
very little pagecache, the re-mounted image should get all the info it
needs from said pagecache on your storage nodes, never having to go to the
actual OSD disks and thus be fast(er) than the initial test.

Finally to potentially improve the initial scan that has to come from the
disks obviously, see how fragmented your OSDs are and depending on the
results defrag them.

Christian
--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.

Re: [ceph-users] Erasure code pool overhead

2015-01-07 Thread Nick Fisk
Hi Italo,

 

=k/(k+m)

 

Where k is data chunks and m is coding chunks.

 

For example k=8 m=2 would give you

 

=8/(8+2)

 

.8 or 80% usable storage and 20% used for coding. Please keep in mind however 
that you can’t fill up the storage completely.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Italo 
Santos
Sent: 06 January 2015 22:14
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Erasure code pool overhead

 

Hello, 

 

I’d like to know how can I calculate the overhead of a erasure pool?

 

Regards.

 

Italo Santos

  http://italosantos.com.br/

 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy dependency errors on fc20 with firefly

2015-01-07 Thread Noah Watkins
I'm trying to install Firefly on an up-to-date FC20 box. I'm getting
the following errors:

[nwatkins@kyoto cluster]$ ../ceph-deploy/ceph-deploy install --release
firefly kyoto
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/nwatkins/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.21): ../ceph-deploy/ceph-deploy
install --release firefly kyoto
[ceph_deploy.install][DEBUG ] Installing stable version firefly on
cluster ceph hosts kyoto
[ceph_deploy.install][DEBUG ] Detecting platform for host kyoto ...
[kyoto][DEBUG ] connection detected need for sudo
[kyoto][DEBUG ] connected to host: kyoto
[kyoto][DEBUG ] detect platform information from remote host
[kyoto][DEBUG ] detect machine type
[ceph_deploy.install][INFO  ] Distro info: Fedora 20 Heisenbug
[kyoto][INFO  ] installing ceph on kyoto
[kyoto][INFO  ] Running command: sudo yum -y install yum-plugin-priorities
[kyoto][DEBUG ] Loaded plugins: langpacks, priorities, refresh-packagekit
[kyoto][DEBUG ] Package yum-plugin-priorities-1.1.31-27.fc20.noarch
already installed and latest version
[kyoto][DEBUG ] Nothing to do
[kyoto][INFO  ] Running command: sudo rpm --import
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
[kyoto][INFO  ] Running command: sudo rpm -Uvh --replacepkgs --force
--quiet http://ceph.com/rpm-firefly/fc20/noarch/ceph-release-1-0.fc20.noarch.rpm
[kyoto][DEBUG ] 
[kyoto][DEBUG ] Updating / installing...
[kyoto][DEBUG ] 
[kyoto][WARNIN] ensuring that /etc/yum.repos.d/ceph.repo contains a
high priority
[kyoto][WARNIN] altered ceph.repo priorities to contain: priority=1
[kyoto][INFO  ] Running command: sudo yum -y -q install ceph
[kyoto][WARNIN] Error: Package: 1:python-cephfs-0.80.7-1.fc20.x86_64 (updates)
[kyoto][WARNIN]Requires: libcephfs1 = 1:0.80.7-1.fc20
[kyoto][WARNIN]Available: libcephfs1-0.80.1-0.fc20.x86_64 (Ceph)
[kyoto][DEBUG ]  You could try using --skip-broken to work around the problem
[kyoto][WARNIN]libcephfs1 = 0.80.1-0.fc20
[kyoto][WARNIN]Available: libcephfs1-0.80.3-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]libcephfs1 = 0.80.3-0.fc20
[kyoto][WARNIN]Available: libcephfs1-0.80.4-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]libcephfs1 = 0.80.4-0.fc20
[kyoto][WARNIN]Available: libcephfs1-0.80.5-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]libcephfs1 = 0.80.5-0.fc20
[kyoto][WARNIN]Available: libcephfs1-0.80.6-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]libcephfs1 = 0.80.6-0.fc20
[kyoto][WARNIN]Installing: libcephfs1-0.80.7-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]libcephfs1 = 0.80.7-0.fc20
[kyoto][WARNIN] Error: Package: 1:python-rbd-0.80.7-1.fc20.x86_64 (updates)
[kyoto][WARNIN]Requires: librbd1 = 1:0.80.7-1.fc20
[kyoto][WARNIN]Available: librbd1-0.80.1-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librbd1 = 0.80.1-0.fc20
[kyoto][WARNIN]Available: librbd1-0.80.3-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librbd1 = 0.80.3-0.fc20
[kyoto][WARNIN]Available: librbd1-0.80.4-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librbd1 = 0.80.4-0.fc20
[kyoto][WARNIN]Available: librbd1-0.80.5-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librbd1 = 0.80.5-0.fc20
[kyoto][WARNIN]Available: librbd1-0.80.6-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librbd1 = 0.80.6-0.fc20
[kyoto][WARNIN]Installing: librbd1-0.80.7-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librbd1 = 0.80.7-0.fc20
[kyoto][WARNIN] Error: Package: 1:python-rados-0.80.7-1.fc20.x86_64 (updates)
[kyoto][WARNIN]Requires: librados2 = 1:0.80.7-1.fc20
[kyoto][WARNIN]Available: librados2-0.80.1-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librados2 = 0.80.1-0.fc20
[kyoto][WARNIN]Available: librados2-0.80.3-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librados2 = 0.80.3-0.fc20
[kyoto][WARNIN]Available: librados2-0.80.4-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librados2 = 0.80.4-0.fc20
[kyoto][WARNIN]Available: librados2-0.80.5-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librados2 = 0.80.5-0.fc20
[kyoto][WARNIN]Available: librados2-0.80.6-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librados2 = 0.80.6-0.fc20
[kyoto][WARNIN]Installing: librados2-0.80.7-0.fc20.x86_64 (Ceph)
[kyoto][WARNIN]librados2 = 0.80.7-0.fc20
[kyoto][DEBUG ]  You could try running: rpm -Va --nofiles --nodigest
[kyoto][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
-q install ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-cep

Re: [ceph-users] Placement groups stuck inactive after down & out of 1/9 OSDs

2015-01-07 Thread Chris Murray
Thank you for your assistance Craig. At the time, I hadn’t noted placement 
group details, but I know to do that if I get inactive placement groups again. 
I’m still getting familiar with the cluster, with 15 OSDs now across five 
hosts, a mix of good and bad drives, XFS/BTRFS and with/without SSD journals so 
I can start to understand what sort of differences the options make.

 

Thanks again.

 

From: Craig Lewis [mailto:cle...@centraldesktop.com] 
Sent: 19 December 2014 23:22
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] Placement groups stuck inactive after down & out of 
1/9 OSDs

 

With only one OSD down and size = 3, you shouldn't've had any PGs inactive.  At 
worst, they should've been active+degraded.

 

The only thought I have is that some of your PGs aren't mapping to the correct 
number of OSDs.  That's not supposed to be able to happen unless you've messed 
up your crush rules.

 

You might go through ceph pg dump, and verify that all PGs have 3 OSDs in the 
reporting and acting columns, and that there are no duplicate OSDs in those 
lists.  

 

With your 1216 PGs, it might be faster to write a script to parse the JSON than 
to do it manually.  If you happen to remember some PGs that were inactive or 
degraded, you could spot check those.

 

 

 

On Fri, Dec 19, 2014 at 11:45 AM, Chris Murray  wrote:

Interesting indeed, those tuneables were suggested on the pve-user mailing list 
too, and they certainly sound like they’ll ease the pressure during the 
recovery operation. What I might not have explained very well though is that 
the VMs hung indefinitely and past the end of the recovery process, rather than 
being slow; almost as if the 78 stuck inactive placement groups contained data 
which was critical to VM operation. Looking at IO and performance in the 
cluster is certainly on the to-do list, with a scale-out of nodes and move of 
journals to SSD, but of course that needs some investment and I’d like to prove 
things first. It’s a bit catch-22 :-)

To my knowledge, the cluster was HEALTH_OK before and it is HEALTH_OK now, BUT 
... I've not followed my usual advice of stopping and thinking about things 
before trying something else, so I suppose the marking of the OSD 'up' this 
morning (which turned those 78 into some other ACTIVE+* states) has spoiled the 
chance of troubleshooting. I’ve been messing around with osd.0 since too, and 
the health is now:

cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
 health HEALTH_OK
 monmap e3: 3 mons at 
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0}, 
election epoch 58, quorum 0,1,2 0,1,2
 osdmap e1205: 9 osds: 9 up, 9 in
  pgmap v120175: 1216 pgs, 4 pools, 892 GB data, 224 kobjects
2679 GB used, 9790 GB / 12525 GB avail
1216 active+clean

If it helps at all, the other details are as follows. Nothing from 'dump stuck' 
although I expect there would have been this morning.

root@ceph25:~# ceph osd tree
# idweight  type name   up/down reweight
-1  12.22   root default
-2  4.3 host ceph25
3   0.9 osd.3   up  1
6   0.68osd.6   up  1
0   2.72osd.0   up  1
-3  4.07host ceph26
1   2.72osd.1   up  1
4   0.9 osd.4   up  1
7   0.45osd.7   up  1
-4  3.85host ceph27
2   2.72osd.2   up  1
5   0.68osd.5   up  1
8   0.45osd.8   up  1
root@ceph25:~# ceph osd dump | grep ^pool
pool 0 'data' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 
stripe_width 0
pool 1 'metadata' replicated size 3 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 3 'vmpool' replicated size 3 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 1024 pgp_num 1024 last_change 187 flags hashpspool stripe_width 0
root@ceph25:~# ceph pg dump_stuck
ok


The more I think about this problem, the less I think there'll be an easy 
answer, and it's more likely that I'll have to reproduce the scenario and 
actually pause myself next time in order to troubleshoot it?

From: Craig Lewis [mailto:cle...@centraldesktop.com]
Sent: 19 December 2014 19:17
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] Placement groups stuck inactive after down & out of 
1/9 OSDs


That seems odd.  So you have 3 nodes, with 3 OSDs each.  You should've been 
able to mark osd.0 down and out, then stop the daemon without having those 
issues.

It's generally best to mark an osd down, then out, and wait until the clust

Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-07 Thread Dan Van Der Ster
Hi Nico,
Yes Ceph is production ready. Yes people are using it in production for qemu. 
Last time I heard, Ceph was surveyed as the most popular backend for OpenStack 
Cinder in production.

When using RBD in production, it really is critically important to (a) use 3 
replicas and (b) pay attention to pg distribution early on so that you don't 
end up with unbalanced OSDs.

Replication is especially important for RBD because you 
_must_not_ever_lose_an_entire_pg_. Parts of every single rbd device are stored 
on every single PG... So losing a PG means you lost random parts of every 
single block device. If this happens, the only safe course of action is to 
restore from backups. But the whole point of Ceph is that it enables you to 
configure adequate replication across failure domains, which makes this 
scenario very very very unlikely to occur.

I don't know why you were getting kernel panics. It's probably advisable to 
stick to the most recent mainline kernel when using kRBD.

Cheers, Dan

On 7 Jan 2015 20:45, Nico Schottelius  wrote:
Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
"clean" for production (sorry for the word joke).

Cheers,

Nico

Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
> Hi Nico and all others who answered,
>
> After some more trying to somehow get the pgs in a working state (I've
> tried force_create_pg, which was putting then in creating state. But
> that was obviously not true, since after rebooting one of the containing
> osd's it went back to incomplete), I decided to save what can be saved.
>
> I've created a new pool, created a new image there, mapped the old image
> from the old pool and the new image from the new pool to a machine, to
> copy data on posix level.
>
> Unfortunately, formatting the image from the new pool hangs after some
> time. So it seems that the new pool is suffering from the same problem
> as the old pool. Which is totaly not understandable for me.
>
> Right now, it seems like Ceph is giving me no options to either save
> some of the still intact rbd volumes, or to create a new pool along the
> old one to at least enable our clients to send data to ceph again.
>
> To tell the truth, I guess that will result in the end of our ceph
> project (running for already 9 Monthes).
>
> Regards,
> Christian
>
> Am 29.12.2014 15:59, schrieb Nico Schottelius:
> > Hey Christian,
> >
> > Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
> >> [incomplete PG / RBD hanging, osd lost also not helping]
> >
> > that is very interesting to hear, because we had a similar situation
> > with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
> > directories to allow OSDs to start after the disk filled up completly.
> >
> > So I am sorry not to being able to give you a good hint, but I am very
> > interested in seeing your problem solved, as it is a show stopper for
> > us, too. (*)
> >
> > Cheers,
> >
> > Nico
> >
> > (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
> > seems to run much smoother. The first one is however not supported
> > by opennebula directly, the second one not flexible enough to host
> > our heterogeneous infrastructure (mixed disk sizes/amounts) - so we
> > are using ceph at the moment.
> >
>
>
> --
> Christian Eichelmann
> Systemadministrator
>
> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
> Brauerstraße 48 · DE-76135 Karlsruhe
> Telefon: +49 721 91374-8026
> christian.eichelm...@1und1.de
>
> Amtsgericht Montabaur / HRB 6484
> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
> Aufsichtsratsvorsitzender: Michael Scheeren

--
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Archives haven't been updated since Dec 8?

2015-01-07 Thread Patrick McGarry
Looks like there was (is) a technical issue at Dreamhost that is being
actively worked on. I put in a request to get mmarch run manually for
now until the issue is resolved. You can always browse the posts in
real time from the archive pages:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/



Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph


On Tue, Dec 23, 2014 at 5:09 PM, Christopher Armstrong
 wrote:
> I was trying to link a colleague to a message on the mailing list, and
> noticed the archives haven't been rebuilt since Dec 8:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/
>
> Did something break there?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:

Can't this be done in parallel? If the OSD doesn't have an object then
it is a noop and should be pretty quick. The number of outstanding
operations can be limited to 100 or a 1000 which would provide a
balance between speed and performance impact if there is data to be
trimmed. I'm not a big fan of a "--skip-trimming" option as there is
the potential to leave some orphan objects that may not be cleaned up
correctly.


Yeah, a --skip-trimming option seems a bit dangerous. This trimming
actually is parallelized (10 ops at once by default, changeable via
--rbd-concurrent-management-ops) since dumpling.

What will really help without being dangerous is keeping a map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/pull/2700


On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:



On Monday, January 5, 2015, Chen, Xiaoxi  wrote:


When you shrinking the RBD, most of the time was spent on
librbd/internal.cc::trim_image(), in this function, client will iterator all
unnecessary objects(no matter whether it exists) and delete them.



So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will definitely take a long time
since rbd client need to send a delete request to OSD, OSD need to find out
the object context and delete(or doesn’t exist at all). The time needed to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy your VM's file system to
the new image, then delete the old one will  NOT help in general, just
because delete the old volume will take exactly the same time as shrinking ,
they both need to call trim_image().



The solution in my mind may be we can provide a “—skip-triming” flag to
skip the trimming. When the administrator absolutely sure there is no
written have taken place in the shrinking area(that means there is no object
created in these area), they can use this flag to skip the time consuming
trimming.



How do you think?



That sounds like a good solution. Like doing "undo grow image"





From: Jake Young [mailto:jak3...@gmail.com]
Sent: Monday, January 5, 2015 9:45 PM
To: Chen, Xiaoxi
Cc: Edwin Peer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day





On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:

You could use rbd info   to see the block_name_prefix, the
object name consist like .,  so for
example, rb.0.ff53.3d1b58ba.e6ad should be the th object  of
the volume with block_name_prefix rb.0.ff53.3d1b58ba.

  $ rbd info huge
 rbd image 'huge':
  size 1024 TB in 268435456 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.8a14.2ae8944a
  format: 1

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Edwin Peer
Sent: Monday, January 5, 2015 3:55 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd
command line tools understands?

Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:



On Sunday, January 4, 2015, Dyweni - Ceph-Users
<6exbab4fy...@dyweni.com > wrote:

 Hi,

 If its the only think in your pool, you could try deleting the
 pool instead.

 I found that to be faster in my testing; I had created 500TB when
 I meant to create 500GB.

 Note for the Devs: I would be nice if rbd create/resize would
 accept sizes with units (i.e. MB GB TB PB, etc).




 On 2015-01-04 08:45, Edwin Peer wrote:

 Hi there,

 I did something stupid while growing an rbd image. I
accidentally
 mistook the units of the resize command for bytes instead of
 megabytes
 and grew an rbd image to 650PB instead of 650GB. This all
happened
 instantaneously enough, but trying to rectify the mistake is
 not going
 nearly as well.

 
 ganymede ~ # rbd resize --size 665600 --allow-shrink
 client-disk-img0/vol-x318644f-0
 Resizing image: 1% complete...
 

 It took a couple days before it started showing 1% complete
 and has
 been stuck on 1% for a couple more. At this rate, I should be
 able to
 shrink the image back to the intended size in about 2016.

   

[ceph-users] Erasure code pool overhead

2015-01-07 Thread Italo Santos
Hello,  

I’d like to know how can I calculate the overhead of a erasure pool?  

Regards.

Italo Santos
http://italosantos.com.br/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data recovery after RBD I/O error

2015-01-07 Thread Austin S Hemmelgarn

On 2015-01-06 23:11, Jérôme Poulin wrote:

On Mon, Jan 5, 2015 at 6:59 AM, Austin S Hemmelgarn
 wrote:

Secondly, I would highly recommend not using ANY non-cluster-aware FS on top
of a clustered block device like RBD



For my use-case, this is just a single server using the RBD device. No
clustering involved on the BTRFS side of thing.
My only point is that there isn't anything in BTRFS to handle it 
accidentally being multiply mounted.  Ext* for example aren't clustered, 
but do have an optional feature to prevent multiple mounting.

However, it was really useful to take snapshots (just like LVM) before 
modifying the
filesystem in any way.

Have you tried Ceph's built in snapshot support?  I don't remember how 
to use it, but I do know it is there (at least, it is in the most recent 
versions), and it is a bit more like LVM's snapshots than BTRFS is.




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rebuilding Cluster from complete MON failure with existing OSDs

2015-01-07 Thread Dan Geist
Hi, I have a situation where I moved the interfaces over which my ceph-public 
network is connected (only the interfaces, not the IPs, etc.) this was done to 
increase available bandwidth, but it backfired catastrophically. My monitors 
all failed and somehow became corrupted, but I was unable to repair them. So I 
rebuild the monitors in hope that I could add the existing OSDs back in and 
recover the cluster.

There are three hosts. Each has a monitor and 6 osds. Each osd is a spinning 
disk partition with a journal located on a SSD partition on the same host. From 
what I can tell, all the data on the osd disks is intact, but even after (what 
I think was) adding all the OSDs back into the crushmap, etc. the cluster 
doesn't seem like it is "seeing" the partitions and I'm at a loss for how to 
troubleshoot it further.

Hosts are all Ubutunu trusty running 0.80.7 ceph packages.

dgeist# ceph -s
cluster ac486394-802a-49d3-a92c-a103268ea189
 health HEALTH_WARN 4288 pgs stuck inactive; 4288 pgs stuck unclean; 18/18 
in osds are down
 monmap e1: 3 mons at 
{hypd01=10.100.100.11:6789/0,hypd02=10.100.100.12:6789/0,hypd03=10.100.100.13:6789/0},
 election epoch 40, quorum 0,1,2 hypd01,hypd02,hypd03
 osdmap e65: 18 osds: 0 up, 18 in
  pgmap v66: 4288 pgs, 4 pools, 0 bytes data, 0 objects
0 kB used, 0 kB / 0 kB avail
4288 creating

dgeist# ceph osd tree
# idweight  type name   up/down reweight
-1  18  root default
-2  6   host hypd01
0   1   osd.0   down1   
1   1   osd.1   down1   
2   1   osd.2   down1   
3   1   osd.3   down1   
4   1   osd.4   down1   
5   1   osd.5   down1   
-3  6   host hypd02
6   1   osd.6   down1   
7   1   osd.7   down1   
8   1   osd.8   down1   
9   1   osd.9   down1   
10  1   osd.10  down1   
11  1   osd.11  down1   
-4  6   host hypd03
12  1   osd.12  down1   
13  1   osd.13  down1   
14  1   osd.14  down1   
15  1   osd.15  down1   
16  1   osd.16  down1   
17  1   osd.17  down1


Thanks in advance for any thoughts on how to recover this.

Dan

Dan Geist dan(@)polter.net
(33.942973, -84.312472)
http://www.polter.net


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd directory listing performance issues

2015-01-07 Thread Christian Balzer

Hello,

On Tue, 6 Jan 2015 15:29:50 + Shain Miley wrote:

> Hello,
> 
> We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up of
> 107 x 4TB drives formatted with xfs. The cluster is running ceph version
> 0.80.7:
> 
I assume journals on the same HDD then.

How much memory per node?

[snip]
> 
> A while back I created an 80 TB rbd image to be used as an archive
> repository for some of our audio and video files. We are still seeing
> good rados and rbd read and write throughput performance, however we
> seem to be having quite a long delay in response times when we try to
> list out the files in directories with a large number of folders, files,
> etc.
> 
> Subsequent directory listing times seem to run a lot faster (but I am
> not sure for long that is the case before we see another instance of
> slowness), however the initial directory listings can take 20 to 45
> seconds.
> 

Basically the same thing(s) that Robert said.
How big is "large"?
How much memory on the machine you're mounting this image?
Ah, never mind, just saw your follow-up.

Definitely add memory to this machine if you can.

The initial listing is always going to be slow-ish of sorts depending on
a number of things in the cluster.

As in, how busy is it (IOPS)? With journals on disk your HDDs are going to
be sluggish individually and your directory information might reside
mostly in one object (on one OSD), thus limiting you to the speed of that
particular disk.

And this is also where the memory of your storage nodes comes in, if it is
large enough your "hot" objects will get cached there as well. 
To see if that's the case (at least temporarily), drop the caches on all
of your storage nodes (echo 3 > /proc/sys/vm/drop_caches), mount your
image, do the "ls -l" until it's "fast", umount it, mount it again and do
the listing again. 
In theory, unless your cluster is extremely busy or your storage node have
very little pagecache, the re-mounted image should get all the info it
needs from said pagecache on your storage nodes, never having to go to the
actual OSD disks and thus be fast(er) than the initial test.

Finally to potentially improve the initial scan that has to come from the
disks obviously, see how fragmented your OSDs are and depending on the
results defrag them.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Robert LeBlanc
Seems like a message bus would be nice. Each opener of an RBD could
subscribe for messages on the bus for that RBD. Anytime the map is modified
a message could be put on the bus to update the others. That opens up a
whole other can of worms though.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Jan 6, 2015 5:35 PM, "Josh Durgin"  wrote:

> On 01/06/2015 04:19 PM, Robert LeBlanc wrote:
>
>> The bitmap certainly sounds like it would help shortcut a lot of code
>> that Xiaoxi mentions. Is the idea that the client caches the bitmap
>> for the RBD so it know which OSDs to contact (thus saving a round trip
>> to the OSD), or only for the OSD to know which objects exist on it's
>> disk?
>>
>
> It's purely at the rbd level, so librbd caches it and maintains its
> consistency. The idea is that since it's kept consistent, librbd can do
> things like delete exactly the objects that exist without any
> extra communication with the osds. Many things that were
> O(size of image) become O(written objects in image).
>
> The only restriction is that keeping the object map consistent requires
> a single writer, so this does not work for the rare case of e.g. ocfs2
> on top of rbd, where there are multiple clients writing to the same
> rbd image at once.
>
> Josh
>
>  On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin 
>> wrote:
>>
>>> On 01/06/2015 10:24 AM, Robert LeBlanc wrote:
>>>

 Can't this be done in parallel? If the OSD doesn't have an object then
 it is a noop and should be pretty quick. The number of outstanding
 operations can be limited to 100 or a 1000 which would provide a
 balance between speed and performance impact if there is data to be
 trimmed. I'm not a big fan of a "--skip-trimming" option as there is
 the potential to leave some orphan objects that may not be cleaned up
 correctly.

>>>
>>>
>>> Yeah, a --skip-trimming option seems a bit dangerous. This trimming
>>> actually is parallelized (10 ops at once by default, changeable via
>>> --rbd-concurrent-management-ops) since dumpling.
>>>
>>> What will really help without being dangerous is keeping a map of
>>> object existence [1]. This will avoid any unnecessary trimming
>>> automatically, and it should be possible to add to existing images.
>>> It should be in hammer.
>>>
>>> Josh
>>>
>>> [1] https://github.com/ceph/ceph/pull/2700
>>>
>>>
>>>  On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:

>
>
>
> On Monday, January 5, 2015, Chen, Xiaoxi 
> wrote:
>
>>
>>
>> When you shrinking the RBD, most of the time was spent on
>> librbd/internal.cc::trim_image(), in this function, client will
>> iterator
>> all
>> unnecessary objects(no matter whether it exists) and delete them.
>>
>>
>>
>> So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
>> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
>> 170,227,200 Objects need to be deleted.That will definitely take a
>> long
>> time
>> since rbd client need to send a delete request to OSD, OSD need to
>> find
>> out
>> the object context and delete(or doesn’t exist at all). The time
>> needed
>> to
>> trim an image is ratio to the size needed to trim.
>>
>>
>>
>> make another image of the correct size and copy your VM's file system
>> to
>> the new image, then delete the old one will  NOT help in general, just
>> because delete the old volume will take exactly the same time as
>> shrinking ,
>> they both need to call trim_image().
>>
>>
>>
>> The solution in my mind may be we can provide a “—skip-triming” flag
>> to
>> skip the trimming. When the administrator absolutely sure there is no
>> written have taken place in the shrinking area(that means there is no
>> object
>> created in these area), they can use this flag to skip the time
>> consuming
>> trimming.
>>
>>
>>
>> How do you think?
>>
>
>
>
> That sounds like a good solution. Like doing "undo grow image"
>
>
>
>>
>> From: Jake Young [mailto:jak3...@gmail.com]
>> Sent: Monday, January 5, 2015 9:45 PM
>> To: Chen, Xiaoxi
>> Cc: Edwin Peer; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day
>>
>>
>>
>>
>>
>> On Sunday, January 4, 2015, Chen, Xiaoxi 
>> wrote:
>>
>> You could use rbd info   to see the block_name_prefix,
>> the
>> object name consist like .,  so
>> for
>> example, rb.0.ff53.3d1b58ba.e6ad should be the th
>> object
>> of
>> the volume with block_name_prefix rb.0.ff53.3d1b58ba.
>>
>>$ rbd info huge
>>   rbd image 'huge':
>>size 1024 TB in 268435456 objects
>>order 22 (4096 kB objects)
>>block_name_prefi

Re: [ceph-users] rbd directory listing performance issues

2015-01-07 Thread Robert LeBlanc
I think your free memory is just fine. If you have lots of data change
(read/write) then I think it is just aging out your directory cache.
If fast directory listing is important to you, you can always write a
script to periodically read the directory listing so it stays in cache
or use http://lime-technology.com/forum/index.php?topic=4500.0.
Otherwise you are limited to trying to reduce the latency in your Ceph
environment for small block sizes. We have tweaked the RBD cache and
added an SSD caching layer (on Giant) and it has helped some, but
nothing spectacular. There have been references that increasing the
readahead on RBD to 4M helps, but it didn't do anything for us.

On Tue, Jan 6, 2015 at 12:18 PM, Shain Miley  wrote:
> It does seem like the entries get cached for a certain period of time.
>
> Here is the memory listing for the rbd client server:
>
> root@cephmount1:~# free -m
>  total   used   free sharedbuffers cached
> Mem: 11965  11816149  3139  10823
> -/+ buffers/cache:853  2
> Swap: 4047  0   4047
>
> I can add more memory to the server if I need to I have 2 or 4 16GB DIMM 
> laying around here someplace.
>
>
> Here are the some of the pagecache sysctl settings:
> vm.dirty_background_bytes = 0
> vm.dirty_background_ratio = 10
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 10
> vm.dirty_writeback_centisecs = 500
>
>
> In terms of the number of files:
>
> root@cephmount1:/mnt/ceph-block-device-archive/library/E# time ls
> real0m8.073s
> user0m0.000s
> sys 0m0.012s
>
> root@cephmount1:/mnt/ceph-block-device-archive/library/E# ls |wc
> 228 5103413
>
>
> However looking at some other directories...I see numbers in the range of 500 
> and 600, etc...so they will vary based on the name of the artist..however if 
> I had to guess we would not use any more than 800 - 1000 in the very heavy 
> directories at this point.
>
> Also...one thing I just noticed is that the 'ls |wc' returns right 
> away...even in cases when right after that I do an 'ls -l' and it takes a 
> while.
>
> Thanks,
>
> Shain
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
> smi...@npr.org | 202.513.3649
>
> 
> From: Robert LeBlanc [rob...@leblancnet.us]
> Sent: Tuesday, January 06, 2015 1:57 PM
> To: Shain Miley
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] rbd directory listing performance issues
>
> I would think that the RBD mounter would cache the directory listing
> which should always make it fast, unless there is so much memory
> pressure that it is dropping it frequently.
>
> How many entries are in your directory and total on the RBD?
> ls | wc -l
> find . | wc -l
>
> What does your memory look like?
> free -h
>
> I'm not sure now much help I can be, but if memory pressure is causing
> buffers to be freed, then it can cause the system to have to go disk
> to get the directory listing. I'm guessing that if the directory is
> large enough it could cause the system to have to go back to the RBD
> many times. Very small I/O on RBD is very expensive compared to big
> sequential access.
>
> On Tue, Jan 6, 2015 at 11:33 AM, Shain Miley  wrote:
>> Robert,
>>
>> xfs on the rbd image as well:
>>
>> /dev/rbd0 on /mnt/ceph-block-device-archive type xfs (rw)
>>
>> However looking at the mount options...it does not look like I've enabled 
>> anything special in terms of mount options.
>>
>> Thanks,
>>
>> Shain
>>
>>
>> Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
>> smi...@npr.org | 202.513.3649
>>
>> 
>> From: Robert LeBlanc [rob...@leblancnet.us]
>> Sent: Tuesday, January 06, 2015 1:27 PM
>> To: Shain Miley
>> Cc: ceph-us...@ceph.com
>> Subject: Re: [ceph-users] rbd directory listing performance issues
>>
>> What fs are you running inside the RBD?
>>
>> On Tue, Jan 6, 2015 at 8:29 AM, Shain Miley  wrote:
>>> Hello,
>>>
>>> We currently have a 12 node (3 monitor+9 OSD) ceph cluster, made up of 107 x
>>> 4TB drives formatted with xfs. The cluster is running ceph version 0.80.7:
>>>
>>> Cluster health:
>>> cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>>>  health HEALTH_WARN crush map has legacy tunables
>>>  monmap e1: 3 mons at
>>> {hqceph1=10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0},
>>> election epoch 156, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>>>  osdmap e19568: 107 osds: 107 up, 107 in
>>>   pgmap v10117422: 2952 pgs, 15 pools, 77202 GB data, 19532 kobjects
>>> 226 TB used, 161 TB / 388 TB avail
>>>
>>> Relevant ceph.conf entries:
>>> osd_journal_size = 10240
>>> filestore_xattr_use_omap = true
>>> osd_mount_options_xfs =
>>> "rw,noatime,nodiratime,logbsize=256k,logbufs=8,inode64"
>>> osd_mkfs_options_xfs = "-f -i size=2048"
>>>
>>>
>>> A while back I created an 80 TB rbd image to

Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-07 Thread Lionel Bouton
On 12/30/14 16:36, Nico Schottelius wrote:
> Good evening,
>
> we also tried to rescue data *from* our old / broken pool by map'ing the
> rbd devices, mounting them on a host and rsync'ing away as much as
> possible.
>
> However, after some time rsync got completly stuck and eventually the
> host which mounted the rbd mapped devices decided to kernel panic at
> which time we decided to drop the pool and go with a backup.
>
> This story and the one of Christian makes me wonder:
>
> Is anyone using ceph as a backend for qemu VM images in production?

Yes with Ceph 0.80.5 since September after extensive testing over
several months (including an earlier version IIRC) and some hardware
failure simulations. We plan to upgrade one storage host and one monitor
to 0.80.7 to validate this version over several months too before
migrating the others.

>
> And:
>
> Has anyone on the list been able to recover from a pg incomplete /
> stuck situation like ours?

Only by adding back an OSD with the data needed to reach min_size for
said pg, which is expected behavior. Even with some experimentations
with isolated unstable OSDs I've not yet witnessed a case where Ceph
lost multiple replicates simultaneously (we lost one OSD to disk failure
and another to a BTRFS bug but without trying to recover the filesystem
so we might have been able to recover this OSD).

If your setup is susceptible to situations where you can lose all
replicates you will lose data but there's not much that can be done
about that. Ceph actually begins to generate new replicates to replace
the missing onesafter"mon osd down out interval" so the actual loss
should not happen unless you lose (and can't recover)  OSDs on
separate hosts (with default crush map) simultaneously. Before going in
production you should know how long Ceph will take to fully recover from
a disk or host failure by testing it with load. Your setup might not be
robust if it hasn't the available disk space or the speed needed to
recover quickly from such a failure.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache Tiering vs. OSD Journal

2015-01-07 Thread deeepdish
Hello.

Quick question RE: cache tiering vs. OSD journals.

As I understand it, SSD acceleration is possible at the pool or OSD level. 

When considering cache tiering, should I still put OSD journals on SSDs or 
should they be disabled altogether.  

Can a single SSD pool function as a cache tier for multiple pools?

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.90 released

2015-01-07 Thread Alfredo Deza
On Sat, Dec 20, 2014 at 1:15 AM, Anthony Alba  wrote:
> Hi Sage,
>
> Has the repo metadata been regenerated?
>
> One of my reposync jobs can only see up to 0.89, using
> http://ceph.com/rpm-testing.

It was generated but we somehow missed out on properly syncing it. You
should now see 0.90 properly.


>
> Thanks
>
> Anthony
>
>
>
> On Sat, Dec 20, 2014 at 6:22 AM, Sage Weil  wrote:
>> This is the last development release before Christmas.  There are some API
>> cleanups for librados and librbd, and lots of bug fixes across the board
>> for the OSD, MDS, RGW, and CRUSH.  The OSD also gets support for discard
>> (potentially helpful on SSDs, although it is off by default), and there
>> are several improvements to ceph-disk.
>>
>> The next two development releases will be getting a slew of new
>> functionality for hammer.  Stay tuned!
>>
>> Upgrading
>> -
>>
>> * Previously, the formatted output of 'ceph pg stat -f ...' was a full
>>   pg dump that included all metadata about all PGs in the system.  It
>>   is now a concise summary of high-level PG stats, just like the
>>   unformatted 'ceph pg stat' command.
>>
>> * All JSON dumps of floating point values were incorrecting surrounding the
>>   value with quotes.  These quotes have been removed.  Any consumer of
>>   structured JSON output that was consuming the floating point values was
>>   previously having to interpret the quoted string and will most likely
>>   need to be fixed to take the unquoted number.
>>
>> Notable Changes
>> ---
>>
>> * arch: fix NEON feaeture detection (#10185 Loic Dachary)
>> * build: adjust build deps for yasm, virtualenv (Jianpeng Ma)
>> * build: improve build dependency tooling (Loic Dachary)
>> * ceph-disk: call partx/partprobe consistency (#9721 Loic Dachary)
>> * ceph-disk: fix dmcrypt key permissions (Loic Dachary)
>> * ceph-disk: fix umount race condition (#10096 Blaine Gardner)
>> * ceph-disk: init=none option (Loic Dachary)
>> * ceph-monstore-tool: fix shutdown (#10093 Loic Dachary)
>> * ceph-objectstore-tool: fix import (#10090 David Zafman)
>> * ceph-objectstore-tool: many improvements and tests (David Zafman)
>> * ceph.spec: package rbd-replay-prep (Ken Dreyer)
>> * common: add 'perf reset ...' admin command (Jianpeng Ma)
>> * common: do not unlock rwlock on destruction (Federico Simoncelli)
>> * common: fix block device discard check (#10296 Sage Weil)
>> * common: remove broken CEPH_LOCKDEP optoin (Kefu Chai)
>> * crush: fix tree bucket behavior (Rongze Zhu)
>> * doc: add build-doc guidlines for Fedora and CentOS/RHEL (Nilamdyuti
>>   Goswami)
>> * doc: enable rbd cache on openstack deployments (Sebastien Han)
>> * doc: improved installation nots on CentOS/RHEL installs (John Wilkins)
>> * doc: misc cleanups (Adam Spiers, Sebastien Han, Nilamdyuti Goswami, Ken
>>   Dreyer, John Wilkins)
>> * doc: new man pages (Nilamdyuti Goswami)
>> * doc: update release descriptions (Ken Dreyer)
>> * doc: update sepia hardware inventory (Sandon Van Ness)
>> * librados: only export public API symbols (Jason Dillaman)
>> * libradosstriper: fix stat strtoll (Dongmao Zhang)
>> * libradosstriper: fix trunc method (#10129 Sebastien Ponce)
>> * librbd: fix list_children from invalid pool ioctxs (#10123 Jason
>>   Dillaman)
>> * librbd: only export public API symbols (Jason Dillaman)
>> * many coverity fixes (Danny Al-Gaaf)
>> * mds: 'flush journal' admin command (John Spray)
>> * mds: fix MDLog IO callback deadlock (John Spray)
>> * mds: fix deadlock during journal probe vs purge (#10229 Yan, Zheng)
>> * mds: fix race trimming log segments (Yan, Zheng)
>> * mds: store backtrace for stray dir (Yan, Zheng)
>> * mds: verify backtrace when fetching dirfrag (#9557 Yan, Zheng)
>> * mon: add max pgs per osd warning (Sage Weil)
>> * mon: fix *_ratio units and types (Sage Weil)
>> * mon: fix JSON dumps to dump floats as flots and not strings (Sage Weil)
>> * mon: fix formatter 'pg stat' command output (Sage Weil)
>> * msgr: async: several fixes (Haomai Wang)
>> * msgr: simple: fix rare deadlock (Greg Farnum)
>> * osd: batch pg log trim (Xinze Chi)
>> * osd: clean up internal ObjectStore interface (Sage Weil)
>> * osd: do not abort deep scrub on missing hinfo (#10018 Loic Dachary)
>> * osd: fix ghobject_t formatted output to include shard (#10063 Loic
>>   Dachary)
>> * osd: fix osd peer check on scrub messages (#9555 Sage Weil)
>> * osd: fix pgls filter ops (#9439 David Zafman)
>> * osd: flush snapshots from cache tier immediately (Sage Weil)
>> * osd: keyvaluestore: fix getattr semantics (Haomai Wang)
>> * osd: keyvaluestore: fix key ordering (#10119 Haomai Wang)
>> * osd: limit in-flight read requests (Jason Dillaman)
>> * osd: log when scrub or repair starts (Loic Dachary)
>> * osd: support for discard for journal trim (Jianpeng Ma)
>> * qa: fix osd create dup tests (#10083 Loic Dachary)
>> * rgw: add location header when object is in another region (VRan Liu)
>> * rgw: check timestamp on s3 keystone auth (#10062 Abhishek 

[ceph-users] EC + RBD Possible?

2015-01-07 Thread deeepdish
Hello.

I wasn’t able to obtain a clear answer in my googling and reading official Ceph 
docs if Erasure Coded pools are possible/supported for RBD access?   

The idea is to have block (cold) storage for archival purposes.   I would 
access an RBD device and format it as EXT or XFS for block use.I understand 
that acceleration is possible via using SSDs as a cache tier or OSD journals.   

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitors and read/write latency

2015-01-07 Thread Robert LeBlanc
Monitors are in charge of the CRUSH map. When ever there is a change
to the CRUSH map, an OSD goes down, a new OSD is added, PGs are
increased, etc, the monitor(s) builds a new CRUSH map and distributes
it to all clients and OSDs. Once the client has the CRUSH map, it does
not need to contact the monitor for placement or retrieval of an
object because any object's locate can be computed by the
client.[1][2]

Having your monitors on a 1 Gb link may be just fine based on the
number of OSDs you have and what things look like when you are doing
backfills. It is suggested that the monitors have very fast disks as
it makes sure things are committed to disk before sending new maps to
clients/OSDs.

[1] http://ceph.com/docs/master/rados/operations/crush-map/
[2] http://ceph.com/docs/master/architecture/#scalability-and-high-availability

On Tue, Jan 6, 2015 at 1:37 PM, Logan Barfield  wrote:
> Do monitors have any impact on read/write latencies?  Everything I've read
> says no, but since a client needs to talk to a monitor before reading or
> writing to OSDs it would seem like that would introduce some overhead.
>
> I ask for two reasons:
> 1) We are currently using SSD based OSD nodes for our RBD pools.  These
> nodes are connected to our hypervisors over 10Gbit links for VM block
> devices.  The rest of the cluster is on 1Gbit links, so the RBD nodes
> contact the monitors across 1Gbit instead of 1Gbit.  I'm not sure if this
> would degrade performance at all.
>
> 2) In a multi-datacenter cluster a client may end up contacting a monitor
> located in a remote location (e.g., over a high latency WAN link).  I would
> think the client would have to wait for a response from the monitor before
> beginning read/write operations on the local OSDs.
>
> I'm not sure exactly what the monitor interactions are.  Do clients only
> pull the cluster map from the monitors (then ping it occasionally for
> updates), or do clients talk to the monitors any time they write a new
> object to determine what placement group / OSDs to write to or read from?
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Problem with Rados gateway

2015-01-07 Thread Patrick McGarry
This is probably more suited to the ceph-user list. Moving it there. Thanks.


Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph


On Wed, Jan 7, 2015 at 9:17 AM, Walter Valenti  wrote:
> Scenario:
> Openstack Juno RDO on Centos7.
> Ceph version: Giant.
>
> On Centos7 there isn't more the old fastcgi,
> but there's "mod_fcgid"
>
>
>
> The apache VH is the following:
> 
> ServerName rdo-ctrl01
> DocumentRoot /var/www/radosgw
> RewriteEngine On
> RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) 
> /s3gw.fcgi?page=$1¶ms=$2&%{QUERY_STRING} 
> [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
> 
> Options +ExecCGI
> AllowOverride All
> SetHandler fcgid-script
> Order allow,deny
> Allow from all
> AuthBasicAuthoritative Off
> 
> AllowEncodedSlashes On
> ErrorLog /var/log/httpd/error.log
> CustomLog /var/log/httpd/access.log combined
> ServerSignature Off
> 
>
>
> On "/var/www/radosgw" there's the cgi file "s3gw.fcgi":
> #!/bin/sh
> exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway -d 
> --debug-rgw 20 --debug-ms 1
>
>
> For the configuration I've followed this documentation:
> http://docs.ceph.com/docs/next/radosgw/config/
>
> When I try to access to the object storage I've got the following errors:
>
> 1) Apache VH error:
> [Wed Jan 07 13:15:22.029411 2015] [fcgid:info] [pid 2051] mod_fcgid: server 
> rdo-ctrl01:/var/www/radosgw/s3gw.fcgi(28527) started
> 2015-01-07 13:15:22.046644 7ff16e240880  0 ceph version 0.87 
> (c51c8f9d80fa4e0168aa52685b8de40e42758578), process radosgw, pid 28527
> 2015-01-07 13:15:22.053673 7ff16e240880  1 -- :/0 messenger.start
> 2015-01-07 13:15:22.054783 7ff16e240880  1 -- :/1028527 --> 
> 163.162.90.120:6789/0 -- auth(proto 0 40 bytes epoch 0) v1 -- ?+0 0x11d9100 
> con 0x11a0870
> 2015-01-07 13:15:22.055339 7ff16e238700  1 -- 163.162.90.120:0/1028527 
> learned my addr 163.162.90.120:0/1028527
> 2015-01-07 13:15:22.056425 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 1  mon_map magic: 0 v1  200+0+0 
> (3839442293 0 0) 0x7ff148000ab0 con 0x11a0870
> 2015-01-07 13:15:22.056547 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
> 33+0+0 (3991100068 0 0) 0x7ff148000f70 con 0x11a0870
> 2015-01-07 13:15:22.056900 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 --> 
> 163.162.90.120:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 
> 0x7ff14c0012e0 con 0x11a0870
> 2015-01-07 13:15:22.057505 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 3  auth_reply(proto 2 0 (0) Success) v1  
> 222+0+0 (1145796146 0 0) 0x7ff148000f70 con 0x11a0870
> 2015-01-07 13:15:22.057768 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 --> 
> 163.162.90.120:6789/0 -- auth(proto 2 181 bytes epoch 0) v1 -- ?+0 
> 0x7ff14c001ca0 con 0x11a0870
> 2015-01-07 13:15:22.058496 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 4  auth_reply(proto 2 0 (0) Success) v1  
> 425+0+0 (2903986998 0 0) 0x7ff148001200 con 0x11a0870
> 2015-01-07 13:15:22.058694 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 --> 
> 163.162.90.120:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x11d94c0 con 
> 0x11a0870
> 2015-01-07 13:15:22.058843 7ff16e240880  1 -- 163.162.90.120:0/1028527 --> 
> 163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
> 0x11d91d0 con 0x11a0870
> 2015-01-07 13:15:22.058934 7ff16e240880  1 -- 163.162.90.120:0/1028527 --> 
> 163.162.90.120:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 
> 0x11d9ab0 con 0x11a0870
> 2015-01-07 13:15:22.059214 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 5  mon_map magic: 0 v1  200+0+0 
> (3839442293 0 0) 0x7ff148001130 con 0x11a0870
> 2015-01-07 13:15:22.059140 7ff1567fc700  2 
> RGWDataChangesLog::ChangesRenewThread: start
> 2015-01-07 13:15:22.059737 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 6  mon_subscribe_ack(300s) v1  20+0+0 
> (1877860257 0 0) 0x7ff148001410 con 0x11a0870
> 2015-01-07 13:15:22.059869 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 7  osd_map(52..52 src has 1..52) v3  
> 5987+0+0 (3066791464 0 0) 0x7ff148002d50 con 0x11a0870
> 2015-01-07 13:15:22.060250 7ff16e240880 20 get_obj_state: rctx=0x119c2f0 
> obj=.rgw.root:default.region state=0x119dba8 s->prefetch_data=0
> 2015-01-07 13:15:22.060302 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 8  mon_subscribe_ack(300s) v1  20+0+0 
> (1877860257 0 0) 0x7ff148001130 con 0x11a0870
> 2015-01-07 13:15:22.060325 7ff15e7fc700  1 -- 163.162.90.120:0/1028527 <== 
> mon.0 163.162.90.120:6789/0 9  osd_map(52..52 src has 1..52) v3  
> 5987+0+0 (3066791464 0 0) 0x7ff1480046f0 con 0x11a0870
> 2015-01-07 13:15:22.060333 7ff16e240880 10 cache get: 
> name=.rgw.r

Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2015-01-07 Thread Yehuda Sadeh
I created a ceph tracker issue:

http://tracker.ceph.com/issues/10471

Thanks,
Yehuda

On Tue, Jan 6, 2015 at 10:19 PM, Mark Kirkwood
 wrote:
> On 07/01/15 17:43, hemant burman wrote:
>>
>> Hello Yehuda,
>>
>> The issue seem to be with the user data file for swift subser not
>> getting synced properly.
>
>
>
> FWIW, I'm seeing exactly the same thing as well (Hermant - that was well
> spotted)!
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-07 Thread Nico Schottelius
Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
"clean" for production (sorry for the word joke).

Cheers,

Nico

Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
> Hi Nico and all others who answered,
> 
> After some more trying to somehow get the pgs in a working state (I've
> tried force_create_pg, which was putting then in creating state. But
> that was obviously not true, since after rebooting one of the containing
> osd's it went back to incomplete), I decided to save what can be saved.
> 
> I've created a new pool, created a new image there, mapped the old image
> from the old pool and the new image from the new pool to a machine, to
> copy data on posix level.
> 
> Unfortunately, formatting the image from the new pool hangs after some
> time. So it seems that the new pool is suffering from the same problem
> as the old pool. Which is totaly not understandable for me.
> 
> Right now, it seems like Ceph is giving me no options to either save
> some of the still intact rbd volumes, or to create a new pool along the
> old one to at least enable our clients to send data to ceph again.
> 
> To tell the truth, I guess that will result in the end of our ceph
> project (running for already 9 Monthes).
> 
> Regards,
> Christian
> 
> Am 29.12.2014 15:59, schrieb Nico Schottelius:
> > Hey Christian,
> > 
> > Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
> >> [incomplete PG / RBD hanging, osd lost also not helping]
> > 
> > that is very interesting to hear, because we had a similar situation
> > with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
> > directories to allow OSDs to start after the disk filled up completly.
> > 
> > So I am sorry not to being able to give you a good hint, but I am very
> > interested in seeing your problem solved, as it is a show stopper for
> > us, too. (*)
> > 
> > Cheers,
> > 
> > Nico
> > 
> > (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
> > seems to run much smoother. The first one is however not supported
> > by opennebula directly, the second one not flexible enough to host
> > our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
> > are using ceph at the moment.
> > 
> 
> 
> -- 
> Christian Eichelmann
> Systemadministrator
> 
> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
> Brauerstraße 48 · DE-76135 Karlsruhe
> Telefon: +49 721 91374-8026
> christian.eichelm...@1und1.de
> 
> Amtsgericht Montabaur / HRB 6484
> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
> Aufsichtsratsvorsitzender: Michael Scheeren

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with btrfs are down

2015-01-07 Thread Dyweni - BTRFS

Hi,

BTRFS crashed because the system ran out of memory...

I see these entries in your logs:


Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020



Jan  4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device
sdb1) in create_pending_snapshot:1334: errno=-12 Out of memory



Jan  4 17:11:06 ceph1 kernel: [756636.536135] BTRFS: error (device
sdb1) in cleanup_transaction:1577: errno=-12 Out of memory



How much memory do you have in this node?  Where you using Ceph
(as a client) on this node?  Do you have swap configured on this
node?









On 2015-01-04 07:12, Jiri Kanicky wrote:

Hi,

My OSDs with btrfs are down on one node. I found the cluster in this 
state:


cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   down0
1   2.72osd.1   down0
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs
recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs
stuck undersized; 631 pgs undersized; recovery 397226/915548 objects
degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub
errors
 monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 30, quorum 0,1 ceph1,ceph2
 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e242: 4 osds: 2 up, 2 in
  pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects
1811 GB used, 3764 GB / 5579 GB avail
397226/915548 objects degraded (43.387%); 72026/915548
objects misplaced (7.867%)
  14 active+recovering+degraded+remapped
 122 active+remapped
   1 active+remapped+inconsistent
 603 active+undersized+degraded
  28 active+undersized+degraded+inconsistent


Would you know if this is pure BTRFS issue or is there any setting I
forgot to use?

Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020
Jan  4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm:
kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian
3.16.7-ckt2-1~bpo70+1
Jan  4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP
ProLiant MicroServer Gen8, BIOS J06 11/09/2013
Jan  4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events
do_async_commit [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535704]  
0001 81541f8f 00204020
Jan  4 17:11:06 ceph1 kernel: [756636.535707]  811519ed
0001 880075de0c00 0002
Jan  4 17:11:06 ceph1 kernel: [756636.535710]  
0001 880075de0c08 0096
Jan  4 17:11:06 ceph1 kernel: [756636.535713] Call Trace:
Jan  4 17:11:06 ceph1 kernel: [756636.535720] [] ?
dump_stack+0x41/0x51
Jan  4 17:11:06 ceph1 kernel: [756636.535725] [] ?
warn_alloc_failed+0xfd/0x160
Jan  4 17:11:06 ceph1 kernel: [756636.535730] [] ?
__alloc_pages_nodemask+0x91f/0xbb0
Jan  4 17:11:06 ceph1 kernel: [756636.535734] [] ?
kmem_getpages+0x60/0x110
Jan  4 17:11:06 ceph1 kernel: [756636.535737] [] ?
fallback_alloc+0x158/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535741] [] ?
kmem_cache_alloc+0x1a4/0x1e0
Jan  4 17:11:06 ceph1 kernel: [756636.535745] [] ?
ida_pre_get+0x60/0xd0
Jan  4 17:11:06 ceph1 kernel: [756636.535749] [] ?
get_anon_bdev+0x21/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535762] [] ?
btrfs_init_fs_root+0xff/0x1b0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535774] [] ?
btrfs_read_fs_root+0x33/0x40 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535785] [] ?
btrfs_get_fs_root+0xd6/0x230 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535796] [] ?
create_pending_snapshot+0x793/0xa00 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535807] [] ?
create_pending_snapshots+0x89/0xa0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535818] [] ?
btrfs_commit_transaction+0x35a/0xa10 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535824] [] ?
mod_timer+0x10e/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535834] [] ?
do_async_commit+0x2a/0x40 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535839] [] ?
process_one_work+0x15c/0x450
Jan  4 17:11:06 ceph1 kernel: [756636.535843] [] ?
worker_thread+0x112/0x540
Jan  4 17:11:06 ceph1 kernel: [756636.535847] [] ?
create_and_start_worker+0x60/0x60
Jan  4 17:11:06 ceph1 kernel: [756636.535851] [] ?
kthread+0xc1/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535854] [] ?
flush_kthread_worker+0xb0/0xb0
Jan  4 17:11:06 ceph1 kernel: [756636.535858] [] ?
ret_from_fork+0x7c/0xb0
Jan  4 17:11:06 ceph1 kernel: [756636.535861] [] ?
flush_kthread_worker+0xb0/0xb0
J

Re: [ceph-users] cephfs usable or not?

2015-01-07 Thread Jiri Kanicky

Hi Max,

Thanks for this info.

I am planing to use CephFS (ceph version 0.87) at home, because its more 
convenient than NFS over RBD. I dont have large environment; about 20TB, 
so hopefully it will hold.


I backup all important data just in case. :)

Thank you.
Jiri

On 29/12/2014 21:09, Thomas Lemarchand wrote:

Hi Max,

I do use CephFS (Giant) in a production environment. It works really
well, but I have backups ready to use, just in case.

As Wido said, kernel version is not really relevant if you use ceph-fuse
(which I recommend over cephfs kernel, for stability and ease of upgrade
reasons).

However, I found ceph-mds memory usage hard to predict, and I had some
problems with that. At first it was undersized (16GB, for ~8M files /
dirs, and 1M inodes cached), but it worked well until I had a server
crash who did not recover (mds rejoin / rebuild) because of the lack of
memory. So I gave it 24GB memory + 24GB swap, no problem anymore.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data recovery after RBD I/O error

2015-01-07 Thread Austin S Hemmelgarn

On 2015-01-04 15:26, Jérôme Poulin wrote:

Happy holiday everyone,

TL;DR: Hardware corruption is really bad, if btrfs-restore work,
kernel Btrfs can!

I'm cross-posting this message since the root cause for this problem
is the Ceph RBD device however, my main concern is data loss from a
BTRFS filesystem hosted on this device.

I'm running a file server which is a staging area for rsync backups of
many folders and also a snapshot store which allow me to recover much
faster older files and folders while our backup still is exported to
an EXT4 filesystem using rdiff-backup.

The server is running Debian Wheezy with kernel 3.16 and I already had
corruption on this volume before, I had to copy the whole device and
since we now had a working Ceph cluster, I copied the volume using
«btrfs send» to another BTRFS hosted on a RBD device. The corruption
was not causing any issue for reading however when writing, the volume
would switch read only once upon a time.

First day of new year, I wake up to see the monitoring telling me the
FS on the server has switched to read only. I took a look at dmesg,
and had some I/O errors from the RBD device. I was unable to unmount
it but had full access to the data, so I wanted to reboot to see if
the glitch would dismiss now that I/O errors were gone. After the
reboot, the BTRFS would not mount anymore.


After trying the usual, read only mount, recovery mount, btrfsck
--repair on a snapshot, only btrfs-restore was working. Btrfs-restore
could restore everything but my data was in snapshot, regex was not
working correctly and it didn't restore file attributes
(normal/extended) even with -x, I used btrfs-tools 3.18.

This is what I was getting:
[   31.582823] parent transid verify failed on 308470693888 wanted
91730 found 90755
[   31.584738] parent transid verify failed on 308470693888 wanted
91730 found 90755
[   31.584743] BTRFS: Failed to read block groups: -5

After looking at the code a bit, I did this change to get BTRFS
recovery working and rsync my stuff. I also tried to use btrfs send by
forcing it to use a read/write snapshot since the whole volume is read
only anyway but failed with oopses.

Patch for recovery
---
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0229c37..aed4062 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2798,7 +2798,8 @@ retry_root_backup:
 ret = btrfs_read_block_groups(extent_root);
 if (ret) {
 printk(KERN_ERR "BTRFS: Failed to read block groups:
%d\n", ret);
-   goto fail_sysfs;
+   if (!btrfs_test_opt(tree_root, RECOVERY))
+   goto fail_sysfs;
 }
 fs_info->num_tolerated_disk_barrier_failures =
 btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
---
Also: http://pastebin.com/YPY3eMMX


Trace when forcing BTRFS send on my R/O volume with R/W subvolume:
[ cut here ]
WARNING: CPU: 3 PID: 27883 at fs/btrfs/send.c:5533
btrfs_ioctl_send+0x8c9/0xfa0 [btrfs]()
Modules linked in: btrfs(O) ufs qnx4 hfsplus hfs minix ntfs vfat msdos
fat jfs xfs reiserfs vhost_net vhost macvtap macvlan tun
ip6table_filter ip6_tabl
es ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT cbc
rbd libceph xt_CHECKSUM iptable_mangle libcrc32c xt_tcpudp ip
table_filter ip_tables x_tables parport_pc ppdev lp parport ib_iser
rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi nfsd auth_rpcgss
oid_registry n
fs_acl nfs lockd fscache sunrpc bridge fuse ipmi_devintf 8021q garp
stp mrp llc loop iTCO_wdt iTCO_vendor_support ttm drm_kms_helper
pcspkr drm evdev lpc_ich i2c_algo_bit i2c_core mfd_core i7core_edac
processor edac_core button coretemp tpm_tis tpm dcdbas kvm_intel
acpi_power_meter ipmi_si thermal_sys ipmi_msghandler kvm ext4 crc16
mbcache jbd2 dm_mod raid456 async_raid6_recov async_memcpy async_pq
async_xor async_tx xor ra
Jan  2 18:55:43 CASRV0104 kernel: id6_pq raid1 md_mod sg sd_mod
crc_t10dif crct10dif_common mvsas libsas ehci_pci ehci_hcd bnx2
crc32c_intel libata scsi_transport_sas scsi_mod usbcore usb_common
[last
unloaded: btrfs]
CPU: 3 PID: 27883 Comm: btrfs Tainted: G   O
3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1
Hardware name: Dell Inc. PowerEdge R310/05XKKK, BIOS 1.5.2 10/15/2010
   a0a52557 81541f8f 
  8106cecc 8800ba625a00 8803152da000 7fffa69f7ab0
  880312f2d1e0 8800ba625a00 a0a419c9 
Call Trace:
  [] ? dump_stack+0x41/0x51
  [] ? warn_slowpath_common+0x8c/0xc0
  [] ? btrfs_ioctl_send+0x8c9/0xfa0 [btrfs]
  [] ? __alloc_pages_nodemask+0x165/0xbb0
  [] ? dput+0x31/0x1a0
  [] ? cache_alloc_refill+0x92/0x2e0
  [] ? btrfs_ioctl+0x1a50/0x2890 [btrfs]
  [] ? alloc_pid+0x1e8/0x4d0
  [] ? set_tas

[ceph-users] CEPH: question on journal placement

2015-01-07 Thread Marco Kuendig
Hello

newbie on CEPH here.I do have three lab servers with CEPH. Each server got 
two 2 x 3TB SATA disks. Up to now I run 2 OSD per server and partitioned the 2 
disks in to 4 partitions and had 2 OSD split over the 4 partitions. 1 Disk = 1 
OSD = 2 partitions (data and journal).

Now I started to think about it and thought that it would maybe more wise to 
setup like this:

1 server = 2 Disks = 1 OSD (1 Disk Data, 1 Disk Journal).

What is the general recommendation ?

My setup works, however I feel that performance is not there where it should be.

thanks for any opinion

marco

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd tree to show primary-affinity value

2015-01-07 Thread Mykola Golub
On Thu, Dec 25, 2014 at 03:57:15PM +1100, Dmitry Smirnov wrote:

> Please don't withhold this improvement -- go ahead and submit pull request to 
> let developers decide whether they want this or not. IMHO it is a very useful 
> improvement. Thank you very much for implementing it.

Done. https://github.com/ceph/ceph/pull/3254

-- 
Mykola Golub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Block and NAS Services for Non Linux OS

2015-01-07 Thread Steven Sim
Hello Eneko;

Firstly, thanks for your comments!

You mentioned that machines see a QEMU IDE/SCSI disk, they don't know
whether its on ceph, NFS, local, LVM, ... so it works OK for any VM guest
SO.

But what if I want to CEPH cluster to serve a whole range of clients in the
data center, ranging from ESXi, Microsoft Hypervisors, Solaris
(unvirtualized), AIX (unvirtualized) etc ...

In particular, I'm being asked to create a NAS and iSCSI Block storage farm
with an ability to serve not just Linux but a range of operating system(s),
some virtualized, some not . ...

I love the distributive nature of CEPH but using Proxy nodes (or heads)
sort of goes against the distributive concept...

Warmest Regards
Steven Sim
Mobile : 96963117
Principal Systems
77 High Street
#10-07 High Street Plaza
Singapore 179433
Company Registration Number : 201002783M

On 30 December 2014 at 18:55, Eneko Lacunza  wrote:

> Hi Steven,
>
> Welcome to the list.
>
> On 30/12/14 11:47, Steven Sim wrote:
>
>> This is my first posting and I apologize if the content or query is not
>> appropriate.
>>
>> My understanding for CEPH is the block and NAS services are through
>> specialized (albeit opensource) kernel modules for Linux.
>>
>> What about the other OS e.g. Solaris, AIX, Windows, ESX ...
>>
>> If the solution is to use a proxy, would using the MON servers (as iSCSI
>> and NAS proxies) be okay?
>>
> Virtual machines see a QEMU IDE/SCSI disk, they don't know whether its on
> ceph, NFS, local, LVM, ... so it works OK for any VM guest SO.
>
> Currently on Proxmox, it's qemu-kvm the ceph (RBD) client, not the linux
> kernel.
>
>>
>> What about performance?
>>
>
> It depends a lot on the setup. Do you have something on your mind? :)
>
> Cheers
> Eneko
>
> --
> Zuzendari Teknikoa / Director Técnico
> Binovo IT Human Project, S.L.
> Telf. 943575997
>   943493611
> Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
> www.binovo.es
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hanging VMs with Qemu + RBD

2015-01-07 Thread Achim Ledermüller
Hi,

We have the same setup including OpenNebula 4.10.1. We had some
backfilling due to node failures and node expansion. If we throttle
osd_max_backfills there is not a problem at all. If the value for
backfilling jobs is too high, we can see delayed reactions within the
shell, eg. `ls -lh` needs 2 seconds.

Kind regards,
Achim

-- 
Achim Ledermüller, M. Sc.
Systems Engineer

NETWAYS Managed Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
GF: Julian Hein, Bernd Erk | AG Nuernberg HRB25207
http://www.netways.de | achim.ledermuel...@netways.de

** OSDC 2015 - April - osdc.de **
** Puppet Camp Berlin 2015 - April - netways.de/puppetcamp **
** OSBConf 2015 - September – osbconf.org **
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Making objects available via FTP

2015-01-07 Thread Carlo Santos
Hi all,

I'm wondering if it's possible to make the files in Ceph available via FTP
by just configuring Ceph. If this is not possible, what are the typical
steps on how to make the files available via FTP?

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2015-01-07 Thread hemant burman
Hello Yehuda,

The issue seem to be with the user data file for swift subser not getting
synced properly.
MasterZone:

root@ceph-all:/var/local# ceph osd map .us-1-east-1.users.uid johndoe2

osdmap e796 pool '.us-1-east-1.users.uid' (286) object 'johndoe2' -> pg
286.c384ed51 (286.51) -> up [2] acting [2]

root@ceph-all:/var/local# *strings
/var/local/osd2/current/286.51_head/johndoe2__head_C384ED51__11e *

johndoe2

KYNM9FLJAFNKCB9HN7UG(

dwQpa4dWCXXJqGZWcNTBo7SzAFFaRX2LWvMtM3so

John Doe_2

joh...@example.com

johndoe2:swift(

J8tmIvmszn8NEtBOKAHa4zOoJO65C4DoIm6Sl9kt

johndoe2

KYNM9FLJAFNKCB9HN7UG

KYNM9FLJAFNKCB9HN7UG(

dwQpa4dWCXXJqGZWcNTBo7SzAFFaRX2LWvMtM3so

swift

swift

*johndoe2:swift*

johndoe2:swift(

J8tmIvmszn8NEtBOKAHa4zOoJO65C4DoIm6Sl9kt

swift




Slave Zone After Replication:

root@ceph-all-1:/var/chef/cache#  ceph osd map .us-1-west-1.users.uid
johndoe2

osdmap e202 pool '.us-1-west-1.users.uid' (63) object 'johndoe2' -> pg
63.c384ed51 (63.51) -> up [2] acting [2]

root@ceph-all-1:/var/chef/cache# *strings
/var/local/osd2/current/63.51_head/johndoe2__head_C384ED51__3f*

johndoe2

KYNM9FLJAFNKCB9HN7UG(

dwQpa4dWCXXJqGZWcNTBo7SzAFFaRX2LWvMtM3so

John Doe_2

joh...@example.com

johndoe2:swift(

J8tmIvmszn8NEtBOKAHa4zOoJO65C4DoIm6Sl9kt

johndoe2

KYNM9FLJAFNKCB9HN7UG

KYNM9FLJAFNKCB9HN7UG(

dwQpa4dWCXXJqGZWcNTBo7SzAFFaRX2LWvMtM3so

swift

swift

*swift*

johndoe2:swift(

J8tmIvmszn8NEtBOKAHa4zOoJO65C4DoIm6Sl9kt

swift

Look @ the diff in bold and even Checksum and Filesizes are different
MasterZone:

root@ceph-all:/var/local# ls -l
/var/local/osd2/current/286.51_head/johndoe2__head_C384ED51__11e

-rw-r--r-- 1 root root *469* Jan  6 19:13
/var/local/osd2/current/286.51_head/johndoe2__head_C384ED51__11e

SlaveZone:

root@ceph-all-1:/var/chef/cache# ls -l
/var/local/osd2/current/63.51_head/johndoe2__head_C384ED51__3f

-rw-r--r-- 1 root root *460* Jan  6 19:22
/var/local/osd2/current/63.51_head/johndoe2__head_C384ED51__3f

-Hemant

On Wed, Jan 7, 2015 at 9:27 AM, Mark Kirkwood  wrote:

> On 07/01/15 16:22, Mark Kirkwood wrote:
>
>>
>>
>> FWIW I can reproduce this too (ceph 0.90-663-ge1384af). The *user*
>> replicates ok (complete with its swift keys and secret). I can
>> authenticate to both zones ok using S3 api (boto version 2.29), but only
>> to the master using swift (swift client versions 2.3.1 and 2.0.3). In
>> the case of the slave zone I'm seeing the same error stack as the above.
>>
>> I'm running Ubuntu 14.10 for ceph and rgw with Apache (version 2.4.10)
>> the standard repos. I'll try replacing the fastcgi module to see if that
>> is a factor.
>>
>>
> Does not appear to be - downgraded apache to 2.4.7 + fastcgi from ceph
> repo and seeing the same sort of thing on the slave zone:
>
>
> 2015-01-07 16:56:13.889445 7febba77c700  1 == starting new request
> req=0x7fec34075ba0 =
> 2015-01-07 16:56:13.889456 7febba77c700  2 req 2454:0.10::GET
> /auth::initializing
> 2015-01-07 16:56:13.889461 7febba77c700 10 host=192.168.122.21
> rgw_dns_name=ceph1
> 2015-01-07 16:56:13.889475 7febba77c700  2 req
> 2454:0.30:swift-auth:GET /auth::getting op
> 2015-01-07 16:56:13.889480 7febba77c700  2 req
> 2454:0.34:swift-auth:GET /auth:swift_auth_get:authorizing
> 2015-01-07 16:56:13.889481 7febba77c700  2 req
> 2454:0.36:swift-auth:GET /auth:swift_auth_get:reading permissions
> 2015-01-07 16:56:13.889482 7febba77c700  2 req
> 2454:0.37:swift-auth:GET /auth:swift_auth_get:init op
> 2015-01-07 16:56:13.889484 7febba77c700  2 req
> 2454:0.38:swift-auth:GET /auth:swift_auth_get:verifying op mask
> 2015-01-07 16:56:13.889485 7febba77c700 20 required_mask= 0 user.op_mask=7
> 2015-01-07 16:56:13.889486 7febba77c700  2 req
> 2454:0.41:swift-auth:GET /auth:swift_auth_get:verifying op permissions
> 2015-01-07 16:56:13.889487 7febba77c700  2 req
> 2454:0.42:swift-auth:GET /auth:swift_auth_get:verifying op params
> 2015-01-07 16:56:13.889488 7febba77c700  2 req
> 2454:0.43:swift-auth:GET /auth:swift_auth_get:executing
> 2015-01-07 16:56:13.889516 7febba77c700  2 req
> 2454:0.70:swift-auth:GET /auth:swift_auth_get:http status=403
> 2015-01-07 16:56:13.889518 7febba77c700  1 == req done
> req=0x7fec34075ba0 http_status=403 ==
> 2015-01-07 16:56:13.889521 7febba77c700 20 process_request() returned -1
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Robert LeBlanc
The bitmap certainly sounds like it would help shortcut a lot of code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?

On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin  wrote:
> On 01/06/2015 10:24 AM, Robert LeBlanc wrote:
>>
>> Can't this be done in parallel? If the OSD doesn't have an object then
>> it is a noop and should be pretty quick. The number of outstanding
>> operations can be limited to 100 or a 1000 which would provide a
>> balance between speed and performance impact if there is data to be
>> trimmed. I'm not a big fan of a "--skip-trimming" option as there is
>> the potential to leave some orphan objects that may not be cleaned up
>> correctly.
>
>
> Yeah, a --skip-trimming option seems a bit dangerous. This trimming
> actually is parallelized (10 ops at once by default, changeable via
> --rbd-concurrent-management-ops) since dumpling.
>
> What will really help without being dangerous is keeping a map of
> object existence [1]. This will avoid any unnecessary trimming
> automatically, and it should be possible to add to existing images.
> It should be in hammer.
>
> Josh
>
> [1] https://github.com/ceph/ceph/pull/2700
>
>
>> On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:
>>>
>>>
>>>
>>> On Monday, January 5, 2015, Chen, Xiaoxi  wrote:


 When you shrinking the RBD, most of the time was spent on
 librbd/internal.cc::trim_image(), in this function, client will iterator
 all
 unnecessary objects(no matter whether it exists) and delete them.



 So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
 there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
 170,227,200 Objects need to be deleted.That will definitely take a long
 time
 since rbd client need to send a delete request to OSD, OSD need to find
 out
 the object context and delete(or doesn’t exist at all). The time needed
 to
 trim an image is ratio to the size needed to trim.



 make another image of the correct size and copy your VM's file system to
 the new image, then delete the old one will  NOT help in general, just
 because delete the old volume will take exactly the same time as
 shrinking ,
 they both need to call trim_image().



 The solution in my mind may be we can provide a “—skip-triming” flag to
 skip the trimming. When the administrator absolutely sure there is no
 written have taken place in the shrinking area(that means there is no
 object
 created in these area), they can use this flag to skip the time
 consuming
 trimming.



 How do you think?
>>>
>>>
>>>
>>> That sounds like a good solution. Like doing "undo grow image"
>>>
>>>


 From: Jake Young [mailto:jak3...@gmail.com]
 Sent: Monday, January 5, 2015 9:45 PM
 To: Chen, Xiaoxi
 Cc: Edwin Peer; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day





 On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:

 You could use rbd info   to see the block_name_prefix, the
 object name consist like .,  so for
 example, rb.0.ff53.3d1b58ba.e6ad should be the th object
 of
 the volume with block_name_prefix rb.0.ff53.3d1b58ba.

   $ rbd info huge
  rbd image 'huge':
   size 1024 TB in 268435456 objects
   order 22 (4096 kB objects)
   block_name_prefix: rb.0.8a14.2ae8944a
   format: 1

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Edwin Peer
 Sent: Monday, January 5, 2015 3:55 AM
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

 Also, which rbd objects are of interest?

 
 ganymede ~ # rados -p client-disk-img0 ls | wc -l
 1672636
 

 And, all of them have cryptic names like:

 rb.0.ff53.3d1b58ba.e6ad
 rb.0.6d386.1d545c4d.00011461
 rb.0.50703.3804823e.1c28
 rb.0.1073e.3d1b58ba.b715
 rb.0.1d76.2ae8944a.022d

 which seem to bear no resemblance to the actual image names that the rbd
 command line tools understands?

 Regards,
 Edwin Peer

 On 01/04/2015 08:48 PM, Jake Young wrote:
>
>
>
> On Sunday, January 4, 2015, Dyweni - Ceph-Users
> <6exbab4fy...@dyweni.com > wrote:
>
>  Hi,
>
>  If its the only think in your pool, you could try deleting the
>  pool instead.
>
>  I found that to be faster in my testing; I had created 500TB when
>  I meant to create 500GB.
>
>

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin

On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

The bitmap certainly sounds like it would help shortcut a lot of code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?


It's purely at the rbd level, so librbd caches it and maintains its
consistency. The idea is that since it's kept consistent, librbd can do
things like delete exactly the objects that exist without any
extra communication with the osds. Many things that were
O(size of image) become O(written objects in image).

The only restriction is that keeping the object map consistent requires
a single writer, so this does not work for the rare case of e.g. ocfs2
on top of rbd, where there are multiple clients writing to the same
rbd image at once.

Josh


On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin  wrote:

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:


Can't this be done in parallel? If the OSD doesn't have an object then
it is a noop and should be pretty quick. The number of outstanding
operations can be limited to 100 or a 1000 which would provide a
balance between speed and performance impact if there is data to be
trimmed. I'm not a big fan of a "--skip-trimming" option as there is
the potential to leave some orphan objects that may not be cleaned up
correctly.



Yeah, a --skip-trimming option seems a bit dangerous. This trimming
actually is parallelized (10 ops at once by default, changeable via
--rbd-concurrent-management-ops) since dumpling.

What will really help without being dangerous is keeping a map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/pull/2700



On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:




On Monday, January 5, 2015, Chen, Xiaoxi  wrote:



When you shrinking the RBD, most of the time was spent on
librbd/internal.cc::trim_image(), in this function, client will iterator
all
unnecessary objects(no matter whether it exists) and delete them.



So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will definitely take a long
time
since rbd client need to send a delete request to OSD, OSD need to find
out
the object context and delete(or doesn’t exist at all). The time needed
to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy your VM's file system to
the new image, then delete the old one will  NOT help in general, just
because delete the old volume will take exactly the same time as
shrinking ,
they both need to call trim_image().



The solution in my mind may be we can provide a “—skip-triming” flag to
skip the trimming. When the administrator absolutely sure there is no
written have taken place in the shrinking area(that means there is no
object
created in these area), they can use this flag to skip the time
consuming
trimming.



How do you think?




That sounds like a good solution. Like doing "undo grow image"





From: Jake Young [mailto:jak3...@gmail.com]
Sent: Monday, January 5, 2015 9:45 PM
To: Chen, Xiaoxi
Cc: Edwin Peer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day





On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:

You could use rbd info   to see the block_name_prefix, the
object name consist like .,  so for
example, rb.0.ff53.3d1b58ba.e6ad should be the th object
of
the volume with block_name_prefix rb.0.ff53.3d1b58ba.

   $ rbd info huge
  rbd image 'huge':
   size 1024 TB in 268435456 objects
   order 22 (4096 kB objects)
   block_name_prefix: rb.0.8a14.2ae8944a
   format: 1

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Edwin Peer
Sent: Monday, January 5, 2015 3:55 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd
command line tools understands?

Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:




On Sunday, January 4, 2015, Dyweni - Ceph-Users
<6exbab4fy...@dyweni.com > wrote:

  Hi,

  If its the only think in your pool, you could try deleting the
  pool instead.

  I found that to be faster in my testing; I had created 500TB when
  I meant to create 500GB.


Re: [ceph-users] librbd cache

2015-01-07 Thread Stuart Longland
Hi all, apologies for the slow reply.

Been flat out lately and so any cluster work has been relegated to the
back-burner.  I'm only just starting to get back to it now.

On 06/06/14 01:00, Sage Weil wrote:
> On Thu, 5 Jun 2014, Wido den Hollander wrote:
>> On 06/05/2014 08:59 AM, Stuart Longland wrote:
>>> Hi all,
>>>
>>> I'm looking into other ways I can boost the performance of RBD devices
>>> on the cluster here and I happened to see these settings:
>>>
>>> http://ceph.com/docs/next/rbd/rbd-config-ref/
>>>
>>> A query, is it possible for the cache mentioned there to be paged out to
>>> swap residing on a SSD or is it purely RAM-only?
> 
> Right now it is RAM only.
> 
>>> I see mention of cache-tiers, but these will be at the wrong end of the
>>> Ethernet cable for my usage: I want the cache on the Ceph clients
>>> themselves not back at the OSDs.
>>>
>>
>> So you want this to serve as a read cache as well?

Yes, this is probably more important to my needs than write cache.  The
disks in the storage (OSD+MON) nodes are fast enough, but the problem
seems to be the speed at which data can be shunted across the network.

The storage nodes each have one gigabit NIC on the "server" network
(exposed to clients) and one on a back-end "storage" network.

I want to eventually put another two network cards in those boxes, but
2U-compatible cards aren't that common, and the budget is not high.
(10GbE can't come down in price fast enough either.  AU$600 a card?  Ouch!)

>> The librbd cache is mainly used as a write-cache for small writes, it's not
>> indented to be a large read cache.
> 
> Right.  There was a blueprint describing a larger (shared) read cache that 
> could be stored on a local SSD or file system, but it hasn't moved beyond 
> the concept stage.
> 
>   http://wiki.ceph.com/Planning/Sideboard/rbd%3A_shared_read_cache

Ahh okay, so a future release.  That document also answers another
question I had, that being "was the RAM cache shared between all RBDs on
a client or per-RBD?".  The answer of course, it's per RBD.

In the interests of science I did some testing over the last couple of
days.  When I deployed the cluster I used the (then latest) Emperor
release.  Monday I did an update to the Firefly release, checked
everything over, then moved to Giant.

So the storage nodes are now all running ceph version 0.87
(c51c8f9d80fa4e0168aa52685b8de40e42758578).  OS there is Ubuntu 12.04 LTS.

I have my laptop plugged in to the "client" network (so one router hop
away) with its on-board gigabit interface, and decided to do some tests
there with a KVM virtual machine.  The host OS is Gentoo with ceph
version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) and QEMU
2.1.2.  The machine itself is a Core i5 with 8GB RAM.

My VM had 256MB RAM, ran Debian Wheezy with two RBD "virtio" disks (8GB
OS and 80GB data), and I used bonnie++ on the data RBD formatted xfs
(default mkfs.xfs options).

The tests were each conducted by starting up the VM, logging in,
performing a test with bonnie++ (8GB file and specifying 256MB RAM,
otherwise using defaults), then powering off the VM before altering
/etc/ceph/ceph.conf for the next test.

With the stock Ceph cache settings, so 32MB RBD cache, default writeback
threshold, I get the following from bonnie++:
> Version  1.96   --Sequential Output-- --Sequential Input- 
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
> %CP
> debian   8G  1631  81 11260   0  4353   0  2913  97 11046   1 112.7   
> 2
> Latency  6564us 539ms 863ms   16660us 433ms 587ms
> Version  1.96   --Sequential Create-- Random 
> Create
> debian  -Create-- --Read--- -Delete-- -Create-- --Read--- 
> -Delete--
>   files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
> %CP
>  16 11833  22 + +++ 25112  44  9567  17 + +++ 15944  
> 28
> Latency   441ms 149us 138us 765ms  39us  97us
> 1.96,1.96,debian,1,1420488810,8G,,1631,81,11260,0,4353,0,2913,97,11046,1,112.7,2,16,11833,22,+,+++,25112,44,9567,17,+,+++,15944,28,6564us,539ms,863ms,16660us,433ms,587ms,441ms,149us,138us,765ms,39us,97us

If I disable writeback and up the RBD cache to 2GB, I get:
> Version  1.96   --Sequential Output-- --Sequential Input- 
> --Random-
> Concurrency   1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
> --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
> %CP
> debian   8G  1506  82 11225   0  4091   0  2096  51  9227   0 117.3   
> 3
> Latency  8966us2540ms1554ms 472ms2190ms 747ms
> Version  1.96   --Sequential Create-- Random 
> Create
> debian  -Create-- --Read--- -Delete-- -Create-- --Read--- 
> -

Re: [ceph-users] 回复: Re: rbd resize (shrink) taking forever and a day

2015-01-07 Thread Sage Weil
On Tue, 6 Jan 2015, Chen, Xiaoxi wrote:
> it is already in parallel, the outstanding ops are limited to ~10 per 
> client(tuneable),enlarge this may help.
> 
> BUut pls note that there is no noop here, OSD has no idea wherher it has 
> an object until it failed to find it in the disk, that means the op had 
> almost traveled the code path.

Also keep in mind that the new object map stuff we're about to merge for 
hammer makes this problem go away.  From hammer onwards we'll know which 
objects exist and will only try to delete (or export, or clone, or 
read) ones that exist.

sage


> 
>  Robert LeBlanc?? 
> 
> > Can't this be done in parallel? If the OSD doesn't have an object then
> > it is a noop and should be pretty quick. The number of outstanding
> > operations can be limited to 100 or a 1000 which would provide a
> > balance between speed and performance impact if there is data to be
> > trimmed. I'm not a big fan of a "--skip-trimming" option as there is
> > the potential to leave some orphan objects that may not be cleaned up
> > correctly.
> > 
> > On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:
> > >
> > >
> > > On Monday, January 5, 2015, Chen, Xiaoxi  wrote:
> > >>
> > >> When you shrinking the RBD, most of the time was spent on
> > >> librbd/internal.cc::trim_image(), in this function, client will iterator 
> > >> all
> > >> unnecessary objects(no matter whether it exists) and delete them.
> > >>
> > >>
> > >>
> > >> So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
> > >> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
> > >> 170,227,200 Objects need to be deleted.That will definitely take a long 
> > >> time
> > >> since rbd client need to send a delete request to OSD, OSD need to find 
> > >> out
> > >> the object context and delete(or doesn?t exist at all). The time needed 
> > >> to
> > >> trim an image is ratio to the size needed to trim.
> > >>
> > >>
> > >>
> > >> make another image of the correct size and copy your VM's file system to
> > >> the new image, then delete the old one will  NOT help in general, just
> > >> because delete the old volume will take exactly the same time as 
> > >> shrinking ,
> > >> they both need to call trim_image().
> > >>
> > >>
> > >>
> > >> The solution in my mind may be we can provide a ??skip-triming? flag to
> > >> skip the trimming. When the administrator absolutely sure there is no
> > >> written have taken place in the shrinking area(that means there is no 
> > >> object
> > >> created in these area), they can use this flag to skip the time consuming
> > >> trimming.
> > >>
> > >>
> > >>
> > >> How do you think?
> > >
> > >
> > > That sounds like a good solution. Like doing "undo grow image"
> > >
> > >
> > >>
> > >>
> > >> From: Jake Young [mailto:jak3...@gmail.com]
> > >> Sent: Monday, January 5, 2015 9:45 PM
> > >> To: Chen, Xiaoxi
> > >> Cc: Edwin Peer; ceph-users@lists.ceph.com
> > >> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:
> > >>
> > >> You could use rbd info   to see the block_name_prefix, the
> > >> object name consist like .,  so for
> > >> example, rb.0.ff53.3d1b58ba.e6ad should be the th object  
> > >> of
> > >> the volume with block_name_prefix rb.0.ff53.3d1b58ba.
> > >>
> > >>  $ rbd info huge
> > >> rbd image 'huge':
> > >>  size 1024 TB in 268435456 objects
> > >>  order 22 (4096 kB objects)
> > >>  block_name_prefix: rb.0.8a14.2ae8944a
> > >>  format: 1
> > >>
> > >> -Original Message-
> > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > >> Edwin Peer
> > >> Sent: Monday, January 5, 2015 3:55 AM
> > >> To: ceph-users@lists.ceph.com
> > >> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day
> > >>
> > >> Also, which rbd objects are of interest?
> > >>
> > >> 
> > >> ganymede ~ # rados -p client-disk-img0 ls | wc -l
> > >> 1672636
> > >> 
> > >>
> > >> And, all of them have cryptic names like:
> > >>
> > >> rb.0.ff53.3d1b58ba.e6ad
> > >> rb.0.6d386.1d545c4d.00011461
> > >> rb.0.50703.3804823e.1c28
> > >> rb.0.1073e.3d1b58ba.b715
> > >> rb.0.1d76.2ae8944a.022d
> > >>
> > >> which seem to bear no resemblance to the actual image names that the rbd
> > >> command line tools understands?
> > >>
> > >> Regards,
> > >> Edwin Peer
> > >>
> > >> On 01/04/2015 08:48 PM, Jake Young wrote:
> > >> >
> > >> >
> > >> > On Sunday, January 4, 2015, Dyweni - Ceph-Users
> > >> > <6exbab4fy...@dyweni.com > wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > If its the only think in your pool, you could try deleting the
> > >> > pool instead.
> > >> >
> > >> > I found that to be faster in my testing; I had created 500TB when
> > >> > I meant to create 500GB.
> > >> >
> > >> > Note for the Dev

[ceph-users] Undeleted objects - is there a garbage collector?

2015-01-07 Thread Max Power
Hi,

my osd folder "current" has a size of ~360MB but I do not have any
objects inside the corresponding pool; ceph status reports '8 bytes
data'. Even with 'rados -p mypool ls --all' I do not see any objects.
But there are a few current/12._head folders with files consuming
disk space.

How to "cleanup" the folders to free disk space?

Greetings!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Multi-site deployment RBD and Federated Gateways

2015-01-07 Thread Logan Barfield
Hello,

I'm re-sending this message since I didn't see it picked up on the list
archives yesterday.  My apologies if it was received previously.

We are currently running a single datacenter Ceph deployment.  Our setup is
as follows:
- 4 HDD OSD nodes (primarily used for RadosGW/Object Storage)
- 2 SSD OSD nodes (used for RBD/VM block devices)
- 3 Monitor daemons running on 3 of the HDD OSD nodes
- The CRUSH rules are set to push all data to the HDD nodes except for the
RBD pool, which uses the SSD nodes.

Our goal is to have OSD nodes in 3 datacenters (US East, US West, Europe).
I'm thinking that we would want the following setup:
- RadosGW instance in each datacenter with geo-dns to direct clients to the
closest one.
- Same OSD configuration as our current location (HDD for RadosGW, SSD for
RBD)
- Separate RBD pool in each datacenter for VM block devices.
- CRUSH rules:
-> RadosGW: 3 replicas, different OSD nodes, at least 1 off-site (e.g., 2
replicas on 2 OSD nodes in one datacenter, 1 replica on 1 OSD node in a
different datacenter).  I don't know if RadosGW is geo-aware enough to do
this efficiently
-> RBD: 2 replicas across 2 OSD nodes in the same datacenter.

>From the documentation it looks like the best way to accomplish this would
be to have a separate cluster in each datacenter, then use a federated
RadosGW configuration to keep geo-redundant replicas of objects.  The other
option would be to have one cluster spanning all 3 locations, but since
they would be connected over VPN/WAN links that doesn't seem ideal.

Concerns:
- With a federated configuration it looks like only one zone will be
writable, so if the master zone is on the east coast all of the west coast
clients would be uploading there as well.
- It doesn't appear that there is a way to only have 1 replica sent to the
secondary zone, rather all data written to the master is replicated to the
secondary (e.g., 3 replicas in each location).  Alternatively with multiple
regions both zones would be read/write, but only metadata would be synced.
- From the documentation I understand that there should be different pools
for each zone, and each cluster will need to have a different name.  Since
our current cluster is in production I don't know how safe it would be to
rename/move pools, or re-name the cluster.  We are using the default "ceph"
cluster name right now because different names add complexity (e.g,
requiring '--cluster' for all commands), and we noticed in testing that
some of the init scripts don't play well with custom cluster names.

It would seem to me that having a federated configuration would add a lot
of complexity. It wouldn't get us exactly what we'd like for replication
(one offsite copy), and doesn't allow for geo-aware writes.

I've seen a few examples of CRUSH maps that span multiple datacenters.
This would seem to be an easier setup, and would get us closer to what we
want with replication.  My only concern would be the WAN latency, setting
up site-to-site VPN (which I don't think is necessary for the federated
setup), and how well Ceph would handle losing a connection to one of the
remote sites for a few seconds or minutes.

Is there a recommended deployment for what we want to do, or any reference
guides beyond the official Ceph docs?  I know Ceph is being used for
multi-site deployments, but other than a few blog posts demonstrating
theoretical setups and vague Powerpoint slides I haven't seen any details
on it.  Unfortunately we are a very small company, so consulting with
Inktank/RedHat isn't financially feasible right now.

Any suggestions/insight would be much appreciated.


Thank You,

Logan Barfield
Tranquil Hosting
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg repair unsuccessful

2015-01-07 Thread Jiri Kanicky

Hi,

I have been experiencing issues with several PGs which remained in 
inconsistent state (I use BTRFS). "ceph pg repair" is not able to repair 
them. The only way I can delete the corresponding file, which is causing 
the issue (see logs bellow) from the OSDs. This however means loss of data.


Is there any other way how to fix it?

$ ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.17 is active+clean+inconsistent, acting [1,3]
1 scrub errors

Log output:
2015-01-07 21:43:13.396376 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : repair 2.17 f2a47417/100f485./head//2 on disk size 
(4194304) does not match object info size (0) adjusted for ondisk to (0)
2015-01-07 21:43:56.771820 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : 2.17 repair 1 errors, 0 fixed
2015-01-07 21:44:10.473870 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : deep-scrub 2.17 f2a47417/100f485./head//2 on disk 
size (4194304) does not match object info size (0) adjusted for ondisk 
to (0)
2015-01-07 21:44:42.919425 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : 2.17 deep-scrub 1 errors



Thx Jiri
cephver 0.87, Debian Wheezy, BTRFS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2015-01-07 Thread Mark Kirkwood

On 07/01/15 16:22, Mark Kirkwood wrote:



FWIW I can reproduce this too (ceph 0.90-663-ge1384af). The *user*
replicates ok (complete with its swift keys and secret). I can
authenticate to both zones ok using S3 api (boto version 2.29), but only
to the master using swift (swift client versions 2.3.1 and 2.0.3). In
the case of the slave zone I'm seeing the same error stack as the above.

I'm running Ubuntu 14.10 for ceph and rgw with Apache (version 2.4.10)
the standard repos. I'll try replacing the fastcgi module to see if that
is a factor.



Does not appear to be - downgraded apache to 2.4.7 + fastcgi from ceph 
repo and seeing the same sort of thing on the slave zone:



2015-01-07 16:56:13.889445 7febba77c700  1 == starting new request 
req=0x7fec34075ba0 =
2015-01-07 16:56:13.889456 7febba77c700  2 req 2454:0.10::GET 
/auth::initializing
2015-01-07 16:56:13.889461 7febba77c700 10 host=192.168.122.21 
rgw_dns_name=ceph1
2015-01-07 16:56:13.889475 7febba77c700  2 req 
2454:0.30:swift-auth:GET /auth::getting op
2015-01-07 16:56:13.889480 7febba77c700  2 req 
2454:0.34:swift-auth:GET /auth:swift_auth_get:authorizing
2015-01-07 16:56:13.889481 7febba77c700  2 req 
2454:0.36:swift-auth:GET /auth:swift_auth_get:reading permissions
2015-01-07 16:56:13.889482 7febba77c700  2 req 
2454:0.37:swift-auth:GET /auth:swift_auth_get:init op
2015-01-07 16:56:13.889484 7febba77c700  2 req 
2454:0.38:swift-auth:GET /auth:swift_auth_get:verifying op mask

2015-01-07 16:56:13.889485 7febba77c700 20 required_mask= 0 user.op_mask=7
2015-01-07 16:56:13.889486 7febba77c700  2 req 
2454:0.41:swift-auth:GET /auth:swift_auth_get:verifying op permissions
2015-01-07 16:56:13.889487 7febba77c700  2 req 
2454:0.42:swift-auth:GET /auth:swift_auth_get:verifying op params
2015-01-07 16:56:13.889488 7febba77c700  2 req 
2454:0.43:swift-auth:GET /auth:swift_auth_get:executing
2015-01-07 16:56:13.889516 7febba77c700  2 req 
2454:0.70:swift-auth:GET /auth:swift_auth_get:http status=403
2015-01-07 16:56:13.889518 7febba77c700  1 == req done 
req=0x7fec34075ba0 http_status=403 ==

2015-01-07 16:56:13.889521 7febba77c700 20 process_request() returned -1

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2015-01-07 Thread Mark Kirkwood

On 06/01/15 06:45, hemant burman wrote:

One more thing Yehuda,

In radosgw log in Slave Zone:

2015-01-05 17:22:42.188108 7fe4b66d2780 20 enqueued request req=0xbc1f50

2015-01-05 17:22:42.188125 7fe4b66d2780 20 RGWWQ:

2015-01-05 17:22:42.188126 7fe4b66d2780 20 req: 0xbc1f50

2015-01-05 17:22:42.188129 7fe4b66d2780 10 allocated request req=0xc1b4f0

2015-01-05 17:22:42.190310 7fe4617b2700 20 dequeued request req=0xbc1f50

2015-01-05 17:22:42.190951 7fe4617b2700 20 RGWWQ: empty

2015-01-05 17:22:42.191466 7fe4617b2700  1 == starting new request
req=0xbc1f50 =

2015-01-05 17:22:42.192023 7fe4617b2700  2 req 4374:0.000558::GET
/auth::initializing

2015-01-05 17:22:42.192046 7fe4617b2700 20 FCGI_ROLE=RESPONDER

2015-01-05 17:22:42.192047 7fe4617b2700 20 SCRIPT_URL=/auth

2015-01-05 17:22:42.192047 7fe4617b2700 20
SCRIPT_URI=http://ceph-all-1:81/auth

2015-01-05 17:22:42.192048 7fe4617b2700 20 HTTP_AUTHORIZATION=

2015-01-05 17:22:42.192048 7fe4617b2700 20 HTTP_USER_AGENT=curl/7.22.0
(x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4
 libidn/1.23 librtmp/2.3

2015-01-05 17:22:42.192050 7fe4617b2700 20 HTTP_HOST=ceph-all-1:81

2015-01-05 17:22:42.192050 7fe4617b2700 20 HTTP_ACCEPT=*/*

2015-01-05 17:22:42.192051 7fe4617b2700 20 HTTP_X_AUTH_USER=johndoe3\:swift

2015-01-05 17:22:42.192051 7fe4617b2700 20
HTTP_X_AUTH_KEY=pehvS2YFl8QcaI1ehvBcSANkQyrQUOQqbib8V2wK

2015-01-05 17:22:42.192052 7fe4617b2700 20 PATH=/usr/local/bin:/usr/bin:/bin

2015-01-05 17:22:42.192052 7fe4617b2700 20 SERVER_SIGNATURE=

2015-01-05 17:22:42.192052 7fe4617b2700 20 SERVER_SOFTWARE=Apache

2015-01-05 17:22:42.192053 7fe4617b2700 20 SERVER_NAME=ceph-all-1

2015-01-05 17:22:42.192053 7fe4617b2700 20 SERVER_ADDR=192.168.56.108

2015-01-05 17:22:42.192054 7fe4617b2700 20 SERVER_PORT=81

2015-01-05 17:22:42.192054 7fe4617b2700 20 REMOTE_ADDR=192.168.56.107

2015-01-05 17:22:42.192054 7fe4617b2700 20 DOCUMENT_ROOT=/var/www

2015-01-05 17:22:42.192055 7fe4617b2700 20
SERVER_ADMIN=ad...@example.com 

2015-01-05 17:22:42.192055 7fe4617b2700 20
SCRIPT_FILENAME=/var/www/s3gw-us-1-west-1.fcgi

2015-01-05 17:22:42.192055 7fe4617b2700 20 REMOTE_PORT=46084

2015-01-05 17:22:42.192056 7fe4617b2700 20 GATEWAY_INTERFACE=CGI/1.1

2015-01-05 17:22:42.192056 7fe4617b2700 20 SERVER_PROTOCOL=HTTP/1.1

2015-01-05 17:22:42.192056 7fe4617b2700 20 REQUEST_METHOD=GET

2015-01-05 17:22:42.192057 7fe4617b2700 20 QUERY_STRING=page=auth¶ms=

2015-01-05 17:22:42.192057 7fe4617b2700 20 REQUEST_URI=/auth

2015-01-05 17:22:42.192058 7fe4617b2700 20 SCRIPT_NAME=/auth

2015-01-05 17:22:42.192058 7fe4617b2700  2 req
4374:0.000593:swift-auth:GET /auth::getting op

2015-01-05 17:22:42.192060 7fe4617b2700  2 req
4374:0.000595:swift-auth:GET /auth:swift_auth_get:authorizing

2015-01-05 17:22:42.192061 7fe4617b2700  2 req
4374:0.000596:swift-auth:GET /auth:swift_auth_get:reading permissions

2015-01-05 17:22:42.192062 7fe4617b2700  2 req
4374:0.000597:swift-auth:GET /auth:swift_auth_get:verifying op mask

2015-01-05 17:22:42.192063 7fe4617b2700 20 required_mask= 0 user.op_mask=7

2015-01-05 17:22:42.192064 7fe4617b2700  2 req
4374:0.000599:swift-auth:GET /auth:swift_auth_get:verifying op permissions

2015-01-05 17:22:42.192065 7fe4617b2700  2 req
4374:0.000600:swift-auth:GET /auth:swift_auth_get:verifying op params

2015-01-05 17:22:42.192066 7fe4617b2700  2 req
4374:0.000601:swift-auth:GET /auth:swift_auth_get:executing

2015-01-05 17:22:42.192082 7fe4617b2700 20 get_obj_state:
rctx=0x7fe494009210 obj=.us-1-west-1.users.swift:johndoe3\:swift
state=0x7fe49401e968 s->prefetch_data=0

2015-01-05 17:22:42.192090 7fe4617b2700 10 moving
.us-1-west-1.users.swift+johndoe3\:swift to cache LRU end

2015-01-05 17:22:42.192092 7fe4617b2700 10 cache get:
name=.us-1-west-1.users.swift+johndoe3\:swift : type miss (requested=6,
cached=0)

2015-01-05 17:22:42.197835 7fe4617b2700 10 cache put:
name=.us-1-west-1.users.swift+johndoe3\:swift

2015-01-05 17:22:42.197842 7fe4617b2700 10 moving
.us-1-west-1.users.swift+johndoe3\:swift to cache LRU end

2015-01-05 17:22:42.197871 7fe4617b2700  5 nothing to log for operation

2015-01-05 17:22:42.197873 7fe4617b2700  2 req
4374:0.006408:swift-auth:GET /auth:swift_auth_get:http status=403

*2015-01-05 17:22:42.198725 7fe4617b2700  1 == req done req=0xbc1f50
http_status=403 ==*



In Master Zone:


I can see a call going to build _token, but its not happening for the
slave zone, seems like its failing somewhere in rgw_swift_auth.cc, but
not sure which section could be in the bold section below
get_random_bytes or may be somewhere before that,:

static int encode_token(CephContext *cct, string& swift_user, string&
key, bufferlist& bl)
{
uint64_t nonce;

*int ret = get_random_bytes((char *)&nonce, sizeof(nonce));*
*if (ret < 0)*
*return ret;*

utime_t expiration = ceph_clock_now(cct);
expiration += cct->_conf->rgw_swi

Re: [ceph-users] Regarding Federated Gateways - Zone Sync Issues

2015-01-07 Thread Mark Kirkwood

On 07/01/15 17:43, hemant burman wrote:

Hello Yehuda,

The issue seem to be with the user data file for swift subser not
getting synced properly.



FWIW, I'm seeing exactly the same thing as well (Hermant - that was well 
spotted)!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 回复: Re: rbd resize (shrink) taking forever and a day

2015-01-07 Thread Chen, Xiaoxi

it is already in parallel, the outstanding ops are limited to ~10 per 
client(tuneable),enlarge this may help.

BUut pls note that there is no noop here, OSD has no idea wherher it has an 
object until it failed to find it in the disk, that means the op had almost 
traveled the code path.

 Robert LeBlanc编写 

> Can't this be done in parallel? If the OSD doesn't have an object then
> it is a noop and should be pretty quick. The number of outstanding
> operations can be limited to 100 or a 1000 which would provide a
> balance between speed and performance impact if there is data to be
> trimmed. I'm not a big fan of a "--skip-trimming" option as there is
> the potential to leave some orphan objects that may not be cleaned up
> correctly.
> 
> On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:
> >
> >
> > On Monday, January 5, 2015, Chen, Xiaoxi  wrote:
> >>
> >> When you shrinking the RBD, most of the time was spent on
> >> librbd/internal.cc::trim_image(), in this function, client will iterator 
> >> all
> >> unnecessary objects(no matter whether it exists) and delete them.
> >>
> >>
> >>
> >> So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
> >> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
> >> 170,227,200 Objects need to be deleted.That will definitely take a long 
> >> time
> >> since rbd client need to send a delete request to OSD, OSD need to find out
> >> the object context and delete(or doesn’t exist at all). The time needed to
> >> trim an image is ratio to the size needed to trim.
> >>
> >>
> >>
> >> make another image of the correct size and copy your VM's file system to
> >> the new image, then delete the old one will  NOT help in general, just
> >> because delete the old volume will take exactly the same time as shrinking 
> >> ,
> >> they both need to call trim_image().
> >>
> >>
> >>
> >> The solution in my mind may be we can provide a “—skip-triming” flag to
> >> skip the trimming. When the administrator absolutely sure there is no
> >> written have taken place in the shrinking area(that means there is no 
> >> object
> >> created in these area), they can use this flag to skip the time consuming
> >> trimming.
> >>
> >>
> >>
> >> How do you think?
> >
> >
> > That sounds like a good solution. Like doing "undo grow image"
> >
> >
> >>
> >>
> >> From: Jake Young [mailto:jak3...@gmail.com]
> >> Sent: Monday, January 5, 2015 9:45 PM
> >> To: Chen, Xiaoxi
> >> Cc: Edwin Peer; ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day
> >>
> >>
> >>
> >>
> >>
> >> On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:
> >>
> >> You could use rbd info   to see the block_name_prefix, the
> >> object name consist like .,  so for
> >> example, rb.0.ff53.3d1b58ba.e6ad should be the th object  of
> >> the volume with block_name_prefix rb.0.ff53.3d1b58ba.
> >>
> >>  $ rbd info huge
> >> rbd image 'huge':
> >>  size 1024 TB in 268435456 objects
> >>  order 22 (4096 kB objects)
> >>  block_name_prefix: rb.0.8a14.2ae8944a
> >>  format: 1
> >>
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> >> Edwin Peer
> >> Sent: Monday, January 5, 2015 3:55 AM
> >> To: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day
> >>
> >> Also, which rbd objects are of interest?
> >>
> >> 
> >> ganymede ~ # rados -p client-disk-img0 ls | wc -l
> >> 1672636
> >> 
> >>
> >> And, all of them have cryptic names like:
> >>
> >> rb.0.ff53.3d1b58ba.e6ad
> >> rb.0.6d386.1d545c4d.00011461
> >> rb.0.50703.3804823e.1c28
> >> rb.0.1073e.3d1b58ba.b715
> >> rb.0.1d76.2ae8944a.022d
> >>
> >> which seem to bear no resemblance to the actual image names that the rbd
> >> command line tools understands?
> >>
> >> Regards,
> >> Edwin Peer
> >>
> >> On 01/04/2015 08:48 PM, Jake Young wrote:
> >> >
> >> >
> >> > On Sunday, January 4, 2015, Dyweni - Ceph-Users
> >> > <6exbab4fy...@dyweni.com > wrote:
> >> >
> >> > Hi,
> >> >
> >> > If its the only think in your pool, you could try deleting the
> >> > pool instead.
> >> >
> >> > I found that to be faster in my testing; I had created 500TB when
> >> > I meant to create 500GB.
> >> >
> >> > Note for the Devs: I would be nice if rbd create/resize would
> >> > accept sizes with units (i.e. MB GB TB PB, etc).
> >> >
> >> >
> >> >
> >> >
> >> > On 2015-01-04 08:45, Edwin Peer wrote:
> >> >
> >> > Hi there,
> >> >
> >> > I did something stupid while growing an rbd image. I
> >> > accidentally
> >> > mistook the units of the resize command for bytes instead of
> >> > megabytes
> >> > and grew an rbd image to 650PB instead of 650GB. This all
> >> > happened
> >> > instantaneously enough, but trying to rectify the mistake

Re: [ceph-users] Different disk usage on different OSDs

2015-01-07 Thread Christian Balzer
On Wed, 7 Jan 2015 00:54:13 +0900 Christian Balzer wrote:

> On Tue, 6 Jan 2015 19:28:44 +0400 ivan babrou wrote:
> 
> > Restarting OSD fixed PGs that were stuck:
> > http://i.imgur.com/qd5vuzV.png
> > 
> Good to hear that. 
> 
> Funny (not really) how often restarting OSDs fixes stuff like that.
> 
> > Still OSD dis usage is very different, 150..250gb. Shall I double PGs
> > again?
> > 
> Not really, your settings are now if anything on the high side.
> 
> Looking at your graph and data the current variance is clearly an
> improvement over the the previous state. 
> Though far from ideal of course.
> 
> I had a Firefly cluster that had non-optimal CRUSH tunables until 20
> minutes ago.
> From the looks of it so far it will improve data placement, however it is
> a very involved process (lots of data movement) and on top of that your
> clients need to all support this.
> 

So the re-balancing finished after moving 35% of my objects in about 1.5
hours. 
Clearly this is something that should be done during off-peak times and
with potentially tuning the backfilling stuff down.

Before getting to the results, a question for the devs:
Why can't I see tunables_3 (or chooseleaf_vary_r) in either the running
config or the "ceph osd crush show-tunables" output?
---
{ "choose_local_tries": 0,
  "choose_local_fallback_tries": 0,
  "choose_total_tries": 50,
  "chooseleaf_descend_once": 1,
  "profile": "bobtail",
  "optimal_tunables": 0,
  "legacy_tunables": 0,
  "require_feature_tunables": 1,
  "require_feature_tunables2": 1}
---
Note that after setting things to optimal unsurprisingly the only thing
that changes is the profile (to firefly) and optimal_tunables to 1.

Now for the results, it reduced my variance from 30% to 25%. 
Actually nearly all OSDs are now within 15% of each other, but one OSD
still is 10% larger than the average.

It might turn out better for Ivan, but no guarantees of course. 
Given that even 5% should help and you've just reduced the data size to
accommodate such a data rebalancing I'd go for it, provided your clients
can handle this change as pointed out below.

Christian

> So let me get back to you tomorrow if that actually improved things
> massively and you should read up at:
> 
> http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> 
> In particular:
> ---
> WHICH CLIENT VERSIONS SUPPORT CRUSH_TUNABLES3
> 
> v0.78 (firefly) or later
> Linux kernel version v3.15 or later (for the file system and RBD kernel
> clients) ---
> 
> Regards,
> 
> Christian
> 
> > On 6 January 2015 at 17:12, ivan babrou  wrote:
> > 
> > > I deleted some old backups and GC is returning some disk space back.
> > > But cluster state is still bad:
> > >
> > > 2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23
> > > active+remapped+wait_backfill, 1
> > > active+remapped+wait_backfill+backfill_toofull, 2
> > > active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784
> > > GB used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623
> > > objects degraded (0.529%)
> > >
> > > Here's how disk utilization across OSDs looks like:
> > > http://i.imgur.com/RWk9rvW.png
> > >
> > > Still one OSD is super-huge. I don't understand one PG is toofull if
> > > the biggest OSD moved from 348gb to 294gb.
> > >
> > > root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full
> > > dumped all in format plain
> > > 10.f26 1018 0 1811 0 2321324247 3261 3261
> > > active+remapped+wait_backfill+backfill_toofull 2015-01-05
> > > 15:06:49.504731 22897'359132 22897:48571 [91,1] 91 [8,40] 8
> > > 19248'358872 2015-01-05 11:58:03.062029 18326'358786 2014-12-31
> > > 23:43:02.285043
> > >
> > >
> > > On 6 January 2015 at 03:40, Christian Balzer  wrote:
> > >
> > >> On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:
> > >>
> > >> > Rebalancing is almost finished, but things got even worse:
> > >> > http://i.imgur.com/0HOPZil.png
> > >> >
> > >> Looking at that graph only one OSD really kept growing and growing,
> > >> everything else seems to be a lot denser, less varied than before,
> > >> as one would have expected.
> > >>
> > >> Since I don't think you mentioned it before, what version of Ceph
> > >> are you using and how are your CRUSH tunables set?
> > >>
> > >
> > > I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at
> > > all.
> > >
> > > > Moreover, one pg is in
> > > > active+remapped+wait_backfill+backfill_toofull
> > >> > state:
> > >> >
> > >> > 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs:
> > >> > 23 active+remapped+wait_backfill, 1
> > >> > active+remapped+wait_backfill+backfill_toofull, 2
> > >> > active+remapped+backfilling, 5805 active+clean, 1
> > >> > active+remapped+backfill_toofull; 11210 GB data, 26174 GB used,
> > >> > 18360 GB / 46906 GB avail; 65246/10590590 objects degraded
> > >> > (0.616%)
> > >> >
> > >> > So at 55.8% disk space utilization ceph is full. That doesn't look
> > >> > very well.
> > >> >
> > >> Indeed it doesn

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin

On 01/06/2015 04:45 PM, Robert LeBlanc wrote:

Seems like a message bus would be nice. Each opener of an RBD could
subscribe for messages on the bus for that RBD. Anytime the map is
modified a message could be put on the bus to update the others. That
opens up a whole other can of worms though.


Rados' watch/notify functions are used as a limited form of this. That's
how rbd can notice that e.g. snapshots are created or disks are resized
online. With the object map code the idea is to funnel all management
operations like that through a single client that's locked the image
for write access (all handled automatically by librbd).

Using watch/notify to coordinate multi-client access would get complex
and inefficient pretty fast, and in general is best left to cephfs
rather than rbd.

Josh


On Jan 6, 2015 5:35 PM, "Josh Durgin" mailto:josh.dur...@inktank.com>> wrote:

On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

The bitmap certainly sounds like it would help shortcut a lot of
code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a
round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?


It's purely at the rbd level, so librbd caches it and maintains its
consistency. The idea is that since it's kept consistent, librbd can do
things like delete exactly the objects that exist without any
extra communication with the osds. Many things that were
O(size of image) become O(written objects in image).

The only restriction is that keeping the object map consistent requires
a single writer, so this does not work for the rare case of e.g. ocfs2
on top of rbd, where there are multiple clients writing to the same
rbd image at once.

Josh

On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin
mailto:josh.dur...@inktank.com>> wrote:

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:


Can't this be done in parallel? If the OSD doesn't have
an object then
it is a noop and should be pretty quick. The number of
outstanding
operations can be limited to 100 or a 1000 which would
provide a
balance between speed and performance impact if there is
data to be
trimmed. I'm not a big fan of a "--skip-trimming" option
as there is
the potential to leave some orphan objects that may not
be cleaned up
correctly.



Yeah, a --skip-trimming option seems a bit dangerous. This
trimming
actually is parallelized (10 ops at once by default,
changeable via
--rbd-concurrent-management-__ops) since dumpling.

What will really help without being dangerous is keeping a
map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing
images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/__pull/2700



On Tue, Jan 6, 2015 at 8:09 AM, Jake Young
mailto:jak3...@gmail.com>> wrote:




On Monday, January 5, 2015, Chen, Xiaoxi
mailto:xiaoxi.c...@intel.com>> wrote:



When you shrinking the RBD, most of the time was
spent on
librbd/internal.cc::trim___image(), in this
function, client will iterator
all
unnecessary objects(no matter whether it exists)
and delete them.



So in this case,  when Edwin shrinking his RBD
from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) *
1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will
definitely take a long
time
since rbd client need to send a delete request
to OSD, OSD need to find
out
the object context and delete(or doesn’t exist
at all). The time needed
to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy
your VM's file system to
the new image, then delete the old one will  NOT
help in general, just
because delete the old volume will take exa

[ceph-users] Monitors and read/write latency

2015-01-07 Thread Logan Barfield
Do monitors have any impact on read/write latencies?  Everything I've read
says no, but since a client needs to talk to a monitor before reading or
writing to OSDs it would seem like that would introduce some overhead.

I ask for two reasons:
1) We are currently using SSD based OSD nodes for our RBD pools.  These
nodes are connected to our hypervisors over 10Gbit links for VM block
devices.  The rest of the cluster is on 1Gbit links, so the RBD nodes
contact the monitors across 1Gbit instead of 1Gbit.  I'm not sure if this
would degrade performance at all.

2) In a multi-datacenter cluster a client may end up contacting a monitor
located in a remote location (e.g., over a high latency WAN link).  I would
think the client would have to wait for a response from the monitor before
beginning read/write operations on the local OSDs.

I'm not sure exactly what the monitor interactions are.  Do clients only
pull the cluster map from the monitors (then ping it occasionally for
updates), or do clients talk to the monitors any time they write a new
object to determine what placement group / OSDs to write to or read from?


Thank You,

Logan Barfield
Tranquil Hosting
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data recovery after RBD I/O error

2015-01-07 Thread Jérôme Poulin
On Mon, Jan 5, 2015 at 6:59 AM, Austin S Hemmelgarn
 wrote:
> Secondly, I would highly recommend not using ANY non-cluster-aware FS on top
> of a clustered block device like RBD


For my use-case, this is just a single server using the RBD device. No
clustering involved on the BTRFS side of thing. However, it was really
useful to take snapshots (just like LVM) before modifying the
filesystem in any way.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multi-site deployment RBD and Federated Gateways

2015-01-07 Thread Logan Barfield
Hello,

We are currently running a single datacenter Ceph deployment.  Our setup is
as follows:
- 4 HDD OSD nodes (primarily used for RadosGW/Object Storage)
- 2 SSD OSD nodes (used for RBD/VM block devices)
- 3 Monitor daemons running on 3 of the HDD OSD nodes
- The CRUSH rules are set to push all data to the HDD nodes except for the
RBD pool, which uses the SSD nodes.

Our goal is to have OSD nodes in 3 datacenters (US East, US West, Europe).
I'm thinking that we would want the following setup:
- RadosGW instance in each datacenter with geo-dns to direct clients to the
closest one.
- Same OSD configuration as our current location (HDD for RadosGW, SSD for
RBD)
- Separate RBD pool in each datacenter for VM block devices.
- CRUSH rules:
-> RadosGW: 3 replicas, different OSD nodes, at least 1 off-site (e.g., 2
replicas on 2 OSD nodes in one datacenter, 1 replica on 1 OSD node in a
different datacenter).  I don't know if RadosGW is geo-aware enough to do
this efficiently
-> RBD: 2 replicas across 2 OSD nodes in the same datacenter.

>From the documentation it looks like the best way to accomplish this would
be to have a separate cluster in each datacenter, then use a federated
RadosGW configuration to keep geo-redundant replicas of objects.  The other
option would be to have one cluster spanning all 3 locations, but since
they would be connected over VPN/WAN links that doesn't seem ideal.

Concerns:
- With a federated configuration it looks like only one zone will be
writable, so if the master zone is on the east coast all of the west coast
clients would be uploading there as well.
- It doesn't appear that there is a way to only have 1 replica sent to the
secondary zone, rather all data written to the master is replicated to the
secondary (e.g., 3 replicas in each location).  Alternatively with multiple
regions both zones would be read/write, but only metadata would be synced.
- From the documentation I understand that there should be different pools
for each zone, and each cluster will need to have a different name.  Since
our current cluster is in production I don't know how safe it would be to
rename/move pools, or re-name the cluster.  We are using the default "ceph"
cluster name right now because different names add complexity (e.g,
requiring '--cluster' for all commands), and we noticed in testing that
some of the init scripts don't play well with custom cluster names.

It would seem to me that having a federated configuration would add a lot
of complexity. It wouldn't get us exactly what we'd like for replication
(one offsite copy), and doesn't allow for geo-aware writes.

I've seen a few examples of CRUSH maps that span multiple datacenters.
This would seem to be an easier setup, and would get us closer to what we
want with replication.  My only concern would be the WAN latency, setting
up site-to-site VPN (which I don't think is necessary for the federated
setup), and how well Ceph would handle losing a connection to one of the
remote sites for a few seconds or minutes.

Is there a recommended deployment for what we want to do, or any reference
guides beyond the official Ceph docs?  I know Ceph is being used for
multi-site deployments, but other than a few blog posts demonstrating
theoretical setups and vague Powerpoint slides I haven't seen any details
on it.  Unfortunately we are a very small company, so consulting with
Inktank/RedHat isn't financially feasible right now.

Any suggestions/insight would be much appreciated.


Thank You,

Logan Barfield
Tranquil Hosting
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com