Re: [ceph-users] Journals on all SSD cluster

2015-01-21 Thread Alexandre DERUMIER
Hi,

From my last benchmark,
I was around 12 iops rand read 4k  , 2iops rand write 4k  (3 nodes with 
2ssd osd+journal ssd intel 3500)

My main bottleneck was cpu (it's was 2x4cores 1,4ghz intel), both on osd and 
client.


I'm going to test next month my production cluster, with bigger nodes 
(2x10cores 3,1ghz),
with a cluster with 3 nodes with 6 osd intel s3500 1,6TB by node.
And same cpu config for the clients.

I'll try to post full benchmark results next month (including qemu-kvm 
optimisations) 

Regards

Alexandre

- Mail original -
De: "Christian Balzer" 
À: "ceph-users" 
Envoyé: Jeudi 22 Janvier 2015 01:28:58
Objet: Re: [ceph-users] Journals on all SSD cluster

Hello, 

On Wed, 21 Jan 2015 23:28:15 +0100 Sebastien Han wrote: 

> It has been proven that the OSDs can’t take advantage of the SSD, so 
> I’ll probably collocate both journal and osd data. Search in the ML for 
> [Single OSD performance on SSD] Can't go over 3, 2K IOPS 
> 
> You will see that there is no difference it terms of performance between 
> the following: 
> 
> * 1 SSD for journal + 1 SSD for osd data 
> * 1 SSD for both journal and data 
> 
Very, very true. 
And that would also be the case in any future where the Ceph code gets 
closer to leverage full SSD performance. 

Now where splitting things _may_ make sense would be if you had different 
types of SSDs, like fast and durable DC S3700s versus less durable and 
slower (but really still too fast for Ceph) ones like Samsung 845DC Evo. 
In that case putting the journal on the Intels would double the lifetime 
of the Samsungs, while hardly making a dent on the Intels endurance. 

> What you can do in order to max out your SSD is to run multiple journals 
> and osd data on the same SSD. Something like this gave me more IOPS: 
> 
> * /dev/sda1 ceph journal 
> * /dev/sda2 ceph data 
> * /dev/sda3 ceph journal 
> * /dev/sda4 ceph data 
> 
Yup, the limitations are in the Ceph OSD code right now. 

However a setup like this will of course kill multiple OSDs if a single 
SSD fails, not that it matters all that much with normal CRUSH rules. 

Christian 

> > On 21 Jan 2015, at 04:32, Andrew Thrift  
> > wrote: 
> > 
> > Hi All, 
> > 
> > We have a bunch of shiny new hardware we are ready to configure for an 
> > all SSD cluster. 
> > 
> > I am wondering what are other people doing for their journal 
> > configuration on all SSD clusters ? 
> > 
> > - Seperate Journal partition and OSD partition on each SSD 
> > 
> > or 
> > 
> > - Journal on OSD 
> > 
> > 
> > Thanks, 
> > 
> > 
> > 
> > 
> > Andrew 
> > ___ 
> > ceph-users mailing list 
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> Cheers. 
>  
> Sébastien Han 
> Cloud Architect 
> 
> "Always give 100%. Unless you're giving blood." 
> 
> Phone: +33 (0)1 49 70 99 72 
> Mail: sebastien@enovance.com 
> Address : 11 bis, rue Roquépine - 75008 Paris 
> Web : www.enovance.com - Twitter : @enovance 
> 


-- 
Christian Balzer Network/Systems Engineer 
ch...@gol.com Global OnLine Japan/Fusion Communications 
http://www.gol.com/ 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to do maintenance without falling out of service?

2015-01-21 Thread Luke Kao
Hi David,
How about your pools size & min_size setting?
In your cluster, you may need to set all pools min_size=1 before shutdown server


BR,
Luke
MYCOM-OSI

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of J David 
[j.david.li...@gmail.com]
Sent: Tuesday, January 20, 2015 12:40 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] How to do maintenance without falling out of service?

A couple of weeks ago, we had some involuntary maintenance come up
that required us to briefly turn off one node of a three-node ceph
cluster.

To our surprise, this resulted in failure to write on the VM's on that
ceph cluster, even though we set noout before the maintenance.

This cluster is for bulk storage, it has copies=1 (2 total) and very
large SATA drives.  The OSD tree looks like this:

# id weight type name up/down reweight
-1 127.1 root default
-2 18.16 host f16
0 4.54 osd.0 up 1
1 4.54 osd.1 up 1
2 4.54 osd.2 up 1
3 4.54 osd.3 up 1
-3 54.48 host f17
4 4.54 osd.4 up 1
5 4.54 osd.5 up 1
6 4.54 osd.6 up 1
7 4.54 osd.7 up 1
8 4.54 osd.8 up 1
9 4.54 osd.9 up 1
10 4.54 osd.10 up 1
11 4.54 osd.11 up 1
12 4.54 osd.12 up 1
13 4.54 osd.13 up 1
14 4.54 osd.14 up 1
15 4.54 osd.15 up 1
-4 54.48 host f18
16 4.54 osd.16 up 1
17 4.54 osd.17 up 1
18 4.54 osd.18 up 1
19 4.54 osd.19 up 1
20 4.54 osd.20 up 1
21 4.54 osd.21 up 1
22 4.54 osd.22 up 1
23 4.54 osd.23 up 1
24 4.54 osd.24 up 1
25 4.54 osd.25 up 1
26 4.54 osd.26 up 1
27 4.54 osd.27 up 1

The host that was turned off was f18.  f16 does have a handful of
OSDs, but it is mostly there to provide an odd number of monitors.
The cluster is very lightly used, here is the current status:

cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
 health HEALTH_OK
 monmap e3: 3 mons at
{f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
election epoch 28, quorum 0,1,2 f16,f17,f18
 osdmap e1674: 28 osds: 28 up, 28 in
  pgmap v12965109: 1152 pgs, 3 pools, 11139 GB data, 2784 kobjects
22314 GB used, 105 TB / 127 TB avail
1152 active+clean
  client io 38162 B/s wr, 9 op/s

Where did we go wrong last time?  How can we do the same maintenance
to f17 (taking it offline for about 15-30 minutes) without repeating
our mistake?

As it stands, it seems like we have inadvertently created a cluster
with three single points of failure, rather than none.  That has not
been our experience with our other clusters, so we're really confused
at present.

Thanks for any advice!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd loaded 100%

2015-01-21 Thread Никитенко Виталий
Hi!
I have a server (ceph version 0.80.7, links 10Gb), there is set: 1 pool is 
write to 5 osd. I'm using the iscsi-target write to this pool (disk rbd3) some 
data from other server. And speed on network is near 150 Mbit / sec. In this 
case, iostat shows the usage rbd3 drive 100%, but drives on which there are 5 
osd (sdc sdd sde sdf sdg) loaded in the region of 20% each. Who knows why this 
could be and what i can run the utility for the diagnosis?

iostat -x 1

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1.800.001.460.710.00   96.03

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 9.000.006.00 0.0068.0022.67 
0.000.670.000.67   0.67   0.40
sdc   0.00 2.000.00   33.00 0.00  7756.00   470.06 
2.76   83.760.00   83.76   5.45  18.00
sdd   0.00 0.000.00   59.00 0.00  9236.00   313.08 
0.579.690.009.69   6.58  38.80
sde   0.00 0.000.00   29.00 0.00  5112.00   352.55 
0.43   13.930.00   13.93   7.03  20.40
sdf   0.00 0.000.00   28.00 0.00  4612.00   329.43 
0.269.140.009.14   6.57  18.40
sdg   0.00 0.000.00   24.00 0.00  4032.00   336.00 
0.228.670.008.67   6.67  16.00
rbd0  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
rbd1  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
rbd2  0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
rbd3  0.00 0.000.00  318.00 0.00 20045.00   126.07 
7.28   28.290.00   28.29   3.13  99.60
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Enabling non default region on existing cluster - data migration

2015-01-21 Thread Mark Kirkwood
I've been looking at the steps required to enable (say) multi region 
metadata sync where there is an existing RGW that has been in use (i.e 
non trivial number of buckets and objects) which been setup without any 
region parameters.


Now given that the existing objects are all in the pools corresponding 
to the default (lack of) region - *not* the new region prefixed ones - 
is there are migration procedure to get them into the *new* ones?


Cheers

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Installation of 2 radosgw, ceph username and instance

2015-01-21 Thread Francois Lafont
Hi,

I have a Ceph cluster that works correctly (Firefly on Ubuntu Trusty servers).
I would like to install a radosgw. In fact, I would like install 2 radosgw:
radosgw-1 and radosgw-2 with a floating IP address to support failover etc.

After reading the doc, I still have a point that is not clear for me.


1. I must create 2 differents ceph accounts for each radosgw (one account
for radosgw-1 and one for radosgw-2). Is it correct?

In this case, what is the good naming?

a. This in ceph.conf?

[client.radosgw-1.gateway]
host = radosgw-1
...
[client.radosgw-2.gateway]
host = radosgw-2
...

b. Or this?

[client.radosgw.gateway-1]
host = radosgw-1
...
[client.radosgw.gateway-2]
host = radosgw-2
...

c. Or maybe we don't care?


2. The meaning of "instance-name" of a radosgw is not clear for me.
For me, it's just a substring of the account name used by a
radosgw server (ie "client.xxx.{instance-name}"). But it's
probably more subtle...

What is the neaming of "instance-name"? Why can't we create
a "classical" ceph account like "client.radosgw"?


3. Will it be a problem if a client contacts a radosgw
with this fqdn radosgw.mydom.tld that it resolved to
the floating IP address (instead of radosgw-1.mydom.tld
or radosgw-2.mydom.tld)?

Thanks for your help.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journals on all SSD cluster

2015-01-21 Thread Christian Balzer

Hello,

On Wed, 21 Jan 2015 23:28:15 +0100 Sebastien Han wrote:

> It has been proven that the OSDs can’t take advantage of the SSD, so
> I’ll probably collocate both journal and osd data. Search in the ML for
> [Single OSD performance on SSD] Can't go over 3, 2K IOPS
> 
> You will see that there is no difference it terms of performance between
> the following:
> 
> * 1 SSD for journal + 1 SSD for osd data
> * 1 SSD for both journal and data
> 
Very, very true. 
And that would also be the case in any future where the Ceph code gets
closer to leverage full SSD performance. 

Now where splitting things _may_ make sense would be  if you had different
types of SSDs, like fast and durable DC S3700s versus less durable and
slower (but really still too fast for Ceph) ones like Samsung 845DC Evo.
In that case putting the journal on the Intels would double the lifetime
of the Samsungs, while hardly making a dent on the Intels endurance.
  
> What you can do in order to max out your SSD is to run multiple journals
> and osd data on the same SSD. Something like this gave me more IOPS:
> 
> * /dev/sda1 ceph journal
> * /dev/sda2 ceph data
> * /dev/sda3 ceph journal
> * /dev/sda4 ceph data
> 
Yup, the limitations are in the Ceph OSD code right now.

However a setup like this will of course kill multiple OSDs if a single
SSD fails, not that it matters all that much with normal CRUSH rules.

Christian

> > On 21 Jan 2015, at 04:32, Andrew Thrift 
> > wrote:
> > 
> > Hi All,
> > 
> > We have a bunch of shiny new hardware we are ready to configure for an
> > all SSD cluster.
> > 
> > I am wondering what are other people doing for their journal
> > configuration on all SSD clusters ?
> > 
> > - Seperate Journal partition and OSD partition on each SSD
> > 
> > or
> > 
> > - Journal on OSD
> > 
> > 
> > Thanks,
> > 
> > 
> > 
> > 
> > Andrew
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> Cheers.
> 
> Sébastien Han
> Cloud Architect
> 
> "Always give 100%. Unless you're giving blood."
> 
> Phone: +33 (0)1 49 70 99 72
> Mail: sebastien@enovance.com
> Address : 11 bis, rue Roquépine - 75008 Paris
> Web : www.enovance.com - Twitter : @enovance
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] erasure coded pool why ever k>1?

2015-01-21 Thread Don Doerner
Well, look at it this way: with 3X replication, for each TB of data you need 3 
TB disk.  With (for example) 10+3 EC, you get better protection, and for each 
TB of data you need 1.3 TB disk.

-don-


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Loic 
Dachary
Sent: 21 January, 2015 15:18
To: Chad William Seys; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] erasure coded pool why ever k>1?



On 21/01/2015 22:42, Chad William Seys wrote:
> Hello all,
>   What reasons would one want k>1?
>   I read that m determines the number of OSD which can fail before 
> loss.  But I don't see explained how to choose k.  Any benefits for choosing 
> k>1?

The size of each chunk is object size / K. If you have K=1 and M=2 it will be 
the same as 3 replicas with none of the advantages ;-)

Cheers

--
Loïc Dachary, Artisan Logiciel Libre

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journals on all SSD cluster

2015-01-21 Thread Sebastien Han
It has been proven that the OSDs can’t take advantage of the SSD, so I’ll 
probably collocate both journal and osd data.
Search in the ML for [Single OSD performance on SSD] Can't go over 3, 2K IOPS

You will see that there is no difference it terms of performance between the 
following:

* 1 SSD for journal + 1 SSD for osd data
* 1 SSD for both journal and data

What you can do in order to max out your SSD is to run multiple journals and 
osd data on the same SSD. Something like this gave me more IOPS:

* /dev/sda1 ceph journal
* /dev/sda2 ceph data
* /dev/sda3 ceph journal
* /dev/sda4 ceph data

> On 21 Jan 2015, at 04:32, Andrew Thrift  wrote:
> 
> Hi All,
> 
> We have a bunch of shiny new hardware we are ready to configure for an all 
> SSD cluster.
> 
> I am wondering what are other people doing for their journal configuration on 
> all SSD clusters ?
> 
> - Seperate Journal partition and OSD partition on each SSD
> 
> or
> 
> - Journal on OSD
> 
> 
> Thanks,
> 
> 
> 
> 
> Andrew
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Cheers.

Sébastien Han
Cloud Architect

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72
Mail: sebastien@enovance.com
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Different flavors of storage?

2015-01-21 Thread Don Doerner
OK, I've set up 'giant' in a single-node cluster, played with a replicated pool 
and an EC pool.  All goes well so far.  Question: I have two different kinds of 
HDD in my server - some fast, 15K RPM SAS drives and some big, slow (5400 RPM!) 
SATA drives.

Right now, I have OSDs on all, and when I created my pool, it got spread over 
all of these drives like peanut butter.

The documentation (e.g., the documentation on cache tiering) hints that its 
possible to differentiate fast from slow devices, but for the life of me, I 
can't see how to create a pool on specific OSDs.  So it must be done some 
different way...

Can someone please provide a pointer?

 
Regards,

-don-___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] erasure coded pool why ever k>1?

2015-01-21 Thread Loic Dachary


On 21/01/2015 22:42, Chad William Seys wrote:
> Hello all,
>   What reasons would one want k>1?
>   I read that m determines the number of OSD which can fail before loss.  But 
> I don't see explained how to choose k.  Any benefits for choosing k>1?

The size of each chunk is object size / K. If you have K=1 and M=2 it will be 
the same as 3 replicas with none of the advantages ;-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] inkscope RPMS and DEBS packages

2015-01-21 Thread eric mourgaya
Hi,

 the ceph  admin and supervision interface Inkscope  is now  packaged.
RPMS and DEBS packages are available at :
  https://github.com/inkscope/inkscope-packaging

enjoy it!

-- 
Eric Mourgaya,


Respectons la planete!
Luttons contre la mediocrite!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPHFS with Erasure Coded Pool for Data and Replicated Pool for Meta Data

2015-01-21 Thread Gregory Farnum
You can run CephFS with a caching pool that is backed by an EC pool,
but you can't use just an EC pool for either of them. There are
currently no plans to develop direct EC support; we have some ideas
but the RADOS EC interface is way more limited than the replicated
one, and we have a lot of other things we'd like to get right first.
:)
-Greg

On Tue, Jan 20, 2015 at 9:53 PM, Mohamed Pakkeer  wrote:
> Hi Greg,
>
> Thanks for your reply. Can we have mixed pools( EC and replicated) for
> CephFS data and metadata or we have to use  anyone pool( EC or Replicated)
> for creating CephFS? Also we would like to know, when will the production
> release of CephFS happen with erasure coded pool ? We are ready to test
> peta-byte scale CephFS cluster with erasure coded pool.
>
>
> -Mohammed Pakkeer
>
> On Wed, Jan 21, 2015 at 9:11 AM, Gregory Farnum  wrote:
>>
>> On Tue, Jan 20, 2015 at 5:48 AM, Mohamed Pakkeer 
>> wrote:
>> >
>> > Hi all,
>> >
>> > We are trying to create 2 PB scale Ceph storage cluster for file system
>> > access using erasure coded profiles in giant release. Can we create
>> > Erasure
>> > coded pool (k+m = 10 +3) for data and replicated (4 replicas) pool for
>> > metadata for creating CEPHFS? What are the pros and cons of using two
>> > different pools to create CEPHFS ?
>>
>> It's standard to use separate pools. Unfortunately you can't use EC
>> pools for CephFS right now.
>> -Greg
>
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] get pool replicated size through api

2015-01-21 Thread wuhaling
Hello.

how to get pool replicated size through api  not by command line?
In the following website ,i can't find the answer.
http://ceph.com/docs/master/rados/api/python/


Bell___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados GW | Multi uploads fail

2015-01-21 Thread Yehuda Sadeh
I think you're hitting issue #10271. It has been fixed, but not in the
a formal firefly release yet. You can try picking up the unofficial
firefly branch package off the ceph gitbuilder and test it.

Yehuda

On Wed, Jan 21, 2015 at 11:37 AM, Castillon de la Cruz, Eddy Gonzalo
 wrote:
>
> Hello Team,
>
> I have a radosgw node and storage cluster running. I am able to upload a
> single file, but the process failed when I enable the multipart option in
> the client side.  I  am using firefly (ceph version 0.80.8 )
>
> Attached the debug log. Below  an extract of the  log.
>
>2015-01-21 19:29:28.902082 7efeef7de700 10 failed to
> authorize request
>   2015-01-21 19:29:28.902171 7efeef7de700 2 req
> 19:0.000769:s3:PUT /Test/file1.mp3:put_obj:http status=403
>   2015-01-21 19:29:28.902179 7efeef7de700 1 == req done
> req=0x17d2340 http_status=403 ==
>   2015-01-21 19:29:28.902211 7efeef7de700 20 process_request()
> returned -1
>
>
>
> Also I created  a pool named "  according with this link, but the issue
> still.
>
> http://comments.gmane.org/gmane.comp.file-systems.ceph.user/11367
>
> I hope that someone can help me.
>
> Regards,
> Eddy Castillon
>
>
>
> NOTICE: Protect the information in this message in accordance with the
> company's security policies. If you received this message in error,
> immediately notify the sender and destroy all copies.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of Ceph while OSDs are down

2015-01-21 Thread Samuel Just
Version?
-Sam

On Tue, Jan 20, 2015 at 9:45 AM, Gregory Farnum  wrote:
> On Tue, Jan 20, 2015 at 2:40 AM, Christian Eichelmann
>  wrote:
>> Hi all,
>>
>> I want to understand what Ceph does if several OSDs are down. First of our,
>> some words to our Setup:
>>
>> We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers
>> are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server.
>> We have a replication factor of 4 and a crush rule applied that says "step
>> chooseleaf firstn 0 type rack". So, in my oppinion, every rack should hold a
>> copy of all the data in our ceph cluster. Is that more or less correct?
>>
>> So, our cluster is in state health OK and I am rebooting one of our OSD
>> servers. That means 60 of 720 OSDs are going down. Since this hardware takes
>> quite some time to boot up, we are using "mon osd down out subtree limit =
>> host" to avoid rebalancing when a whole server goes down. Ceph show this
>> output of "ceph -s" while the OSDs are down:
>>
>>  health HEALTH_WARN 7 pgs degraded; 1 pgs peering; 7 pgs stuck
>> degraded; 1 pgs stuck inactive; 8 pgs stuck unclean; 7 pgs stuck und
>> ersized; 7 pgs undersized; recovery 623/7420 objects degraded (8.396%);
>> 60/720 in osds are down
>>  monmap e5: 5 mons at
>> {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=1
>> 0.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4
>> mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
>>  osdmap e60390: 720 osds: 660 up, 720 in
>>   pgmap v15427437: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
>> 3948 GB used, 1304 TB / 1308 TB avail
>> 623/7420 objects degraded (8.396%)
>>45356 active+clean
>>1 peering
>>7 active+undersized+degraded
>>
>> The pgs that are degraded and undersized are not a problem, since this
>> behaviour is expected. I am worried about the peering pg (it stays in this
>> state until all osds are up again) since this would cause I/O to hang if I
>> am not mistaken.
>>
>> After the host is back up and all OSDs are up and running again, I see this:
>>
>>  health HEALTH_WARN 2 pgs stuck unclean
>>  monmap e5: 5 mons at
>> {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=10.76.28.9:6789/0},
>> election epoch 228, quorum 0,1,2,3,4
>> mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
>>  osdmap e60461: 720 osds: 720 up, 720 in
>>   pgmap v15427555: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
>> 3972 GB used, 1304 TB / 1308 TB avail
>>2 inactive
>>67582 active+clean
>>
>> Without any interaction, it will stay in this state. I guess these two
>> inactive pgs will also cause I/O to hang? Some more information:
>>
>> ceph health detail
>> HEALTH_WARN 2 pgs stuck unclean
>> pg 9.f765 is stuck unclean for 858.298811, current state inactive, last
>> acting [91,362,484,553]
>> pg 9.ea0f is stuck unclean for 963.441117, current state inactive, last
>> acting [91,233,485,524]
>>
>> I was trying to give osd.91 a kick with "ceph osd down 91"
>>
>> After the osd is back in the cluster:
>> health HEALTH_WARN 3 pgs peering; 54 pgs stuck inactive; 57 pgs stuck
>> unclean
>>
>> So even worse. I decided to take the osd out. The cluster goes back to
>> HEALTH_OK. Bringing the OSD back in, the cluster does some rebalancing,
>> ending with the cluster in an OK state again.
>>
>> That actually happens everytime when there are some OSDs going down. I don't
>> understand why the cluster is not able to get back to a healthy state
>> without admin interaction. In a setup with several hundred OSDs it is normal
>> business that some of the go down from time to time. Are there any ideas why
>> this is happening? Right now, we do not have many data in our cluster, so I
>> can do some tests. Any suggestions would be appreciated.
>
> Have you done any digging into the state of the PGs reported as
> peering or inactive or whatever when this pops up? Running pg_query,
> looking at their calculated and acting sets, etc.
>
> I suspect it's more likely you're exposing a reporting bug with stale
> data, rather than actually stuck PGs, but it would take more
> information to check that out.
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 4 GB mon database?

2015-01-21 Thread Gregory Farnum
On Mon, Jan 19, 2015 at 2:48 PM, Brian Rak  wrote:
> Awhile ago, I ran into this issue: http://tracker.ceph.com/issues/10411
>
> I did manage to solve that by deleting the PGs, however ever since that
> issue my mon databases have been growing indefinitely.  At the moment, I'm
> up to 3404 sst files, totaling 7.4GB of space.
>
> This appears to be causing a significant performance hit to all cluster
> operations.
>
> How can I get Ceph to clean up these files?  I've tried 'ceph tell mon.X
> compact', which had no effect (well, it updated the modification time on a
> lot of files, but they're all still there). I don't see any other obvious
> commands that would help.
>
> I tried running 'ceph-monstore-tool --mon-store-path . --command dump-keys >
> keys' (I have no idea if this is even the right direction), but it
> segfaults:
>
> # ceph-monstore-tool --mon-store-path . --command dump-keys > keys
> ./mon/MonitorDBStore.h: In function 'MonitorDBStore::~MonitorDBStore()'
> thread 7fbea24b2760 time 2015-01-19 17:45:52.015742
> ./mon/MonitorDBStore.h: 630: FAILED assert(!is_open)
>  ceph version 0.87-73-gabdbbd6 (abdbbd6e846727385cf0a1412393bc9759dc0244)
>  1: (MonitorDBStore::~MonitorDBStore()+0x88) [0x4bf3c8]
>  2: (main()+0xdba) [0x4bbe2a]
>  3: (__libc_start_main()+0xfd) [0x3efc21ed5d]
>  4: ceph-monstore-tool() [0x4bad39]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
> 2015-01-19 17:45:52.015987 7fbea24b2760 -1 ./mon/MonitorDBStore.h: In
> function 'MonitorDBStore::~MonitorDBStore()' thread 7fbea24b2760 time
> 2015-01-19 17:45:52.015742
> ./mon/MonitorDBStore.h: 630: FAILED assert(!is_open)
>
>  ceph version 0.87-73-gabdbbd6 (abdbbd6e846727385cf0a1412393bc9759dc0244)
>  1: (MonitorDBStore::~MonitorDBStore()+0x88) [0x4bf3c8]
>  2: (main()+0xdba) [0x4bbe2a]
>  3: (__libc_start_main()+0xfd) [0x3efc21ed5d]
>  4: ceph-monstore-tool() [0x4bad39]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> --- begin dump of recent events ---
>-13> 2015-01-19 17:45:46.843470 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command perfcounters_dump hook 0x3eb1a80
>-12> 2015-01-19 17:45:46.843483 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command 1 hook 0x3eb1a80
>-11> 2015-01-19 17:45:46.843486 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command perf dump hook 0x3eb1a80
>-10> 2015-01-19 17:45:46.843491 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command perfcounters_schema hook 0x3eb1a80
> -9> 2015-01-19 17:45:46.843494 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command 2 hook 0x3eb1a80
> -8> 2015-01-19 17:45:46.843496 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command perf schema hook 0x3eb1a80
> -7> 2015-01-19 17:45:46.843498 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command config show hook 0x3eb1a80
> -6> 2015-01-19 17:45:46.843501 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command config set hook 0x3eb1a80
> -5> 2015-01-19 17:45:46.843505 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command config get hook 0x3eb1a80
> -4> 2015-01-19 17:45:46.843508 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command config diff hook 0x3eb1a80
> -3> 2015-01-19 17:45:46.843510 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command log flush hook 0x3eb1a80
> -2> 2015-01-19 17:45:46.843514 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command log dump hook 0x3eb1a80
> -1> 2015-01-19 17:45:46.843516 7fbea24b2760  5 asok(0x3eb1ad0)
> register_command log reopen hook 0x3eb1a80
>  0> 2015-01-19 17:45:52.015987 7fbea24b2760 -1 ./mon/MonitorDBStore.h:
> In function 'MonitorDBStore::~MonitorDBStore()'
>
> It did dump some data (it crashed while printing out pgmap_pg entries)..
> this is a summary of what's in there:
>
> # cat keys | awk '{print $1}' | sort | uniq -c
> 173 auth
>1351 logm
>   3 mdsmap
>   1 mkfs
>   6 monitor
>  22 monmap
>   1 mon_sync
>   95521 osdmap
> 105 osd_metadata
> 595 paxos
> 534 pgmap
>   6 pgmap_meta
> 105 pgmap_osd
>   13121 pgmap_pg

You appear to have 95000 untrimmed osdmaps, which would be...a lot.
That's probably the cause of your store growth.
These should be trimmed (automatically, of course) as long as the
cluster is clean; if it's not you should get it healthy and if it is
then there's a bug in the monitor.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to do maintenance without falling out of service?

2015-01-21 Thread Gregory Farnum
On Mon, Jan 19, 2015 at 8:40 AM, J David  wrote:
> A couple of weeks ago, we had some involuntary maintenance come up
> that required us to briefly turn off one node of a three-node ceph
> cluster.
>
> To our surprise, this resulted in failure to write on the VM's on that
> ceph cluster, even though we set noout before the maintenance.
>
> This cluster is for bulk storage, it has copies=1 (2 total) and very
> large SATA drives.  The OSD tree looks like this:

2 total? Is that the pool size, you mean?
Depending on how you configured things it's possible that the min_size
is also set to 2, which would be bad for your purposes (it should be
at 1).

But without more information about what the cluster was reporting
during that time we can't tell you more.
-Greg

>
> # id weight type name up/down reweight
> -1 127.1 root default
> -2 18.16 host f16
> 0 4.54 osd.0 up 1
> 1 4.54 osd.1 up 1
> 2 4.54 osd.2 up 1
> 3 4.54 osd.3 up 1
> -3 54.48 host f17
> 4 4.54 osd.4 up 1
> 5 4.54 osd.5 up 1
> 6 4.54 osd.6 up 1
> 7 4.54 osd.7 up 1
> 8 4.54 osd.8 up 1
> 9 4.54 osd.9 up 1
> 10 4.54 osd.10 up 1
> 11 4.54 osd.11 up 1
> 12 4.54 osd.12 up 1
> 13 4.54 osd.13 up 1
> 14 4.54 osd.14 up 1
> 15 4.54 osd.15 up 1
> -4 54.48 host f18
> 16 4.54 osd.16 up 1
> 17 4.54 osd.17 up 1
> 18 4.54 osd.18 up 1
> 19 4.54 osd.19 up 1
> 20 4.54 osd.20 up 1
> 21 4.54 osd.21 up 1
> 22 4.54 osd.22 up 1
> 23 4.54 osd.23 up 1
> 24 4.54 osd.24 up 1
> 25 4.54 osd.25 up 1
> 26 4.54 osd.26 up 1
> 27 4.54 osd.27 up 1
>
> The host that was turned off was f18.  f16 does have a handful of
> OSDs, but it is mostly there to provide an odd number of monitors.
> The cluster is very lightly used, here is the current status:
>
> cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
>  health HEALTH_OK
>  monmap e3: 3 mons at
> {f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
> election epoch 28, quorum 0,1,2 f16,f17,f18
>  osdmap e1674: 28 osds: 28 up, 28 in
>   pgmap v12965109: 1152 pgs, 3 pools, 11139 GB data, 2784 kobjects
> 22314 GB used, 105 TB / 127 TB avail
> 1152 active+clean
>   client io 38162 B/s wr, 9 op/s
>
> Where did we go wrong last time?  How can we do the same maintenance
> to f17 (taking it offline for about 15-30 minutes) without repeating
> our mistake?
>
> As it stands, it seems like we have inadvertently created a cluster
> with three single points of failure, rather than none.  That has not
> been our experience with our other clusters, so we're really confused
> at present.
>
> Thanks for any advice!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] erasure coded pool why ever k>1?

2015-01-21 Thread Chad William Seys
Hello all,
  What reasons would one want k>1?
  I read that m determines the number of OSD which can fail before loss.  But 
I don't see explained how to choose k.  Any benefits for choosing k>1?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] verifying tiered pool functioning

2015-01-21 Thread Chad William Seys
Hello,
  Could anyone provide a howto verify that a tiered pool is working correctly?
E.g.
  Command to watch as PG migrate from one pool to another?  (Or determine 
which pool a PG is currently in.)
  Command to see how much data is in each pool (global view of number of PGs I 
guess)?

Thanks!
Chad.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how do I show active ceph configuration

2015-01-21 Thread Sebastien Han
You can use the admin socket:

$ ceph daemon mon. config show

or locally

ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show

> On 21 Jan 2015, at 19:46, Robert Fantini  wrote:
> 
> Hello
> 
>  Is there a way to see running / acrive  ceph.conf  configuration items?
> 
> kind regards
> Rob Fantini
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Cheers.

Sébastien Han
Cloud Architect

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72
Mail: sebastien@enovance.com
Address : 11 bis, rue Roquépine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how do I show active ceph configuration

2015-01-21 Thread Robert Fantini
Hello

 Is there a way to see running / acrive  ceph.conf  configuration items?

kind regards
Rob Fantini
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph, LIO, VMWARE anyone?

2015-01-21 Thread Nick Fisk
Hi Jake,

 

Thanks for this, I have been going through this and have a pretty good idea on 
what you are doing now, however I maybe missing something looking through your 
scripts, but I’m still not quite understanding how you are managing to make 
sure locking is happening with the ESXi ATS SCSI command.

 

>From this slide

 

https://wiki.ceph.com/@api/deki/files/38/hammer-ceph-devel-summit-scsi-target-clustering.pdf
   (Page 8)

 

It seems to indicate that for a true active/active setup the two targets need 
to be aware of each other and exchange locking information for it to work 
reliably, I’ve also watched the video from the Ceph developer summit where this 
is discussed and it seems that Ceph+Kernel need changes to allow this locking 
to be pushed back to the RBD layer so it can be shared, from what I can see 
browsing through the Linux Git Repo, these patches haven’t made the mainline 
kernel yet.

 

Can you shed any light on this? As tempting as having active/active is, I’m 
wary about using the configuration until I understand how the locking is 
working and if fringe cases involving multiple ESXi hosts writing to the same 
LUN on different targets could spell disaster.

 

Many thanks,

Nick

 

From: Jake Young [mailto:jak3...@gmail.com] 
Sent: 14 January 2015 16:54
To: Nick Fisk
Cc: Giuseppe Civitella; ceph-users
Subject: Re: [ceph-users] Ceph, LIO, VMWARE anyone?

 

Yes, it's active/active and I found that VMWare can switch from path to path 
with no issues or service impact.

 

  

I posted some config files here: github.com/jak3kaj/misc 
 

 

One set is from my LIO nodes, both the primary and secondary configs so you can 
see what I needed to make unique.  The other set (targets.conf) are from my tgt 
nodes.  They are both 4 LUN configs.

 

Like I said in my previous email, there is no performance difference between 
LIO and tgt.  The only service I'm running on these nodes is a single iscsi 
target instance (either LIO or tgt).

 

Jake

 

On Wed, Jan 14, 2015 at 8:41 AM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

Hi Jake,

 

I can’t remember the exact details, but it was something to do with a potential 
problem when using the pacemaker resource agents. I think it was to do with a 
potential hanging issue when one LUN on a shared target failed and then it 
tried to kill all the other LUNS to fail the target over to another host. This 
then leaves the TCM part of LIO locking the RBD which also can’t fail over.

 

That said I did try multiple LUNS on one target as a test and didn’t experience 
any problems.

 

I’m interested in the way you have your setup configured though. Are you saying 
you effectively have an active/active configuration with a path going to either 
host, or are you failing the iSCSI IP between hosts? If it’s the former, have 
you had any problems with scsi locking/reservations…etc between the two targets?

 

I can see the advantage to that configuration as you reduce/eliminate a lot of 
the troubles I have had with resources failing over.

 

Nick

 

From: Jake Young [mailto:jak3...@gmail.com  ] 
Sent: 14 January 2015 12:50
To: Nick Fisk
Cc: Giuseppe Civitella; ceph-users
Subject: Re: [ceph-users] Ceph, LIO, VMWARE anyone?

 

Nick,

 

Where did you read that having more than 1 LUN per target causes stability 
problems?

 

I am running 4 LUNs per target. 

 

For HA I'm running two linux iscsi target servers that map the same 4 rbd 
images. The two targets have the same serial numbers, T10 address, etc.  I copy 
the primary's config to the backup and change IPs. This way VMWare thinks they 
are different target IPs on the same host. This has worked very well for me. 

 

One suggestion I have is to try using rbd enabled tgt. The performance is 
equivalent to LIO, but I found it is much better at recovering from a cluster 
outage. I've had LIO lock up the kernel or simply not recognize that the rbd 
images are available; where tgt will eventually present the rbd images again. 

 

I have been slowly adding servers and am expanding my test setup to a 
production setup (nice thing about ceph). I now have 6 OSD hosts with 7 disks 
on each. I'm using the LSI Nytro cache raid controller, so I don't have a 
separate journal and have 40Gb networking. I plan to add another 6 OSD hosts in 
another rack in the next 6 months (and then another 6 next year). I'm doing 3x 
replication, so I want to end up with 3 racks. 

 

Jake

On Wednesday, January 14, 2015, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

Hi Giuseppe,

 

I am working on something very similar at the moment. I currently have it 
working on some test hardware but seems to be working reasonably well.

 

I say reasonably as I have had a few instability’s but these are on the HA 
side, the LIO and RBD side of things have been rock solid so far. The main 
problems I have had seem to be around recovering from failure with resources 
ending up in a u

Re: [ceph-users] Behaviour of Ceph while OSDs are down

2015-01-21 Thread Christian Eichelmann

Hi Samuel, Hi Gregory,

we are using Giant (0.87).

Sure, I was checking on this PGs. The strange thing was, that they 
reported a bad state ("state": "inactive"), but looking at the recovery 
state, everything seems to be fine. That would point to the mentioned 
bug. Do you have a link to this bug, so I can have a look at it to 
confirm that we are having the same issues?


Here is a pg_query (slightly older and with only 3x replication, so 
don't be confused):

http://pastebin.com/fyC8Qepv

Regards,
Christian

On 01/20/2015 10:57 PM, Samuel Just wrote:

Version?
-Sam

On Tue, Jan 20, 2015 at 9:45 AM, Gregory Farnum  wrote:

On Tue, Jan 20, 2015 at 2:40 AM, Christian Eichelmann
 wrote:

Hi all,

I want to understand what Ceph does if several OSDs are down. First of our,
some words to our Setup:

We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers
are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server.
We have a replication factor of 4 and a crush rule applied that says "step
chooseleaf firstn 0 type rack". So, in my oppinion, every rack should hold a
copy of all the data in our ceph cluster. Is that more or less correct?

So, our cluster is in state health OK and I am rebooting one of our OSD
servers. That means 60 of 720 OSDs are going down. Since this hardware takes
quite some time to boot up, we are using "mon osd down out subtree limit =
host" to avoid rebalancing when a whole server goes down. Ceph show this
output of "ceph -s" while the OSDs are down:

  health HEALTH_WARN 7 pgs degraded; 1 pgs peering; 7 pgs stuck
degraded; 1 pgs stuck inactive; 8 pgs stuck unclean; 7 pgs stuck und
ersized; 7 pgs undersized; recovery 623/7420 objects degraded (8.396%);
60/720 in osds are down
  monmap e5: 5 mons at
{mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=1
0.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4
mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
  osdmap e60390: 720 osds: 660 up, 720 in
   pgmap v15427437: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
 3948 GB used, 1304 TB / 1308 TB avail
 623/7420 objects degraded (8.396%)
45356 active+clean
1 peering
7 active+undersized+degraded

The pgs that are degraded and undersized are not a problem, since this
behaviour is expected. I am worried about the peering pg (it stays in this
state until all osds are up again) since this would cause I/O to hang if I
am not mistaken.

After the host is back up and all OSDs are up and running again, I see this:

  health HEALTH_WARN 2 pgs stuck unclean
  monmap e5: 5 mons at
{mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=10.76.28.9:6789/0},
election epoch 228, quorum 0,1,2,3,4
mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
  osdmap e60461: 720 osds: 720 up, 720 in
   pgmap v15427555: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
 3972 GB used, 1304 TB / 1308 TB avail
2 inactive
67582 active+clean

Without any interaction, it will stay in this state. I guess these two
inactive pgs will also cause I/O to hang? Some more information:

ceph health detail
HEALTH_WARN 2 pgs stuck unclean
pg 9.f765 is stuck unclean for 858.298811, current state inactive, last
acting [91,362,484,553]
pg 9.ea0f is stuck unclean for 963.441117, current state inactive, last
acting [91,233,485,524]

I was trying to give osd.91 a kick with "ceph osd down 91"

After the osd is back in the cluster:
health HEALTH_WARN 3 pgs peering; 54 pgs stuck inactive; 57 pgs stuck
unclean

So even worse. I decided to take the osd out. The cluster goes back to
HEALTH_OK. Bringing the OSD back in, the cluster does some rebalancing,
ending with the cluster in an OK state again.

That actually happens everytime when there are some OSDs going down. I don't
understand why the cluster is not able to get back to a healthy state
without admin interaction. In a setup with several hundred OSDs it is normal
business that some of the go down from time to time. Are there any ideas why
this is happening? Right now, we do not have many data in our cluster, so I
can do some tests. Any suggestions would be appreciated.

Have you done any digging into the state of the PGs reported as
peering or inactive or whatever when this pops up? Running pg_query,
looking at their calculated and acting sets, etc.

I suspect it's more likely you're exposing a reporting bug with stale
data, rather than actually stuck PGs, but it would take more
information to check that out.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph

Re: [ceph-users] Cache data consistency among multiple RGW instances

2015-01-21 Thread ZHOU Yuan
Greg, Thanks a lot for the education!

Sincerely, Yuan


On Tue, Jan 20, 2015 at 2:37 PM, Gregory Farnum  wrote:
> You don't need to list them anywhere for this to work. They set up the
> necessary communication on their own by making use of watch-notify.
>
> On Mon, Jan 19, 2015 at 6:55 PM ZHOU Yuan  wrote:
>>
>> Thanks Greg, that's a awesome feature I missed. I find some
>> explanation on the watch-notify thing:
>> http://www.slideshare.net/Inktank_Ceph/sweil-librados.
>>
>> Just want to confirm, it looks like I need to list all the RGW
>> instances in ceph.conf, and then these RGW instances will
>> automatically do the cache invalidation if necessary?
>>
>>
>> Sincerely, Yuan
>>
>>
>> On Mon, Jan 19, 2015 at 10:58 PM, Gregory Farnum  wrote:
>> > On Sun, Jan 18, 2015 at 6:40 PM, ZHOU Yuan  wrote:
>> >> Hi list,
>> >>
>> >> I'm trying to understand the RGW cache consistency model. My Ceph
>> >> cluster has multiple RGW instances with HAProxy as the load balancer.
>> >> HAProxy would choose one RGW instance to serve the request(with
>> >> round-robin).
>> >> The question is if RGW cache was enabled, which is the default
>> >> behavior, there seem to be some cache inconsistency issue. e.g.,
>> >> object0 was cached in RGW-0 and RGW-1 at the same time. Sometime later
>> >> it was updated from RGW-0. In this case if the next read was issued to
>> >> RGW-1, the outdated cache would be served out then since RGW-1 wasn't
>> >> aware of the updates. Thus the data would be inconsistent. Is this
>> >> behavior expected or is there anything I missed?
>> >
>> > The RGW instances make use of the watch-notify primitive to keep their
>> > caches consistent. It shouldn't be a problem.
>> > -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache data consistency among multiple RGW instances

2015-01-21 Thread Ashish Chandra
Hi Greg/Zhou,

I have got a similar setup where I have got one HAProxy node and 3 RadosGW
client. I have got rgw cache disabled in my setup.
earlier I had only one node running RadosGW, there  I can see the
difference in inbound and outbound network traffic  sometimes to a tune of
factor of 10. If the traffic recieved from OSDs are some 700-800 MB only
90-100 MB of data is sent back to client. Can you please guide me what
could be the reason for that.

Next what I did, I setup one HAProxy node and three RadosGW nodes. Still I
can see the difference in outbound and inbound traffic but it is less,
somewhat 100 MB.

Not sure what is happening, is it due to the ceph not supporting
parallelized reads or something else. Please help.

I am running on CentOS 7, and ceph version is Firefly.



On Tue, Jan 20, 2015 at 10:37 AM, Gregory Farnum  wrote:

> You don't need to list them anywhere for this to work. They set up the
> necessary communication on their own by making use of watch-notify.
> On Mon, Jan 19, 2015 at 6:55 PM ZHOU Yuan  wrote:
>
>> Thanks Greg, that's a awesome feature I missed. I find some
>> explanation on the watch-notify thing:
>> http://www.slideshare.net/Inktank_Ceph/sweil-librados.
>>
>> Just want to confirm, it looks like I need to list all the RGW
>> instances in ceph.conf, and then these RGW instances will
>> automatically do the cache invalidation if necessary?
>>
>>
>> Sincerely, Yuan
>>
>>
>> On Mon, Jan 19, 2015 at 10:58 PM, Gregory Farnum 
>> wrote:
>> > On Sun, Jan 18, 2015 at 6:40 PM, ZHOU Yuan  wrote:
>> >> Hi list,
>> >>
>> >> I'm trying to understand the RGW cache consistency model. My Ceph
>> >> cluster has multiple RGW instances with HAProxy as the load balancer.
>> >> HAProxy would choose one RGW instance to serve the request(with
>> >> round-robin).
>> >> The question is if RGW cache was enabled, which is the default
>> >> behavior, there seem to be some cache inconsistency issue. e.g.,
>> >> object0 was cached in RGW-0 and RGW-1 at the same time. Sometime later
>> >> it was updated from RGW-0. In this case if the next read was issued to
>> >> RGW-1, the outdated cache would be served out then since RGW-1 wasn't
>> >> aware of the updates. Thus the data would be inconsistent. Is this
>> >> behavior expected or is there anything I missed?
>> >
>> > The RGW instances make use of the watch-notify primitive to keep their
>> > caches consistent. It shouldn't be a problem.
>> > -Greg
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

.-  -..--.  ,---.  .-=<>=-.
   /_-\'''/-_\  / / '' \ \ |,-.| /____\
  |/  o) (o  \|| | ')(' | |   /,'-'.\   |/ (')(') \|
   \   ._.   /  \ \/ /   {_/(') (')\_}   \   __   /
   ,>-_,,,_-<.   >'=jf='< `.   _   .','--__--'.
 /  .  \/\ /'-___-'\/:|\
(_) . (_)  /  \   / \  (_)   :|   (_)
 \_-'--/  (_)(_) (_)___(_)   |___:||
  \___/ || \___/ |_|


Thanks and Regards

Ashish Chandra

Openstack Developer, Cloud Engineering
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPHFS with Erasure Coded Pool for Data and Replicated Pool for Meta Data

2015-01-21 Thread Mohamed Pakkeer
Hi Greg,

We are planning to create 3 PB EC based storage cluster initially. what
would be the recommended hardware configuration for creating caching pool?
How many nodes will cache pool require to cater the 3 PB storage cluster?
What is the size and network connectivity of each node?

-- Mohammed Pakkeer

On Wed, Jan 21, 2015 at 11:25 AM, Gregory Farnum  wrote:

> You can run CephFS with a caching pool that is backed by an EC pool,
> but you can't use just an EC pool for either of them. There are
> currently no plans to develop direct EC support; we have some ideas
> but the RADOS EC interface is way more limited than the replicated
> one, and we have a lot of other things we'd like to get right first.
> :)
> -Greg
>
> On Tue, Jan 20, 2015 at 9:53 PM, Mohamed Pakkeer 
> wrote:
> > Hi Greg,
> >
> > Thanks for your reply. Can we have mixed pools( EC and replicated) for
> > CephFS data and metadata or we have to use  anyone pool( EC or
> Replicated)
> > for creating CephFS? Also we would like to know, when will the production
> > release of CephFS happen with erasure coded pool ? We are ready to test
> > peta-byte scale CephFS cluster with erasure coded pool.
> >
> >
> > -Mohammed Pakkeer
> >
> > On Wed, Jan 21, 2015 at 9:11 AM, Gregory Farnum 
> wrote:
> >>
> >> On Tue, Jan 20, 2015 at 5:48 AM, Mohamed Pakkeer 
> >> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > We are trying to create 2 PB scale Ceph storage cluster for file
> system
> >> > access using erasure coded profiles in giant release. Can we create
> >> > Erasure
> >> > coded pool (k+m = 10 +3) for data and replicated (4 replicas) pool for
> >> > metadata for creating CEPHFS? What are the pros and cons of using two
> >> > different pools to create CEPHFS ?
> >>
> >> It's standard to use separate pools. Unfortunately you can't use EC
> >> pools for CephFS right now.
> >> -Greg
> >
> >
> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPHFS with Erasure Coded Pool for Data and Replicated Pool for Meta Data

2015-01-21 Thread Gregory Farnum
I've not built such a system myself, so I can't really be sure. The
size and speed of the cache pool would have to depend on how much hot
data you have at a time.
-Greg

On Wed, Jan 21, 2015 at 12:53 AM, Mohamed Pakkeer  wrote:
> Hi Greg,
>
> We are planning to create 3 PB EC based storage cluster initially. what
> would be the recommended hardware configuration for creating caching pool?
> How many nodes will cache pool require to cater the 3 PB storage cluster?
> What is the size and network connectivity of each node?
>
> -- Mohammed Pakkeer
>
> On Wed, Jan 21, 2015 at 11:25 AM, Gregory Farnum  wrote:
>>
>> You can run CephFS with a caching pool that is backed by an EC pool,
>> but you can't use just an EC pool for either of them. There are
>> currently no plans to develop direct EC support; we have some ideas
>> but the RADOS EC interface is way more limited than the replicated
>> one, and we have a lot of other things we'd like to get right first.
>> :)
>> -Greg
>>
>> On Tue, Jan 20, 2015 at 9:53 PM, Mohamed Pakkeer 
>> wrote:
>> > Hi Greg,
>> >
>> > Thanks for your reply. Can we have mixed pools( EC and replicated) for
>> > CephFS data and metadata or we have to use  anyone pool( EC or
>> > Replicated)
>> > for creating CephFS? Also we would like to know, when will the
>> > production
>> > release of CephFS happen with erasure coded pool ? We are ready to test
>> > peta-byte scale CephFS cluster with erasure coded pool.
>> >
>> >
>> > -Mohammed Pakkeer
>> >
>> > On Wed, Jan 21, 2015 at 9:11 AM, Gregory Farnum 
>> > wrote:
>> >>
>> >> On Tue, Jan 20, 2015 at 5:48 AM, Mohamed Pakkeer 
>> >> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > We are trying to create 2 PB scale Ceph storage cluster for file
>> >> > system
>> >> > access using erasure coded profiles in giant release. Can we create
>> >> > Erasure
>> >> > coded pool (k+m = 10 +3) for data and replicated (4 replicas) pool
>> >> > for
>> >> > metadata for creating CEPHFS? What are the pros and cons of using two
>> >> > different pools to create CEPHFS ?
>> >>
>> >> It's standard to use separate pools. Unfortunately you can't use EC
>> >> pools for CephFS right now.
>> >> -Greg
>> >
>> >
>> >
>> >
>> >
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-21 Thread Jiri Kanicky

Hi,

BTW, is there a way how to achieve redundancy over multiple OSDs in one 
box by changing CRUSH map?


Thank you
Jiri

On 20/01/2015 13:37, Jiri Kanicky wrote:

Hi,

Thanks for the reply. That clarifies it. I thought that the redundancy 
can be achieved with multiple OSDs (like multiple disks in RAID) in 
case you don't have more nodes. Obviously the single point of failure 
would be the box.


My current setting is:
osd_pool_default_size = 2

Thank you
Jiri


On 20/01/2015 13:13, Lindsay Mathieson wrote:
You only have one osd node (ceph4). The default replication 
requirements  for your pools (size = 3) require osd's spread over 
three nodes, so the data can be replicate on three different nodes. 
That will be why your pgs are degraded.


You need to either add mode osd nodes or reduce your size setting 
down to the number of osd nodes you have.


Setting your size to 1 would be a bad idea, there would be no 
redundancy in your data at all. Loosing one disk would destroy all 
your data.


The command to see you pool size is:

sudo ceph osd pool get  size

assuming default setup:

ceph osd pool  get rbd size
returns: 3

On 20 January 2015 at 10:51, Jiri Kanicky > wrote:


Hi,

I just would like to clarify if I should expect degraded PGs with
11 OSD in one node. I am not sure if a setup with 3 MON and 1 OSD
(11 disks) nodes allows me to have healthy cluster.

$ sudo ceph osd pool create test 512
pool 'test' created

$ sudo ceph status
cluster 4e77327a-118d-450d-ab69-455df6458cd4
 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean;
512 pgs undersized
 monmap e1: 3 mons at

{ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0

},
election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
 osdmap e190: 11 osds: 11 up, 11 in
  pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
53724 kB used, 9709 GB / 9720 GB avail
 512 active+undersized+degraded

$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  9.45root default
-2  9.45host ceph4
0   0.45osd.0   up  1
1   0.9 osd.1   up  1
2   0.9 osd.2   up  1
3   0.9 osd.3   up  1
4   0.9 osd.4   up  1
5   0.9 osd.5   up  1
6   0.9 osd.6   up  1
7   0.9 osd.7   up  1
8   0.9 osd.8   up  1
9   0.9 osd.9   up  1
10  0.9 osd.10  up  1


Thank you,
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Lindsay




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache data consistency among multiple RGW instances

2015-01-21 Thread ZHOU Yuan
Thanks Greg, that's a awesome feature I missed. I find some
explanation on the watch-notify thing:
http://www.slideshare.net/Inktank_Ceph/sweil-librados.

Just want to confirm, it looks like I need to list all the RGW
instances in ceph.conf, and then these RGW instances will
automatically do the cache invalidation if necessary?


Sincerely, Yuan


On Mon, Jan 19, 2015 at 10:58 PM, Gregory Farnum  wrote:
> On Sun, Jan 18, 2015 at 6:40 PM, ZHOU Yuan  wrote:
>> Hi list,
>>
>> I'm trying to understand the RGW cache consistency model. My Ceph
>> cluster has multiple RGW instances with HAProxy as the load balancer.
>> HAProxy would choose one RGW instance to serve the request(with
>> round-robin).
>> The question is if RGW cache was enabled, which is the default
>> behavior, there seem to be some cache inconsistency issue. e.g.,
>> object0 was cached in RGW-0 and RGW-1 at the same time. Sometime later
>> it was updated from RGW-0. In this case if the next read was issued to
>> RGW-1, the outdated cache would be served out then since RGW-1 wasn't
>> aware of the updates. Thus the data would be inconsistent. Is this
>> behavior expected or is there anything I missed?
>
> The RGW instances make use of the watch-notify primitive to keep their
> caches consistent. It shouldn't be a problem.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-21 Thread Jiri Kanicky

Hi,

Thanks for the reply. That clarifies it. I thought that the redundancy 
can be achieved with multiple OSDs (like multiple disks in RAID) in case 
you don't have more nodes. Obviously the single point of failure would 
be the box.


My current setting is:
osd_pool_default_size = 2

Thank you
Jiri


On 20/01/2015 13:13, Lindsay Mathieson wrote:
You only have one osd node (ceph4). The default replication 
requirements  for your pools (size = 3) require osd's spread over 
three nodes, so the data can be replicate on three different nodes. 
That will be why your pgs are degraded.


You need to either add mode osd nodes or reduce your size setting down 
to the number of osd nodes you have.


Setting your size to 1 would be a bad idea, there would be no 
redundancy in your data at all. Loosing one disk would destroy all 
your data.


The command to see you pool size is:

sudo ceph osd pool get  size

assuming default setup:

ceph osd pool  get rbd size
returns: 3

On 20 January 2015 at 10:51, Jiri Kanicky > wrote:


Hi,

I just would like to clarify if I should expect degraded PGs with
11 OSD in one node. I am not sure if a setup with 3 MON and 1 OSD
(11 disks) nodes allows me to have healthy cluster.

$ sudo ceph osd pool create test 512
pool 'test' created

$ sudo ceph status
cluster 4e77327a-118d-450d-ab69-455df6458cd4
 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean;
512 pgs undersized
 monmap e1: 3 mons at

{ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0

},
election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
 osdmap e190: 11 osds: 11 up, 11 in
  pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
53724 kB used, 9709 GB / 9720 GB avail
 512 active+undersized+degraded

$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  9.45root default
-2  9.45host ceph4
0   0.45osd.0   up  1
1   0.9 osd.1   up  1
2   0.9 osd.2   up  1
3   0.9 osd.3   up  1
4   0.9 osd.4   up  1
5   0.9 osd.5   up  1
6   0.9 osd.6   up  1
7   0.9 osd.7   up  1
8   0.9 osd.8   up  1
9   0.9 osd.9   up  1
10  0.9 osd.10  up  1


Thank you,
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Lindsay


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Journals on all SSD cluster

2015-01-21 Thread Andrew Thrift
Hi All,

We have a bunch of shiny new hardware we are ready to configure for an all
SSD cluster.

I am wondering what are other people doing for their journal configuration
on all SSD clusters ?

- Seperate Journal partition and OSD partition on each SSD

or

- Journal on OSD


Thanks,




Andrew
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-21 Thread Lindsay Mathieson
You only have one osd node (ceph4). The default replication requirements
for your pools (size = 3) require osd's spread over three nodes, so the
data can be replicate on three different nodes. That will be why your pgs
are degraded.

You need to either add mode osd nodes or reduce your size setting down to
the number of osd nodes you have.

Setting your size to 1 would be a bad idea, there would be no redundancy in
your data at all. Loosing one disk would destroy all your data.

The command to see you pool size is:

sudo ceph osd pool get  size

assuming default setup:

ceph osd pool  get rbd size
returns: 3

On 20 January 2015 at 10:51, Jiri Kanicky  wrote:

> Hi,
>
> I just would like to clarify if I should expect degraded PGs with 11 OSD
> in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) nodes
> allows me to have healthy cluster.
>
> $ sudo ceph osd pool create test 512
> pool 'test' created
>
> $ sudo ceph status
> cluster 4e77327a-118d-450d-ab69-455df6458cd4
>  health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 pgs
> undersized
>  monmap e1: 3 mons at {ceph1=172.16.41.31:6789/0,
> ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0}, election epoch 36,
> quorum 0,1,2 ceph1,ceph2,ceph3
>  osdmap e190: 11 osds: 11 up, 11 in
>   pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
> 53724 kB used, 9709 GB / 9720 GB avail
>  512 active+undersized+degraded
>
> $ sudo ceph osd tree
> # idweight  type name   up/down reweight
> -1  9.45root default
> -2  9.45host ceph4
> 0   0.45osd.0   up  1
> 1   0.9 osd.1   up  1
> 2   0.9 osd.2   up  1
> 3   0.9 osd.3   up  1
> 4   0.9 osd.4   up  1
> 5   0.9 osd.5   up  1
> 6   0.9 osd.6   up  1
> 7   0.9 osd.7   up  1
> 8   0.9 osd.8   up  1
> 9   0.9 osd.9   up  1
> 10  0.9 osd.10  up  1
>
>
> Thank you,
> Jiri
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS aborted after recovery and active, FAILED assert (r >=0)

2015-01-21 Thread Mohd Bazli Ab Karim
Hi all,

Our MDS still fine today. Thanks everyone!

Regards,
Bazli

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mohd Bazli Ab Karim
Sent: Monday, January 19, 2015 11:38 AM
To: John Spray
Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: RE: MDS aborted after recovery and active, FAILED assert (r >=0)

Hi John,

Good shot!
I've increased the osd_max_write_size to 1GB (still smaller than osd journal 
size) and now the mds still running fine after an hour.
Now checking if fs still accessible or not. Will update from time to time.

Thanks again John.

Regards,
Bazli


-Original Message-
From: john.sp...@inktank.com [mailto:john.sp...@inktank.com] On Behalf Of John 
Spray
Sent: Friday, January 16, 2015 11:58 PM
To: Mohd Bazli Ab Karim
Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: MDS aborted after recovery and active, FAILED assert (r >=0)

It has just been pointed out to me that you can also workaround this issue on 
your existing system by increasing the osd_max_write_size setting on your OSDs 
(default 90MB) to something higher, but still smaller than your osd journal 
size.  That might get you on a path to having an accessible filesystem before 
you consider an upgrade.

John

On Fri, Jan 16, 2015 at 10:57 AM, John Spray  wrote:
> Hmm, upgrading should help here, as the problematic data structure
> (anchortable) no longer exists in the latest version.  I haven't
> checked, but hopefully we don't try to write it during upgrades.
>
> The bug you're hitting is more or less the same as a similar one we
> have with the sessiontable in the latest ceph, but you won't hit it
> there unless you're very unlucky!
>
> John
>
> On Fri, Jan 16, 2015 at 7:37 AM, Mohd Bazli Ab Karim
>  wrote:
>> Dear Ceph-Users, Ceph-Devel,
>>
>> Apologize me if you get double post of this email.
>>
>> I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 
>> down and only 1 up) at the moment.
>> Plus I have one CephFS client mounted to it.
>>
>> Now, the MDS always get aborted after recovery and active for 4 secs.
>> Some parts of the log are as below:
>>
>> -3> 2015-01-15 14:10:28.464706 7fbcc8226700  1 --
>> 10.4.118.21:6800/5390 <== osd.19 10.4.118.32:6821/243161 73 
>> osd_op_re
>> ply(3742 1000240c57e. [create 0~0,setxattr (99)]
>> v56640'1871414 uv1871414 ondisk = 0) v6  221+0+0 (261801329 0 0)
>> 0x
>> 7770bc80 con 0x69c7dc0
>> -2> 2015-01-15 14:10:28.464730 7fbcc8226700  1 --
>> 10.4.118.21:6800/5390 <== osd.18 10.4.118.32:6818/243072 67 
>> osd_op_re
>> ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567
>> ondisk = 0) v6  179+0+0 (3759887079 0 0) 0x7757ec80 con
>> 0x1c6bb00
>> -1> 2015-01-15 14:10:28.464754 7fbcc8226700  1 --
>> 10.4.118.21:6800/5390 <== osd.47 10.4.118.35:6809/8290 79 
>> osd_op_repl
>> y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90
>> (Message too long)) v6  174+0+0 (3942056372 0 0) 0x69f94
>> a00 con 0x1c6b9a0
>>  0> 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc:
>> In function 'void MDSTable::save_2(int, version_t)' thread 7
>> fbcc8226700 time 2015-01-15 14:10:28.46
>> mds/MDSTable.cc: 83: FAILED assert(r >= 0)
>>
>>  ceph version  ()
>>  1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25]
>>  2: (Context::complete(int)+0x9) [0x568d29]
>>  3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7]
>>  4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900]
>>  5: (MDS::_dispatch(Message*)+0x2f) [0x58908f]
>>  6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93]
>>  7: (DispatchQueue::entry()+0x549) [0x975739]
>>  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd]
>>  9: (()+0x7e9a) [0x7fbcccb0de9a]
>>  10: (clone()+0x6d) [0x7fbccb4ba3fd]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>>
>> Is there any workaround/patch to fix this issue? Let me know if need to see 
>> the log with debug-mds of certain level as well.
>> Any helps would be very much appreciated.
>>
>> Thanks.
>> Bazli
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html


DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).


MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinion

[ceph-users] Rados GW | Multi uploads fail

2015-01-21 Thread Castillon de la Cruz, Eddy Gonzalo

Hello Team, 

I have a radosgw node and storage cluster running. I am able to upload a single 
file, but the process failed when I enable the multipart option in the client 
side. I am using firefly (ceph version 0.80.8 ) 

Attached the debug log. Below an extract of the log. 



2015-01-21 19:29:28.902082 7efeef7de700 10 failed to authorize request 
2015-01-21 19:29:28.902171 7efeef7de700 2 req 19:0.000769:s3:PUT 
/Test/file1.mp3:put_obj:http status=403 
2015-01-21 19:29:28.902179 7efeef7de700 1 == req done req=0x17d2340 
http_status=403 == 
2015-01-21 19:29:28.902211 7efeef7de700 20 process_request() returned -1 



Also I created a pool named " according with this link, but the issue still. 

http://comments.gmane.org/gmane.comp.file-systems.ceph.user/11367 

I hope that someone can help me. 

Regards, 
Eddy Castillon 



NOTICE: Protect the information in this message in accordance with the 
company's security policies. If you received this message in error, immediately 
notify the sender and destroy all copies.2015-01-21 19:28:54.347994 7eff33fff700  2 RGWDataChangesLog::ChangesRenewThread: start
2015-01-21 19:29:02.635724 7efef700 20 enqueued request req=0x17be520
2015-01-21 19:29:02.635771 7efef700 20 RGWWQ:
2015-01-21 19:29:02.635774 7efef700 20 req: 0x17be520
2015-01-21 19:29:02.635782 7efef700 10 allocated request req=0x17be870
2015-01-21 19:29:02.635812 7efeeefdd700 20 dequeued request req=0x17be520
2015-01-21 19:29:02.635818 7efeeefdd700 20 RGWWQ: empty
2015-01-21 19:29:02.635913 7efeeefdd700 20 DOCUMENT_ROOT=/var/www
2015-01-21 19:29:02.635918 7efeeefdd700 20 FCGI_ROLE=RESPONDER
2015-01-21 19:29:02.635920 7efeeefdd700 20 GATEWAY_INTERFACE=CGI/1.1
2015-01-21 19:29:02.635922 7efeeefdd700 20 HTTP_AUTHORIZATION=AWS LE790DJCU2H4111X0W93:NRhk5PQ8NsXFFCPr3KFu7duyWSI=
2015-01-21 19:29:02.635923 7efeeefdd700 20 HTTP_HOST=10.128.56.11
2015-01-21 19:29:02.635924 7efeeefdd700 20 HTTP_USER_AGENT=S3 Browser 4-8-7 http://s3browser.com
2015-01-21 19:29:02.635925 7efeeefdd700 20 HTTP_X_AMZ_DATE=Wed, 21 Jan 2015 19:29:01 GMT
2015-01-21 19:29:02.635927 7efeeefdd700 20 PATH=/usr/local/bin:/usr/bin:/bin
2015-01-21 19:29:02.635929 7efeeefdd700 20 QUERY_STRING=acl=
2015-01-21 19:29:02.635930 7efeeefdd700 20 REMOTE_ADDR=10.65.52.31
2015-01-21 19:29:02.635932 7efeeefdd700 20 REMOTE_PORT=24123
2015-01-21 19:29:02.635933 7efeeefdd700 20 REQUEST_METHOD=GET
2015-01-21 19:29:02.635935 7efeeefdd700 20 REQUEST_URI=/Test/?acl=
2015-01-21 19:29:02.635936 7efeeefdd700 20 SCRIPT_FILENAME=/var/www/s3gw.fcgi
2015-01-21 19:29:02.635937 7efeeefdd700 20 SCRIPT_NAME=/Test/
2015-01-21 19:29:02.635938 7efeeefdd700 20 SCRIPT_URI=http://10.128.56.11/Test/
2015-01-21 19:29:02.635939 7efeeefdd700 20 SCRIPT_URL=/Test/
2015-01-21 19:29:02.635941 7efeeefdd700 20 SERVER_ADDR=10.128.56.11
2015-01-21 19:29:02.635942 7efeeefdd700 20 SERVER_ADMIN=dl_runteam_p...@axcess-financial.com
2015-01-21 19:29:02.635943 7efeeefdd700 20 SERVER_NAME=10.128.56.11
2015-01-21 19:29:02.635944 7efeeefdd700 20 SERVER_PORT=80
2015-01-21 19:29:02.635945 7efeeefdd700 20 SERVER_PROTOCOL=HTTP/1.1
2015-01-21 19:29:02.635946 7efeeefdd700 20 SERVER_SIGNATURE=
2015-01-21 19:29:02.635947 7efeeefdd700 20 SERVER_SOFTWARE=Apache/2.2.22 (Ubuntu)
2015-01-21 19:29:02.635949 7efeeefdd700  1 == starting new request req=0x17be520 =
2015-01-21 19:29:02.636009 7efeeefdd700  2 req 13:0.61::GET /Test/::initializing
2015-01-21 19:29:02.636056 7efeeefdd700 10 meta>> HTTP_X_AMZ_DATE
2015-01-21 19:29:02.636099 7efeeefdd700 10 x>> x-amz-date:Wed, 21 Jan 2015 19:29:01 GMT
2015-01-21 19:29:02.636181 7efeeefdd700 10 s->object= s->bucket=Test
2015-01-21 19:29:02.636205 7efeeefdd700  2 req 13:0.000256:s3:GET /Test/::getting op
2015-01-21 19:29:02.636213 7efeeefdd700  2 req 13:0.000265:s3:GET /Test/:get_acls:authorizing
2015-01-21 19:29:02.636515 7efeeefdd700 20 get_obj_state: rctx=0x7eff6570 obj=.users:LE790DJCU2H4111X0W93 state=0x7eff0001b8e8 s->prefetch_data=0
2015-01-21 19:29:02.636585 7efeeefdd700 10 cache get: name=.users+LE790DJCU2H4111X0W93 : hit
2015-01-21 19:29:02.636618 7efeeefdd700 20 get_obj_state: s->obj_tag was set empty
2015-01-21 19:29:02.636662 7efeeefdd700 10 cache get: name=.users+LE790DJCU2H4111X0W93 : hit
2015-01-21 19:29:02.636722 7efeeefdd700 20 get_obj_state: rctx=0x7eff67f0 obj=.users.uid:noc state=0x7eff0001b9b8 s->prefetch_data=0
2015-01-21 19:29:02.636731 7efeeefdd700 10 cache get: name=.users.uid+noc : hit
2015-01-21 19:29:02.636737 7efeeefdd700 20 get_obj_state: s->obj_tag was set empty
2015-01-21 19:29:02.636745 7efeeefdd700 10 cache get: name=.users.uid+noc : hit
2015-01-21 19:29:02.636958 7efeeefdd700 10 get_canon_resource(): dest=/Test/?acl
2015-01-21 19:29:02.636965 7efeeefdd700 10 auth_hdr:
GET



x-amz-date:Wed, 21 Jan 2015 19:29:01 GMT
/Test/?acl
2015-01-21 19:29:02.637131 7efeeefdd700 15 calculated digest=NRhk5PQ8NsXFFCPr3KFu7duyWSI=
2015-01-21 19:29:02.637135 7efeeefdd700 15 auth_sign=NRhk5PQ8NsXFFCPr3KFu7duyWSI=
20

[ceph-users] RGW Unexpectedly high number of objects in .rgw pool

2015-01-21 Thread Mark Kirkwood
We have a cluster running RGW (Giant release). We've noticed that the 
".rgw" pool has an unexpectedly high number of objects:


$ ceph df
...
POOLS:
NAME   ID USED   %USED MAX AVAIL 
OBJECTS

...
.rgw.root  5 840 029438G 
3
.rgw.control   6   0 029438G 
8
.rgw   7   9145k 029438G 
37953
.rgw.gc8   0 029438G 
   32
.users.uid 94364 029438G 
   23
.users 10 22 029438G 
2
.users.swift   11 22 029438G 
2
.rgw.buckets.index 12  0 029438G 
   13
.rgw.buckets   13 16536M  0.0229438G 
38091



So there are about the same number of objects in .rgw as there are in 
.rgw.buckets. Is that expected? Or are we perhaps having some gc issues 
that we are unaware of?


The reason this came up is that we have sized .rgw with only a small 
number of PGs - do we need to rethink that (or does that fact that the 
objects are tiny mitigate the need to increase PGs)?


Cheers

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 4 GB mon database?

2015-01-21 Thread Brian Rak

Awhile ago, I ran into this issue: http://tracker.ceph.com/issues/10411

I did manage to solve that by deleting the PGs, however ever since that 
issue my mon databases have been growing indefinitely.  At the moment, 
I'm up to 3404 sst files, totaling 7.4GB of space.


This appears to be causing a significant performance hit to all cluster 
operations.


How can I get Ceph to clean up these files?  I've tried 'ceph tell mon.X 
compact', which had no effect (well, it updated the modification time on 
a lot of files, but they're all still there). I don't see any other 
obvious commands that would help.


I tried running 'ceph-monstore-tool --mon-store-path . --command 
dump-keys > keys' (I have no idea if this is even the right direction), 
but it segfaults:


# ceph-monstore-tool --mon-store-path . --command dump-keys > keys
./mon/MonitorDBStore.h: In function 'MonitorDBStore::~MonitorDBStore()' 
thread 7fbea24b2760 time 2015-01-19 17:45:52.015742

./mon/MonitorDBStore.h: 630: FAILED assert(!is_open)
 ceph version 0.87-73-gabdbbd6 (abdbbd6e846727385cf0a1412393bc9759dc0244)
 1: (MonitorDBStore::~MonitorDBStore()+0x88) [0x4bf3c8]
 2: (main()+0xdba) [0x4bbe2a]
 3: (__libc_start_main()+0xfd) [0x3efc21ed5d]
 4: ceph-monstore-tool() [0x4bad39]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.
2015-01-19 17:45:52.015987 7fbea24b2760 -1 ./mon/MonitorDBStore.h: In 
function 'MonitorDBStore::~MonitorDBStore()' thread 7fbea24b2760 time 
2015-01-19 17:45:52.015742

./mon/MonitorDBStore.h: 630: FAILED assert(!is_open)

 ceph version 0.87-73-gabdbbd6 (abdbbd6e846727385cf0a1412393bc9759dc0244)
 1: (MonitorDBStore::~MonitorDBStore()+0x88) [0x4bf3c8]
 2: (main()+0xdba) [0x4bbe2a]
 3: (__libc_start_main()+0xfd) [0x3efc21ed5d]
 4: ceph-monstore-tool() [0x4bad39]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- begin dump of recent events ---
   -13> 2015-01-19 17:45:46.843470 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command perfcounters_dump hook 0x3eb1a80
   -12> 2015-01-19 17:45:46.843483 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command 1 hook 0x3eb1a80
   -11> 2015-01-19 17:45:46.843486 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command perf dump hook 0x3eb1a80
   -10> 2015-01-19 17:45:46.843491 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command perfcounters_schema hook 0x3eb1a80
-9> 2015-01-19 17:45:46.843494 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command 2 hook 0x3eb1a80
-8> 2015-01-19 17:45:46.843496 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command perf schema hook 0x3eb1a80
-7> 2015-01-19 17:45:46.843498 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command config show hook 0x3eb1a80
-6> 2015-01-19 17:45:46.843501 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command config set hook 0x3eb1a80
-5> 2015-01-19 17:45:46.843505 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command config get hook 0x3eb1a80
-4> 2015-01-19 17:45:46.843508 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command config diff hook 0x3eb1a80
-3> 2015-01-19 17:45:46.843510 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command log flush hook 0x3eb1a80
-2> 2015-01-19 17:45:46.843514 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command log dump hook 0x3eb1a80
-1> 2015-01-19 17:45:46.843516 7fbea24b2760  5 asok(0x3eb1ad0) 
register_command log reopen hook 0x3eb1a80
 0> 2015-01-19 17:45:52.015987 7fbea24b2760 -1 
./mon/MonitorDBStore.h: In function 'MonitorDBStore::~MonitorDBStore()'


It did dump some data (it crashed while printing out pgmap_pg entries).. 
this is a summary of what's in there:


# cat keys | awk '{print $1}' | sort | uniq -c
173 auth
   1351 logm
  3 mdsmap
  1 mkfs
  6 monitor
 22 monmap
  1 mon_sync
  95521 osdmap
105 osd_metadata
595 paxos
534 pgmap
  6 pgmap_meta
105 pgmap_osd
  13121 pgmap_pg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPHFS with Erasure Coded Pool for Data and Replicated Pool for Meta Data

2015-01-21 Thread Gregory Farnum
On Tue, Jan 20, 2015 at 5:48 AM, Mohamed Pakkeer  wrote:
>
> Hi all,
>
> We are trying to create 2 PB scale Ceph storage cluster for file system
> access using erasure coded profiles in giant release. Can we create Erasure
> coded pool (k+m = 10 +3) for data and replicated (4 replicas) pool for
> metadata for creating CEPHFS? What are the pros and cons of using two
> different pools to create CEPHFS ?

It's standard to use separate pools. Unfortunately you can't use EC
pools for CephFS right now.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to do maintenance without falling out of service?

2015-01-21 Thread J David
A couple of weeks ago, we had some involuntary maintenance come up
that required us to briefly turn off one node of a three-node ceph
cluster.

To our surprise, this resulted in failure to write on the VM's on that
ceph cluster, even though we set noout before the maintenance.

This cluster is for bulk storage, it has copies=1 (2 total) and very
large SATA drives.  The OSD tree looks like this:

# id weight type name up/down reweight
-1 127.1 root default
-2 18.16 host f16
0 4.54 osd.0 up 1
1 4.54 osd.1 up 1
2 4.54 osd.2 up 1
3 4.54 osd.3 up 1
-3 54.48 host f17
4 4.54 osd.4 up 1
5 4.54 osd.5 up 1
6 4.54 osd.6 up 1
7 4.54 osd.7 up 1
8 4.54 osd.8 up 1
9 4.54 osd.9 up 1
10 4.54 osd.10 up 1
11 4.54 osd.11 up 1
12 4.54 osd.12 up 1
13 4.54 osd.13 up 1
14 4.54 osd.14 up 1
15 4.54 osd.15 up 1
-4 54.48 host f18
16 4.54 osd.16 up 1
17 4.54 osd.17 up 1
18 4.54 osd.18 up 1
19 4.54 osd.19 up 1
20 4.54 osd.20 up 1
21 4.54 osd.21 up 1
22 4.54 osd.22 up 1
23 4.54 osd.23 up 1
24 4.54 osd.24 up 1
25 4.54 osd.25 up 1
26 4.54 osd.26 up 1
27 4.54 osd.27 up 1

The host that was turned off was f18.  f16 does have a handful of
OSDs, but it is mostly there to provide an odd number of monitors.
The cluster is very lightly used, here is the current status:

cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
 health HEALTH_OK
 monmap e3: 3 mons at
{f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
election epoch 28, quorum 0,1,2 f16,f17,f18
 osdmap e1674: 28 osds: 28 up, 28 in
  pgmap v12965109: 1152 pgs, 3 pools, 11139 GB data, 2784 kobjects
22314 GB used, 105 TB / 127 TB avail
1152 active+clean
  client io 38162 B/s wr, 9 op/s

Where did we go wrong last time?  How can we do the same maintenance
to f17 (taking it offline for about 15-30 minutes) without repeating
our mistake?

As it stands, it seems like we have inadvertently created a cluster
with three single points of failure, rather than none.  That has not
been our experience with our other clusters, so we're really confused
at present.

Thanks for any advice!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to compile and use ceph with Raspberry Pi single-board computers?

2015-01-21 Thread Gregory Farnum
Joao has done it in the past so it's definitely possible, but I
confess I don't know what if anything he had to hack up to make it
work or what's changed since then. ARMv6 is definitely not something
we worry about when adding dependencies. :/
-Greg

On Thu, Jan 15, 2015 at 12:17 AM, Prof. Dr. Christian Baun
 wrote:
> Hi all,
>
> I try to compile and use Ceph on a cluster of Raspberry Pi
> single-board computers with Raspbian as operating system. I tried it
> this way:
>
> wget http://ceph.com/download/ceph-0.91.tar.bz2
> tar -xvjf ceph-0.91.tar.bz2
> cd ceph-0.91
> ./autogen.sh
> ./configure  --without-tcmalloc
> make -j2
>
> But result, I got this error message:
>
> ...
>  CC common/module.lo
>  CXXcommon/Readahead.lo
>  CXXcommon/Cycles.lo
> In file included from common/Cycles.cc:38:0:
> common/Cycles.h:76:2: error: #error No high-precision counter
> available for your OS/arch
> common/Cycles.h: In static member function 'static uint64_t Cycles::rdtsc()':
> common/Cycles.h:78:3: warning: no return statement in function
> returning non-void [-Wreturn-type]
> Makefile:13166: recipe for target 'common/Cycles.lo' failed
> make[3]: *** [common/Cycles.lo] Error 1
> make[3]: *** Waiting for unfinished jobs
> make[3]: Leaving directory '/usr/src/ceph-0.91/src'
> Makefile:17129: recipe for target 'all-recursive' failed
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory '/usr/src/ceph-0.91/src'
> Makefile:6645: recipe for target 'all' failed
> make[1]: *** [all] Error 2
> make[1]: Leaving directory '/usr/src/ceph-0.91/src'
> Makefile:405: recipe for target 'all-recursive' failed
> make: *** [all-recursive] Error 1
>
> Is it possible at all to build and use Ceph on the ARMv6 architecture?
>
> Thanks for any help.
>
> Best Regards
>Christian Baun
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPHFS with Erasure Coded Pool for Data and Replicated Pool for Meta Data

2015-01-21 Thread Mohamed Pakkeer
Hi Greg,

Thanks for your reply. Can we have mixed pools( EC and replicated) for
CephFS data and metadata or we have to use  anyone pool( EC or Replicated)
for creating CephFS? Also we would like to know, when will the production
release of CephFS happen with erasure coded pool ? We are ready to test
peta-byte scale CephFS cluster with erasure coded pool.

-Mohammed Pakkeer

On Wed, Jan 21, 2015 at 9:11 AM, Gregory Farnum  wrote:

> On Tue, Jan 20, 2015 at 5:48 AM, Mohamed Pakkeer 
> wrote:
> >
> > Hi all,
> >
> > We are trying to create 2 PB scale Ceph storage cluster for file system
> > access using erasure coded profiles in giant release. Can we create
> Erasure
> > coded pool (k+m = 10 +3) for data and replicated (4 replicas) pool for
> > metadata for creating CEPHFS? What are the pros and cons of using two
> > different pools to create CEPHFS ?
>
> It's standard to use separate pools. Unfortunately you can't use EC
> pools for CephFS right now.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com