Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-16 Thread Christian Balzer

Hello,

On Wed, 17 Aug 2016 09:27:30 +0500 Дробышевский, Владимир wrote:

> Christian,
> 
>   thanks a lot for your time. Please see below.
> 
> 
> 2016-08-17 5:41 GMT+05:00 Christian Balzer :
> 
> >
> > Hello,
> >
> > On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote:
> >
> > >   So demands look like these:
> > >
> > > 1. He has a number of clients which need to periodically write a set of
> > > data as big as 160GB to a storage. The acceptable write speed is about a
> > > minute for the such amount, so it is around 2700-2800MB per second. Each
> > > write session will happend in a dedicated manner.
> >
> > Let me confirm that "dedicated" here means non-concurrent, sequential.
> > So not more than one client at a time, the cluster and network would be
> > good if doing 3GB/s?
> >
> Yes, this is what I meant.
>
That's good to know, it makes that data dump from a single client/server
at least marginally possible, without resorting to even more expensive
network infrastructure.
  
> 
> >
> > Note that with IPoIB and QDR 3GB/s is about the best you can hope for,
> > that's with a single client of course.
> >
> I understand, thank you. Alexander doesn't have any setup yet and would
> like to build a cost-effective one (not exactly 'cheap', but with minimal
> costs to satify requirements), so I've recommended him QDR IB as a minimal
> setup if they will be able to live with the used hardware (which is pretty
> cheap in general and would allow to make inexpensive multi-port per server
> setup with bonding, but hardly to get in Russia) or FDR if it is possible
> to get new network hardware only.
> 
Single link QDR should do the trick.
Bonding via a Linux bondn: interface with IPoIB currently only supports
failover (active-standby), not load balancing.
Never mind that load balancing may still not improve bandwidth for a
single client talking to a single target (it would help on a server
talking to Ceph, thus multiple OSD nodes).

There are of course other ways of using 2 interfaces to achieve higher
bandwidth, like using routing to the host. 
But that gets more involved. 

> 
> >
> > >Data read should also be
> > > pretty fast. The written data must be shared after the write.
> > Fast reading might be achieved by these factors:
> > a) lots of RAM, to hold all FS SLAB data and of course page cache.
> > b) splitting writes and reads amongst the pools by using readfoward cache
> > mode, so writes go (primarily, initially) to the SSD cache pool and
> > (cold) reads come from the HDD base pool.
> > c) having a large cache pool.
> >
> > >Clients OS -
> > > Windows.
> > So what server(s) are they writing to?
> > I don't think that Windows RBD port (dokan) is a well tested
> > implementation, besides not being updated for a year or so.
> >
> This is the question I haven't asked (I hope Alexander will read this and
> write me an answer, and I answer here), but I believe they use local P3608
> for this at the moment. The main problem is that P3608s are pretty
> expensive, and local setup doesn't provide enough reliability, so they
> would like to build a cost-effective reliable setup with more inxepensive
> drives as well as providing a network storage for another data as well.
> The situation with dokan is exactly what I thought and told Alexander. So
> the only way is to setup intermediate servers which will significantly
> reduce speed.
> 
I haven't even tried to use Samba or NFS on top of RBD or CephFS, but
given that fio (with direct=1!) gives me the full speed of the OSDs, same
as with a "cp -ar", I'd hope that such file servers wouldn't be
significantly slower than their storage system.

> 
> > > 2. It is necessary to have a regular storage as well. He thinks about
> > 1.2TB
> > > HDD storage with 34TB SSD cache tier at the moment.
> > >
> > A 34TB cache pool with (at the very least) 2x replication will not be
> > cheap.
> >
> > > The main question with an answer I don't have is how to calculate\predict
> > > per client write speed for a ceph cluster?
> > This question has been asked before and in fact quite recently, see the
> > very short lived "Ceph performance calculator" thread.
> >
> Thank you, I've founded it. I've been following for the list for a pretty
> long time but seems that I missed this discussion.
> 
> 
> >
> > In short, too many variables.
> >
> > >For example, if there will be a
> > > cache tier or even a dedicated SSD-only pool with Intel S3710 or Samsung
> > > SM863 drives - how to get approximation for the write speed? Concurent
> > > writes to the 6-8 good SSD drives could probably give such speed, but is
> > it
> > > true for the cluster in general?
> >
> > Since we're looking here at one of the relatively few use case where
> > bandwidth/throughput is the main factor and not IOPS, this calculation
> > becomes a bit easier and predictable.
> > For an example, see my recent post:
> > "Better late than never, some XFS versus EXT4 test results"
> >
> Found it too, 

Re: [ceph-users] Testing Ceph cluster for future deployment.

2016-08-16 Thread Christian Balzer

Hello,

On Mon, 15 Aug 2016 14:14:05 +0200 jan hugo prins wrote:

> Hello,
> 
> I'm currently in the fase of testing a Ceph setup to see if it will fit
> our need for a 3 DC storage sollution.
> 
The usual warnings about performance (network latency) with a multi DC
Ceph cluster apply.
Do (re-)search the ML archives.

> I install Centos 7 with Ceph version 10.2.2
> 
> I have a few things that I noticed so far:
> 
[no RGW insights from me]
> 
> - I currently have 3 OSD nodes with each of them 3 1TB SSD drives for
> OSD. So in total I have 9 OSD drives. Looking at the documentation this
> would give me a total of 512 PG's in total. The total number of pools
> that we are going to house on this storage is currently unknown, but I
> have started with the installation of S3 which gives me 12 pools to
> start with, so the pg_num and the pgp_num per pool should be set to 32.
> Is this correct, or am I missing something here? 

That is basically correct, however you want to allocate more PGs to busy
and large (data) pools and less to infrequently used and small pools. 
Again, not ideas about RGW, but looking at http://ceph.com/pgcalc/
I'd say that most pools will be better off with 16 PGs and the buckets
pool with the remainder.

>What if I create more
> pools over time and have more then 16 pools? Then my total number of
> PG's is higher then this number of 512. 
You need to balance pools, PGs and OSDs.
The idea here obviously being that a pool that's actually needed will also
consume data/space and thus require more OSDs anyway.

>I allready see the message "too
> many PGs per OSD (609 > max 300)" and I could make this warning level
> higher, but where are the limits?
> 
That's way too high and you should not be seeing this if all your 12 pools
have 32 PGs. 
So you probably already have more pools and with a LOT of PGs (like 1500
more than your 12 RGW ones).

> - I currently have an warning stating the following: pool
> default.rgw.buckets.data has many more objects per pg than average (too
> few pgs?)

See above and pgcalc.

Christian
> Is it possible to spread the buckets in a pure S3 workload on multiple
> pools? Could I make a dedicated pool for a bucket if I expect that
> bucket to be very big, or to make a split between the buckets of
> different customers? Or maybe have different protection levels for
> different buckets?
> 
> - I try to follow the following howto
> (http://cephnotes.ksperis.com/blog/2014/11/28/placement-pools-on-rados-gw)
> on how to put a bucket in a specific placement group so I can split data
> of different customers in different pools but some commands return an error:
> 
> [root@blsceph01-1 ~]# radosgw-admin region get > region.conf.json
> failed to init zonegroup: (2) No such file or directory
> [root@blsceph01-1 ~]# radosgw-admin zone get > zone.conf.json
> unable to initialize zone: (2) No such file or directory
> 
> This could have something to do with the other error radosgw-admin is
> giving me.
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw ERROR rgw_bucket_sync_user_stats() for user

2016-08-16 Thread zhu tong
Hi all,

Version: 0.94.7
radosgw has reported the following error:

2016-08-16 15:26:06.883957 7fc2f0bfe700  0 ERROR: rgw_bucket_sync_user_stats() 
for user=user1, 
bucket=2537e61b32ca783432138237f234e610d1ee186e(@{i=.rgw.buckets.index,e=.rgw.buckets.extra}.rgw.buckets[default.4151.167])
 returned -2
2016-08-16 15:26:06.883989 7fc2f0bfe700  0 WARNING: sync_bucket() returned r=-2

ERROR like this happens to user1's all buckets during that time.


Problem description:

user1 and user2 could use a same bucket like it is public accessible while this 
bucket is not configured so.


According to https://bugzilla.redhat.com/show_bug.cgi?format=multiple=1296130

this seems to be a bug but with a different description.


Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: Testing Ceph cluster for future deployment.

2016-08-16 Thread zhu tong
[root@blsceph01-1 ~]# radosgw-admin user info --uid=testuser
2016-08-15 12:04:33.290367 7f7bea1f09c0  0 RGWZoneParams::create():
error creating default zone params: (17) File exists

You might don't have a radosgw user named testuser.  to see a list of users: 
radosgw-admin --name client.admin metadata list user

[root@blsceph01-1 ~]# radosgw-admin region get > region.conf.json
failed to init zonegroup: (2) No such file or directory
[root@blsceph01-1 ~]# radosgw-admin zone get > zone.conf.json
unable to initialize zone: (2) No such file or directory

You might don't have a zone or region configured as in params.

Other than above, I have also noticed not every result matches what their man 
says. More likely, the result is empty: []

Regarding, pg and osd and objects, my experience is once you have built a 
cluster with its initial settings, (maybe pg per pool), the more pools the more 
pgs. Sooner or later, you would get a health warn like "too many pg per osd" or 
"too many objects per xxx", then adding more OSDs into cluster would be the 
helpful.

发件人: ceph-users  代表 jan hugo prins 

发送时间: 2016年8月15日 12:14:05
收件人: ceph-users@lists.ceph.com
主题: [ceph-users] Testing Ceph cluster for future deployment.

Hello,

I'm currently in the fase of testing a Ceph setup to see if it will fit
our need for a 3 DC storage sollution.

I install Centos 7 with Ceph version 10.2.2

I have a few things that I noticed so far:

- In S3 radosgw-admin I see an error:

[root@blsceph01-1 ~]# radosgw-admin user info --uid=testuser
2016-08-15 12:04:33.290367 7f7bea1f09c0  0 RGWZoneParams::create():
error creating default zone params: (17) File exists

I have found some reference to this error online but this was related to
some upgrade issue (http://tracker.ceph.com/issues/15597).
My install is a fresh install of 10.2.2. I think someone else also
mentioned he saw this error, but I can't find a sollution so far.


- I choose not to name my cluster simply "ceph" because we could end up
with multiple clusters in the future but I named my cluster blsceph01.
During installation I ran into the issue that the cluster wouldn't start
and I found a hard reference in the systemd files
(/usr/lib/systemd/system/) to the clustername ceph
(Environment=CLUSTER=ceph) and only after changing this to my
clustername everything would work normally.


- I currently have 3 OSD nodes with each of them 3 1TB SSD drives for
OSD. So in total I have 9 OSD drives. Looking at the documentation this
would give me a total of 512 PG's in total. The total number of pools
that we are going to house on this storage is currently unknown, but I
have started with the installation of S3 which gives me 12 pools to
start with, so the pg_num and the pgp_num per pool should be set to 32.
Is this correct, or am I missing something here? What if I create more
pools over time and have more then 16 pools? Then my total number of
PG's is higher then this number of 512. I allready see the message "too
many PGs per OSD (609 > max 300)" and I could make this warning level
higher, but where are the limits?

- I currently have an warning stating the following: pool
default.rgw.buckets.data has many more objects per pg than average (too
few pgs?)
Is it possible to spread the buckets in a pure S3 workload on multiple
pools? Could I make a dedicated pool for a bucket if I expect that
bucket to be very big, or to make a split between the buckets of
different customers? Or maybe have different protection levels for
different buckets?

- I try to follow the following howto
(http://cephnotes.ksperis.com/blog/2014/11/28/placement-pools-on-rados-gw)
on how to put a bucket in a specific placement group so I can split data
of different customers in different pools but some commands return an error:

[root@blsceph01-1 ~]# radosgw-admin region get > region.conf.json
failed to init zonegroup: (2) No such file or directory
[root@blsceph01-1 ~]# radosgw-admin zone get > zone.conf.json
unable to initialize zone: (2) No such file or directory

This could have something to do with the other error radosgw-admin is
giving me.


--
Met vriendelijke groet / Best regards,

Jan Hugo Prins
Infra and Isilon storage consultant

Better.be B.V.
Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
jpr...@betterbe.com | www.betterbe.com

This e-mail is intended exclusively for the addressee(s), and may not
be passed on to, or made available for use by any person other than
the addressee(s). Better.be B.V. rules out any and every liability
resulting from any electronic transmission.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-16 Thread Christian Balzer

Hello,

On Wed, 17 Aug 2016 00:09:14 +0500 Дробышевский, Владимир wrote:

> Dear community,
> 
>   I've had a conversation with Alexander, and he asked me to explain the
> situation and will be very grateful for any advices.
>
Your summary makes it somewhat clearer, but it still leaves some
questions, see below.
 
>   So demands look like these:
> 
> 1. He has a number of clients which need to periodically write a set of
> data as big as 160GB to a storage. The acceptable write speed is about a
> minute for the such amount, so it is around 2700-2800MB per second. Each
> write session will happend in a dedicated manner. 

Let me confirm that "dedicated" here means non-concurrent, sequential.
So not more than one client at a time, the cluster and network would be
good if doing 3GB/s?

Note that with IPoIB and QDR 3GB/s is about the best you can hope for,
that's with a single client of course.

>Data read should also be
> pretty fast. The written data must be shared after the write. 
Fast reading might be achieved by these factors:
a) lots of RAM, to hold all FS SLAB data and of course page cache.
b) splitting writes and reads amongst the pools by using readfoward cache
mode, so writes go (primarily, initially) to the SSD cache pool and
(cold) reads come from the HDD base pool. 
c) having a large cache pool.

>Clients OS -
> Windows.

So what server(s) are they writing to? 
I don't think that Windows RBD port (dokan) is a well tested
implementation, besides not being updated for a year or so.

> 2. It is necessary to have a regular storage as well. He thinks about 1.2TB
> HDD storage with 34TB SSD cache tier at the moment.
> 
A 34TB cache pool with (at the very least) 2x replication will not be
cheap.

> The main question with an answer I don't have is how to calculate\predict
> per client write speed for a ceph cluster? 
This question has been asked before and in fact quite recently, see the
very short lived "Ceph performance calculator" thread.

In short, too many variables.
 
>For example, if there will be a
> cache tier or even a dedicated SSD-only pool with Intel S3710 or Samsung
> SM863 drives - how to get approximation for the write speed? Concurent
> writes to the 6-8 good SSD drives could probably give such speed, but is it
> true for the cluster in general? 

Since we're looking here at one of the relatively few use case where
bandwidth/throughput is the main factor and not IOPS, this calculation
becomes a bit easier and predictable. 
For an example, see my recent post: 
"Better late than never, some XFS versus EXT4 test results" 

Which basically shows that with sufficient network bandwidth all available
drive speed can be utilized.

With fio randwrite and 4MB blocks the above setup gives me 440MB/s and
with 4K blocks 8000 IOPS.
So throughput wise, 100% utilization, full speed present.
IOPS, less than a third (the SSDs are at 33% utilization, the delays are
caused by Ceph and network latencies).

>3 sets per 8 drives in 13 servers (with an
> additional overhead for the network operations, ACKs and placement
> calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, is
> there a formula exists to calculate speed expectations from the raw speed
> and/or IOPS point of view?
> 

Lets look at a simplified example:
10 nodes (with fast enough CPU cores to fully utilize those SSDs/NVMes),
40Gb/s (QDR, Ether) interconnects.
Each node with 2 1.6TB P3608s, which are rated at 2000MB/s writes speeds.
Of course journals needs to go somewhere, so the effective speed is half
of that.
Thus we get a top speed per node of 2GB/s.
With a replication of 2 we would get a 10GB/s write capable cluster, with
3 it's down to a theoretical 6.6GB/s. 

I'm ignoring the latency, ACK overhead up there, which has a significantly
lower impact on throughput than on IOPS. 

Having a single client or intermediary file server write all that to the
Ceph cluster over a single link is the bit I'd be more worried about.

Christian

> Or, from another side, if there are pre-requisites exist, how to be sure
> the projected cluster meets them? I'm pretty sure it's a typical task, how
> would you solve it?
> 
> Thanks a lot in advance and best regards,
> Vladimir
> 
> 
> С уважением,
> Дробышевский Владимир
> Компания "АйТи Город"
> +7 343 192
> 
> Аппаратное и программное обеспечение
> IBM, Microsoft, Eset
> Поставка проектов "под ключ"
> Аутсорсинг ИТ-услуг
> 
> 2016-08-08 19:39 GMT+05:00 Александр Пивушков :
> 
> > Hello dear community!
> > I'm new to the Ceph and not long ago took up the theme of building
> > clusters.
> > Therefore it is very important to your opinion.
> >
> > It is necessary to create a cluster from 1.2 PB storage and very rapid
> > access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe
> > PCIe 3.0 x4 Solid State Drive" were used, their speed of all satisfies, but
> > with increase of volume of storage, the price of such cluster very strongly
> > grows and therefore there 

Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-16 Thread Goncalo Borges

Hi i Chris

According to the documentation I mentioned previously, if you can get 
osd.116 back on, that should remove the blocking. So it is indeed 
worthwhile to try that before actually marking the osd lost.


I’d like to understand more why the down OSD would cause the PG to get 
stuck after CRUSH was able to locate enough OSD to map the PG.




This is just a hypothesis after checking your pg query results. I would 
say that, once you get osd.116 on, if the pg recovers, the hypothesis is 
correct. If the pg does not recover, the hypothesis is wrong, and there 
is some other problem.


But this is just my 2 cents (as a site admin who recently is dealing 
with a lot of pg issues)



Cheers
Goncalo



Is this some form of safety catch that prevents it from recovering, 
even though OSD.116 is no longer important for data integrity?


Marking the OSD lost is an option here, but it’s not really lost … it 
just takes some time to get a machine rebooted.


I’m still working out my operational procedures for CEPH and marking 
the OSD lost but having it pop back up once the system reboots could 
be an issue that I’m not yet sure how to resolve.


Can an OSD be marked as ‘found’ once it returns to the network?

-Chris

*From: *Goncalo Borges 
*Date: *Monday, August 15, 2016 at 11:36 PM
*To: *"Heller, Chris" , 
"ceph-users@lists.ceph.com" 
*Subject: *Re: [ceph-users] PG is in 'stuck unclean' state, but all 
acting OSD are up


Hi Chris...

The precise osd set you see now [79,8,74] was obtained on epoch 104536 
but this was after a lot of tries as showed by the recovery section.


Actually, in the first try (on epoch 100767) osd 116 was selected 
somehow (maybe it was up at the time?) and probably the pg got stuck 
because it went down during the recover process?


recovery_state": [
{
"name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2016-08-11 11:45:06.052568",
"requested_info_from": []
},
{
"name": "Started\/Primary\/Peering",
"enter_time": "2016-08-11 11:45:06.052558",
"past_intervals": [
{
"first": 100767,
"last": 100777,
"maybe_went_rw": 1,
"up": [
79,
116,
74
],
"acting": [
79,
116,
74
],
"primary": 79,
"up_primary": 79
},

The pg query also shows

peering_blocked_by": [
{
"osd": 116,
"current_lost_at": 0,
"comment": "starting or marking this osd lost
may let us proceed"
}

Maybe, you can check the documentation in [1] and see if you think you 
could follow the suggestion inside the pg and mark osd 116 as lost. 
This should be done after proper evaluation from you.


Another thing I found strange is that in the recovery section, there 
are a lot of tries where you do not get a proper osd set. The very 
last recover try was on epoch 104540.


{
"first": 104536,
"last": 104540,
"maybe_went_rw": 1,
"up": [
2147483647,
8,
74
],
"acting": [
2147483647,
8,
74
],
"primary": 8,
"up_primary": 8
}

From [2], "When CRUSH fails to find enough OSDs to map to a PG, it 
will show as a 2147483647 which is ITEM_NONE or no OSD found.".


This could be an artifact of the peering being blocked by osd.116, or 
a genuine problem where you are not being able to get a proper osd 
set. That could be for a variety of reasons: from network issues, to 
osds being almost full or simply because the system can't get 3 osds 
in 3 different hosts.


Cheers

Goncalo

[1] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure 



[2] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ 

Re: [ceph-users] Rbd map command doesn't work

2016-08-16 Thread Bruce McFarland
EP,
Try setting the crush map to use legacy tunables. I've had the same issue with 
the"feature mismatch" errors when using krbd that didn't support format 2 and 
running jewel 10.2.2 on the storage nodes. 

From the command line:
ceph osd crush tunables legacy

Bruce

> On Aug 16, 2016, at 4:21 PM, Somnath Roy  wrote:
> 
> This is usual feature mismatch stuff , the inbox krbd you are using is not 
> supporting Jewel.
> Try googling with the error and I am sure you will get lot of prior 
> discussion around that..
>  
> From: EP Komarla [mailto:ep.koma...@flextronics.com] 
> Sent: Tuesday, August 16, 2016 4:15 PM
> To: Somnath Roy; ceph-users@lists.ceph.com
> Subject: RE: Rbd map command doesn't work
>  
> Somnath,
>  
> Thanks.
>  
> I am trying your suggestion.  See the commands below.  Still it doesn’t seem 
> to go.
>  
> I am missing something here…
>  
> Thanks,
>  
> - epk
>  
> =
> [test@ep-c2-client-01 ~]$ rbd create rbd/test1 --size 1G --image-format 1
> rbd: image format 1 is deprecated
> [test@ep-c2-client-01 ~]$ rbd map rbd/test1
> rbd: sysfs write failed
> In some cases useful info is found in syslog - try "dmesg | tail" or so.
> rbd: map failed: (13) Permission denied
> [test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1
> ^C[test@ep-c2-client-01 ~]$
> [test@ep-c2-client-01 ~]$
> [test@ep-c2-client-01 ~]$
> [test@ep-c2-client-01 ~]$
> [test@ep-c2-client-01 ~]$ dmesg|tail -20
> [1201954.248195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1201954.253365] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1201964.274082] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1201964.281195] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1201974.298195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1201974.305300] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1204128.917562] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1204128.924173] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1204138.956737] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1204138.964011] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1204148.980701] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1204148.987892] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1204159.004939] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1204159.012136] libceph: mon2 172.20.60.53:6789 missing required protocol 
> features
> [1204169.028802] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1204169.035992] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1204476.803192] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
> [1204476.810578] libceph: mon0 172.20.60.51:6789 missing required protocol 
> features
> [1204486.821279] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
> 102b84a842a42 < server's 40102b84a842a42, missing 400
>  
>  
>  
> From: Somnath Roy [mailto:somnath@sandisk.com] 
> Sent: Tuesday, August 16, 2016 3:59 PM
> To: EP Komarla ; ceph-users@lists.ceph.com
> Subject: RE: Rbd map command doesn't work
>  
> The default format of rbd image in jewel is 2 along with bunch of other 
> deatures enabled , so, you have following two option:
>  
> 1. create a format 1 image –image-format 1
>  
> 2. Or, do this in the ceph.conf file [client] or [global] before creating 
> image..
> rbd_default_features = 3
>  
> Thanks & Regards
> Somnath
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
> Komarla
> Sent: Tuesday, August 16, 2016 2:52 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Rbd map command doesn't work
>  
> All,
>  
> I am creating an image and mapping it.  The below commands used to work in 
> Hammer, now the same is not working in Jewel.  I see the message about some 
> feature set mismatch – what features are we talking about here?  Is this a 
> known issue in Jewel with a workaround?
>  
> Thanks,
>  
> - epk
>  
> =
>  
>  
> [test@ep-c2-client-01 ~]$  rbd create rbd/test1 --size 1G
> [test@ep-c2-client-01 ~]$ rbd info test1
> rbd image 'test1':
> size 1024 MB in 256 objects
> order 22 

Re: [ceph-users] Rbd map command doesn't work

2016-08-16 Thread Somnath Roy
This is usual feature mismatch stuff , the inbox krbd you are using is not 
supporting Jewel.
Try googling with the error and I am sure you will get lot of prior discussion 
around that..

From: EP Komarla [mailto:ep.koma...@flextronics.com]
Sent: Tuesday, August 16, 2016 4:15 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Rbd map command doesn't work

Somnath,

Thanks.

I am trying your suggestion.  See the commands below.  Still it doesn't seem to 
go.

I am missing something here...

Thanks,

- epk

=
[test@ep-c2-client-01 ~]$ rbd create rbd/test1 --size 1G --image-format 1
rbd: image format 1 is deprecated
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1
^C[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$ dmesg|tail -20
[1201954.248195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201954.253365] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201964.274082] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201964.281195] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201974.298195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201974.305300] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204128.917562] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204128.924173] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204138.956737] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204138.964011] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204148.980701] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204148.987892] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204159.004939] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204159.012136] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1204169.028802] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204169.035992] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204476.803192] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204476.810578] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204486.821279] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400



From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, August 16, 2016 3:59 PM
To: EP Komarla >; 
ceph-users@lists.ceph.com
Subject: RE: Rbd map command doesn't work

The default format of rbd image in jewel is 2 along with bunch of other 
deatures enabled , so, you have following two option:

1. create a format 1 image -image-format 1

2. Or, do this in the ceph.conf file [client] or [global] before creating 
image..
rbd_default_features = 3

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, August 16, 2016 2:52 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Rbd map command doesn't work

All,

I am creating an image and mapping it.  The below commands used to work in 
Hammer, now the same is not working in Jewel.  I see the message about some 
feature set mismatch - what features are we talking about here?  Is this a 
known issue in Jewel with a workaround?

Thanks,

- epk

=


[test@ep-c2-client-01 ~]$  rbd create rbd/test1 --size 1G
[test@ep-c2-client-01 ~]$ rbd info test1
rbd image 'test1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8146238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ dmesg|tail
[1197731.547522] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 

Re: [ceph-users] Rbd map command doesn't work

2016-08-16 Thread EP Komarla
Somnath,

Thanks.

I am trying your suggestion.  See the commands below.  Still it doesn't seem to 
go.

I am missing something here...

Thanks,

- epk

=
[test@ep-c2-client-01 ~]$ rbd create rbd/test1 --size 1G --image-format 1
rbd: image format 1 is deprecated
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1
^C[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$ dmesg|tail -20
[1201954.248195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201954.253365] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201964.274082] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201964.281195] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201974.298195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201974.305300] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204128.917562] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204128.924173] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204138.956737] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204138.964011] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204148.980701] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204148.987892] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204159.004939] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204159.012136] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1204169.028802] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204169.035992] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204476.803192] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204476.810578] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204486.821279] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400



From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, August 16, 2016 3:59 PM
To: EP Komarla ; ceph-users@lists.ceph.com
Subject: RE: Rbd map command doesn't work

The default format of rbd image in jewel is 2 along with bunch of other 
deatures enabled , so, you have following two option:

1. create a format 1 image -image-format 1

2. Or, do this in the ceph.conf file [client] or [global] before creating 
image..
rbd_default_features = 3

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, August 16, 2016 2:52 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Rbd map command doesn't work

All,

I am creating an image and mapping it.  The below commands used to work in 
Hammer, now the same is not working in Jewel.  I see the message about some 
feature set mismatch - what features are we talking about here?  Is this a 
known issue in Jewel with a workaround?

Thanks,

- epk

=


[test@ep-c2-client-01 ~]$  rbd create rbd/test1 --size 1G
[test@ep-c2-client-01 ~]$ rbd info test1
rbd image 'test1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8146238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ dmesg|tail
[1197731.547522] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197731.554621] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[1197741.571645] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197741.578760] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1198586.766120] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < 

Re: [ceph-users] Rbd map command doesn't work

2016-08-16 Thread Somnath Roy
The default format of rbd image in jewel is 2 along with bunch of other 
deatures enabled , so, you have following two option:

1. create a format 1 image -image-format 1

2. Or, do this in the ceph.conf file [client] or [global] before creating 
image..
rbd_default_features = 3

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, August 16, 2016 2:52 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Rbd map command doesn't work

All,

I am creating an image and mapping it.  The below commands used to work in 
Hammer, now the same is not working in Jewel.  I see the message about some 
feature set mismatch - what features are we talking about here?  Is this a 
known issue in Jewel with a workaround?

Thanks,

- epk

=


[test@ep-c2-client-01 ~]$  rbd create rbd/test1 --size 1G
[test@ep-c2-client-01 ~]$ rbd info test1
rbd image 'test1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8146238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ dmesg|tail
[1197731.547522] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197731.554621] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[1197741.571645] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197741.578760] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1198586.766120] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198586.771248] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[1198596.789453] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198596.796557] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1198606.813825] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198606.820929] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1


EP KOMARLA,
[Flex_RGB_Sml_tm]
Emal: ep.koma...@flextronics.com
Address: 677 Gibraltor Ct, Building #2, Milpitas, CA 94035, USA
Phone: 408-674-6090 (mobile)


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rbd map command doesn't work

2016-08-16 Thread EP Komarla
All,

I am creating an image and mapping it.  The below commands used to work in 
Hammer, now the same is not working in Jewel.  I see the message about some 
feature set mismatch - what features are we talking about here?  Is this a 
known issue in Jewel with a workaround?

Thanks,

- epk

=


[test@ep-c2-client-01 ~]$  rbd create rbd/test1 --size 1G
[test@ep-c2-client-01 ~]$ rbd info test1
rbd image 'test1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8146238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ dmesg|tail
[1197731.547522] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197731.554621] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[1197741.571645] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197741.578760] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1198586.766120] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198586.771248] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[1198596.789453] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198596.796557] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1198606.813825] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198606.820929] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1


EP KOMARLA,
[Flex_RGB_Sml_tm]
Emal: ep.koma...@flextronics.com
Address: 677 Gibraltor Ct, Building #2, Milpitas, CA 94035, USA
Phone: 408-674-6090 (mobile)


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-16 Thread Дробышевский , Владимир
Dear community,

  I've had a conversation with Alexander, and he asked me to explain the
situation and will be very grateful for any advices.

  So demands look like these:

1. He has a number of clients which need to periodically write a set of
data as big as 160GB to a storage. The acceptable write speed is about a
minute for the such amount, so it is around 2700-2800MB per second. Each
write session will happend in a dedicated manner. Data read should also be
pretty fast. The written data must be shared after the write. Clients OS -
Windows.
2. It is necessary to have a regular storage as well. He thinks about 1.2TB
HDD storage with 34TB SSD cache tier at the moment.

The main question with an answer I don't have is how to calculate\predict
per client write speed for a ceph cluster? For example, if there will be a
cache tier or even a dedicated SSD-only pool with Intel S3710 or Samsung
SM863 drives - how to get approximation for the write speed? Concurent
writes to the 6-8 good SSD drives could probably give such speed, but is it
true for the cluster in general? 3 sets per 8 drives in 13 servers (with an
additional overhead for the network operations, ACKs and placement
calculations), QDR or FDR Inifiniband or 40GbE; we know drive specs, is
there a formula exists to calculate speed expectations from the raw speed
and/or IOPS point of view?

Or, from another side, if there are pre-requisites exist, how to be sure
the projected cluster meets them? I'm pretty sure it's a typical task, how
would you solve it?

Thanks a lot in advance and best regards,
Vladimir


С уважением,
Дробышевский Владимир
Компания "АйТи Город"
+7 343 192

Аппаратное и программное обеспечение
IBM, Microsoft, Eset
Поставка проектов "под ключ"
Аутсорсинг ИТ-услуг

2016-08-08 19:39 GMT+05:00 Александр Пивушков :

> Hello dear community!
> I'm new to the Ceph and not long ago took up the theme of building
> clusters.
> Therefore it is very important to your opinion.
>
> It is necessary to create a cluster from 1.2 PB storage and very rapid
> access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe
> PCIe 3.0 x4 Solid State Drive" were used, their speed of all satisfies, but
> with increase of volume of storage, the price of such cluster very strongly
> grows and therefore there was an idea to use Ceph.
> There are following requirements:
>
> - The amount of data 160 GB should be read and written at speeds of SSD
> P3608
> - There must be created a high-speed storage of the SSD drives 36 TB
> volume with read / write speed tends to SSD P3608
> - Must be created store 1.2 PB with the access speed than the bigger, the
> better ...
> - Must have triple redundancy
> I do not really understand yet, so to create a configuration with SSD
> P3608 Disk. Of course, the configuration needs to be changed, it is very
> expensive.
>
> InfiniBand will be used, and 40 GB Ethernet.
> We will also use virtualization to high-performance hardware to optimize
> the number of physical servers.
> I'm not tied to a specific server models and manufacturers. I create only
> the cluster scheme which should be criticized :)
>
> 1. OSD - 13 pieces.
>  a. 1.4 TB SSD-drive analogue Intel® SSD DC P3608 Series - 2 pieces
>  b. Fiber Channel 16 Gbit / c - 2 port.
>  c. An array (not RAID) to 284 TB of SATA-based drives (36 drives for
> 8TB);
>  d. 360 GB SSD- analogue Intel SSD DC S3500 1 piece
>  e. SATA drive 40 GB for installation of the operating system (or
> booting from the network, which is preferable)
>  f. RAM 288 GB
>  g. 2 x CPU - 9 core 2 Ghz. - E-5-2630v4
> 2. MON - 3 pieces. All virtual server:
>  a. 1 Gbps Ethernet / c - 1 port.
>  b. SATA drive 40 GB for installation of the operating system (or
> booting from the network, which is preferable)
>  c. SATA drive 40 GB
>  d. 6GB RAM
>  e. 1 x CPU - 2 cores at 1.9 Ghz
> 3. MDS - 2 pcs. All virtual server:
>  a. 1 Gbps Ethernet / c - 1 port.
>  b. SATA drive 40 GB for installation of the operating system (or
> booting from the network, which is preferable)
>  c. SATA drive 40 GB
>  d. 6GB RAM
>  e. 1 x CPU - min. 2 cores at 1.9 Ghz
>
> I assume to use for an acceleration SSD for a cache and a log of OSD.
>
> --
> Alexander Pushkov
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Auto recovering after loosing all copies of a PG(s)

2016-08-16 Thread Wido den Hollander

> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw :
> 
> 
> Hi,
> 
> I've been slowly getting some insight into this, but I haven't yet
> found any compromise that works well.
> 
> I'm currently testing ceph using librados C bindings directly, there
> are two components that access the storage cluster, one that only
> writes, and another that only reads.  Between them, all that happens
> is stat(), write(), and read() on a pool - for efficiency we're using
> the AIO variants.
> 
> On the reader side, it's stat(), and read() if an object exists, with
> an nginx proxy cache infront, a fair amount of accesses just stat()
> and return 304 if the IMS headers match the object's mtime.  On the
> writer side, it's only stat() and write() if an object doesn't exist
> in the pool.  Each object is independent, there is no relationship
> between that and any other objects stored.
> 
> If this makes it to production, this pool will have around 1.8 billion
> objects, anywhere between 4 and 30kbs in size.  The store itself is
> pretty much 90% write.  Of what is read, there is zero reference of
> locality, a given object could be read once, then not again for many
> days.  Even then, the seek times are mission critical, and so can't
> have a situation where it takes a longer than a normal disk's seek
> time to stat() a file.  On that front, Ceph has been working very well
> for us, with most requests taking an average of around 6 milliseconds
> - though there are problems that relate to deep-scrubbing running in
> the background that I may come back to at a later date.
> 
> With that brief description out the way, what probably makes our usage
> maybe unique is that of the data we do write, it's actually not a big
> a deal if just goes missing, infact we've even had a situation where
> 10% of our data was being deleted on a daily basis, and no one noticed
> for months.  This is because our writers guarantee that whatever isn't
> present on disk will be regenerated in the next loop of their
> workload.
> 
> Probably the one thing we do care about, is that our clients continue
> working, no matter what state the cluster is in.  Unfortunately, this
> is where a nasty (feature? bug?) side-effect of Ceph's internals comes
> in.  Where something happens(tm) to cause PGs to go missing, and all
> client operations will become effectively blocked, and indefinitely
> so.  That is 100% downtime for loosing something even as small as 4-6%
> of data held, this is outrageously undesirable!
> 
> When playing around with a test instance, I managed to get normal
> operations to resume using something to the effect of the following
> commands, though I'm not sure which were required or not, probably
> all.
> 
>   ceph osd down osd.0
>   ceph osd out osd.0
>   ceph out lost osd.0
>   for pg in $(get list of missing pgs) do ceph pg force_create & done
>   ceph osd crush rm osd.0
> 
> Only 20 minutes later after being stuck and stale, probably repeating
> some steps in a different order, (or doing something else that I
> didn't make a note of) did the cluster finally decide to recreate the
> lost PGs on the OSDs still standing, and normal operations were
> unblocked.  Sometime later, the lost osd.0 came back up, though I
> didn't look too much into whether the objects it held were merged
> back, or just wiped.  It wouldn't really make a difference either way.
> 
> So, the short question I have, is there a way to keep ceph running
> following a small data loss that would be completely catastrophic in
> probably any situation except my specific use case?  Increasing
> replication count isn't a solution I can afford.  Infact, with the
> relationship this application has with the storage layer, it would
> actually be better off without any sort of replication whatsoever.
> 
> The desired behaviour for me would be for the client to get an instant
> "not found" response from stat() operations.  For write() to recreate
> unfound objects.  And for missing placement groups to be recreated on
> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
> it can still be accessed is just not workable, I'm afraid.
> 

Well, you can't make Ceph do that, but you can make librados do such a thing.

I'm using the OSD and MON timeout settings in libvirt for example: 
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157

You can set these options:
- client_mount_timeout
- rados_mon_op_timeout
- rados_osd_op_timeout

Where I think only the last two should be sufficient in your case.

You wel get ETIMEDOUT back as error when a operation times out.

Wido

> Thanks ahead of time for any suggestions.
> 
> -- 
> Iain Buclaw
> 
> *(p < e ? p++ : p) = (c & 0x0f) + '0';
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to Local Disks

2016-08-16 Thread Gaurav Goyal
Hello


I need your help to redesign my ceph storage network.

As suggested in earlier discussions, i must not use SAN storage. So we have
decided to removed it.

Now we are ordering Local HDDs.

My Network would be

Host1 --> Controller + COmpute --> Local Disk 600GB Host 2--> Compute2 -->
Local Disk 600GB Host 3 --> Compute2

Is it right setup for ceph network? For Host1 and Host2 , we are using 1
600GB disk for basic filesystem.

Should we use same size storage disks for ceph environment or i can order
Disks in size of 2TB for ceph cluster?

Making it

2T X 2 on Host1 2T X 2 on Host 2 2T X 2 on Host 3

12TB in total. replication factor 2 should make it 6 TB?


Regards

Gaurav Goyal

On Thu, Aug 4, 2016 at 1:52 AM, Bharath Krishna 
wrote:

> Hi Gaurav,
>
> There are several ways to do it depending on how you deployed your ceph
> cluster. Easiest way to do it is using ceph-ansible with purge-cluster yaml
> ready made to wipe off CEPH.
>
> https://github.com/ceph/ceph-ansible/blob/master/purge-cluster.yml
>
> You may need to configure ansible inventory with ceph hosts.
>
> Else if you want to purge manually, you can do it using:
> http://docs.ceph.com/docs/hammer/rados/deployment/ceph-deploy-purge/
>
>
> Thanks
> Bharath
>
> From: ceph-users  on behalf of Gaurav
> Goyal 
> Date: Thursday, August 4, 2016 at 8:19 AM
> To: David Turner 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Fwd: Ceph Storage Migration from SAN storage to
> Local Disks
>
> Please suggest a procedure for this uninstallation process?
>
>
> Regards
> Gaurav Goyal
>
> On Wed, Aug 3, 2016 at 5:58 PM, Gaurav Goyal  mailto:er.gauravgo...@gmail.com>> wrote:
>
> Thanks for your  prompt
> response!
>
> Situation is bit different now. Customer want us to remove the ceph
> storage configuration from scratch. Let is openstack system work without
> ceph. Later on install ceph with local disks.
>
> So I need to know a procedure to uninstall ceph and unconfigure it from
> openstack.
>
> Regards
> Gaurav Goyal
> On 03-Aug-2016 4:59 pm, "David Turner"  > wrote:
> If I'm understanding your question correctly that you're asking how to
> actually remove the SAN osds from ceph, then it doesn't matter what is
> using the storage (ie openstack, cephfs, krbd, etc) as the steps are the
> same.
>
> I'm going to assume that you've already added the new storage/osds to the
> cluster, weighted the SAN osds to 0.0 and that the backfilling has
> finished.  If that is true, then your disk used space on the SAN's should
> be basically empty while the new osds on the local disks should have a fair
> amount of data.  If that is the case, then for every SAN osd, you just run
> the following commands replacing OSD_ID with the osd's id:
>
> # On the server with the osd being removed
> sudo stop ceph-osd id=OSD_ID
> ceph osd down OSD_ID
> ceph osd out OSD_ID
> ceph osd crush remove osd.OSD_ID
> ceph auth del osd.OSD_ID
> ceph osd rm OSD_ID
>
> Test running those commands on a test osd and if you had set the weight of
> the osd to 0.0 previously and if the backfilling had finished, then what
> you should see is that your cluster has 1 less osd than it used to, and no
> pgs should be backfilling.
>
> HOWEVER, if my assumptions above are incorrect, please provide the output
> of the following commands and try to clarify your question.
>
> ceph status
> ceph osd tree
>
> I hope this helps.
>
> > Hello David,
> >
> > Can you help me with steps/Procedure to uninstall Ceph storage from
> openstack environment?
> >
> >
> > Regards
> > Gaurav Goyal
> 
> [cid:image001.jpg@01D1EE42.88EF6E60]
>
> David Turner | Cloud Operations Engineer | StorageCraft Technology
> Corporation
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2760 | Mobile: 385.224.2943
>
> 
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> 
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Auto recovering after loosing all copies of a PG(s)

2016-08-16 Thread Iain Buclaw
Hi,

I've been slowly getting some insight into this, but I haven't yet
found any compromise that works well.

I'm currently testing ceph using librados C bindings directly, there
are two components that access the storage cluster, one that only
writes, and another that only reads.  Between them, all that happens
is stat(), write(), and read() on a pool - for efficiency we're using
the AIO variants.

On the reader side, it's stat(), and read() if an object exists, with
an nginx proxy cache infront, a fair amount of accesses just stat()
and return 304 if the IMS headers match the object's mtime.  On the
writer side, it's only stat() and write() if an object doesn't exist
in the pool.  Each object is independent, there is no relationship
between that and any other objects stored.

If this makes it to production, this pool will have around 1.8 billion
objects, anywhere between 4 and 30kbs in size.  The store itself is
pretty much 90% write.  Of what is read, there is zero reference of
locality, a given object could be read once, then not again for many
days.  Even then, the seek times are mission critical, and so can't
have a situation where it takes a longer than a normal disk's seek
time to stat() a file.  On that front, Ceph has been working very well
for us, with most requests taking an average of around 6 milliseconds
- though there are problems that relate to deep-scrubbing running in
the background that I may come back to at a later date.

With that brief description out the way, what probably makes our usage
maybe unique is that of the data we do write, it's actually not a big
a deal if just goes missing, infact we've even had a situation where
10% of our data was being deleted on a daily basis, and no one noticed
for months.  This is because our writers guarantee that whatever isn't
present on disk will be regenerated in the next loop of their
workload.

Probably the one thing we do care about, is that our clients continue
working, no matter what state the cluster is in.  Unfortunately, this
is where a nasty (feature? bug?) side-effect of Ceph's internals comes
in.  Where something happens(tm) to cause PGs to go missing, and all
client operations will become effectively blocked, and indefinitely
so.  That is 100% downtime for loosing something even as small as 4-6%
of data held, this is outrageously undesirable!

When playing around with a test instance, I managed to get normal
operations to resume using something to the effect of the following
commands, though I'm not sure which were required or not, probably
all.

  ceph osd down osd.0
  ceph osd out osd.0
  ceph out lost osd.0
  for pg in $(get list of missing pgs) do ceph pg force_create & done
  ceph osd crush rm osd.0

Only 20 minutes later after being stuck and stale, probably repeating
some steps in a different order, (or doing something else that I
didn't make a note of) did the cluster finally decide to recreate the
lost PGs on the OSDs still standing, and normal operations were
unblocked.  Sometime later, the lost osd.0 came back up, though I
didn't look too much into whether the objects it held were merged
back, or just wiped.  It wouldn't really make a difference either way.

So, the short question I have, is there a way to keep ceph running
following a small data loss that would be completely catastrophic in
probably any situation except my specific use case?  Increasing
replication count isn't a solution I can afford.  Infact, with the
relationship this application has with the storage layer, it would
actually be better off without any sort of replication whatsoever.

The desired behaviour for me would be for the client to get an instant
"not found" response from stat() operations.  For write() to recreate
unfound objects.  And for missing placement groups to be recreated on
an OSD that isn't overloaded.  Halting the entire cluster when 96% of
it can still be accessed is just not workable, I'm afraid.

Thanks ahead of time for any suggestions.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fresh Jewel install with RDS missing default REALM

2016-08-16 Thread jan hugo prins
After doing all the things below I still ran into an error:

2016-08-16 15:13:13.462548 7fe4d530d9c0  0 Cannot find zone id= (name=),
switching to local zonegroup configuration
2016-08-16 15:13:13.464303 7fe4d530d9c0 -1 Cannot find zone id= (name=)

Searching some more I found out that this might have to do with default
not being set.

radosgw-admin realm default --rgw-realm=default
radosgw-admin zonegroup default --rgw-zonegroup=default
radosgw-admin zone default --rgw-zone=default

After this I can create a user etc. So I think I now have a
configuration with a default realm, zonegroup, zone and I can start
creating placement groups.

If I'm missing something please let me know because I really want to
learn more about this.

Jan Hugo Prins


On 08/16/2016 02:48 PM, jan hugo prins wrote:
> Hi,
>
> I'm currently testing a fresh Jewel install with RDS and I run into some
> issues.
> The last couple of days everything was fine until I started playing
> around with creating buckets in different storage pools and creating
> different placement_targets. I Then I found out that the fresh install
> of Jewel RDS is missing a default Realm.
>
> So this morning I created a complete fresh RDS install and then I did
> the following:
> Could someone tell me if everything below looks fine, or if I'm missing
> something?
>
>
> [root@blsceph01-1 ~]# radosgw-admin metadata zone get --rgw-zone=default
> {
> "id": "1adc4b51-3345-4a4b-bf6d-b55b35991530",
> "name": "default",
> "domain_root": "default.rgw.data.root",
> "control_pool": "default.rgw.control",
> "gc_pool": "default.rgw.gc",
> "log_pool": "default.rgw.log",
> "intent_log_pool": "default.rgw.intent-log",
> "usage_log_pool": "default.rgw.usage",
> "user_keys_pool": "default.rgw.users.keys",
> "user_email_pool": "default.rgw.users.email",
> "user_swift_pool": "default.rgw.users.swift",
> "user_uid_pool": "default.rgw.users.uid",
> "system_key": {
> "access_key": "",
> "secret_key": ""
> },
> "placement_pools": [
> {
> "key": "default-placement",
> "val": {
> "index_pool": "default.rgw.buckets.index",
> "data_pool": "default.rgw.buckets.data",
> "data_extra_pool": "default.rgw.buckets.non-ec",
> "index_type": 0
> }
> }
> ],
> "metadata_heap": "default.rgw.meta",
> "realm_id": ""
> }
> [root@blsceph01-1 ~]# radosgw-admin metadata zonegroup get
> --rgw-zonegroup=default
> {
> "id": "d5ad18ed-dfb3-4e4a-a6ee-3c7b4f0cddae",
> "name": "default",
> "api_name": "",
> "is_master": "true",
> "endpoints": [],
> "hostnames": [],
> "hostnames_s3website": [],
> "master_zone": "1adc4b51-3345-4a4b-bf6d-b55b35991530",
> "zones": [
> {
> "id": "1adc4b51-3345-4a4b-bf6d-b55b35991530",
> "name": "default",
> "endpoints": [],
> "log_meta": "false",
> "log_data": "false",
> "bucket_index_max_shards": 0,
> "read_only": "false"
> }
> ],
> "placement_targets": [
> {
> "name": "default-placement",
> "tags": []
> }
> ],
> "default_placement": "default-placement",
> "realm_id": ""
> }
>
> [root@blsceph01-1 ~]# radosgw-admin realm create --rgw-realm=default
> --default
> 2016-08-16 13:15:24.082459 7fa5788d19c0  0 error read_lastest_epoch
> .rgw.root:periods.d825f817-43d1-4ca0-9ca2-0f3946c1e9b7.latest_epoch
> {
> "id": "f8cdcfe3-238a-4e0d-84c4-d58fada869aa",
> "name": "default",
> "current_period": "d825f817-43d1-4ca0-9ca2-0f3946c1e9b7",
> "epoch": 1
> }
>
> [root@blsceph01-1 ~]# radosgw-admin period update --commit
> 2016-08-16 13:15:30.518961 7fe43116c9c0  0 RGWZoneParams::create():
> error creating default zone params: (17) File exists
> 2016-08-16 13:15:30.607504 7fe43116c9c0  0 error read_lastest_epoch
> .rgw.root:periods.f8cdcfe3-238a-4e0d-84c4-d58fada869aa:staging.latest_epoch
> cannot commit period: period does not have a master zone of a master
> zonegroup
> failed to commit period: (22) Invalid argument
>
> Looks like the realm is missing some information.
>
> If found out that it could have something to do with the zonegroup-map:
>
> [root@blsceph01-1 ~]# radosgw-admin zonegroup-map get
> {
> "zonegroups": [],
> "master_zonegroup": "",
> "bucket_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> },
> "user_quota": {
> "enabled": false,
> "max_size_kb": -1,
> "max_objects": -1
> }
> }
>
> Looks like something is indeed missing here.
>
> [root@blsceph01-1 ~]# radosgw-admin zonegroup-map get >zonegroup-map.json
> [root@blsceph01-1 ~]# vi zonegroup-map.json
>
> {
> "zonegroups": ["default"],
> "master_zonegroup": "default",
> "bucket_quota": {
> "enabled": false,
>  

Re: [ceph-users] MDS restart when create million of files with smallfile tool

2016-08-16 Thread Yan, Zheng
It seems you have multiple active MDS. Multiple active MDS is not
stable yet. Please use single active MDS.


On Tue, Aug 16, 2016 at 8:10 PM, yu2xiangyang  wrote:
> I have found MDS restart several times  between two MDS processes with
> ACTIVE and BACKUP mode when I perform smallfile  creating  lots of files(3
> clients each with 8 threads creating 1 files) . Would any one encounter
> the same problem?  Is there any configuration I can set ? Thank you for any
> reply.
>
> Here is one of MDS logs.
> 2016-08-16 19:53:43.246001 7f90e4864180  0 ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 21852
> 2016-08-16 19:53:43.246494 7f90e4864180 -1 deprecation warning: MDS id
> 'mds.1' is invalid and will be forbidden in a future version.  MDS names may
> not start with a numeric digit.
> 2016-08-16 19:53:43.248084 7f90e4864180  0 pidfile_write: ignore empty
> --pid-file
> 2016-08-16 19:53:44.369886 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:45.719945 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:46.812074 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:48.412859 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:51.967246 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:53.163012 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:56.930083 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:05.376155 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:09.801776 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:13.442563 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:17.019500 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:17.220698 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:17.220704 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:boot --> up:replay
> 2016-08-16 19:54:17.220718 7f90dea57700  1 mds.0.137 replay_start
> 2016-08-16 19:54:17.220728 7f90dea57700  1 mds.0.137  recovery set is
> 2016-08-16 19:54:17.220734 7f90dea57700  1 mds.0.137  waiting for osdmap
> 51053 (which blacklists prior instance)
> 2016-08-16 19:54:17.291291 7f90d974a700  0 mds.0.cache creating system inode
> with ino:100
> 2016-08-16 19:54:17.291548 7f90d974a700  0 mds.0.cache creating system inode
> with ino:1
> 2016-08-16 19:54:18.871153 7f90d7b3c700  1 mds.0.137 replay_done
> 2016-08-16 19:54:18.871166 7f90d7b3c700  1 mds.0.137 making mds journal
> writeable
> 2016-08-16 19:54:19.710851 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:19.710860 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:replay --> up:reconnect
> 2016-08-16 19:54:19.710874 7f90dea57700  1 mds.0.137 reconnect_start
> 2016-08-16 19:54:19.710877 7f90dea57700  1 mds.0.137 reopen_log
> 2016-08-16 19:54:19.710912 7f90dea57700  1 mds.0.server reconnect_clients --
> 5 sessions
> 2016-08-16 19:54:19.711646 7f90d6931700  0 -- 192.168.5.12:6817/21852 >>
> 192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=0 pgs=0 cs=0 l=0
> c=0x7f90efbc6780).accept peer addr is really 192.168.5.9:0/2954821946
> (socket is 192.168.5.9:51609/0)
> 2016-08-16 19:54:19.712664 7f90d652d700  0 -- 192.168.5.12:6817/21852 >>
> 192.168.5.13:0/3688491801 pipe(0x7f90f02ac800 sd=63 :6817 s=0 pgs=0 cs=0 l=0
> c=0x7f90efbc6a80).accept peer addr is really 192.168.5.13:0/3688491801
> (socket is 192.168.5.13:57657/0)
> 2016-08-16 19:54:19.713002 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434663 192.168.5.13:0/643433156 after 0.002023
> 2016-08-16 19:54:19.725704 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.014790
> 2016-08-16 19:54:19.728322 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.017410
> 2016-08-16 19:54:19.734812 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.023899
> 2016-08-16 19:54:19.740344 7f90d6129700  0 -- 192.168.5.12:6817/21852 >>
> 192.168.5.8:0/1814981959 pipe(0x7f90f03a3400 sd=65 :6817 s=0 pgs=0 cs=0 l=0
> c=0x7f90efbc7c80).accept peer addr is really 192.168.5.8:0/1814981959
> (socket is 192.168.5.8:46034/0)
> 2016-08-16 19:54:19.746170 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.035255
> 2016-08-16 19:54:19.746722 7f90dea57700  1 mds.0.137 reconnect_done
> 2016-08-16 19:54:20.860114 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:20.860123 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2016-08-16 19:54:20.860138 7f90dea57700  1 mds.0.137 rejoin_start
> 2016-08-16 19:54:20.870836 7f90dea57700  1 mds.0.137 rejoin_joint_start
> 2016-08-16 19:54:21.115345 7f90da14d700  1 mds.0.137 rejoin_done
> 

[ceph-users] Understanding throughput/bandwidth changes in object store

2016-08-16 Thread hrast
Env: Ceph 10.2.2, 6 nodes, 96 OSDs, journals on ssd (8 per ssd), OSDs are 
enterprise SATA disks, 50KB objects, dual 10 Gbe, 3 copies of each object

I'm running some tests with COSbench to our object store, and I'm not really 
understanding what I'm seeing when changing the number of nodes. With 6 nodes, 
I'm getting a write bandwidth of about 160MB/s and 3300 Operations per second. 
I suspect we're IOPS bound, so to verify, we thought we'd take down one node, 
and see if the performance reduced by 1/6th (or so). However, when I removed 
the OSDs, mon, and rgw from that node in the cluster (and waited for the 
rebalance to complete) then turned off the node. I reran the same test I had 
run before, except my bandwidth and operations per sec were now a 1/3 what they 
had been a 6 nodes. I'm at a loss to understand why the huge impact at the 
removal of a single node, does anyone have an explaination?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fresh Jewel install with RDS missing default REALM

2016-08-16 Thread jan hugo prins
Hi,

I'm currently testing a fresh Jewel install with RDS and I run into some
issues.
The last couple of days everything was fine until I started playing
around with creating buckets in different storage pools and creating
different placement_targets. I Then I found out that the fresh install
of Jewel RDS is missing a default Realm.

So this morning I created a complete fresh RDS install and then I did
the following:
Could someone tell me if everything below looks fine, or if I'm missing
something?


[root@blsceph01-1 ~]# radosgw-admin metadata zone get --rgw-zone=default
{
"id": "1adc4b51-3345-4a4b-bf6d-b55b35991530",
"name": "default",
"domain_root": "default.rgw.data.root",
"control_pool": "default.rgw.control",
"gc_pool": "default.rgw.gc",
"log_pool": "default.rgw.log",
"intent_log_pool": "default.rgw.intent-log",
"usage_log_pool": "default.rgw.usage",
"user_keys_pool": "default.rgw.users.keys",
"user_email_pool": "default.rgw.users.email",
"user_swift_pool": "default.rgw.users.swift",
"user_uid_pool": "default.rgw.users.uid",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "default.rgw.buckets.index",
"data_pool": "default.rgw.buckets.data",
"data_extra_pool": "default.rgw.buckets.non-ec",
"index_type": 0
}
}
],
"metadata_heap": "default.rgw.meta",
"realm_id": ""
}
[root@blsceph01-1 ~]# radosgw-admin metadata zonegroup get
--rgw-zonegroup=default
{
"id": "d5ad18ed-dfb3-4e4a-a6ee-3c7b4f0cddae",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "1adc4b51-3345-4a4b-bf6d-b55b35991530",
"zones": [
{
"id": "1adc4b51-3345-4a4b-bf6d-b55b35991530",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false"
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": ""
}

[root@blsceph01-1 ~]# radosgw-admin realm create --rgw-realm=default
--default
2016-08-16 13:15:24.082459 7fa5788d19c0  0 error read_lastest_epoch
.rgw.root:periods.d825f817-43d1-4ca0-9ca2-0f3946c1e9b7.latest_epoch
{
"id": "f8cdcfe3-238a-4e0d-84c4-d58fada869aa",
"name": "default",
"current_period": "d825f817-43d1-4ca0-9ca2-0f3946c1e9b7",
"epoch": 1
}

[root@blsceph01-1 ~]# radosgw-admin period update --commit
2016-08-16 13:15:30.518961 7fe43116c9c0  0 RGWZoneParams::create():
error creating default zone params: (17) File exists
2016-08-16 13:15:30.607504 7fe43116c9c0  0 error read_lastest_epoch
.rgw.root:periods.f8cdcfe3-238a-4e0d-84c4-d58fada869aa:staging.latest_epoch
cannot commit period: period does not have a master zone of a master
zonegroup
failed to commit period: (22) Invalid argument

Looks like the realm is missing some information.

If found out that it could have something to do with the zonegroup-map:

[root@blsceph01-1 ~]# radosgw-admin zonegroup-map get
{
"zonegroups": [],
"master_zonegroup": "",
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}

Looks like something is indeed missing here.

[root@blsceph01-1 ~]# radosgw-admin zonegroup-map get >zonegroup-map.json
[root@blsceph01-1 ~]# vi zonegroup-map.json

{
"zonegroups": ["default"],
"master_zonegroup": "default",
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}


[root@blsceph01-1 ~]# radosgw-admin zonegroup-map set zonegroup.conf.json
[root@blsceph01-1 ~]# vi zonegroup.conf.json
[root@blsceph01-1 ~]# radosgw-admin metadata zonegroup set
--rgw-zonegroup=default http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados cppool slooooooowness

2016-08-16 Thread Simon Murray
My crush map is already set up and the rules exist for the various roots.
Just tried altering the crush rule set on a test pool and it migrates
successfully... didn't know you could do that!

Thanks Maxime

On 16 August 2016 at 12:13, Maxime Guyot  wrote:

> Hi Simon,
>
>
>
> If everything is in the same Ceph cluster and you want to move the whole
> “.rgw.buckets” (I assume your RBD traffic is targeted into a “data” or
> “rbd” pool) to your cold storage OSD maybe you could edit the CRUSH map,
> then it’s just a matter of rebalancing.
>
> You can check the ssd/platter example in the doc:
> http://docs.ceph.com/docs/master/rados/operations/crush-map/ or this
> article detailing different maps: http://cephnotes.ksperis.com/
> blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map
>
>
>
> Cheers,
>
> Maxime
>
> *From: *ceph-users  on behalf of Simon
> Murray 
> *Date: *Tuesday 16 August 2016 12:25
> *To: *"ceph-users@lists.ceph.com" 
> *Subject: *[ceph-users] rados cppool slooowness
>
>
>
> Morning guys,
>
> I've got about 8 million objects sat in .rgw.buckets that wants moving out
> of the way of OpenStack RDB traffic onto its own (admittedly small) cold
> storage pool on separate OSDs.
>
> I attempted to do this over the weekend during a 12h scheduled downtime,
> however my estimates had this pool completing in a rather un-customer
> friendly (think no backups...) 7 days.
>
> Anyone had any experience in doing this quicker?  Any obvious reasons why
> I can't hack do_copy_pool() to spawn a bunch of threads and bang this off
> in a few hours?
>
> Cheers
>
> Si
>
>
> DataCentred Limited registered in England and Wales no. 05611763
>

-- 
DataCentred Limited registered in England and Wales no. 05611763
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] openATTIC 2.0.13 beta has been released

2016-08-16 Thread Lenz Grimmer
Hi Alexander,

sorry for the late reply, I've been on vacation for a bit.

On 08/11/2016 07:16 AM, Александр Пивушков wrote:

> and what does not suit calamari?

Thank you for your comment! openATTIC has a somewhat different scope: we
aim at providing a versatile storage management system that supports
both "traditional" storage (e.g. NFS/CIFS/iSCSI) as well as adding
support for managing and monitoring Ceph for users that have storage
demands that exceed the boundaries of individual servers with local
attached storage.

We intend to organically grow the Ceph management and monitoring
functionality over time, based on user feedback and demand. Currently,
we're in the final stretch of completing a dashboard that displays the
overall status and health of the configured Ceph Cluster(s). We're also
working on extending the Ceph Pool management and monitoring
functionality for the next release (we release a new openATTIC version
every 5-6 weeks).

I blogged about the state of Ceph support a few months ago [1], a
followup posting is currently in the works.

Our roadmap and development process is fully open - we look forward to
your feedback and suggestions.

Thanks,

Lenz

[1]
https://blog.openattic.org/posts/update-the-state-of-ceph-support-in-openattic/



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS restart when create million of files with smallfile tool

2016-08-16 Thread yu2xiangyang
I have found MDS restart several times  between two MDS processes with ACTIVE 
and BACKUP mode when I perform smallfile  creating  lots of files(3 clients 
each with 8 threads creating 1 files) . Would any one encounter  the same 
problem?  Is there any configuration I can set ? Thank you for any reply.


Here is one of MDS logs.
2016-08-16 19:53:43.246001 7f90e4864180  0 ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 21852
2016-08-16 19:53:43.246494 7f90e4864180 -1 deprecation warning: MDS id 'mds.1' 
is invalid and will be forbidden in a future version.  MDS names may not start 
with a numeric digit.
2016-08-16 19:53:43.248084 7f90e4864180  0 pidfile_write: ignore empty 
--pid-file
2016-08-16 19:53:44.369886 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:45.719945 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:46.812074 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:48.412859 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:51.967246 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:53.163012 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:56.930083 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:05.376155 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:09.801776 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:13.442563 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:17.019500 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:17.220698 7f90dea57700  1 mds.0.137 handle_mds_map i am now 
mds.0.137
2016-08-16 19:54:17.220704 7f90dea57700  1 mds.0.137 handle_mds_map state 
change up:boot --> up:replay
2016-08-16 19:54:17.220718 7f90dea57700  1 mds.0.137 replay_start
2016-08-16 19:54:17.220728 7f90dea57700  1 mds.0.137  recovery set is
2016-08-16 19:54:17.220734 7f90dea57700  1 mds.0.137  waiting for osdmap 51053 
(which blacklists prior instance)
2016-08-16 19:54:17.291291 7f90d974a700  0 mds.0.cache creating system inode 
with ino:100
2016-08-16 19:54:17.291548 7f90d974a700  0 mds.0.cache creating system inode 
with ino:1
2016-08-16 19:54:18.871153 7f90d7b3c700  1 mds.0.137 replay_done
2016-08-16 19:54:18.871166 7f90d7b3c700  1 mds.0.137 making mds journal 
writeable
2016-08-16 19:54:19.710851 7f90dea57700  1 mds.0.137 handle_mds_map i am now 
mds.0.137
2016-08-16 19:54:19.710860 7f90dea57700  1 mds.0.137 handle_mds_map state 
change up:replay --> up:reconnect
2016-08-16 19:54:19.710874 7f90dea57700  1 mds.0.137 reconnect_start
2016-08-16 19:54:19.710877 7f90dea57700  1 mds.0.137 reopen_log
2016-08-16 19:54:19.710912 7f90dea57700  1 mds.0.server reconnect_clients -- 5 
sessions
2016-08-16 19:54:19.711646 7f90d6931700  0 -- 192.168.5.12:6817/21852 >> 
192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=0 pgs=0 cs=0 l=0 
c=0x7f90efbc6780).accept peer addr is really 192.168.5.9:0/2954821946 (socket 
is 192.168.5.9:51609/0)
2016-08-16 19:54:19.712664 7f90d652d700  0 -- 192.168.5.12:6817/21852 >> 
192.168.5.13:0/3688491801 pipe(0x7f90f02ac800 sd=63 :6817 s=0 pgs=0 cs=0 l=0 
c=0x7f90efbc6a80).accept peer addr is really 192.168.5.13:0/3688491801 (socket 
is 192.168.5.13:57657/0)
2016-08-16 19:54:19.713002 7f90dea57700  0 log_channel(cluster) log [DBG] : 
reconnect by client.25434663 192.168.5.13:0/643433156 after 0.002023
2016-08-16 19:54:19.725704 7f90dea57700  0 log_channel(cluster) log [DBG] : 
reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.014790
2016-08-16 19:54:19.728322 7f90dea57700  0 log_channel(cluster) log [DBG] : 
reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.017410
2016-08-16 19:54:19.734812 7f90dea57700  0 log_channel(cluster) log [DBG] : 
reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.023899
2016-08-16 19:54:19.740344 7f90d6129700  0 -- 192.168.5.12:6817/21852 >> 
192.168.5.8:0/1814981959 pipe(0x7f90f03a3400 sd=65 :6817 s=0 pgs=0 cs=0 l=0 
c=0x7f90efbc7c80).accept peer addr is really 192.168.5.8:0/1814981959 (socket 
is 192.168.5.8:46034/0)
2016-08-16 19:54:19.746170 7f90dea57700  0 log_channel(cluster) log [DBG] : 
reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.035255
2016-08-16 19:54:19.746722 7f90dea57700  1 mds.0.137 reconnect_done
2016-08-16 19:54:20.860114 7f90dea57700  1 mds.0.137 handle_mds_map i am now 
mds.0.137
2016-08-16 19:54:20.860123 7f90dea57700  1 mds.0.137 handle_mds_map state 
change up:reconnect --> up:rejoin
2016-08-16 19:54:20.860138 7f90dea57700  1 mds.0.137 rejoin_start
2016-08-16 19:54:20.870836 7f90dea57700  1 mds.0.137 rejoin_joint_start
2016-08-16 19:54:21.115345 7f90da14d700  1 mds.0.137 rejoin_done
2016-08-16 19:54:21.995720 7f90dea57700  1 mds.0.137 handle_mds_map i am now 
mds.0.137
2016-08-16 19:54:21.995727 7f90dea57700  1 mds.0.137 handle_mds_map state 
change up:rejoin --> up:clientreplay
2016-08-16 19:54:21.995739 7f90dea57700  1 mds.0.137 recovery_done -- 
successful recovery!
2016-08-16 

Re: [ceph-users] How to hide monitoring ip in cephfs mounted clients

2016-08-16 Thread gjprabu
Hi John,



 Any further update on this.



Regards

Prabu GJ




 On Thu, 28 Jul 2016 16:05:00 +0530 gjprabu 
gjpr...@zohocorp.comwrote  




Hi John,



   Thanks for your reply,  Its a normal docker container can see the mount 
information like /dev/sda... but this cause ip is exposed and it may security 
reason should avoid ip address.  As of now we will try to change hostname 
instead of monitor ip address but is there any way to prevent to see monitors 
ip in containers .



Regards

Prabu GJ





 On Wed, 20 Jul 2016 15:32:48 +0530 John Spray 
jsp...@redhat.comwrote  










___ 

ceph-users mailing list 

ceph-users@lists.ceph.com 

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


On Wed, Jul 20, 2016 at 8:33 AM, gjprabu gjpr...@zohocorp.com wrote: 

 

 Hi Team, 

 

 We are using chepfs file systems to mount client machines, here 

 while mount we should provide monitoring ip address, is there any option 
to 

 hide monitoring ips address in the mounted partition. We are using 
container 

 in all ceph clients and which all able see monitoring ip's, this could be 
a 

 security issue for us. Kindly let us know is there any solution on this. 



Hmm, so you have a situation where the containers are prevented from 

actually communicating with the monitor IPs, but the cephfs mounts are 

exposed to the containers in a way that lets them see them when they 

run `mount`? 



I don't think we've thought about this case before. Is it normal that 

when you have e.g. a docker container with a volume attached, the 

container can see the mount information for the filesystem that the 

volume lives on? 



John 



 Regards 

 Prabu GJ 

 

 ___ 

 ceph-users mailing list 

 ceph-users@lists.ceph.com 

 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD are up

2016-08-16 Thread Heller, Chris
I’d like to understand more why the down OSD would cause the PG to get stuck 
after CRUSH was able to locate enough OSD to map the PG.

Is this some form of safety catch that prevents it from recovering, even though 
OSD.116 is no longer important for data integrity?

Marking the OSD lost is an option here, but it’s not really lost … it just 
takes some time to get a machine rebooted.
I’m still working out my operational procedures for CEPH and marking the OSD 
lost but having it pop back up once the system reboots could be an issue that 
I’m not yet sure how to resolve.

Can an OSD be marked as ‘found’ once it returns to the network?

-Chris

From: Goncalo Borges 
Date: Monday, August 15, 2016 at 11:36 PM
To: "Heller, Chris" , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] PG is in 'stuck unclean' state, but all acting OSD 
are up


Hi Chris...

The precise osd set you see now [79,8,74] was obtained on epoch 104536 but this 
was after a lot of tries as showed by the recovery section.

Actually, in the first try (on epoch 100767) osd 116 was selected somehow 
(maybe it was up at the time?) and probably the pg got stuck because it went 
down during the recover process?

recovery_state": [
{
"name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2016-08-11 11:45:06.052568",
"requested_info_from": []
},
{
"name": "Started\/Primary\/Peering",
"enter_time": "2016-08-11 11:45:06.052558",
"past_intervals": [
{
"first": 100767,
"last": 100777,
"maybe_went_rw": 1,
"up": [
79,
116,
74
],
"acting": [
79,
116,
74
],
"primary": 79,
"up_primary": 79
},

The pg query also shows

peering_blocked_by": [
{
"osd": 116,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us 
proceed"
}

Maybe, you can check the documentation in [1] and see if you think you could 
follow the suggestion inside the pg and mark osd 116 as lost. This should be 
done after proper evaluation from you.

Another thing I found strange is that in the recovery section, there are a lot 
of tries where you do not get a proper osd set. The very last recover try was 
on epoch 104540.

{
"first": 104536,
"last": 104540,
"maybe_went_rw": 1,
"up": [
2147483647,
8,
74
],
"acting": [
2147483647,
8,
74
],
"primary": 8,
"up_primary": 8
}

From [2], "When CRUSH fails to find enough OSDs to map to a PG, it will show as 
a 2147483647 which is ITEM_NONE or no OSD found.".

This could be an artifact of the peering being blocked by osd.116, or a genuine 
problem where you are not being able to get a proper osd set. That could be for 
a variety of reasons: from network issues, to osds being almost full or simply 
because the system can't get 3 osds in 3 different hosts.

Cheers

Goncalo


[1] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure

[2] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/

On 08/16/2016 11:42 AM, Heller, Chris wrote:
Output of `ceph pg dump_stuck`

# ceph pg dump_stuck
ok
pg_stat state   up  up_primary  acting  acting_primary
4.2a8   down+peering[79,8,74]   79  [79,8,74]   79
4.c3down+peering[56,79,67]  56  [56,79,67]  56

-Chris

From: Goncalo Borges 

Date: Monday, August 15, 2016 at 9:03 PM
To: 

[ceph-users] help

2016-08-16 Thread yu2xiangyang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados cppool slooooooowness

2016-08-16 Thread Maxime Guyot
Hi Simon,

If everything is in the same Ceph cluster and you want to move the whole 
“.rgw.buckets” (I assume your RBD traffic is targeted into a “data” or “rbd” 
pool) to your cold storage OSD maybe you could edit the CRUSH map, then it’s 
just a matter of rebalancing.
You can check the ssd/platter example in the doc: 
http://docs.ceph.com/docs/master/rados/operations/crush-map/ or this article 
detailing different maps: 
http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map

Cheers,
Maxime
From: ceph-users  on behalf of Simon Murray 

Date: Tuesday 16 August 2016 12:25
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] rados cppool slooowness

Morning guys,
I've got about 8 million objects sat in .rgw.buckets that wants moving out of 
the way of OpenStack RDB traffic onto its own (admittedly small) cold storage 
pool on separate OSDs.
I attempted to do this over the weekend during a 12h scheduled downtime, 
however my estimates had this pool completing in a rather un-customer friendly 
(think no backups...) 7 days.
Anyone had any experience in doing this quicker?  Any obvious reasons why I 
can't hack do_copy_pool() to spawn a bunch of threads and bang this off in a 
few hours?
Cheers
Si

DataCentred Limited registered in England and Wales no. 05611763
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados cppool slooooooowness

2016-08-16 Thread Simon Murray
Morning guys,

I've got about 8 million objects sat in .rgw.buckets that wants moving out
of the way of OpenStack RDB traffic onto its own (admittedly small) cold
storage pool on separate OSDs.

I attempted to do this over the weekend during a 12h scheduled downtime,
however my estimates had this pool completing in a rather un-customer
friendly (think no backups...) 7 days.

Anyone had any experience in doing this quicker?  Any obvious reasons why I
can't hack do_copy_pool() to spawn a bunch of threads and bang this off in
a few hours?

Cheers
Si

-- 
DataCentred Limited registered in England and Wales no. 05611763
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph map error

2016-08-16 Thread Chengwei Yang
On Tue, Aug 16, 2016 at 10:21:55AM +0200, Ilya Dryomov wrote:
> On Tue, Aug 16, 2016 at 5:18 AM, Yanjun Shen  wrote:
> > hi,
> >when i run cep map -p pool rbd test, error
> > hdu@ceph-mon2:~$ sudo rbd map -p rbd test
> > rbd: sysfs write failed
> > In some cases useful info is found in syslog - try "dmesg | tail" or so.
> > rbd: map failed: (5) Input/output error
> >
> > dmesg |tail
> > [ 4148.672530] libceph: mon1 172.22.111.173:6789 feature set mismatch, my
> > 384a042a42 < server's 4384a042a42, missing 400
> > [ 4148.672576] libceph: mon1 172.22.111.173:6789 socket error on read
> > [ 4158.688709] libceph: mon0 172.22.111.172:6789 feature set mismatch, my
> > 384a042a42 < server's 4384a042a42, missing 400
> > [ 4158.688750] libceph: mon0 172.22.111.172:6789 socket error on read
> > [ 4168.704629] libceph: mon1 172.22.111.173:6789 feature set mismatch, my
> > 384a042a42 < server's 4384a042a42, missing 400
> > [ 4168.704674] libceph: mon1 172.22.111.173:6789 socket error on read
> > [ 4178.721313] libceph: mon2 172.22.111.174:6789 feature set mismatch, my
> > 384a042a42 < server's 4384a042a42, missing 400
> > [ 4178.721396] libceph: mon2 172.22.111.174:6789 socket error on read
> > [ 4188.736345] libceph: mon1 172.22.111.173:6789 feature set mismatch, my
> > 384a042a42 < server's 4384a042a42, missing 400
> > [ 4188.736383] libceph: mon1 172.22.111.173:6789 socket error on read
> >
> >
> > sudo ceph -v
> > ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> >
> > hdu@ceph-mon2:~$ uname -a
> > Linux ceph-mon2 3.14.0-031400-generic #201403310035 SMP Mon Mar 31 04:36:23
> > UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> >
> > can you help me ?
> 
> Hi Yanjun,
> 
> Please disregard Chengwei's message.  The page he linked to clearly
> states that hashpspool is supported in kernels 3.9 and above...
> 
> Your CRUSH tunables are set to jewel (tunables5).  As per [1],
> tunables5 are supported starting with kernel 4.5.  You need to either
> upgrade your kernel or set your tunables to legacy with

Ah, thanks Ilya correct me.

> 
> $ ceph osd crush tunables legacy
> 
> [1] 
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#which-client-versions-support-crush-tunables5
> 
> Thanks,
> 
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Thanks,
Chengwei


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd image features supported by which kernel version?

2016-08-16 Thread Chengwei Yang
On Tue, Aug 16, 2016 at 10:46:37AM +0200, Ilya Dryomov wrote:
> On Tue, Aug 16, 2016 at 4:06 AM, Chengwei Yang
>  wrote:
> > On Mon, Aug 15, 2016 at 03:27:50PM +0200, Ilya Dryomov wrote:
> >> On Mon, Aug 15, 2016 at 9:54 AM, Chengwei Yang
> >>  wrote:
> >> > Hi List,
> >> >
> >> > I read from ceph document[1] that there are several rbd image features
> >> >
> >> >   - layering: layering support
> >> >   - striping: striping v2 support
> >> >   - exclusive-lock: exclusive locking support
> >> >   - object-map: object map support (requires exclusive-lock)
> >> >   - fast-diff: fast diff calculations (requires object-map)
> >> >   - deep-flatten: snapshot flatten support
> >> >   - journaling: journaled IO support (requires exclusive-lock)
> >> >
> >> > But I didn't found any document/blog/google tells these features 
> >> > supported since
> >> > which kernel version.
> >>
> >> No released kernel currently supports these features.  exclusive-lock
> >> is staged for 4.9, we are working on staging object-map and fast-diff.
> >
> > Thanks Ilya
> >
> > Since object-map, fast-diff, journaling are depend on exclusive-lock, so I
> > expect them will be available after exclusive-lock.
> >
> > And I verified that layering is supported fine by centos kernel 3.10.0-327 
> > while
> > neither striping nor deep-flatten is supported, do you know which valina 
> > kernel
> > version start to supported rbd striping and deep-flatten?
> 
> Right, I should have been more specific - layering has been supported
> for a long time (3.10+), so I crossed it off.
> 
> None of the features except layering are currenty supported by the
> kernel client.  It will let you map an image with the striping feature
> enabled, but only if stripe_unit == object_size and stripe_count == 1
> (the default striping pattern).
> 
> exclusive-lock, object-map and fast-diff are being worked on.
> striping, deep-flatten and journaling are farther away.

Thanks Ilya, much appreciate!

> 
> Thanks,
> 
> Ilya

-- 
Thanks,
Chengwei


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd image features supported by which kernel version?

2016-08-16 Thread Ilya Dryomov
On Tue, Aug 16, 2016 at 4:06 AM, Chengwei Yang
 wrote:
> On Mon, Aug 15, 2016 at 03:27:50PM +0200, Ilya Dryomov wrote:
>> On Mon, Aug 15, 2016 at 9:54 AM, Chengwei Yang
>>  wrote:
>> > Hi List,
>> >
>> > I read from ceph document[1] that there are several rbd image features
>> >
>> >   - layering: layering support
>> >   - striping: striping v2 support
>> >   - exclusive-lock: exclusive locking support
>> >   - object-map: object map support (requires exclusive-lock)
>> >   - fast-diff: fast diff calculations (requires object-map)
>> >   - deep-flatten: snapshot flatten support
>> >   - journaling: journaled IO support (requires exclusive-lock)
>> >
>> > But I didn't found any document/blog/google tells these features supported 
>> > since
>> > which kernel version.
>>
>> No released kernel currently supports these features.  exclusive-lock
>> is staged for 4.9, we are working on staging object-map and fast-diff.
>
> Thanks Ilya
>
> Since object-map, fast-diff, journaling are depend on exclusive-lock, so I
> expect them will be available after exclusive-lock.
>
> And I verified that layering is supported fine by centos kernel 3.10.0-327 
> while
> neither striping nor deep-flatten is supported, do you know which valina 
> kernel
> version start to supported rbd striping and deep-flatten?

Right, I should have been more specific - layering has been supported
for a long time (3.10+), so I crossed it off.

None of the features except layering are currenty supported by the
kernel client.  It will let you map an image with the striping feature
enabled, but only if stripe_unit == object_size and stripe_count == 1
(the default striping pattern).

exclusive-lock, object-map and fast-diff are being worked on.
striping, deep-flatten and journaling are farther away.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-16 Thread Félix Barbeira
Thanks everybody for the answers, it really helped me a lot. So, to sum up,
this is the options that I have:


   - OS in a RAID1.
  - PROS: the cluster is protected against OS failures. If one of this
  disks fail, it could be easily replaced because it is hot-swappable.
  - CONS: we are "wasting" 2 bays of disks that could be destinated to
  OSDs.

* In the case of R730xd we have the option to put 2x2.5" SSDs disks on the
slots on the back like Brian says. For me this is clearly the best option.
We'll see if the department of finance has the same opinion :)


   - OS in a single disk.
   - PROS: we are using only 1 disk slot. It could be a cheaper disk than
  the 4TB model because we are only going to use ~10GB.
  - CONS: the OS is not protected against failures and if this disk
  fails, the OSDs in this machine (11) fails too. In this case we might try
  to adjust the configuration in order to not reconstruct all this
OSDs data
  and wait until the OS disk is replaced (I'm not sure if this is
possible, I
  should check the docs).
   - OS in a SATADOM ( http://www.innodisk.com/intel/product.html )
  - PROS: we have all the disk slots available to use for OSDs.
  - CONS: I have no experience with this kind of devices, I'm not sure
  if the are trustworthy. This devices are fast but they are not raid
  protected, it's a single point of failure like the previous option.
   - OS boot from a SAN (this is the option I'm considering for the non
   R730xd machines, which does not have the 2x2.5" slots on the back).
  - PROS: all the disk slots are available to OSDs. The OS disk is
  protected by RAID on the remote storage.
  - CONS: we depend of the network, I guess the OS device does not
  require a lot of traffic, all the ceph OSDs network traffic should be
  managed through another network card.

Maybe I'm missing some other option, in that case please tell me, it would
be helpful.

It would be really helpful if somebody has experience with the option of
booting OS from a SAN, sharing their pros/cons experience because that
option it's very interesting to me.


2016-08-14 14:57 GMT+02:00 Christian Balzer :

>
> Hello,
>
> I shall top-quote, summarize here.
>
> Firstly we have to consider that Ceph is deployed by people with a wide
> variety of needs, budgets and most of all cluster sizes.
>
> Wido has the pleasure (or is that nightmare? ^o^) to deal with a really
> huge cluster, thousands of OSDs and an according larg number of nodes (if
> memory serves me).
>
> While many others have comparatively small clusters, with decisively less
> than 10 storage nodes, like me.
>
> So the approach and philosophy is obviously going to differ quite a bit
> on either end of this spectrum.
>
> If you start large (dozens of nodes and hundreds of OSDs), where only a
> small fraction of your data (10% or less) is in a failure domain (host
> initially), then you can play fast and loose and save a lot of money by
> designing your machines and infrastructure accordingly.
> Things like redundant OS drives, PSUs, even network links on the host if
> the cluster big enough.
> In a cluster of sufficient size, a node failure and the resulting data
> movements is just background noise.
>
> OTOH with smaller clusters, you obviously want to avoid failures if at all
> possible, since not only the re-balancing is going to be more painful, but
> the resulting smaller cluster will also have less performance.
> This is why my OSD nodes have all the redundancy bells and whistles there
> are, simply because a cluster big enough to not need them would be both
> vastly more expensive despite cheaper individual node costs and also
> underutilized.
>
> Of course if you should grow to a certain point, maybe your next
> generation of OSD nodes can be build on the cheap w/o compromising safe
> operations.
>
> No matter what size your cluster is though, setting
> "mon_osd_down_out_subtree_limit" to an appropriate value (host for small
> clusters) is a good way to avoid re-balancing storms when a node (or some
> larger segment) goes down, given that recovering the failed part can be
> significantly faster than moving tons of data around.
> This of course implies 24/7 monitoring and access to the HW.
>
>
> As for dedicated MONs, I usually try to have the primary MON (lowest IP)
> on dedicated HW and to be sure that MONs residing on OSD nodes have fast
> storage and enough CPU/RAM to be happy even if the OSDs go on full spin.
>
> Which incidentally is why your shared MONs are likely a better fit for a
> HDD based OSD node than a SSD based one used for a cache pool for example.
>
> Anyway, MONs are clearly candidates for having their OS (where /var/lib
> resides) on RAIDed, hot-swappable fast and durable and power-loss safe
> SSDs, just so you can avoid loosing one and having to shut down the whole
> thing in the (unlikely) case of a SSD failure.
>
>
> Regards,
>
> 

Re: [ceph-users] ceph map error

2016-08-16 Thread Ilya Dryomov
On Tue, Aug 16, 2016 at 5:18 AM, Yanjun Shen  wrote:
> hi,
>when i run cep map -p pool rbd test, error
> hdu@ceph-mon2:~$ sudo rbd map -p rbd test
> rbd: sysfs write failed
> In some cases useful info is found in syslog - try "dmesg | tail" or so.
> rbd: map failed: (5) Input/output error
>
> dmesg |tail
> [ 4148.672530] libceph: mon1 172.22.111.173:6789 feature set mismatch, my
> 384a042a42 < server's 4384a042a42, missing 400
> [ 4148.672576] libceph: mon1 172.22.111.173:6789 socket error on read
> [ 4158.688709] libceph: mon0 172.22.111.172:6789 feature set mismatch, my
> 384a042a42 < server's 4384a042a42, missing 400
> [ 4158.688750] libceph: mon0 172.22.111.172:6789 socket error on read
> [ 4168.704629] libceph: mon1 172.22.111.173:6789 feature set mismatch, my
> 384a042a42 < server's 4384a042a42, missing 400
> [ 4168.704674] libceph: mon1 172.22.111.173:6789 socket error on read
> [ 4178.721313] libceph: mon2 172.22.111.174:6789 feature set mismatch, my
> 384a042a42 < server's 4384a042a42, missing 400
> [ 4178.721396] libceph: mon2 172.22.111.174:6789 socket error on read
> [ 4188.736345] libceph: mon1 172.22.111.173:6789 feature set mismatch, my
> 384a042a42 < server's 4384a042a42, missing 400
> [ 4188.736383] libceph: mon1 172.22.111.173:6789 socket error on read
>
>
> sudo ceph -v
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>
> hdu@ceph-mon2:~$ uname -a
> Linux ceph-mon2 3.14.0-031400-generic #201403310035 SMP Mon Mar 31 04:36:23
> UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> can you help me ?

Hi Yanjun,

Please disregard Chengwei's message.  The page he linked to clearly
states that hashpspool is supported in kernels 3.9 and above...

Your CRUSH tunables are set to jewel (tunables5).  As per [1],
tunables5 are supported starting with kernel 4.5.  You need to either
upgrade your kernel or set your tunables to legacy with

$ ceph osd crush tunables legacy

[1] 
http://docs.ceph.com/docs/master/rados/operations/crush-map/#which-client-versions-support-crush-tunables5

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com