Re: [ceph-users] Old CEPH (0.87) cluster degradation - putting OSDs down one by one

2016-03-09 Thread maxxik
HI Jan

Yes- that's exactly the case - FS on OSDs was corrupted but it was not
Intel SATA/SAS:

Hardware :  3x Serial Attached SCSI controller: LSI Logic / Symbios
Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

and mpt2sas

however it was almost at once.

Max

On 27/02/2016 11:54 PM, Jan Schermer wrote:
> Anythink in dmesg/kern.log at the time this happened?
>
>  0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal
> (Aborted) **
>
> I think your filesystem was somehow corrupted.
>
> An regarding this: 2. Physical HDD replaced  and NOT added to CEPH -
> here we had strange kernel crash just after HDD connected to the
> controller.
> What are the drives connected to? We have had problems with Intel
> SATA/SAS driver. You can do a hotplug of a drive but if you remove one
> and put in another the kernel crashes (it only happens if some time
> passes between those two actions, makes it very nasty).
>
> Jan
>
>
>
>> On 27 Feb 2016, at 00:14, maxxik > > wrote:
>>
>> Hi Cephers
>>
>> At the moment we are trying to recover our CEPH cluser (0.87) which
>> is behaving very odd.
>>
>> What have been done :
>>
>> 1. OSD drive failure happened - CEPH put OSD down and  out.
>> 2. Physical HDD replaced  and NOT added to CEPH - here we had strange
>> kernel crash just after HDD connected to the controller.
>> 3. Physical host rebooted.
>> 4. CEPH started restoration and putting OSD's down one by one
>> (actually I can see osd process crush in logs).
>>
>> ceph.conf is in attachment.
>>
>>
>> OSD failure :
>>
>> -4> 2016-02-26 23:20:47.906443 7f942b4b6700  5 -- op tracker --
>> seq: 471061, time: 2016-02-26 23:20:47.906404, even
>> t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
>> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
>> -3> 2016-02-26 23:20:47.906451 7f942b4b6700  5 -- op tracker --
>> seq: 471061, time: 2016-02-26 23:20:47.906406, even
>> t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb
>> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
>> -2> 2016-02-26 23:20:47.906456 7f942b4b6700  5 -- op tracker --
>> seq: 471061, time: 2016-02-26 23:20:47.906421, even
>> t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb
>> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
>> -1> 2016-02-26 23:20:47.906462 7f942b4b6700  5 -- op tracker --
>> seq: 471061, time: 0.00, event: dispatched, op:
>>  pg_backfill(progress 13.77 e 183964/183964 lb
>> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
>>  0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f9434e0f700
>>
>>  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>>  1: /usr/bin/ceph-osd() [0x9e2015]
>>  2: (()+0xfcb0) [0x7f945459fcb0]
>>  3: (gsignal()+0x35) [0x7f94533d30d5]
>>  4: (abort()+0x17b) [0x7f94533d683b]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9453d2569d]
>>  6: (()+0xb5846) [0x7f9453d23846]
>>  7: (()+0xb5873) [0x7f9453d23873]
>>  8: (()+0xb596e) [0x7f9453d2396e]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x259) [0xacb979]
>>  10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f]
>>  11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54]
>>  12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]
>>  13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3]
>>  14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd]
>>  15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x67ccf3]
>>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce]
>>  17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]
>>  18: (()+0x7e9a) [0x7f9454597e9a]
>>  19: (clone()+0x6d) [0x7f94534912ed]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> --- logging levels ---
>>0/ 5 none
>>0/ 1 lockdep
>>0/ 1 context
>>1/ 1 crush
>>1/ 5 mds
>>1/ 5 mds_balancer
>>1/ 5 mds_locker
>>1/ 5 mds_log
>>1/ 5 mds_log_expire
>>1/ 5 mds_migrator
>>0/ 1 buffer
>>0/ 1 timer
>>0/ 1 filer
>>0/ 1 striper
>>0/ 1 objecter
>>0/ 5 rados
>>0/ 5 rbd
>>0/ 5 rbd_replay
>>0/ 5 journaler
>>0/ 5 objectcacher
>>0/ 5 client
>>0/ 5 osd
>>0/ 5 optracker
>>0/ 5 objclass
>>1/ 3 filestore
>>1/ 3 keyvaluestore
>>1/ 3 journal
>>0/ 5 ms
>>1/ 5 mon
>>0/10 monc
>>1/ 5 paxos
>>0/ 5 tp
>>1/ 5 auth
>>1/ 5 crypto
>>1/ 1 finisher
>>1/ 5 heartbeatmap
>>1/ 5 perfcounter
>>1/ 5 rgw
>>1/10 civetweb
>>1/ 5 javaclient
>>1/ 5 asok
>>1/ 1 throttle
>>0/ 0 refs
>>   -1/-1 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent 1
>>   max_new 1000
>>   log_file /var/log/ceph/ceph-osd.27.log
>>
>>
>> Current OSD tree:
>>
>>
>> # idweight  type name   up/down reweight
>> -10 2   root ssdtree
>> -8  1 

Re: [ceph-users] how ceph osd handle ios sent from crashed ceph client

2016-03-09 Thread Ric Wheeler

On 03/08/2016 08:09 PM, Jason Dillaman wrote:

librbd provides crash-consistent IO.  It is still up to your application to 
provide its own consistency by adding barriers (flushes) where necessary.  If 
you flush your IO, once that flush completes you are guaranteed that your 
previous IO is safely committed to disk.



Jeff Moyer wrote a good article on how applications can manage data durability 
for lwn.net a few years back - still worth reading:


https://lwn.net/Articles/457667

This is more focused on applications on top of file systems, but is still 
relevant for applications running on block devices.


Regards,

Ric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New added OSD always down when full flag of osdmap is set

2016-03-09 Thread hzwulibin
Sure, here it is:
# ceph osd tree
ID WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.04997 root default  
-2 0.04997 host ceph01   
 0 0.00999 osd.0up  1.0  1.0 
 1 0.00999 osd.1  down0  1.0 
 3 0.00999 osd.3up  1.0  1.0 
 4 0.00999 osd.4  down0  1.0 
 2 0.00999 osd.2  down0  1.0 


root@ceph01:~# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL %USE  VAR  
 0 0.00999  1.0 15348M 14920M  428M 97.21 1.00 
 1 0.009990  0  0 0 00 
 3 0.00999  1.0 15348M 14918M  430M 97.19 1.00 
 4 0.009990  0  0 0 00 
 2 0.009990  0  0 0 00 
  TOTAL 30697M 29839M  858M 97.20  
MIN/MAX VAR: 0/1.00  STDDEV: 0.01


root@ceph01:~# ceph df
GLOBAL:
SIZE   AVAIL RAW USED %RAW USED 
30697M  858M   29839M 97.20 
POOLS:
NAME ID USED   %USED MAX AVAIL OBJECTS 
rbd  0  15360M 50.04 1070M3842 


And the size(replicas) of the pool is 2, min_size is 1
--   
hzwulibin
2016-03-10

-
发件人:Shinobu Kinjo 
发送日期:2016-03-10 11:28
收件人:hzwulibin
抄送:ceph-devel,ceph-users
主题:Re: [ceph-users] New added OSD always down when full flag of osdmap
 is set

Can you provide us with:

 sudo ceph osd tree
 sudo ceph osd df
 sudo ceph df

Cheers,
S

On Thu, Mar 10, 2016 at 11:58 AM, hzwulibin  wrote:
> No, just 98%.
>
> Another scene, if backfill too full, also could not add the osd in and up.
>
> --
> hzwulibin
> 2016-03-10
>
> -
> 发件人:Shinobu Kinjo 
> 发送日期:2016-03-10 10:49
> 收件人:hzwulibin
> 抄送:ceph-devel,ceph-users
> 主题:Re: [ceph-users] New added OSD always down when full flag of osdmap
>  is set
>
> On Thu, Mar 10, 2016 at 11:37 AM, hzwulibin  wrote:
>> Hi, cephers
>>
>> Recently, i found we could not add new osd in cluster when the full flag is 
>> set in osdmap.
>>
>> Shortly describe the scene here:
>> 1. Some osd are full, and the osdmap has full flag
>
> Are those OSDs usage really 100% which is not expected?
>
>> 2. Add new osd
>> 3. New osd service is running, but it's state always down
>>
>> Here is the issue:
>> http://tracker.ceph.com/issues/15025
>>
>> Anyone know it's a normal way or a bug here?
>>
>> --
>> hzwulibin
>> 2016-03-10
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Email:
> shin...@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Email:
shin...@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New added OSD always down when full flag of osdmap is set

2016-03-09 Thread Shinobu Kinjo
Can you provide us with:

 sudo ceph osd tree
 sudo ceph osd df
 sudo ceph df

Cheers,
S

On Thu, Mar 10, 2016 at 11:58 AM, hzwulibin  wrote:
> No, just 98%.
>
> Another scene, if backfill too full, also could not add the osd in and up.
>
> --
> hzwulibin
> 2016-03-10
>
> -
> 发件人:Shinobu Kinjo 
> 发送日期:2016-03-10 10:49
> 收件人:hzwulibin
> 抄送:ceph-devel,ceph-users
> 主题:Re: [ceph-users] New added OSD always down when full flag of osdmap
>  is set
>
> On Thu, Mar 10, 2016 at 11:37 AM, hzwulibin  wrote:
>> Hi, cephers
>>
>> Recently, i found we could not add new osd in cluster when the full flag is 
>> set in osdmap.
>>
>> Shortly describe the scene here:
>> 1. Some osd are full, and the osdmap has full flag
>
> Are those OSDs usage really 100% which is not expected?
>
>> 2. Add new osd
>> 3. New osd service is running, but it's state always down
>>
>> Here is the issue:
>> http://tracker.ceph.com/issues/15025
>>
>> Anyone know it's a normal way or a bug here?
>>
>> --
>> hzwulibin
>> 2016-03-10
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Email:
> shin...@linux.com
> GitHub:
> shinobu-x
> Blog:
> Life with Distributed Computational System based on OpenSource
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Email:
shin...@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New added OSD always down when full flag of osdmap is set

2016-03-09 Thread Shinobu Kinjo
On Thu, Mar 10, 2016 at 11:37 AM, hzwulibin  wrote:
> Hi, cephers
>
> Recently, i found we could not add new osd in cluster when the full flag is 
> set in osdmap.
>
> Shortly describe the scene here:
> 1. Some osd are full, and the osdmap has full flag

Are those OSDs usage really 100% which is not expected?

> 2. Add new osd
> 3. New osd service is running, but it's state always down
>
> Here is the issue:
> http://tracker.ceph.com/issues/15025
>
> Anyone know it's a normal way or a bug here?
>
> --
> hzwulibin
> 2016-03-10
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Email:
shin...@linux.com
GitHub:
shinobu-x
Blog:
Life with Distributed Computational System based on OpenSource
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New added OSD always down when full flag of osdmap is set

2016-03-09 Thread hzwulibin
Hi, cephers

Recently, i found we could not add new osd in cluster when the full flag is set 
in osdmap.

Shortly describe the scene here:
1. Some osd are full, and the osdmap has full flag
2. Add new osd
3. New osd service is running, but it's state always down

Here is the issue:
http://tracker.ceph.com/issues/15025

Anyone know it's a normal way or a bug here?

--
hzwulibin
2016-03-10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-09 Thread Chris Dunlop
Hi Loic,

On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote:
> I think you misread what Sage wrote : "The intention was to
> continue building stable releases (0.94.x) on the old list of
> supported platforms (which inclues 12.04 and el6)". In other
> words, the old OS'es are still supported. Their absence is a
> glitch in the release process that will be fixed.

Any news on a release of v0.94.6 for debian wheezy?

Cheers,

Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] uncompiled crush map for ceph-rest-api /osd/crush/set

2016-03-09 Thread Jared Watts
Hi, I have a few questions about the usage of ceph-rest-api's /osd/crush/set.

Documentation obtained from GET /api/v0.1/osd:

osd/crush/set PUT "set crush map from input file"

1) Does the input crush map have to be compiled before using this API or can an 
uncompiled map be used?
2) Is there anything in the ceph-rest-api API to compile a crush map?

I can successfully set a compiled crush map, but I get an error using an 
uncompiled one which seems to indicate it must be compiled first.  The below is 
my attempt with curl to use an uncompiled map but it gets 400 BadRequest: 
"Error: Failed to parse crushmap: buffer::malformed_input: bad magic number 
(-22)”.

Is there a way to do this with an uncompiled map?


curl -iv -XPUT --data-binary "@/tmp/crushmap-uncompiled" -H "Accept: 
application/json" -H "Content-type: application/octet-stream" 
'0.0.0.0:53279/api/v0.1/osd/crush/set'

*   Trying 0.0.0.0...

* Connected to 0.0.0.0 (127.0.0.1) port 53279 (#0)

> PUT /api/v0.1/osd/setcrushmap HTTP/1.1

> User-Agent: curl/7.38.0

> Host: 0.0.0.0:53279

> Accept: application/json

> Content-type: application/octet-stream

> Content-Length: 1297

> Expect: 100-continue

>

< HTTP/1.1 100 Continue

HTTP/1.1 100 Continue


* HTTP 1.0, assume close after body

< HTTP/1.0 400 BAD REQUEST

HTTP/1.0 400 BAD REQUEST

< Content-Type: text/html; charset=utf-8

Content-Type: text/html; charset=utf-8

< Content-Length: 108

Content-Length: 108

* Server Werkzeug/0.9.6 Python/2.7.9 is not blacklisted

< Server: Werkzeug/0.9.6 Python/2.7.9

Server: Werkzeug/0.9.6 Python/2.7.9

< Date: Wed, 09 Mar 2016 22:45:57 GMT

Date: Wed, 09 Mar 2016 22:45:57 GMT


<

* Closing connection 0

{"status": "Error: Failed to parse crushmap: buffer::malformed_input: bad magic 
number (-22)", "output": []}

Thanks for any help!


--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering a secondary replica from another secondary replica

2016-03-09 Thread Gregory Farnum
On Wed, Mar 9, 2016 at 1:57 PM, Александр Шишенко  wrote:
> Well, my aim is to make replicas request chunks of data from each
> other during the recovery process, but not from primary. So, let's
> consider the three-osd cluster with a PG residing on it:
> node1(primary), node2(replica), node3(replica), 3in/3up active+clean
>
> Now, let's shutdown node3.
> node1(primary), node2(replica), node3(replica), 3in/2up
> active+undersized+degraded
>
> After node3 comes back online, the recovery is started and node1 sends
> chunks of data node3. Is there a way to make node2 send these chunks
> instead of node1 without making node2 a primary?

Oh, I see. I think you're missing that there are many PGs on each OSD,
and you will generally be recovering more than one PG at a time. So
rather than doing the (extreme and difficult) bookkeeping to allow
recovery directly via replicas, we count on having PGs whose primary
is different to distribute that work. :)
-Greg

>
> 2016-03-10 0:42 GMT+03:00 Gregory Farnum :
>> On Wed, Mar 9, 2016 at 2:21 AM, Александр Шишенко  
>> wrote:
>>> Hello,
>>>
>>> I have a development cluster of three OSD's. My aim is to make a
>>> secondary replica recover from another secondary replica (not
>>> primary). Is it possible to do so with minor changes to ceph-osd
>>> source code? Are ceph algorithms compatible with these changes?
>>
>> I'm not sure what you mean by these statements. While we have a
>> primary and replicas in Ceph, that status is ephemeral — if the
>> primary disappears, one of the replicas will become primary. Recovery
>> is orchestrated and largely involves whichever node is currently the
>> primary, but if the primary is unsuitable for that it maps one of the
>> secondary replicas to be primary for the duration of the recovery...
>> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering a secondary replica from another secondary replica

2016-03-09 Thread Александр Шишенко
Well, my aim is to make replicas request chunks of data from each
other during the recovery process, but not from primary. So, let's
consider the three-osd cluster with a PG residing on it:
node1(primary), node2(replica), node3(replica), 3in/3up active+clean

Now, let's shutdown node3.
node1(primary), node2(replica), node3(replica), 3in/2up
active+undersized+degraded

After node3 comes back online, the recovery is started and node1 sends
chunks of data node3. Is there a way to make node2 send these chunks
instead of node1 without making node2 a primary?

2016-03-10 0:42 GMT+03:00 Gregory Farnum :
> On Wed, Mar 9, 2016 at 2:21 AM, Александр Шишенко  wrote:
>> Hello,
>>
>> I have a development cluster of three OSD's. My aim is to make a
>> secondary replica recover from another secondary replica (not
>> primary). Is it possible to do so with minor changes to ceph-osd
>> source code? Are ceph algorithms compatible with these changes?
>
> I'm not sure what you mean by these statements. While we have a
> primary and replicas in Ceph, that status is ephemeral — if the
> primary disappears, one of the replicas will become primary. Recovery
> is orchestrated and largely involves whichever node is currently the
> primary, but if the primary is unsuitable for that it maps one of the
> secondary replicas to be primary for the duration of the recovery...
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Announcing new download mirrors for Ceph

2016-03-09 Thread Wido den Hollander
Hi,

In the past few months I have been working with various people and companies in
the Ceph community to get more mirrors online.

Thanks to those people and companies I am happy to announce that we have new
international locations where you can download Ceph!

Currently the main download location for Ceph is download.ceph.com in the US and
eu.ceph.com in the EU (Netherlands).

As of today multiple new mirrors are available to download Ceph tarballs and
packages from. This ensures that Ceph can be downloaded from a location nearby
your datacenter(s).

The new locations:
* au.ceph.com: Australia
* cz.ceph.com: Czech Republic
* de.ceph.com: Germany
* se.ceph.com: Sweden
* hk.ceph.com: Hong Kong
* fr.ceph.com: France
* us-east.ceph.com: US East Coast

You can find a list of mirrors in the Ceph documentation:
http://docs.ceph.com/docs/master/install/mirrors/

Using the mirrors is simple. Just replace any URL for download.ceph.com with the
hostname of the mirror.

With the addition of the new mirrors there are also some new scripts to easily
mirror Ceph to your local repositories:
https://github.com/ceph/ceph/tree/master/mirroring

All the Ceph mirrors use the same Bash/rsync script to mirror Ceph.

Should there be any questions, please let me know!

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering a secondary replica from another secondary replica

2016-03-09 Thread Gregory Farnum
On Wed, Mar 9, 2016 at 2:21 AM, Александр Шишенко  wrote:
> Hello,
>
> I have a development cluster of three OSD's. My aim is to make a
> secondary replica recover from another secondary replica (not
> primary). Is it possible to do so with minor changes to ceph-osd
> source code? Are ceph algorithms compatible with these changes?

I'm not sure what you mean by these statements. While we have a
primary and replicas in Ceph, that status is ephemeral — if the
primary disappears, one of the replicas will become primary. Recovery
is orchestrated and largely involves whichever node is currently the
primary, but if the primary is unsuitable for that it maps one of the
secondary replicas to be primary for the duration of the recovery...
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to choose EC plugins and rulesets

2016-03-09 Thread Yoann Moulin
Hello,

We are looking for recommendations and guidelines for using erasure codes (EC)
with Ceph.

Our setup consists of 25 identical nodes which we dedicate to Ceph. Each node
contains 10 HDDs (full specs below)

We started with 10 nodes (comprising 100 OSDs) and created a pool with 3-times
replication.

In order to increase the usable capacity, we would like to go for EC instead of
replication.

- Can anybody share with us recommendations regarding the choice of plugins and
rulesets?
- In particular, how do we relate to the number of nodes and OSDs? Any formulas
or rules of thumb?
- Is it possible to change rulesets live on a pool?

We currently use Infernalis but plan to move to Jewel.

- Are there any improvement in Jewel with regard to erasure codes?

Looking forward for your answers.


=

Full specs of nodes

CPU: 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Memory: 128GB of Memory
OS Storage: 2 x SSD 240GB Intel S3500 DC (raid 1)
Journal Storage: 2 x SSD 400GB Intel S3300 DC (no Raid)
OSD Disk: 10 x HGST ultrastar-7k6000 6TB
Network: 1 x 10Gb/s
OS: Ubuntu 14.04

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] yum install ceph on RHEL 7.2

2016-03-09 Thread Deneau, Tom


> -Original Message-
> From: Ken Dreyer [mailto:kdre...@redhat.com]
> Sent: Tuesday, March 08, 2016 10:24 PM
> To: Shinobu Kinjo
> Cc: Deneau, Tom; ceph-users
> Subject: Re: [ceph-users] yum install ceph on RHEL 7.2
> 
> On Tue, Mar 8, 2016 at 4:11 PM, Shinobu Kinjo 
> wrote:
> > If you register subscription properly, you should be able to install
> > the Ceph without the EPEL.
> 
> The opposite is true (when installing upstream / ceph.com).
> 
> We rely on EPEL for several things, like leveldb and xmlstarlet.
> 
> - Ken

Ken --

What about when you just do yum install from the preconfigured repos?
Should EPEL be required for that?

-- Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system

2016-03-09 Thread Jeffrey McDonald
Hi, I went back to the mon logs to see if I could illicit any additional
information about this PG.
Prior to 1/27/16, the deep-scrub on this OSD passes(then I see obsolete
rollback objects found):

ceph.log.4.gz:2016-01-20 09:43:36.195640 osd.307 10.31.0.67:6848/127170 538
: cluster [INF] 70.459 deep-scrub ok
ceph.log.4.gz:2016-01-27 09:51:49.952459 osd.307 10.31.0.67:6848/127170 583
: cluster [INF] 70.459 deep-scrub starts
ceph.log.4.gz:2016-01-27 10:10:57.196311 osd.108 10.31.0.69:6816/4283 335 :
cluster [ERR] osd.108 pg 70.459s5 found obsolete rollback obj
79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/5
generation < trimmed_to 130605'206504...repaired
ceph.log.4.gz:2016-01-27 10:10:57.043942 osd.307 10.31.0.67:6848/127170 584
: cluster [ERR] osd.307 pg 70.459s0 found obsolete rollback obj
79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
generation < trimmed_to 130605'206504...repaired
ceph.log.4.gz:2016-01-27 10:10:58.225017 osd.307 10.31.0.67:6848/127170 585
: cluster [ERR] 70.459s0 shard 4(3) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-27 10:10:58.225068 osd.307 10.31.0.67:6848/127170 586
: cluster [ERR] 70.459s0 shard 10(2) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-27 10:10:58.225088 osd.307 10.31.0.67:6848/127170 587
: cluster [ERR] 70.459s0 shard 26(1) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-27 10:10:58.225127 osd.307 10.31.0.67:6848/127170 588
: cluster [ERR] 70.459s0 shard 132(4) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-27 10:13:52.926032 osd.307 10.31.0.67:6848/127170 589
: cluster [ERR] 70.459s0 deep-scrub stat mismatch, got 21324/21323 objects,
0/0 clones, 21324/21323 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
whiteouts, 64313094166/64308899862 bytes,0/0 hit_set_archive bytes.
ceph.log.4.gz:2016-01-27 10:13:52.927589 osd.307 10.31.0.67:6848/127170 590
: cluster [ERR] 70.459s0 deep-scrub 1 missing, 0 inconsistent objects
ceph.log.4.gz:2016-01-27 10:13:52.931250 osd.307 10.31.0.67:6848/127170 591
: cluster [ERR] 70.459 deep-scrub 5 errors
ceph.log.4.gz:2016-01-28 10:32:37.083809 osd.307 10.31.0.67:6848/127170 592
: cluster [INF] 70.459 repair starts
ceph.log.4.gz:2016-01-28 10:51:44.608297 osd.307 10.31.0.67:6848/127170 593
: cluster [ERR] osd.307 pg 70.459s0 found obsolete rollback obj
79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0
generation < trimmed_to 130605'206504...repaired
ceph.log.4.gz:2016-01-28 10:51:45.802549 osd.307 10.31.0.67:6848/127170 594
: cluster [ERR] 70.459s0 shard 4(3) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-28 10:51:45.802933 osd.307 10.31.0.67:6848/127170 595
: cluster [ERR] 70.459s0 shard 10(2) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-28 10:51:45.802978 osd.307 10.31.0.67:6848/127170 596
: cluster [ERR] 70.459s0 shard 26(1) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-28 10:51:45.803039 osd.307 10.31.0.67:6848/127170 597
: cluster [ERR] 70.459s0 shard 132(4) missing
cffed459/default.325671.93__shadow_wrfout_d01_2005-04-18_00_00_00.2~DyHqLoH7FFV_6fz8MOzmPEVO3Td4bZx.10_82/head//70
ceph.log.4.gz:2016-01-28 10:51:44.781639 osd.108 10.31.0.69:6816/4283 338 :
cluster [ERR] osd.108 pg 70.459s5 found obsolete rollback obj
79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/5
generation < trimmed_to 130605'206504...repaired
ceph.log.4.gz:2016-01-28 11:01:18.119350 osd.26 10.31.0.103:6812/77378 2312
: cluster [INF] 70.459s1 restarting backfill on osd.305(0) from (0'0,0'0]
MAX to 130605'206506
ceph.log.4.gz:2016-02-01 13:40:55.096030 osd.307 10.31.0.67:6848/13421 16 :
cluster [INF] 70.459s0 

Re: [ceph-users] Infernalis 9.2.1 MDS crash

2016-03-09 Thread John Spray
On Wed, Mar 9, 2016 at 11:37 AM, Florent B  wrote:
> Hi John and thank you for your explanations :)
>
> It could be a network issue.
>
> MDS should respawn, but "ceph-mds" process was no more running after
> last log message, so I deduced it crashed...

Hmm, that's worth investigating.  You can induce the MDS to respawn
itself by simply doing "ceph mds fail ", or "ceph tell mds.
respawn"

Can you play around and see if it's consistently failing to respawn,
and if you can see any extra evidence, maybe try running the MDS in
the foreground to make it easier to see any output ("ceph-mds -i 
-f -d")

John

>
> On 03/09/2016 12:26 PM, John Spray wrote:
>> The MDS restarted because it received an MDSMap from the monitors in
>> which its own entry had been removed.
>>
>> This is usually a sign that the MDS was failing to communicate with
>> the mons for some period of time, and as a result the mons have given
>> up and cause another MDS to take over.  However, in this instance we
>> can see the mds and mon exchanging beacons regularly.
>>
>> The last acknowledged beacon from was at 2016-03-09 04:53:38.824983
>>
>> The updated mdsmap came at  04:53:56.  18 seconds shouldn't have been
>> long enough for anything to time out, unless you've changed the
>> defaults.
>>
>> I notice that the new MDSMap (epoch 573) also indicates that peer MDS
>> daemons have been failed, and that shortly before receiving the new
>> map, there are a bunch of log messages indicating various client
>> connections resetting.
>>
>> So from this log I would guess some kind of network issue?
>>
>> You say that the MDS crashed, why?  From the log it looks like it's
>> respawning itself, which shouldn't immediately be noticeable, you
>> should just see another MDS daemon take over, and a few seconds later
>> this guy would come back as a standby.
>>
>> John
>>
>> On Wed, Mar 9, 2016 at 9:55 AM, Florent B  wrote:
>>> Hi everyone,
>>>
>>> Last night one of my MDS crashed.
>>>
>>> It was running last Infernalis packaged version for Jessie.
>>>
>>> Here is last minutes log : http://paste.ubuntu.com/15333772/
>>>
>>> Does anyone have an idea of what caused the crash ?
>>>
>>> Thank you.
>>>
>>> Florent
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd timeout

2016-03-09 Thread Luis Periquito
I have a cluster spread across 2 racks, with a crush rule that splits
data across those racks.

To test a failure scenario we powered off one of the racks, and
expected ceph to continuing running. Of the 56 OSDs that were powered
off 52 were quickly set as down in the cluster (it took around 30
seconds) but the remaining 4, all in different hosts, took the 900s
with the "marked down after no pg stats for 900.118248seconds"
message.

now for some questions:
Is it expected some OSDs to not go down as they should? I/O was
happening to the cluster before, during and after this event...
Should we reduce the 900s timeout to a much lower time? How can we
make sure it's not too low beforehand? How low should we go?

After those 18mins elapsed (900 seconds) the cluster resumed all IO
and came back to life as expected.

The big issue is that having these 4 OSDs down caused almost all IO to
stop to this cluster, specially as OSDs get a bigger and bigger queue
of slow requests...

Another issue was during recovery, after we turned back on those
servers, and started ceph on those nodes, the cluster just ground to a
halt for a while. "osd perf" had all the numbers <100, slow requests
were at the same level in all OSDs, nodes didn't have any IOWait, CPUs
were idle, load avg < 0.5, so we couldn't find anything that pointed
to a culprit. However one of the OSDs timed out a "tell osd.* version"
and restarting that OSD made the cluster responsive again. Any idea on
how to detect this type of situations?

This cluster is running hammer (0.94.5) and has 112 OSDs, 56 in each rack.

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis 9.2.1 MDS crash

2016-03-09 Thread John Spray
The MDS restarted because it received an MDSMap from the monitors in
which its own entry had been removed.

This is usually a sign that the MDS was failing to communicate with
the mons for some period of time, and as a result the mons have given
up and cause another MDS to take over.  However, in this instance we
can see the mds and mon exchanging beacons regularly.

The last acknowledged beacon from was at 2016-03-09 04:53:38.824983

The updated mdsmap came at  04:53:56.  18 seconds shouldn't have been
long enough for anything to time out, unless you've changed the
defaults.

I notice that the new MDSMap (epoch 573) also indicates that peer MDS
daemons have been failed, and that shortly before receiving the new
map, there are a bunch of log messages indicating various client
connections resetting.

So from this log I would guess some kind of network issue?

You say that the MDS crashed, why?  From the log it looks like it's
respawning itself, which shouldn't immediately be noticeable, you
should just see another MDS daemon take over, and a few seconds later
this guy would come back as a standby.

John

On Wed, Mar 9, 2016 at 9:55 AM, Florent B  wrote:
> Hi everyone,
>
> Last night one of my MDS crashed.
>
> It was running last Infernalis packaged version for Jessie.
>
> Here is last minutes log : http://paste.ubuntu.com/15333772/
>
> Does anyone have an idea of what caused the crash ?
>
> Thank you.
>
> Florent
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw (infernalis docker) with hammer cluster

2016-03-09 Thread Félix Barbeira
I want to use the ceph object gateway. The docker container has 9.2.1
version (infernalis) and my cluster it's a hammer LTS version (0.94.6).

It is possible to use the rgw docker container (ceph/daemon rgw) with this
ceph cluster hammer version or maybe something breaks due to its new
version?

-- 
Félix Barbeira.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs go down with infernalis

2016-03-09 Thread Yoann Moulin
Hello,

> If you manually create your journal partition, you need to specify the correct
> Ceph partition GUID in order for the system and Ceph to identify the partition
> as Ceph journal and affect correct ownership and permissions at boot via udev.

In my latest run, I let ceph-ansible creating partitions, everything seem to be
fine.

> I used something like this to create the partition :
> sudo sgdisk --new=1:0G:15G --typecode=1:45B0969E-9B03-4F30-B4C6-B4B80CEFF106
>  --partition-guid=$(uuidgen -r) --mbrtogpt -- /dev/sda
> 
> 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 being the GUID. More info on GTP GUID is
> available on wikipedia [1].
> 
> I think the issue with the label you had was linked to some bugs in the disk
> initialization process. This was discussed a few weeks back on this mailing 
> list.
> 
> [1] https://en.wikipedia.org/wiki/GUID_Partition_Table

That what I read on the irc channel, it seem to be a common mistake, might be
good to talk about that in the doc or FAQ ?

Yoann

> On Tue, Mar 8, 2016 at 5:21 PM, Yoann Moulin  > wrote:
> 
> Hello Adrien,
> 
> > I think I faced the same issue setting up my own cluster. If it is the 
> same,
> > it's one of the many people encounter(ed) during disk initialization.
> > Could you please give the output of :
> >  - ll /dev/disk/by-partuuid/
> >  - ll /var/lib/ceph/osd/ceph-*
> 
> unfortunately, I already reinstall my test cluster, but I got some 
> information
> that might explain this issue.
> 
> I was creating the journal partition before running the ansible playbook.
> firstly, owner and right was not persistent at boot (had to add udev's 
> rules).
> And I strongly suspect a side effect of not let ceph-disk create journal
> partition.
> 
> Yoann
> 
> > On Thu, Mar 3, 2016 at 3:42 PM, Yoann Moulin  
> > >> wrote:
> >
> > Hello,
> >
> > I'm (almost) a new user of ceph (couple of month). In my university,
> we start to
> > do some test with ceph a couple of months ago.
> >
> > We have 2 clusters. Each cluster have 100 OSDs on 10 servers :
> >
> > Each server as this setup :
> >
> > CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> > Memory : 128GB of Memory
> > OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
> > Journal Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
> > OSD Disk : 10 x HGST ultrastar-7k6000 6TB
> > Network : 1 x 10Gb/s
> > OS : Ubuntu 14.04
> > Ceph version : infernalis 9.2.0
> >
> > One cluster give access to some user through a S3 gateway (service 
> is
> still in
> > beta). We call this cluster "ceph-beta".
> >
> > One cluster is for our internal need to learn more about ceph. We 
> call
> this
> > cluster "ceph-test". (those servers will be integrated into the 
> ceph-beta
> > cluster when we will need more space)
> >
> > We have deploy both clusters with the ceph-ansible playbook[1]
> >
> > Journal are raw partitions on SSDs (400GB Intel S3300 DC) with no 
> raid. 5
> > journals partitions on each SSDs.
> >
> > OSDs disk are format in XFS.
> >
> > 1. https://github.com/ceph/ceph-ansible
> >
> > We have an issue. Some OSDs go down and don't start. It seem to be
> related to
> > the fsid of the journal partition :
> >
> > > -1> 2016-03-03 14:09:05.422515 7f31118d0940 -1 journal
> FileJournal::open:
> > ondisk fsid ---- doesn't match 
> expected
> > eeadbce2-f096-4156-ba56-dfc634e59106, invalid (someone else's?) 
> journal
> >
> > in attachment, the full logs of one of the dead OSDs
> >
> > We had this issue with 2 OSDs on ceph-beta cluster fixed by 
> removing,
> zapping
> > and readding it.
> >
> > Now, we have the same issue on ceph-test cluster but on 18 OSDs.
> >
> > Now the stats of this cluster
> >
> > > root@icadmin004:~# ceph -s
> > > cluster 4fb4773c-0873-44ad-a65f-269f01bfcff8
> > >  health HEALTH_WARN
> > > 1024 pgs incomplete
> > > 1024 pgs stuck inactive
> > > 1024 pgs stuck unclean
> > >  monmap e1: 3 mons at
> >   
>  
> {iccluster003=10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0
> 
> 
> >   
>  
> }
> > > election epoch 62, quorum 0,1,2
> > 

[ceph-users] Recovering a secondary replica from another secondary replica

2016-03-09 Thread Александр Шишенко
Hello,

I have a development cluster of three OSD's. My aim is to make a
secondary replica recover from another secondary replica (not
primary). Is it possible to do so with minor changes to ceph-osd
source code? Are ceph algorithms compatible with these changes?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com