Re: [ceph-users] debian repositories path change?

2015-09-19 Thread Sage Weil
On Sat, 19 Sep 2015, Brian Kroth wrote:
> Just to be clear, there's no longer going to be a generic
> http://downloads.ceph.com/debian (sans -{ceph-release-name}) path?  In other
> words, we'll have to monitor something else to determine what's considered
> stable for our {distro-release} and then update the sources to point at a
> new debian-{ceph-release-name} ourselves, correct?

This was an oversight.. I'll add the symlinks for debian and rpm, pointing 
to hammer for now.  They'll generally always point to the most recent 
stable release.

sage


> 
> Thanks,
> Brian
> 
> 
> On Fri, Sep 18, 2015, 09:45 Alfredo Deza  wrote:
>   The new locations are in:
> 
> 
>   http://packages.ceph.com/
> 
>   For debian this would be:
> 
>   http://packages.ceph.com/debian-{release}
> 
>   Note that ceph-extras is no longer available: the current repos
>   should
>   provide everything/anything that is needed to properly install
>   ceph. Otherwise, please let us know .
> 
>   On Fri, Sep 18, 2015 at 10:35 AM, Brian Kroth
>    wrote:
>   > Hmm, apparently I haven't gotten that far in my email backlog
>   yet.  That's
>   > good to know too.
>   >
>   > Thanks,
>   > Brian
>   >
>   > Olivier Bonvalet  2015-09-18 16:02:
>   >
>   >> Hi,
>   >>
>   >> not sure if it's related, but there is recent changes because
>   of a
>   >> security issue :
>   >>
>   >>
>   
> >>http://ceph.com/releases/important-security-notice-regarding-signing-key-an
>   d-binary-downloads-of-ceph/
>   >>
>   >>
>   >>
>   >>
>   >> Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a
>   écrit :
>   >>>
>   >>> Hi all, we've had the following in our
>   >>> /etc/apt/sources.list.d/ceph.list
>   >>> for a while based on some previous docs,
>   >>>
>   >>> # ceph upstream stable (currently giant) release packages
>   for wheezy:
>   >>> deb http://ceph.com/debian/ wheezy main
>   >>>
>   >>> # ceph extras:
>   >>> deb http://ceph.com/packages/ceph-extras/debian wheezy main
>   >>>
>   >>> but it seems like the straight "debian/" portion of that
>   path has
>   >>> gone
>   >>> missing recently, and now there's only debian-firefly/,
>   debian
>   >>> -giant/,
>   >>> debian-hammer/, etc.
>   >>>
>   >>> Is that just an oversight, or should we be switching our
>   sources to
>   >>> one
>   >>> of the named releases?  I figured that the unnamed one would
>   >>> automatically track what ceph currently considered "stable"
>   for the
>   >>> target distro release for me, but maybe that's not the case.
>   >>>
>   >>> Thanks,
>   >>> Brian
>   >>> ___
>   >>> ceph-users mailing list
>   >>> ceph-users@lists.ceph.com
>   >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>   >>>
>   > ___
>   > ceph-users mailing list
>   > ceph-users@lists.ceph.com
>   > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>   ___
>   ceph-users mailing list
>   ceph-users@lists.ceph.com
>   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian repositories path change?

2015-09-19 Thread Lindsay Mathieson
On 19 September 2015 at 01:55, Ken Dreyer  wrote:

> To avoid confusion here, I've deleted packages.ceph.com from DNS
> today, and the change will propagate soon.
>
> Please use download.ceph.com (it's the same IP address and server,
> 173.236.248.54)
>


I'm getting:

  W: GPG error: http://download.ceph.com wheezy Release: The following
signatures couldn't be verified because the public key is not available:
NO_PUBKEY E84AC2C0460F3994


Trying to update from there


-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-datacenter crush map

2015-09-19 Thread Wouter De Borger
Ok, so if I understand correctly, for replication level 3 or 4 I would have
to use the rule

rule replicated_ruleset {

ruleset 0
type replicated
min_size 1
max_size 10
step take root
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
}

The question I have now is: how will it behave when a DC goes down?
(Assuming catastrophic failure, the thing burns down)

For example, if I set replication to 3, min_rep to 3.
Then, if a DC goes down, crush will only return 2 PG's, so everything will
hang  (same for 4/4 and 4/3)

If I set replication to 3, min_rep to 2, it could occur that all data of a
PG is in one DC (degraded mode). if this DC goes down, the PG will hang,
As far as I know, degraded PG's will still accept writes, so data loss is
possible. (same for 4/2)



I can't seem to find a way around this. What am I missing.


Wouter




On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum  wrote:

> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger 
> wrote:
> > Hi all,
> >
> > I have found on the mailing list that it should be possible to have a
> multi
> > datacenter setup, if latency is low enough.
> >
> > I would like to set this up, so that each datacenter has at least two
> > replicas and each PG has a replication level of 3.
> >
> > In this mail, it is suggested that I should use the following crush map
> for
> > multi DC:
> >
> > rule dc {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take default
> > step chooseleaf firstn 0 type datacenter
> > step emit
> > }
> >
> > This looks suspicious to me, as it will only generate a list of two PG's,
> > (and only one PG if one DC is down).
> >
> > I think I should use:
> >
> > rule replicated_ruleset {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step choose firstn 2 type datacenter
> > step chooseleaf firstn 2 type host
> > step emit
> > step take root
> > step chooseleaf firstn -4 type host
> > step emit
> > }
> >
> > This correctly generates a list with 2 PG's in one DC, then 2 PG's in the
> > other and then a list of PG's
> >
> > The problem is that this list contains duplicates (e.g. for 8 OSDS per
> DC)
> >
> > [13,11,1,8,13,11,16,4,3,7]
> > [9,2,13,11,9,15,12,18,3,5]
> > [3,5,17,10,3,5,7,13,18,10]
> > [7,6,11,14,7,14,3,16,4,11]
> > [6,3,15,18,6,3,12,9,16,15]
> >
> > Will this be a problem?
>
> For replicated pools, it probably will cause trouble. For EC pools I
> think it should work fine, but obviously you're losing all kinds of
> redundancy. Nothing in the system will do work to avoid colocating
> them if you use a rule like this. Rather than distributing some of the
> replicas randomly across DCs, you really just want to split them up
> evenly across datacenters (or in some ratio, if one has more space
> than the other). Given CRUSH's current abilities that does require
> building the replication size into the rule, but such is life.
>
>
> > If crush is executed, will it only consider osd's which are (up,in)  or
> all
> > OSD's in the map and then filter them from the list afterwards?
>
> CRUSH will consider all OSDs, but if it selects any OSDs which are out
> then it retries until it gets one that is still marked in.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Potential OSD deadlock?

2015-09-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We have had two situations where I/O just seems to be indefinitely
blocked on our production cluster today (0.94.3). In the case this
morning, it was just normal I/O traffic, no recovery or backfill. The
case this evening, we were backfilling to some new OSDs. I would have
loved to have bumped up the debugging to get an idea of what was going
on, but time was exhausted. The incident this evening I was able to do
some additional troubleshooting, but got real anxious after I/O had
been blocked for 10 minutes and OPs was getting hot around the collar.

Here are the important parts of the logs:
[osd.30]
2015-09-18 23:05:36.188251 7efed0ef0700  0 log_channel(cluster) log
[WRN] : slow request 30.662958 seconds old,
 received at 2015-09-18 23:05:05.525220: osd_op(client.3117179.0:18654441
 rbd_data.1099d2f67aaea.0f62 [set-alloc-hint object_size
8388608 write_size 8388608,write 1048576~643072] 4.5ba1672c
ack+ondisk+write+known_if_redirected e55919)
 currently waiting for subops from 32,70,72

[osd.72]
2015-09-18 23:05:19.302985 7f3fa19f8700  0 log_channel(cluster) log
[WRN] : slow request 30.200408 seconds old,
 received at 2015-09-18 23:04:49.102519: osd_op(client.4267090.0:3510311
 rbd_data.3f41d41bd65b28.9e2b [set-alloc-hint object_size
4194304 write_size 4194304,write 1048576~421888] 17.40adcada
ack+ondisk+write+known_if_redirected e55919)
 currently waiting for subops from 2,30,90

The other OSDs listed (32,70,2,90) did not have any errors in the logs
about blocked I/O. It seems that osd.30 was waiting for osd.72 and
visa versa. I looked at top and iostat of these two hosts and the OSD
processes and disk I/O were pretty idle.

I know that this isn't a lot to go on. Our cluster is under very heavy
load and we get several blocked I/Os every hour, but they usually
clear up within 15 seconds. We seem to get I/O blocked when the op
latency of the cluster goes above 1 (average from all OSDs as seen by
Graphite).

Has anyone seen this infinite blocked I/O? Bouncing osd.72 immediately
cleared all the blocked I/O and then it was fine after rejoining the
cluster. Increasing what logs and to what level would be most
beneficial in this case for troubleshooting?

I hope this makes sense, it has been a long day.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/QiuCRDmVDuy+mK58QAAfskP/A0+RRAtq49pwfJcmuaV
LKMsdaOFu0WL1zNLgnj4KOTR1oYyEShXW3Xn0axw1C2U2qXkJQfvMyQ7PTj7
cKqNeZl7rcgwkgXlij1hPYs9tjsetjYXBmmui+CqbSyNNo95aPrtUnWPcYnc
K7blP6wuv7p0ddaF8wgw3Jf0GhzlHyykvVlxLYjQWwBh1CTrSzNWcEiHz5NE
9Y/GU5VZn7o8jeJDh6tQGgSbUjdk4NM2WuhyWNEP1klV+x1P51krXYDR7cNC
DSWaud1hNtqYdquVPzx0UCcUVR0JfVlEX26uxRLgNd0dDkq+CRXIGhakVU75
Yxf8jwVdbAg1CpGtgHx6bWyho2rrsTzxeul8AFLWtELfod0e5nLsSUfQuQ2c
MXrIoyHUcs7ySP3ozazPOdxwBEpiovUZOBy1gl2sCSGvYsmYokHEO0eop2rl
kVS4dSAvDezmDhWumH60Y661uzySBGtrMlV/u3nw8vfvLhEAbuE+lLybMmtY
nJvJIzbTqFzxaeX4PTWcUhXRNaPp8PDS5obmx5Fpn+AYOeLet/S1Alz1qNM2
4w34JKwKO92PtDYqzA6cj628fltdLkxFNoz7DFfqxr80DM7ndLukmSkPY+Oq
qYOQMoownMnHuL0IrC9Jo8vK07H8agQyLF8/m4c3oTqnzZhh/rPRlPfyHEio
Roj5
=ut4B
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to clear unnormal huge objects

2015-09-19 Thread Xiangyu (Raijin, BP Dept)
Hello,

I have built a ceph cluster for test, after I performed some recover testing, 
some osd down as no available disk space, when I check the osd data folder, I 
found that there are many huge objects which have prefix obj-xvrzfdsafd, I 
would like to know how those objects generated and what its use for? how to 
clear? It seems that I can not delete them directly
Ps: I used block storage(rbd) only
Also, I found the pool usage even more than 100%,why ?
Waiting for your feedback ,thanks

[cid:image003.png@01D0F2F5.37CD86F0]
[cid:image002.png@01D0F2F4.73BC59D0]
[cid:image004.png@01D0F2F6.40FF3C90]


Best Regards!
*
[图像 007]
向毓(Raijin.Xiang)
计算与存储部(Computing and Storage Dept)
华为技术有限公司(Huawei Technologies Co., Ltd.)
Mobile:+86 186 2032 2562
Mail: xiang...@huawei.com
地址:深圳市龙岗区坂田街道雪岗路2018号天安云谷产业园一期1栋B座  邮编:518129
Bldg1-B, Cloud Park, Huancheng Road, Bantian Str., Longgang District,  518129  
Shenzhen, P. R. China
***

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] lttng duplicate registration problem when using librados2 and libradosstriper

2015-09-19 Thread Nick Fisk
Hi Paul,

I hit the same problem here (see last post):

https://groups.google.com/forum/#!topic/bareos-users/mEzJ7IbDxvA

If I ever get to the bottom of it, I will let you know. Sorry I can't be of
any more help.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Paul Mansfield
> Sent: 18 September 2015 17:16
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] lttng duplicate registration problem when using
> librados2 and libradosstriper
> 
> Hello,
> thanks for your attention.
> 
> I have started using rados striper library, calling the functions from a C
> program.
> 
> As soon as I add libradosstriper to the linking process, I get this error
when
> the program runs, even though I am not calling any functions from the
rados
> striper library (I commented them out).
> 
> LTTng-UST: Error (-17) while registering tracepoint probe. Duplicate
> registration of tracepoint probes having the same name is not allowed.
> /bin/sh: line 1: 61001 Aborted (core dumped) ./$test
> 
> 
> I had been using lttng in my program but removed it to ensure it wasn't
> causing the problem.
> 
> I have tried running the program using gdb but the calls to initialise
lttng occur
> before main() is called and so I cannot add a break point to see what is
> happening.
> 
> 
> thanks
> Paul
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to move OSD form 1TB disk to 2TB disk

2015-09-19 Thread wsnote
I know another way: out 1TB osd, up 2TB osd as osd.X without data, then rados 
will backfill the data to 2TB disks.
Now I use rsync to mv data form 1TB disk to 2TB disk, but the new osd coredump.
What's the problem?


ceph version:0.80.1
osd.X 
host1 with 1TB disks
host2 with 2TB disks


on host1:
osd.X down
ceph-osd -i X --flush-journal
rsync -av /data/osd/osd.X/ root:host2:/data/osd/osd.X/  
on host2:
vim ceph.conf
ceph-osd -i X --mkjournal
ceph-osd -i X


then osd.X coredump
osd log:
-1> 2015-09-19 14:52:22.371149 7f008cd007a0 0 osd.29 416 load_pgs
0> 2015-09-19 14:52:22.378677 7f008cd007a0 -1 osd/PG.cc: In function 'static 
epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, 
ceph::bufferlist*)' thread 7f008cd007a0 time 2015-09-19 14:52:22.377569
osd/PG.cc: 2559: FAILED assert(r > 0)

ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, 
ceph::buffer::list*)+0x48d) [0x7fa4ad]
2: (OSD::load_pgs()+0x18f1) [0x63c771]
3: (OSD::init()+0x22b0) [0x6550e0]
4: (main()+0x359e) [0x5f931e]
5: (__libc_start_main()+0xfd) [0x3073c1ed5d]
6: ceph-osd() [0x5f59c9]

coredump:
(gdb) bt

0 0x00307400f5db in raise () from /lib64/libpthread.so.0
1 0x009ab7f4 in ?? ()
2 
3 0x003073c32635 in raise () from /lib64/libc.so.6
4 0x003073c33e15 in abort () from /lib64/libc.so.6
5 0x003b4febea7d in __gnu_cxx::__verbose_terminate_handler() () from 
/usr/lib64/libstdc++.so.6
6 0x003b4febcbd6 in ?? () from /usr/lib64/libstdc++.so.6
7 0x003b4febcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
8 0x003b4febcd22 in __cxa_throw () from /usr/lib64/libstdc++.so.6
9 0x00aec612 in ceph::__ceph_assert_fail(char const*, char const*, int, 
char const*) ()
10 0x007fa4ad in PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, 
ceph::buffer::list*) ()
11 0x0063c771 in OSD::load_pgs() ()
12 0x006550e0 in OSD::init() ()
13 0x005f931e in main ()





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian repositories path change?

2015-09-19 Thread Daniel Swarbrick

On 18/09/15 17:28, Sage Weil wrote:


Make that download.ceph.com .. the packages url was temporary while we got
the new site ready and will go away shortly!

(Also, HTTPS is enabled now.)



But still no jessie packages available... :(


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to move OSD form 1TB disk to 2TB disk

2015-09-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Just use the built in Ceph recovery to move data to the new disk. By
changing disk sizes, you also change the mapping across the cluster so
you are going to be moving more data than necessary.

My recommendation, bring the new disk in as a new OSD. Then set the
old disk to 'out'. This will keep the OSD participating in the
backfills until it is empty. Once the backfill is done, stop the old
OSD and remove it from the cluster.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sat, Sep 19, 2015 at 2:30 AM, wsnote  wrote:
> I know another way: out 1TB osd, up 2TB osd as osd.X without data, then
> rados will backfill the data to 2TB disks.
> Now I use rsync to mv data form 1TB disk to 2TB disk, but the new osd
> coredump.
> What's the problem?
>
> ceph version:0.80.1
> osd.X
> host1 with 1TB disks
> host2 with 2TB disks
>
> on host1:
> osd.X down
> ceph-osd -i X --flush-journal
> rsync -av /data/osd/osd.X/ root:host2:/data/osd/osd.X/
> on host2:
> vim ceph.conf
> ceph-osd -i X --mkjournal
> ceph-osd -i X
>
> then osd.X coredump
> osd log:
> -1> 2015-09-19 14:52:22.371149 7f008cd007a0 0 osd.29 416 load_pgs
> 0> 2015-09-19 14:52:22.378677 7f008cd007a0 -1 osd/PG.cc: In function 'static
> epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
> ceph::bufferlist*)' thread 7f008cd007a0 time 2015-09-19 14:52:22.377569
> osd/PG.cc: 2559: FAILED assert(r > 0)
>
> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
> 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
> ceph::buffer::list*)+0x48d) [0x7fa4ad]
> 2: (OSD::load_pgs()+0x18f1) [0x63c771]
> 3: (OSD::init()+0x22b0) [0x6550e0]
> 4: (main()+0x359e) [0x5f931e]
> 5: (__libc_start_main()+0xfd) [0x3073c1ed5d]
> 6: ceph-osd() [0x5f59c9]
>
> coredump:
> (gdb) bt
>
> 0 0x00307400f5db in raise () from /lib64/libpthread.so.0
>
> 1 0x009ab7f4 in ?? ()
>
> 2
>
> 3 0x003073c32635 in raise () from /lib64/libc.so.6
>
> 4 0x003073c33e15 in abort () from /lib64/libc.so.6
>
> 5 0x003b4febea7d in __gnu_cxx::__verbose_terminate_handler() () from
> /usr/lib64/libstdc++.so.6
>
> 6 0x003b4febcbd6 in ?? () from /usr/lib64/libstdc++.so.6
>
> 7 0x003b4febcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
>
> 8 0x003b4febcd22 in __cxa_throw () from /usr/lib64/libstdc++.so.6
>
> 9 0x00aec612 in ceph::__ceph_assert_fail(char const*, char const*,
> int, char const*) ()
>
> 10 0x007fa4ad in PG::peek_map_epoch(ObjectStore*, coll_t,
> hobject_t&, ceph::buffer::list*) ()
>
> 11 0x0063c771 in OSD::load_pgs() ()
>
> 12 0x006550e0 in OSD::init() ()
>
> 13 0x005f931e in main ()
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/XWUCRDmVDuy+mK58QAAfLgP/RPZkQDcV+odf2eaI3vR
CVt0nxWq4RDf6jARMPtQRO9k+BROvoNON0UGFeOeeX3AXGd46gw/DqrAiswb
PsYfM5FSmirWAjB6vQ0A4+nZsVfSPyazA2/XZJ6oIGjiS3RVLDFB+ZrLZO7C
+/XBGet7LUjvp6F8WtQr7lBWY+i93aYeNHdmD3u7hIypSgyNqbWvpjM9xZR1
nJLvoUan/qG96bvAGWyJQz1CSmEfYxJeFTooOZTeCXTKu+bGwQxFqsam4JoJ
Og4t7vlo0zA4LRrT6q/8SdO3u1cPf6CZJ1LY4uPmNlu8FsBQpk44ILsCY3hY
wj49JVGistxK3ADJSPxxJRs/Tuh53lvBYaY5D16sMLVw98lhEBXEyTH6ivII
cUc60HK/v91iVFef2oVNBlVtxIMHK9PXxIBfnmLffjcRcl22w9crvtUuz9ts
ime/ebMiFTUnK/xUqrDbjgvgBDWkpdvYb7Mq/koGhG+7TJXx5cEktd1q0rvr
WWaphubYPXTfPF/UCjX73jAKIMav8Frl+LbKzZyQtNuJbdwI8s9DrbTaMVKa
711Du6G+YzccAZ4BdKk32+8xuNtblnnEanbLHdrhwwuTBpsb+ynrtf2yCqix
yHW7DSeK19y+aEu0PurBvOUMmu41viUQQEd4aXYVh8hAExpjqVVBsr3T/Ow1
Qen3
=klMb
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to move OSD form 1TB disk to 2TB disk

2015-09-19 Thread Wido den Hollander
On 09/19/2015 10:30 AM, wsnote wrote:
> I know another way: out 1TB osd, up 2TB osd as osd.X without data, then rados 
> will backfill the data to 2TB disks.
> Now I use rsync to mv data form 1TB disk to 2TB disk, but the new osd 
> coredump.
> What's the problem?
> 

Did you use rsync with the -X option to also transfer xattrs?

Wido

> 
> ceph version:0.80.1
> osd.X 
> host1 with 1TB disks
> host2 with 2TB disks
> 
> 
> on host1:
> osd.X down
> ceph-osd -i X --flush-journal
> rsync -av /data/osd/osd.X/ root:host2:/data/osd/osd.X/  
> on host2:
> vim ceph.conf
> ceph-osd -i X --mkjournal
> ceph-osd -i X
> 
> 
> then osd.X coredump
> osd log:
> -1> 2015-09-19 14:52:22.371149 7f008cd007a0 0 osd.29 416 load_pgs
> 0> 2015-09-19 14:52:22.378677 7f008cd007a0 -1 osd/PG.cc: In function 'static 
> epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, 
> ceph::bufferlist*)' thread 7f008cd007a0 time 2015-09-19 14:52:22.377569
> osd/PG.cc: 2559: FAILED assert(r > 0)
> 
> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
> 1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, 
> ceph::buffer::list*)+0x48d) [0x7fa4ad]
> 2: (OSD::load_pgs()+0x18f1) [0x63c771]
> 3: (OSD::init()+0x22b0) [0x6550e0]
> 4: (main()+0x359e) [0x5f931e]
> 5: (__libc_start_main()+0xfd) [0x3073c1ed5d]
> 6: ceph-osd() [0x5f59c9]
> 
> coredump:
> (gdb) bt
> 
> 0 0x00307400f5db in raise () from /lib64/libpthread.so.0
> 1 0x009ab7f4 in ?? ()
> 2 
> 3 0x003073c32635 in raise () from /lib64/libc.so.6
> 4 0x003073c33e15 in abort () from /lib64/libc.so.6
> 5 0x003b4febea7d in __gnu_cxx::__verbose_terminate_handler() () from 
> /usr/lib64/libstdc++.so.6
> 6 0x003b4febcbd6 in ?? () from /usr/lib64/libstdc++.so.6
> 7 0x003b4febcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
> 8 0x003b4febcd22 in __cxa_throw () from /usr/lib64/libstdc++.so.6
> 9 0x00aec612 in ceph::__ceph_assert_fail(char const*, char const*, 
> int, char const*) ()
> 10 0x007fa4ad in PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&, 
> ceph::buffer::list*) ()
> 11 0x0063c771 in OSD::load_pgs() ()
> 12 0x006550e0 in OSD::init() ()
> 13 0x005f931e in main ()
> 
> 
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-datacenter crush map

2015-09-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

You will want size=4 min_size=2 if you want to keep I/O going if a DC
fails and ensure some data integrity. Data checksumming (I think is
being added) would provide much stronger data integrity checking in a
two copy situation as you would be able to tell which of the two
copies is the good copy instead of needing a third to break the tie.

However, you have yet another problem on your hands. The way monitors
works makes this tricky. If you have one monitor in one DC and two in
the other, if the two monitor DC burns down, the surviving cluster
stops working too because there isn't more than 50% of the monitors
available. Putting two monitors in each DC only causes both to stop
working if one goes down (you need three to make a quorum). It has
been suggested that putting the odd monitor in the cloud (or other
off-site location to both DCs) could be an option, but latency could
cause problems. The cloud monitor would complete the quorum with
whichever DC survives.

Also remember that there is no data locality awareness in Ceph at the
moment. This could mean that the primary for a PG is in the other DC.
So your client has to contact the primary in the other DC, then that
OSD contacts one OSD in it's DC and two in the other and has to get
confirmation that the write is acknowledged then ack the write to the
client. For a write you will be between 2 x ( LAN latency + WAN
latency ) and 2 x ( LAN latency + 2 x WAN latency ). Additionally your
reads will be between 2 x LAN latency and 2 x WAN latency. Then there
is write amplification so you need to make sure you have a lot more
WAN bandwidth than you think you need.

I think the large majority of us are eagerly waiting for the RBD
replication feature or some sort of lag behind OSD for situations like
this.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/bgrCRDmVDuy+mK58QAAOTsQAJd5Q/3uVMP6D0U+iZv/
FGvEThfxLqarEo/n/TAPiJdCeZP9sKr8szTP72Iajt5UAwH8Ry5qcvClUoet
LmMXfOxHJQJcMbXcKHxI8G7w9h/8ExkGA3GkoBYltUvZ9+oEI30ANHZphBiK
HhaLWanrEKh8L4EbXnqA9JvEYwf1BGDvxKbdvFDNSIIbDywN3DJn7OavRhC9
M63GQnFmxSO6F+Oy1q5vMfpur/VtZ27GRfzIDsougRTmM5q9zbdpSY8pHrrZ
RDExkM1t0orl1gUnbNhl/YgQTGfU/XWpEKtJju7Wk9Ciem5SFczJRWsputHc
AhBtnxBoEInlsnpHKnCsPvbY8wEcoo+YxNt79/M3cR8x0UzXl+/4SoDlYnSK
X3afL/YmVnbCV6hoxl2LAOqHbTYasN9VxQIbpQe4kAzSq45yJX//k8NRXBfD
+hGF8qfxpcbTe/9IjJiqwe+ZpaAd4vX7Xfq4oHHeMwWUrvd8sXSbr5CIV1AJ
CYsixEy2gJ0oFFVKcBGtzAfBUxJHb/FAcAuV97zSdYyYRplMq5Qjaz/hwGeu
9pC83kxY40pfzdD9uEElWoI3+6/34LdNo4TLi3IM8aeZmNGzzIgt/MxAuFOk
9Jf2Dwmab0+Ut6uJasY4Fr6HiyNoeTXea+CSWrnvsMohOseyJg996GUP3gUl
OEoA
=PfhN
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sat, Sep 19, 2015 at 12:54 PM, Wouter De Borger  wrote:
> Ok, so if I understand correctly, for replication level 3 or 4 I would have
> to use the rule
>
> rule replicated_ruleset {
>
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take root
> step choose firstn 2 type datacenter
> step chooseleaf firstn 2 type host
> step emit
> }
>
> The question I have now is: how will it behave when a DC goes down?
> (Assuming catastrophic failure, the thing burns down)
>
> For example, if I set replication to 3, min_rep to 3.
> Then, if a DC goes down, crush will only return 2 PG's, so everything will
> hang  (same for 4/4 and 4/3)
>
> If I set replication to 3, min_rep to 2, it could occur that all data of a
> PG is in one DC (degraded mode). if this DC goes down, the PG will hang,
> As far as I know, degraded PG's will still accept writes, so data loss is
> possible. (same for 4/2)
>
>
>
> I can't seem to find a way around this. What am I missing.
>
>
> Wouter
>
>
>
>
> On Fri, Sep 18, 2015 at 10:10 PM, Gregory Farnum  wrote:
>>
>> On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger 
>> wrote:
>> > Hi all,
>> >
>> > I have found on the mailing list that it should be possible to have a
>> > multi
>> > datacenter setup, if latency is low enough.
>> >
>> > I would like to set this up, so that each datacenter has at least two
>> > replicas and each PG has a replication level of 3.
>> >
>> > In this mail, it is suggested that I should use the following crush map
>> > for
>> > multi DC:
>> >
>> > rule dc {
>> > ruleset 0
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take default
>> > step chooseleaf firstn 0 type datacenter
>> > step emit
>> > }
>> >
>> > This looks suspicious to me, as it will only generate a list of two
>> > PG's,
>> > (and only one PG if one DC is down).
>> >
>> > I think I should use:
>> >
>> > rule replicated_ruleset {
>> > ruleset 0
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take root
>> > step choose firstn 2 type datacenter
>> > step chooseleaf