Re: [ceph-users] Disable intra-host replication?

2018-11-23 Thread Janne Johansson
Den fre 23 nov. 2018 kl 15:19 skrev Marco Gaiarin :
>
>
> Previous (partial) node failures and my current experiments on adding a
> node lead me to the fact that, when rebalancing are needed, ceph
> rebalance also on intra-node: eg, if an OSD of a node die, data are
> rebalanced on all OSD, even if i've pool molteplicity 3 and 3 nodes.
>
> This, indeed, make perfectly sense: overral data scattering have better
> performance and safety.
>
>
> But... there's some way to se to crush 'don't rebalance in the same node, go
> in degradated mode'?
>

The default crush rules with replication=3 would only place PGs on
separate hosts,
so in that case it would go into degraded mode if a node goes away,
and not place
replicas on different disks on the remaining hosts.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New OSD with weight 0, rebalance still happen...

2018-11-23 Thread Matthew H
Greetings,

You need to set the following configuration option under [osd] in your 
ceph.conf file for your new OSDs.

[osd]
osd_crush_initial_weight = 0

This will ensure your new OSDs come up with a 0 crush weight, thus preventing 
the automatic rebalance that you see occuring.

Good luck,


From: ceph-users  on behalf of Marco Gaiarin 

Sent: Thursday, November 22, 2018 3:22 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] New OSD with weight 0, rebalance still happen...


Ceph still surprise me, when i'm sure i've fully understood it,
something 'strange' (to my knowledge) happen.


I need to move out a server of my ceph hammer cluster (3 nodes, 4 OSD
per node), and for some reasons i cannot simply move disks.
So i've added a new node, and yesterday i've setup the new 4 OSD.
In my mind i will add 4 OSD with weight 0, and then slowly i will lower
the old OSD weight and increase the weight of the new.

I've done before:

ceph osd set noin

and then added OSD, and (as expected) new OSD start with weight 0.

But, despite of the fact that weight is zero, rebalance happen, and
using percentage of rebalance 'weighted' to the size of new disk (eg,
i've had 18TB circa of space, i've added a 2TB disks and roughly 10% of
data start to rebalance).


Why? Thanks.

--
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] will crush rule be used during object relocation in OSD failure ?

2018-11-23 Thread ST Wong (ITSC)
Hi all,


We've 8 osd hosts, 4 in room 1 and 4 in room2.

A pool with size = 3 using following crush map is created, to cater for room 
failure.


rule multiroom {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}



We're expecting:

1.for each object, there are always 2 replicas in one room and 1 replica in 
other room making size=3.  But we can't control which room has 1 or 2 replicas.

2.in case an osd host fails, ceph will assign remaining osds to the same PG to 
hold replicas on the failed osd host.  Selection is based on crush rule of the 
pool, thus maintaining the same failure domain - won't make all replicas in the 
same room.

3.in case of entire room with 1 replica fails, the pool will remain degraded 
but won't do any replica relocation.

4. in case of entire room with 2 replicas fails, ceph will make use of osds in 
the surviving room and making 2 replicas.  Pool will not be writeable before 
all objects are made 2 copies (unless we make pool size=4?).  Then when 
recovery is complete, pool will remain in degraded state until the failed room 
recover.

Is our understanding correct?  Thanks a lot.
Will do some simulation later to verify.

Regards,
/stwong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS file contains garbage zero padding after an unclean cluster shutdown

2018-11-23 Thread Hector Martin
Background: I'm running single-node Ceph with CephFS as an experimental
replacement for "traditional" filesystems. In this case I have 11 OSDs,
1 mon, and 1 MDS.

I just had an unclean shutdown (kernel panic) while a large (>1TB) file
was being copied to CephFS (via rsync). Upon bringing the system back
up, I noticed that the (incomplete) file has about 320MB worth of zeroes
at the end.

This is the kind of behavior I would expect of traditional local
filesystems, where file metadata was updated to reflect the new size of
a growing file before disk extents were allocated and filled with data,
so an unclean shutdown results in files with tails of zeroes, but I'm
surprised to see it with Ceph. I expected the OSD side of things should
be atomic with all the BlueStore goodness, checksums, etc. I figured
CephFS would build upon those primitives in a way that this kind of
inconsistency isn't possible.

Is this expected behavior? It's not a huge dealbreaker, but I'd like to
understand how this kind of situation happens in CephFS (and how it
could affect a proper cluster, if at all - can this happen if e.g. a
client, or an MDS, or an OSD dies uncleanly? Or only if several things
go down at once?)

-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 12.2.9 release

2018-11-23 Thread Matthew Vernon
On 07/11/2018 23:28, Neha Ojha wrote:

> For those who haven't upgraded to 12.2.9 -
> 
> Please avoid this release and wait for 12.2.10.

Any idea when 12.2.10 is going to be here, please?

Regards,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disable intra-host replication?

2018-11-23 Thread Marco Gaiarin


Previous (partial) node failures and my current experiments on adding a
node lead me to the fact that, when rebalancing are needed, ceph
rebalance also on intra-node: eg, if an OSD of a node die, data are
rebalanced on all OSD, even if i've pool molteplicity 3 and 3 nodes.

This, indeed, make perfectly sense: overral data scattering have better
performance and safety.


But... there's some way to se to crush 'don't rebalance in the same node, go
in degradated mode'?


Thanks.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with CephFS

2018-11-23 Thread Rodrigo Embeita
Hi Daniel, thanks a lot for your help.
Do you know how I can recover the data again in this scenario since I lost
1 node with 6 OSD?
My configuration had 12 OSD (6 per host).

Regards

On Wed, Nov 21, 2018 at 3:16 PM Daniel Baumann 
wrote:

> Hi,
>
> On 11/21/2018 07:04 PM, Rodrigo Embeita wrote:
> > Reduced data availability: 7 pgs inactive, 7 pgs down
>
> this is your first problem: unless you have all data available again,
> cephfs will not be back.
>
> after that, I would take care about the redundancy next, and get the one
> missing monitor back online.
>
> once that is done, get the mds working again and your cephfs should be
> back in service.
>
> if you encounter problems with any of the steps, send all the necessary
> commands and outputs to the list and I (or others) can try to help.
>
> Regards,
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: RocksDB and WAL migration to new block device

2018-11-23 Thread Francois Scheurer

Dear Igor


Thank you for your help!
I am working with Florian.
We have built the ceph-bluestore-tool with your patch on SLES 12SP3.

We will post back the results  ASAP.


Best Regards
Francois Scheurer





 Weitergeleitete Nachricht 
Betreff: Re: [ceph-users] RocksDB and WAL migration to new block device
Datum: Wed, 21 Nov 2018 11:34:47 +0300
Von: Igor Fedotov 
An: Florian Engelmann , 
ceph-users@lists.ceph.com


Actually  (given that your devices are already expanded) you don't 
need to expand them once again - one can just update size labels with 
my new PR.


For new migrations you can use updated bluefs expand command which 
sets size label automatically though.



Thanks,
Igor
On 11/21/2018 11:11 AM, Florian Engelmann wrote:
Great support Igor Both thumbs up! We will try to build the tool 
today and expand those bluefs devices once again.



Am 11/20/18 um 6:54 PM schrieb Igor Fedotov:

FYI: https://github.com/ceph/ceph/pull/25187


On 11/20/2018 8:13 PM, Igor Fedotov wrote:


On 11/20/2018 7:05 PM, Florian Engelmann wrote:

Am 11/20/18 um 4:59 PM schrieb Igor Fedotov:



On 11/20/2018 6:42 PM, Florian Engelmann wrote:

Hi Igor,



what's your Ceph version?


12.2.8 (SES 5.5 - patched to the latest version)



Can you also check the output for

ceph-bluestore-tool show-label -p 


ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0/
infering bluefs devices from bluestore path
{
    "/var/lib/ceph/osd/ceph-0//block": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 8001457295360,
    "btime": "2018-06-29 23:43:12.088842",
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "a146-6561-307e-b032-c5cee2ee520c",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "ready": "ready",
    "whoami": "0"
    },
    "/var/lib/ceph/osd/ceph-0//block.wal": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098690",
    "description": "bluefs wal"
    },
    "/var/lib/ceph/osd/ceph-0//block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}





It should report 'size' labels for every volume, please check 
they contain new values.




That's exactly the problem, whether "ceph-bluestore-tool 
show-label" nor "ceph daemon osd.0 perf dump|jq '.bluefs'" did 
recognize the new sizes. But we are 100% sure the new devices 
are used as we already deleted the old once...


We tried to delete the "key" "size" to add one with the new 
value but:


ceph-bluestore-tool rm-label-key --dev 
/var/lib/ceph/osd/ceph-0/block.db -k size

key 'size' not present

even if:

ceph-bluestore-tool show-label --dev 
/var/lib/ceph/osd/ceph-0/block.db

{
    "/var/lib/ceph/osd/ceph-0/block.db": {
    "osd_uuid": "1e5b3908-20b1-41e4-b6eb-f5636d20450b",
    "size": 524288000,
    "btime": "2018-06-29 23:43:12.098023",
    "description": "bluefs db"
    }
}

So it looks like the key "size" is "read-only"?


There was a bug in updating specific keys, see
https://github.com/ceph/ceph/pull/24352

This PR also eliminates the need to set sizes manually on 
bdev-expand.


I thought it had been backported to Luminous but it looks like it 
doesn't.

Will submit a PR shortly.




Thank you so much Igor! So we have to decide how to proceed. Maybe 
you could help us here as well.


Option A: Wait for this fix to be available. -> could last weeks 
or even months
if you can build a custom version of ceph_bluestore_tool then this 
is a short path. I'll submit a patch today or tomorrow which you 
need to integrate into your private build.

Then you need to upgrade just the tool and apply new sizes.



Option B: Recreate OSDs "one-by-one". -> will take a very long 
time as well

No need for that IMO.


Option C: There is some "lowlevel" commad allowing us to fix those 
sizes?
Well hex editor might help here as well. What you need is just to 
update 64bit size value in block.db and block.wal files. In my lab 
I can find it at offset 0x52. Most probably this is the fixed 
location but it's better to check beforehand - existing value 
should contain value corresponding to the one reported with 
show-label. Or I can do that for you - please send the first 4K 
chunks to me along with corresponding label report.
Then update with new values - the field has to contain exactly the 
same size as your new partition.










Thanks,

Igor


On 11/20/2018 5:29 PM, Florian Engelmann wrote:

Hi,

today we migrated all of our rocksdb and wal devices to new 
once. The new once are much bigger (500MB for wal/db -> 60GB 
db and 2G WAL) and LVM based.


We migrated like:

    export OSD=x

    systemctl stop ceph-osd@$OSD

    lvcreate -n db-osd$OSD -L60g data || exit 1
    lvcreate -n wal-osd$OSD 

Re: [ceph-users] New OSD with weight 0, rebalance still happen...

2018-11-23 Thread Janne Johansson
Den fre 23 nov. 2018 kl 11:08 skrev Marco Gaiarin :

> Reading ceph docs lead to me that 'ceph osd reweight' and 'ceph osd crush
> reweight' was roughly the same, the first is effectively 'temporary'
> and expressed in percentage (0-1), while the second is 'permanent' and
> expressed, normally, as disk terabyte.
>
> You are saying that insted the first modify only the disk occupation,
> while only the latter alter the crush map.

The crush weight tells the cluster how much this disk adds to the
capacity of the host it is attached to,
the OSD weight says  (from 0 to 1) how much of the advertized size it
actually wants to receive/handle.
If you add crush weight, data will flow to the node, but if it has low
OSD weight, the other OSDs on the
host will have to bear the extra data. So starting out with 0 for
crush and 1.0 for OSD weight is fine, it
will not cause data movement, until you start (slowly perhaps) to add
to the crush weight until it matches
the size of the disk.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New OSD with weight 0, rebalance still happen...

2018-11-23 Thread Marco Gaiarin
Mandi! Paweł Sadowski
  In chel di` si favelave...

> This is most probably due to big difference in weights between your hosts (the
> new one has 20x lower weight than the old ones) which in combination with 
> straw
> algorithm is a 'known' issue.

Ok. I've reweighted back that disk to '1' and status goes back to
HEALTH_OK.


> You could try to increase choose_total_tries in
> your crush map from 50 to some bigger number. The best IMO would be to use
> straw2 (which will cause some rebalance) and then use 'ceph osd crush 
> reweight'
> (instead of 'ceph osd reweight') with small steps to slowly rebalance data 
> onto
> new OSDs.

For now i'm putting in the new disks with 'ceph osd reweight',
probably when i'm on 50% of new disks i'll start to use 'ceph osd crush 
reweight'
against the old one.

Thanks.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New OSD with weight 0, rebalance still happen...

2018-11-23 Thread Marco Gaiarin
Mandi! Paweł Sadowsk
  In chel di` si favelave...

> Exactly, your 'new' OSD have weight 1.81999 (osd.12, osd.13) and 0.90999
> (osd.14, osd.15). As Jarek pointed out you should add them using
>   'osd crush initial weight = 0'
> and the use
>   'ceph osd crush reweight osd.x 0.05'
> to slowly increase weight on them.
> From your osd tree it looks like you used 'ceph osd reweight'.

Reading ceph docs lead to me that 'ceph osd reweight' and 'ceph osd crush
reweight' was roughly the same, the first is effectively 'temporary'
and expressed in percentage (0-1), while the second is 'permanent' and
expressed, normally, as disk terabyte.

You are saying that insted the first modify only the disk occupation,
while only the latter alter the crush map.

Right?


This is true only for 'straw' algorithm? Or is general? Thanks.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com