On Mon, Feb 5, 2018 at 3:23 AM Caspar Smit <caspars...@supernas.eu> wrote:

> Hi Gregory,
>
> Thanks for your answer.
>
> I had to add another step emit to your suggestion to make it work:
>
> step take default
> step chooseleaf indep 4 type host
> step emit
> step take default
> step chooseleaf indep 4 type host
> step emit
>
> However, now the same OSD is chosen twice for every PG:
>
> # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> --num-rep 8
> CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]
>

Oh, that must be because it has the exact same inputs on every run.
Hrmmm...Sage, is there a way to seed them differently? Or do you have any
other ideas? :/




> I'm wondering why something like this won't work (crushtool test ends up
> empty):
>
> step take default
> step chooseleaf indep 4 type host
> step choose indep 2 type osd
> step emit
>

Chooseleaf is telling crush to go all the way down to individual OSDs. I’m
not quite sure what happens when you then tell it to pick OSDs again but
obviously it’s failing (as the instruction is nonsense) and emitting an
empty list.



>
> # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> --num-rep 8
> CRUSH rule 1 x 1 []
>
> Kind regards,
> Caspar Smit
>
> 2018-02-02 19:09 GMT+01:00 Gregory Farnum <gfar...@redhat.com>:
>
>> On Fri, Feb 2, 2018 at 8:13 AM, Caspar Smit <caspars...@supernas.eu>
>> wrote:
>> > Hi all,
>> >
>> > I'd like to setup a small cluster (5 nodes) using erasure coding. I
>> would
>> > like to use k=5 and m=3.
>> > Normally you would need a minimum of 8 nodes (preferably 9 or more) for
>> > this.
>> >
>> > Then i found this blog:
>> > https://ceph.com/planet/erasure-code-on-small-clusters/
>> >
>> > This sounded ideal to me so i started building a test setup using the
>> 5+3
>> > profile
>> >
>> > Changed the erasure ruleset to:
>> >
>> > rule erasure_ruleset {
>> >   ruleset X
>> >   type erasure
>> >   min_size 8
>> >   max_size 8
>> >   step take default
>> >   step choose indep 4 type host
>> >   step choose indep 2 type osd
>> >   step emit
>> > }
>> >
>> > Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
>> each,
>> > perfect.
>> >
>> > But then i tested a node failure, no problem again, all PG's stay active
>> > (most undersized+degraded, but still active). Then after 10 minutes the
>> > OSD's on the failed node were all marked as out, as expected.
>> >
>> > I waited for the data to be recovered to the other (fifth) node but that
>> > doesn't happen, there is no recovery whatsoever.
>> >
>> > Only when i completely remove the down+out OSD's from the cluster the
>> data
>> > is recovered.
>> >
>> > My guess is that the "step choose indep 4 type host" chooses 4 hosts
>> > beforehand to store data on.
>>
>> Hmm, basically, yes. The basic process is:
>>
>> >   step take default
>>
>> take the default root.
>>
>> >   step choose indep 4 type host
>>
>> Choose four hosts that exist under the root. *Note that at this layer,
>> it has no idea what OSDs exist under the hosts.*
>>
>> >   step choose indep 2 type osd
>>
>> Within the host chosen above, choose two OSDs.
>>
>>
>> Marking out an OSD does not change the weight of its host, because
>> that causes massive data movement across the whole cluster on a single
>> disk failure. The "chooseleaf" commands deal with this (because if
>> they fail to pick an OSD within the host, they will back out and go
>> for a different host), but that doesn't work when you're doing
>> independent "choose" steps.
>>
>> I don't remember the implementation details well enough to be sure,
>> but you *might* be able to do something like
>>
>> step take default
>> step chooseleaf indep 4 type host
>> step take default
>> step chooseleaf indep 4 type host
>> step emit
>>
>> And that will make sure you get at least 4 OSDs involved?
>> -Greg
>>
>> >
>> > Would it be possible to do something like this:
>> >
>> > Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
>> hosts
>> > are needed), in case of node failure -> recover data from failed node to
>> > fifth node.
>> >
>> > Thank you in advance,
>> > Caspar
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to