Re: [ceph-users] Crush rule freeze cluster

Georgios Dimitrakakis Mon, 11 May 2015 02:43:15 -0700

Oops... to fast to answer...

G.


On Mon, 11 May 2015 12:13:48 +0300, Timofey Titovets wrote:

Hey! I catch it again. Its a kernel bug. Kernel crushed if i try to
map rbd device with map like above!
Hooray!

2015-05-11 12:11 GMT+03:00 Timofey Titovets <[email protected]>:
FYI and history
Rule:
# rules
rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step choose firstn 0 type room
  step choose firstn 0 type rack
  step choose firstn 0 type host
  step chooseleaf firstn 0 type osd
  step emit
}
And after reset node, i can't find any usable info. Cluster worksfine
and data just rebalanced by osd disks.
syslog:
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Starting NetworkTime
Synchronization...
May 9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Started NetworkTime
Synchronization.
May  9 19:30:02 srv-lab-ceph-node-01 systemd[1]: Reloading.
May  9 19:30:02 srv-lab-ceph-node-01 CRON[1731]: (CRON) info (No MTA
installed, discarding output)
May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: [origin
software="rsyslogd" swVersion="7.4.4" x-pid="689"
x-info="http://www.rsyslog.com";] start
May 11 11:54:56 srv-lab-ceph-node-01 rsyslogd: rsyslogd's groupidchanged to 103May 11 11:54:57 srv-lab-ceph-node-01 rsyslogd: rsyslogd's useridchanged to 100
Sorry for noise, guys. Georgios, in any way, thanks for helping.
2015-05-10 12:44 GMT+03:00 Georgios Dimitrakakis<[email protected]>:
Timofey,
may be your best chance is to connect directly at the server andsee what is
going on.
Then you can try debug why the problem occurred. If you don't wantto wait
until tomorrow
you may try to see what is going on using the server's directremote console
access.
The majority of the servers provide you with that just with adifferent nameeach (DELL calls it iDRAC, Fujitsu iRMC, etc.) so if you have it upand
running you can use that.
I think this should be your starting point and you can take it onfrom
there.
I am sorry I cannot help you further with the Crush rules and thereason why
it crashed since I am far from being an expert in the field :-(

Regards,

George
Georgios, oh, sorry for my poor english _-_, may be I poorexpressed
what i want =]
i know how to write simple Crush rule and how use it, i wantseveral
things things:
1. Understand why, after inject bad map, my test node makeoffline.
This is unexpected.
2. May be somebody can explain what and why happens with this map.
3. This is not a problem to write several crushmap or/and switchit
while cluster running.
But, in production, we have several nfs servers, i think aboutmoving
it to ceph, but i can't disable more then 1 server for maintenance
simultaneously. I want avoid data disaster while setup and movingdata
to ceph, case like "Use local data replication, if only one node
exist" looks usable as temporally solution, while i not add second
node _-_.
4. May be some one also have test cluster and can test that happen
with clients, if crushmap like it was injected.
2015-05-10 8:23 GMT+03:00 Georgios Dimitrakakis<[email protected]>:
Hi Timofey,
assuming that you have more than one OSD hosts and that thereplicatorfactor is equal (or less) to the number of the hosts why don'tyou just
change the crushmap to host replication?

You just need to change the default CRUSHmap rule from

step chooseleaf firstn 0 type osd

to

step chooseleaf firstn 0 type host
I believe that this is the easiest way to do have replicationacross OSD
nodes unless you have a much more "sophisticated" setup.

Regards,

George
Hi list,
i had experiments with crush maps, and I've try to get raid1likebehaviour (if cluster have 1 working osd node, duplicate dataacrosslocal disk, for avoiding data lose in case local disk failureand
allow client working, because this is not a degraded state)
(
  in best case, i want dynamic rule, like:
  if has only one host -> spread data over local disks;
else if host count > 1 -> spread over hosts (rack o somethingelse);
)

i write rule, like below:

rule test {
              ruleset 0
              type replicated
              min_size 0
              max_size 10
              step take default
              step choose firstn 0 type host
              step chooseleaf firstn 0 type osd
              step emit
}
I've inject it in cluster and client node, now looks like havegetkernel panic, I've lost my connection with it. No ssh, no ping,this
is remote node and i can't see what happens until Monday.
Yes, it looks like I've shoot in my foot.
This is just a test setup and cluster destruction, not aproblem, buti think, what broken rules, must not crush something else and inworst
case, must be just ignored by cluster/crushtool compiler.
May be someone can explain, how this rule can crush system? Maybe
this is a crazy mistake somewhere?
--
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Have a nice day,
Timofey.


--
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crush rule freeze cluster

Reply via email to