Hi all,

we've a small (?) problem with a 2-node cluster on Debian 8:

Linux h1b 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u2 (2017-06-26)
x86_64 GNU/Linux

ocfs2-tools 1.6.4-3

Two ocfs2 filesystems (drbd0 600 GB w/ 8 slots and drbd1 6 TB w/ 6
slots) are created on top of drbd w/ 4k block and cluster size,
'max_features' enabled.

cluster.conf assigns sequential node numbers 1 - 8. Nodes 1, 2 are the
hypervisors. Nodes 3, 4, 5 are VMs on node 1. Nodes 6, 7, 8 the
corresponding VMs on node 2.

VMs all run Debian 8 as well:

Linux srv2 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64
GNU/Linux

When mounting drbd0 in order of increasing node numbers and concurrently
watching the 'hb' output from debugsfs.ocfs2 we get a clean slot map (?):

hb
        node: node              seq       generation checksum
           1:    1 0000000059b8d94a fa60f0d8423590d9 edec9643
           2:    2 0000000059b8d94c aca059df4670f467 994e3458
           3:    3 0000000059b8d949 f03dc9ba8f27582c d4473fc2
           4:    4 0000000059b8d94b df5bbdb756e757f8 12a198eb
           5:    5 0000000059b8d94a 1af81d94a7cb681b 91fba906
           6:    6 0000000059b8d94b 104538f30cdb35fa 8713e798
           7:    7 0000000059b8d94b 195658c9fb8ca7f9 5e54edf6
           8:    8 0000000059b8d949 dc6bfb46b9cf1ac3 de7a8757

Device drbd1 in contrast yields the following table after mounting on
nodes 1, 2:

hb
        node: node              seq       generation checksum
           8:    1 0000000059b8d9ba 73a63eb550a33095 f4e074d1
          16:    2 0000000059b8d9b9 5c7504c05637983e 07d696ec

Proceeding with the drbd1 mounts on nodes 3, 5, 6 leads us to:

hb
        node: node              seq       generation checksum
           3:    3 0000000059b8da3b 9443b4b209b16175 f2cc87ec
           5:    5 0000000059b8da3c 4b742f709377466f 3ac41cf3
           6:    6 0000000059b8da3b d96e2de0a55514f6 335a4d90
           8:    1 0000000059b8da3c 73a63eb550a33095 2312c1c4
          16:    2 0000000059b8da3d 5c7504c05637983e 659571a1

The problem arises when trying to mount node 8 since its slot is already
occupied by node 1:

kern.log node 1:

(o2hb-0AEE381A14,50990,4):o2hb_check_own_slot:582 ERROR: Another node is
heartbeating on device (drbd1): expected(1:0x18acf7b0b3e5544c,
0x59b8445c), ondisk(8:0xb91302db72a65364, 0x59b8445b)

kern.log node 8:

ocfs2: Mounting device (254,16) on (node 8, slot 7) with ordered data mode.
(o2hb-0AEE381A14,518,1):o2hb_check_own_slot:582 ERROR: Another node is
heartbeating on device (vdc): expected(8:0x18acf7b0b3e5544c,
0x59b8445c), ondisk(1:0x18acf7b0b3e5544c, 0x59b8445c)

This can be "fixed" by exchanging node numbers 1 and 8 in cluster.conf.
Then node 8 will be assigned slot 8, node 2 stays in slot 16, 3 to 7 as
expected. There is no node 16 configured so there's no conflict. But
since we experience some other so far not explainable instabilities with
this ocfs2 device / system during operation further down the road we
decided to take care of and try to fix this issue first.

Somehow the failure reminds of bit shift or masking problems:

1 << 3 = 8
2 << 3 = 16

But then again - what do I know ...

Tried so far:

A. Create offending file system with 8 slots instead of 6 -> same issue.
B. Set features to 'default' (disables feature 'extended-slotmap') ->
same issue.

We'd very much appreciate any comments on this. Has anything similar
ever been experienced before? Are we completely missing something
important here?

If there's a fix already out for this any pointers (src files / commits)
to where to look would be greatly appreciated.

Thanks in advance + Best regards ... Michael U.

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to