On 2021-08-10 3:16 p.m., Janusz Jaskiewicz wrote:
Hi,

Thanks for your answers.

Answering your questions:
DRBD_KERNEL_VERSION=9.0.25

Linux kernel:
4.18.0-305.3.1.el8.x86_6

File system type: 
XFS.

So the file system is not cluster-aware, but as far as I understand in an active/passive setup - single primary (that I have) it should be OK.
Just checked the doc which seems to confirm that.

I think the problem may come from the way I'm testing it.
I came up with this testing scenario, that I described in my first post, because I didn't have an easy way to abruptly restart the server.
When I do the hard reset of the primary server it works as expected (at least I can find a logical explanation).

I think what happened in my previous scenario was:
Service is writing to the disk, and some portion of the written data is in a disk cache. As the picture https://linbit.com/wp-content/uploads/drbd/drbd-guide-9_0-en/images/drbd-in-kernel.png shows, the cache is above the DRBD module.
Then I kill the service and the network, but some data is still in the cache.
At some point the cache is flushed and the data gets written to the disk.
DRBD probably reports some error at this point, as it can't send that data to the secondary node (DRBD thinks the other node has left the cluster).

When I check the files at this point I see more data on the primary because it also contains the data from the cache, which were not replicated because the network was down when the data hit the DRBD.

When I do the hard restart of the server, data in the cache is lost, so we don't observe the result as above.

Does it make sense?

Regards,
Janusz.

OK, it sounded from your first post like you have the FS mounted on both nodes at the same time, that would be a problem. If it's only mounted in one place at a time, then it's OK.

As for caching; DRBD on the Secondary will say "write complete" to the primary, in protocol C, when it has been told that the disk write is complete. So if the cache is _above_ drbd's kernel module, then that's probably not the problem because the Secondary won't tell the primary it's done until it receives the data. If there is a caching issue _below_ DRBD on the Secondary, then it's _possible_ that's the problem, but I doubt it. The reason is that whatever is managing the cache below DRBD on the Secondary should know that a given block hasn't flushed yet and, on read request, read from cache not disk. This is a guess on my part.

What are your 'disk { disk-flushes [yes|no]; and md-flushes [yes|no]; }' set to?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to