Re: [Cluster-devel] Recording extents in GFS2

Steven Whitehouse Mon, 22 Feb 2021 02:22:08 -0800

Hi,

On 20/02/2021 09:48, Andreas Gruenbacher wrote:

Hi all,

once we change the journal format, in addition to recording blocknumbers as extents, there are some additional issues we should addressat the same time:


I. The current transaction format of our journals is as follows:

  * One METADATA log descriptor block for each [503 / 247 / 119 / 55]
    metadata blocks, followed by those metadata blocks. For each
    metadata block, the log descriptor records the 64-bit block number.
  * One JDATA log descriptor block for each [251 / 123 / 59 / 27]
    metadata blocks, followed by those metadata blocks. For each
    metadata block, the log descriptor records the 64-bit block number
    and another 64-bit field for indicating whether the block needed
    escaping.
  * One REVOKE log descriptor block for the initial [503 / 247 / 119 /
    55] revokes, followed by a metadata header (not to be confused
    with the log header) for each additional [509 / 253 / 125 / 61]
    revokes. Each revoke is recorded as a 64-bit block number in its
    REVOKE log descriptor or metadata header.
  * One log header with various necessary and useful metadata that
    acts as a COMMIT record. If the log header is incorrect or
    missing, the preceding log descriptors are ignored.

^^^^ succeeding? (I hope!)

We should change that so that a single log descriptor contains anumber of records. There should be records for METADATA and JDATAblocks that follow, as well as for REVOKES and for COMMIT. If atransaction contains metadata and/or jdata blocks, those willobviously need a precursor and a commit block like today, but weshouldn't need separate blocks for metadata and journaled data in manycases. Small transactions that only consist of revokes and of a commitshould frequently fit into a single block entirely, though.

Yes, it makes sense to try and condense what we are writing. Why wouldwe not need to have separate blocks for journaled data though? That oneseems difficult to avoid, and since it is used so infrequently, perhapsnot such an important issue.

Right now, we're writing log headers ("commits") with REQ_PREFLUSH tomake sure all the log descriptors of a transaction make it to diskbefore the log header. Depending on the device, this is often costly.If we can fit an entire transaction into a single block, REQ_PREFLUSHwon't be needed anymore.

I'm not sure I agree. The purpose of the preflush is to ensure that thedata and the preceding log blocks are really on disk before we write thecommit record. That will still be required while we use ordered writes,even if we can use (as you suggest below) a checksum to cover the wholetransaction, and thus check for a complete log record after the fact.Also, we would still have to issue the flush in the case of a fsyncderived log flush too.

III. We could also checksum entire transactions to avoid REQ_PREFLUSH.At replay time, all the blocks that make up a transaction will eitherbe there and the checksum will match, or the transaction will beinvalid. This should be less prohibitively expensive with CPU supportfor CRC32C nowadays, but depending on the hardware, it may make senseto turn this off.
IV. We need recording of unwritten blocks / extents (allocations /fallocate) as this will significantly speed up moving glocks from onenode to another:


That would definitely be a step forward.

At the moment, data=ordered is implemented by keeping a list of allinodes that did an ordered write. When it comes time to flush the log,the data of all those ordered inodes is flushed first. When all wewant is to flush a single glock in order to move it to a differentnode, we currently flush all the ordered inodes as well as the journal.
If we only flushed the ordered data of the glock being moved plus theentire journal, the ordering guarantees for the other ordered inodesin the journal would be violated. In that scenario, unwritten blockscould (and would) show up in files after crashes.
If we instead record unwritten blocks in the journal, we'll know whichblocks need to be zeroed out at recovery time. Once an unwritten blockis written, we record a REVOKE entry for that block.
This comes at the cost of tracking those blocks of course, but withthat in place, moving a glock from one node to another will onlyrequire flushing the underlying inode (assuming it's a inode glock)and the journal. And most likely, we won't have to bother withimplementing "simple" transactions as described inhttps://bugzilla.redhat.com/show_bug.cgi?id=1631499.
Thanks,
Andreas

That would be another way of looking at the problem, yes. It does add alot to the complexity though, and it doesn't scale very well on systemswith large amounts of memory (and therefore potentially lots ofunwritten extents to record & keep track of). If there are lots of smalltransactions, then each one might be significantly expanded by the needto write the info to track the things which have not been written yet.

If we keep track of individual allocations/deallocations, as per Abhi'ssuggestion, then we know where the areas are which may potentially haveunwritten data in them. That may allow us to avoid having to do the datawriteback ahead of the journal flush in the first place - movingsomething more towards the XFS way of doing things. We would have toensure that we did get data written back before the allocation recordsvanish from the active part of the log though, so a slightly differentconstraint to currently,


Steve.

Re: [Cluster-devel] Recording extents in GFS2

Reply via email to