Re: [Cluster-devel] [PATCH v4 17/20] gfs2: No revokes for transactions at the tail of the log

2021-02-22 Thread Andreas Gruenbacher
On Wed, Feb 3, 2021 at 7:08 PM Andreas Gruenbacher  wrote:
> In gfs2_log_flush, we're going through all active transactions.  For
> each of the buffers in those transactions that has completed, we either
> add a revoke to the active transaction immediately or we move the
> buffer to the transaction's ail2 list, which may result in a revoke
> later.
>
> However, whenever a transaction at the tail of the log completes, the
> current tail of the log advances.  gfs2_log_flush writes out the log
> header for the system transaction, with lh_tail set to that current tail
> (sd_log_flush_head).  This implicitly revokes all previous blocks in the
> log, so the revokes we've just written become obsolete.
>
> So instead of writing unnecessary revokes, just skip completed
> transactions at the tail of the log.

This patch is still causing problems during testing, so we'll put it
aside for now and fix it later.

Thanks,
Andreas



Re: [Cluster-devel] Recording extents in GFS2

2021-02-22 Thread Andreas Gruenbacher
On Mon, Feb 22, 2021 at 12:41 PM Andreas Gruenbacher 
wrote:

> On Mon, Feb 22, 2021 at 11:21 AM Steven Whitehouse 
> wrote:
>
>> Hi,
>> On 20/02/2021 09:48, Andreas Gruenbacher wrote:
>>
>> Hi all,
>>
>> once we change the journal format, in addition to recording block numbers
>> as extents, there are some additional issues we should address at the same
>> time:
>>
>> I. The current transaction format of our journals is as follows:
>>
>>- One METADATA log descriptor block for each [503 / 247 / 119 / 55]
>>metadata blocks, followed by those metadata blocks. For each metadata
>>block, the log descriptor records the 64-bit block number.
>>- One JDATA log descriptor block for each [251 / 123 / 59 / 27]
>>metadata blocks, followed by those metadata blocks. For each metadata
>>block, the log descriptor records the 64-bit block number and another
>>64-bit field for indicating whether the block needed escaping.
>>- One REVOKE log descriptor block for the initial [503 / 247 / 119 /
>>55] revokes, followed by a metadata header (not to be confused with the 
>> log
>>header) for each additional [509 / 253 / 125 / 61] revokes. Each revoke is
>>recorded as a 64-bit block number in its REVOKE log descriptor or metadata
>>header.
>>- One log header with various necessary and useful metadata that acts
>>as a COMMIT record. If the log header is incorrect or missing, the
>>preceding log descriptors are ignored.
>>
>>   
>> succeeding? (I hope!)
>>
>
> No, we call lops_before_commit (which writes the various log descriptors,
> metadata, and journaled data blocks) before writing the log header in
> log_write_header -> gfs2_write_log_header. In that sense, we could call it
> a trailer.
>
> We should change that so that a single log descriptor contains a number of
>> records. There should be records for METADATA and JDATA blocks that follow,
>> as well as for REVOKES and for COMMIT. If a transaction contains metadata
>> and/or jdata blocks, those will obviously need a precursor and a commit
>> block like today, but we shouldn't need separate blocks for metadata and
>> journaled data in many cases. Small transactions that only consist of
>> revokes and of a commit should frequently fit into a single block entirely,
>> though.
>>
>> Yes, it makes sense to try and condense what we are writing. Why would we
>> not need to have separate blocks for journaled data though? That one seems
>> difficult to avoid, and since it is used so infrequently, perhaps not such
>> an important issue.
>>
> Journaled data would of course still need to be written. We could have a
> single log descriptor with METADATA and JDATA records, followed by the
> metadata and journaled data blocks, followed by a log descriptor with a
> COMMIT record.
>
>> Right now, we're writing log headers ("commits") with REQ_PREFLUSH to
>> make sure all the log descriptors of a transaction make it to disk before
>> the log header. Depending on the device, this is often costly. If we can
>> fit an entire transaction into a single block, REQ_PREFLUSH won't be needed
>> anymore.
>>
>> I'm not sure I agree. The purpose of the preflush is to ensure that the
>> data and the preceding log blocks are really on disk before we write the
>> commit record. That will still be required while we use ordered writes,
>> even if we can use (as you suggest below) a checksum to cover the whole
>> transaction, and thus check for a complete log record after the fact. Also,
>> we would still have to issue the flush in the case of a fsync derived log
>> flush too.
>>
>>
>>
>> III. We could also checksum entire transactions to avoid REQ_PREFLUSH. At
>> replay time, all the blocks that make up a transaction will either be there
>> and the checksum will match, or the transaction will be invalid. This
>> should be less prohibitively expensive with CPU support for CRC32C
>> nowadays, but depending on the hardware, it may make sense to turn this off.
>>
>> IV. We need recording of unwritten blocks / extents (allocations /
>> fallocate) as this will significantly speed up moving glocks from one node
>> to another:
>>
>> That would definitely be a step forward.
>>
>>
>>
>> At the moment, data=ordered is implemented by keeping a list of all
>> inodes that did an ordered write. When it comes time to flush the log, the
>> data of all those ordered inodes is flushed first. When all we want is to
>> flush a single glock in order to move it to a different node, we currently
>> flush all the ordered inodes as well as the journal.
>>
>> If we only flushed the ordered data of the glock being moved plus the
>> entire journal, the ordering guarantees for the other ordered inodes in the
>> journal would be violated. In that scenario, unwritten blocks could (and
>> would) show up in files after crashes.
>>
>> If we instead record unwritten blocks in the journal, we'll know which
>> blocks 

Re: [Cluster-devel] Recording extents in GFS2

2021-02-22 Thread Andreas Gruenbacher
On Mon, Feb 22, 2021 at 11:21 AM Steven Whitehouse 
wrote:

> Hi,
> On 20/02/2021 09:48, Andreas Gruenbacher wrote:
>
> Hi all,
>
> once we change the journal format, in addition to recording block numbers
> as extents, there are some additional issues we should address at the same
> time:
>
> I. The current transaction format of our journals is as follows:
>
>- One METADATA log descriptor block for each [503 / 247 / 119 / 55]
>metadata blocks, followed by those metadata blocks. For each metadata
>block, the log descriptor records the 64-bit block number.
>- One JDATA log descriptor block for each [251 / 123 / 59 / 27]
>metadata blocks, followed by those metadata blocks. For each metadata
>block, the log descriptor records the 64-bit block number and another
>64-bit field for indicating whether the block needed escaping.
>- One REVOKE log descriptor block for the initial [503 / 247 / 119 /
>55] revokes, followed by a metadata header (not to be confused with the log
>header) for each additional [509 / 253 / 125 / 61] revokes. Each revoke is
>recorded as a 64-bit block number in its REVOKE log descriptor or metadata
>header.
>- One log header with various necessary and useful metadata that acts
>as a COMMIT record. If the log header is incorrect or missing, the
>preceding log descriptors are ignored.
>
>   
> succeeding? (I hope!)
>

No, we call lops_before_commit (which writes the various log descriptors,
metadata, and journaled data blocks) before writing the log header in
log_write_header -> gfs2_write_log_header. In that sense, we could call it
a trailer.

We should change that so that a single log descriptor contains a number of
> records. There should be records for METADATA and JDATA blocks that follow,
> as well as for REVOKES and for COMMIT. If a transaction contains metadata
> and/or jdata blocks, those will obviously need a precursor and a commit
> block like today, but we shouldn't need separate blocks for metadata and
> journaled data in many cases. Small transactions that only consist of
> revokes and of a commit should frequently fit into a single block entirely,
> though.
>
> Yes, it makes sense to try and condense what we are writing. Why would we
> not need to have separate blocks for journaled data though? That one seems
> difficult to avoid, and since it is used so infrequently, perhaps not such
> an important issue.
>
Journaled data would of course still need to be written. We could have a
single log descriptor with METADATA and JDATA records, followed by the
metadata and journaled data blocks, followed by a log descriptor with a
COMMIT record.

> Right now, we're writing log headers ("commits") with REQ_PREFLUSH to make
> sure all the log descriptors of a transaction make it to disk before the
> log header. Depending on the device, this is often costly. If we can fit an
> entire transaction into a single block, REQ_PREFLUSH won't be needed
> anymore.
>
> I'm not sure I agree. The purpose of the preflush is to ensure that the
> data and the preceding log blocks are really on disk before we write the
> commit record. That will still be required while we use ordered writes,
> even if we can use (as you suggest below) a checksum to cover the whole
> transaction, and thus check for a complete log record after the fact. Also,
> we would still have to issue the flush in the case of a fsync derived log
> flush too.
>
>
>
> III. We could also checksum entire transactions to avoid REQ_PREFLUSH. At
> replay time, all the blocks that make up a transaction will either be there
> and the checksum will match, or the transaction will be invalid. This
> should be less prohibitively expensive with CPU support for CRC32C
> nowadays, but depending on the hardware, it may make sense to turn this off.
>
> IV. We need recording of unwritten blocks / extents (allocations /
> fallocate) as this will significantly speed up moving glocks from one node
> to another:
>
> That would definitely be a step forward.
>
>
>
> At the moment, data=ordered is implemented by keeping a list of all inodes
> that did an ordered write. When it comes time to flush the log, the data of
> all those ordered inodes is flushed first. When all we want is to flush a
> single glock in order to move it to a different node, we currently flush
> all the ordered inodes as well as the journal.
>
> If we only flushed the ordered data of the glock being moved plus the
> entire journal, the ordering guarantees for the other ordered inodes in the
> journal would be violated. In that scenario, unwritten blocks could (and
> would) show up in files after crashes.
>
> If we instead record unwritten blocks in the journal, we'll know which
> blocks need to be zeroed out at recovery time. Once an unwritten block is
> written, we record a REVOKE entry for that block.
>
> This comes at the cost of tracking those blocks of course, 

Re: [Cluster-devel] Recording extents in GFS2

2021-02-22 Thread Steven Whitehouse

Hi,

On 20/02/2021 09:48, Andreas Gruenbacher wrote:

Hi all,

once we change the journal format, in addition to recording block 
numbers as extents, there are some additional issues we should address 
at the same time:


I. The current transaction format of our journals is as follows:

  * One METADATA log descriptor block for each [503 / 247 / 119 / 55]
metadata blocks, followed by those metadata blocks. For each
metadata block, the log descriptor records the 64-bit block number.
  * One JDATA log descriptor block for each [251 / 123 / 59 / 27]
metadata blocks, followed by those metadata blocks. For each
metadata block, the log descriptor records the 64-bit block number
and another 64-bit field for indicating whether the block needed
escaping.
  * One REVOKE log descriptor block for the initial [503 / 247 / 119 /
55] revokes, followed by a metadata header (not to be confused
with the log header) for each additional [509 / 253 / 125 / 61]
revokes. Each revoke is recorded as a 64-bit block number in its
REVOKE log descriptor or metadata header.
  * One log header with various necessary and useful metadata that
acts as a COMMIT record. If the log header is incorrect or
missing, the preceding log descriptors are ignored.


 succeeding? (I hope!)
We should change that so that a single log descriptor contains a 
number of records. There should be records for METADATA and JDATA 
blocks that follow, as well as for REVOKES and for COMMIT. If a 
transaction contains metadata and/or jdata blocks, those will 
obviously need a precursor and a commit block like today, but we 
shouldn't need separate blocks for metadata and journaled data in many 
cases. Small transactions that only consist of revokes and of a commit 
should frequently fit into a single block entirely, though.


Yes, it makes sense to try and condense what we are writing. Why would 
we not need to have separate blocks for journaled data though? That one 
seems difficult to avoid, and since it is used so infrequently, perhaps 
not such an important issue.



Right now, we're writing log headers ("commits") with REQ_PREFLUSH to 
make sure all the log descriptors of a transaction make it to disk 
before the log header. Depending on the device, this is often costly. 
If we can fit an entire transaction into a single block, REQ_PREFLUSH 
won't be needed anymore.


I'm not sure I agree. The purpose of the preflush is to ensure that the 
data and the preceding log blocks are really on disk before we write the 
commit record. That will still be required while we use ordered writes, 
even if we can use (as you suggest below) a checksum to cover the whole 
transaction, and thus check for a complete log record after the fact. 
Also, we would still have to issue the flush in the case of a fsync 
derived log flush too.





III. We could also checksum entire transactions to avoid REQ_PREFLUSH. 
At replay time, all the blocks that make up a transaction will either 
be there and the checksum will match, or the transaction will be 
invalid. This should be less prohibitively expensive with CPU support 
for CRC32C nowadays, but depending on the hardware, it may make sense 
to turn this off.


IV. We need recording of unwritten blocks / extents (allocations / 
fallocate) as this will significantly speed up moving glocks from one 
node to another:


That would definitely be a step forward.




At the moment, data=ordered is implemented by keeping a list of all 
inodes that did an ordered write. When it comes time to flush the log, 
the data of all those ordered inodes is flushed first. When all we 
want is to flush a single glock in order to move it to a different 
node, we currently flush all the ordered inodes as well as the journal.


If we only flushed the ordered data of the glock being moved plus the 
entire journal, the ordering guarantees for the other ordered inodes 
in the journal would be violated. In that scenario, unwritten blocks 
could (and would) show up in files after crashes.


If we instead record unwritten blocks in the journal, we'll know which 
blocks need to be zeroed out at recovery time. Once an unwritten block 
is written, we record a REVOKE entry for that block.


This comes at the cost of tracking those blocks of course, but with 
that in place, moving a glock from one node to another will only 
require flushing the underlying inode (assuming it's a inode glock) 
and the journal. And most likely, we won't have to bother with 
implementing "simple" transactions as described in 
https://bugzilla.redhat.com/show_bug.cgi?id=1631499.


Thanks,
Andreas


That would be another way of looking at the problem, yes. It does add a 
lot to the complexity though, and it doesn't scale very well on systems 
with large amounts of memory (and therefore potentially lots of 
unwritten extents to record & keep track of). If there are lots of small 
transactions, then each one might be significantly expanded by