Re: [Cluster-devel] Recording extents in GFS2

Andreas Gruenbacher Mon, 01 Mar 2021 09:54:11 -0800

On Tue, Feb 2, 2021 at 6:35 PM Steven Whitehouse <swhit...@redhat.com>
wrote:


> Hi,
> On 24/01/2021 06:44, Abhijith Das wrote:
>
> Hi all,
>
> I've been looking at rgrp.c:gfs2_alloc_blocks(), which is called from
> various places to allocate single/multiple blocks for inodes. I've come up
> with some data structures to accomplish recording of these allocations as
> extents.
>
> I'm proposing we add a new metadata type for journal blocks that will hold
> these extent records.
>
> GFS2_METATYPE_EX 15 /* New metadata type for a block that will hold
> extents */
>
> This structure below will be at the start of the block, followed by a
> number of alloc_ext structures.
>
> struct gfs2_extents { /* This structure is 32 bytes long */
>     struct gfs2_meta_header ex_header;
>     __be32 ex_count; /* count of number of alloc_ext structs that follow
> this header. */
>     __be32 __pad;
> };
> /* flags for the alloc_ext struct */
> #define AE_FL_XXX
>
> struct alloc_ext { /* This structure is 48 bytes long */
>     struct gfs2_inum ae_num; /* The inode this allocation/deallocation
> belongs to */
>     __be32 ae_flags; /* specifies if we're allocating/deallocating,
> data/metadata, etc. */
>     __be64 ae_start; /* starting physical block number of the extent */
>     __be64 ae_len;   /* length of the extent */
>     __be32 ae_uid;   /* user this belongs to, for quota accounting */
>     __be32 ae_gid;   /* group this belongs to, for quota accounting */
>     __be32 __pad;
> };
>
> The gfs2_inum structure is a bit OTT for this I think. A single 64 bit
> inode number should be enough? Also, it is quite likely we may have
> multiple extents for the same inode... so should we split this into two so
> we can have something like this? It is more complicated, but should save
> space in the average case.
>
> struct alloc_hdr {
>
>     __be64 inum;
>
>     __be32 uid; /* This is duplicated from the inode... various options
> here depending on whether we think this is something we should do. Should
> we also consider logging chown using this structure? We will have to
> carefully check chown sequence wrt to allocations/deallocations for quota
> purposes */
>
>     __be32 gid;
>
>     __u8 num_extents; /* Never likely to have huge numbers of extents per
> header, due to block size! */
>
>     /* padding... or is there something else we could/should add here? */
>
> };
>
> followed by num_extents copies of:
>
> struct alloc_extent {
>
>     __be64 phys_start;
>
>     __be64 logical_start; /* Do we need a logical & physical start? Maybe
> we don't care about the logical start? */
>
>     __be32 length; /* Max extent length is limited by rgrp length... only
> need 32 bits */
>
>     __be32 flags; /* Can we support unwritten, zero extents with this?
> Need to indicate alloc/free/zero, data/metadata */
>
> };
>
We're trying to keep allocations relatively close together and within the
same resource group, so to store extent lists more compactly, we could
store the first extent's start address absolutely, and the start of each
successive extent within range as a signed 32-bit number relative to that.

> Just wondering if there is also some shorthand we might be able to use in
> case we have multiple extents all separated by either one metadata block,
> or a very small number of metadata blocks (which will be the case for
> streaming writes). Again it increases the complexity, but will likely
> reduce the amount we have to write into the new journal blocks quite a lot.
> Not much point having a 32 bit length, but never filling it with a value
> above 509 (4k block size)...
>
The current allocator fills at most one indirect block before allocating
the next indirect block(s), which is why we end up with the above described
pattern. Once we switch to extent-based inodes, we won't be allocating
indirect blocks anymore, so we also won't end up with those chopped-up
extents anymore. There will be the occasional node split in the inode
extent tree, but that will be a much less frequent occurrence, and it won't
happen when extending an existing extent. Delayed allocation would further
improve the on-disk allocation patterns. On the other hand, we'll end up
with more overhead when files are highly fragmented.

As long as we're only storing extents in the journal, I don't think those
509-block chunks are a problem; we'll still end up with more compact
metadata for mostly-contiguous files. We'll do much worse for test cases
that write every other block, for example.

> With 4k block sizes, we can fit 84 extents (10 for 512b, 20 for 1k, 42
> for 2k block sizes) in one block. As we process more allocs/deallocs, we
> keep creating more such alloc_ext records and tack them to the back of this
> block if there's space or else create a new block. For smaller extents,
> this might not be efficient, so we might just want to revert to the old
> method of recording the bitmap blocks instead.
> During journal replay, we decode these new blocks and flip the
> corresponding bitmaps for each of the blocks represented in the extents.
> For the ones where we just recorded the bitmap blocks the old-fashioned
> way, we also replay them the old-fashioned way. This way we're also
> backward compatible with an older version of gfs2 that only records the
> bitmaps.
> Since we record the uid/gid with each extent, we can do the quota
> accounting without relying on the quota change file. We might need to keep
> the quota change file around for backward compatibility and for the cases
> where we might want to record allocs/deallocs the old-fashioned way.
>
> I'm going to play around with this and come up with some patches to see if
> this works and what kind of performance improvements we get. These data
> structures will mostly likely need reworking and renaming, but this is the
> general direction I'm thinking along.
>
> Please let me know what you think.
>
> Cheers!
> --Abhi
>
> That all sounds good. I'm sure it will take a little while to figure out
> how to get this right,
>
> Steve.
>
Thanks,
Andreas

Re: [Cluster-devel] Recording extents in GFS2

Reply via email to