Dear Kim:
Happy New Year!

> +static struct f2fs_folio_state *
> +ffs_find_or_alloc(struct folio *folio)
> +{
> +  struct f2fs_folio_state *ffs = folio->private;
> +
> +  if (ffs)
> +          return ffs;
> +
> +  ffs = f2fs_kmem_cache_alloc(ffs_entry_slab, GFP_NOIO, true, NULL);
> +
> +  spin_lock_init(&ffs->state_lock);
> +  folio_attach_private(folio, ffs);
> +  return ffs;
> +}

It looks like ffs_find_or_alloc() does not initialize
read_pages_pending.
When I debug locally, printing read_pages_pending shows an undefined
random value. Also, when I run a basic read test with dd, tasks can hang
(because read_pages_pending never reaches zero, so the folio is never
unlocked and never marked uptodate).

I know this function is modeled after iomap's ifs_alloc():

static struct iomap_folio_state *ifs_alloc(struct inode *inode,
                                        struct folio *folio, unsigned int flags)
{
        struct iomap_folio_state *ifs = folio->private;
        unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
        gfp_t gfp;

        if (ifs || nr_blocks <= 1)
                return ifs;
        /*...*/
        /*
         * ifs->state tracks two sets of state flags when the
         * filesystem block size is smaller than the folio size.
         * The first state tracks per-block uptodate and the
         * second tracks per-block dirty state.
         */
        ifs = kzalloc(struct_size(ifs, state,
                      BITS_TO_LONGS(2 * nr_blocks)), gfp);
        if (!ifs)
                return ifs;

        spin_lock_init(&ifs->state_lock);
        if (folio_test_uptodate(folio))
                bitmap_set(ifs->state, 0, nr_blocks);
        if (folio_test_dirty(folio))
                bitmap_set(ifs->state, nr_blocks, nr_blocks);
        folio_attach_private(folio, ifs);

        return ifs;
}

Note ifs_alloc() uses kzalloc(), which zero-initializes the allocated memory by default while f2fs_kmem_cache_alloc() does not.

We could fix this by explicitly setting read_pages_pending = 0,
or by doing a memset() right after f2fs_kmem_cache_alloc()
(the latter seems more extensible if the struct grows). What do you think?

>            /*
> +           * This page will go to BIO.  Do we need to send this
> +           * BIO off first?
> +           */
> +          if (bio && (!page_is_mergeable(F2FS_I_SB(inode), bio,
> +                                  last_block_in_bio, block_nr) ||
> +                  !f2fs_crypt_mergeable_bio(bio, inode, index, NULL))) {
> +submit_and_realloc:
> +                  f2fs_submit_read_bio(F2FS_I_SB(inode), bio, DATA);
> +                  bio = NULL;
> +          }
> +          if (bio == NULL)
> +                  bio = f2fs_grab_read_bio(inode, block_nr,
> +                                  max_nr_pages,
> +                                  f2fs_ra_op_flags(rac),
> +                                  index, false);
> +
> +          /*
> +           * If the page is under writeback, we need to wait for
> +           * its completion to see the correct decrypted data.
> +           */
> +          f2fs_wait_on_block_writeback(inode, block_nr);
> +
> +          if (!bio_add_folio(bio, folio, F2FS_BLKSIZE,
> +                                  offset << PAGE_SHIFT))
> +                  goto submit_and_realloc;
> +
> +          if (folio_test_large(folio)) {
> +                  ffs = ffs_find_or_alloc(folio);
> +
> +                  /* set the bitmap to wait */
> +                  spin_lock_irq(&ffs->state_lock);
> +                  ffs->read_pages_pending++;
> +                  spin_unlock_irq(&ffs->state_lock);
> +          }

In the current code, it looks like a subpage is added to the BIO (or a
cached BIO is submitted) before read_pages_pending is incremented.
This can cause the following behaviour:

After one subpage of a folio is submitted, if the I/O completes very
fast, the endio path may interrupt the read loop, run bio_endio, and
eventually call f2fs_finish_read_bio(), which decrements read_pages_pending
down to zero. That can make folio_finish_read() run too early, even though other parts of the same folio have not been added to a BIO yet.

I managed to trigger this locally by creating a heavily fragmented file
and temporarily injecting the following code right after BIO submission:

        f2fs_io_schedule_timeout(1);
        WARN_ON_ONCE(!folio_test_locked(folio));

I think the correct ordering is to increment read_pages_pending first,
and then add the corresponding subpage to the BIO.
In that ordering, the BIO side will either:
  1) add a subpage after the increment (matching the new pending count),
or
  2) submit a BIO that corresponds to the pending increment from the
     ** previous iteration **,
so read_pages_pending will not reach zero prematurely.
This is exactly the order that iomap_readpage_iter() implements.

If you need the script I used to reproduce the bug, please let me know.
I will attach it in my next reply. Thanks!

Best regards,
Nanzhe Zhao



_______________________________________________
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Reply via email to