Hi,

Is it possible that multiple ranges of the same page are submitted to
one or more bios, and such ranges race with each other and cause data
corruption.

Recently I'm trying to add subpage read/write support for btrfs, and
notice one strange false data corruption.

E.g, there is a 64K page to be read from disk:

0       16K     32K     48K     64K
|///////|       |///////|       |

Where |///| means data which needs to be read from disk.
And |   | means hole, we just zeroing the range.

Currently the code will:

- Submit bio for [0, 16K)
- Zero [16K, 32K)
- Submit bio for [32K, 48K)
- Zero [48K, 64k)

Between bio submission and zero, there is no need to wait for submitted
bio to finish, as I assume the submitted bio won't touch any range of
the page, except the one specified.

But randomly (not reliable), btrfs csum verification at the endio time
reports errors for the data read from disk mismatch from csum.

However the following things show it's read path has something wrong:
- On-disk data matches with csum

- If fully serialized the read path, the error just disappera
  If I changed the read path to be fully serialized, e.g:
  - Submit bio for [0, 16K)
  - Wait bio for [0, 16K) to finish
  - Zero [16K, 32K)
  - Submit bio for [32K, 48K)
  - Wait bio for [32K, 48K) to finish
  - Zero [48K, 64k)
  Then the problem just completely disappears.

So this looks like that, the read path hole zeroing and bio submission
is racing with each other?

Shouldn't bios only touch the range specified and not touching anything
else?

Or is there something I missed like off-by-one bug?

Thanks,
Qu

Reply via email to