On Thu, Oct 30, 2014 at 1:28 AM, Zhiwen Zheng <[email protected]> wrote:
> Hello, all!
>
> /*
> * For i/o error checking, read the first and last level-0
> * blocks (if they are not aligned), and all the level-1 blocks.
> */
> if (dn->dn_maxblkid == 0) {
> delta = dn->dn_datablksz;
> start = (off < dn->dn_datablksz) ? 0 : 1;
> end = (off+len <= dn->dn_datablksz) ? 0 : 1;
> if (start == 0 && (off > 0 || len < dn->dn_datablksz)) {
> err = dmu_tx_check_ioerr(NULL, dn, 0, 0);
> if (err)
> goto out;
> delta -= off;
> }
> } else {
> zio_t *zio = zio_root(dn->dn_objset->os_spa,
> NULL, NULL, ZIO_FLAG_CANFAIL);
>
> /* first level-0 block */
> start = off >> dn->dn_datablkshift;
> if (P2PHASE(off, dn->dn_datablksz) ||
> len < dn->dn_datablksz) {
> err = dmu_tx_check_ioerr(zio, dn, 0, start);
> if (err)
> goto out;
> }
>
> /* last level-0 block */
> end = (off+len-1) >> dn->dn_datablkshift;
> if (end != start && end <= dn->dn_maxblkid &&
> P2PHASE(off+len, dn->dn_datablksz)) {
> err = dmu_tx_check_ioerr(zio, dn, 0, end);
> if (err)
> goto out;
> }
>
> /* level-1 blocks */
> if (nlvls > 1) {
> int shft = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
> for (i = (start>>shft)+1; i < end>>shft; i++) {
> err = dmu_tx_check_ioerr(zio, dn, 1, i);
> if (err)
> goto out;
> }
> }
>
> err = zio_wait(zio);
> if (err)
> goto out;
> delta = P2NPHASE(off, dn->dn_datablksz);
> }
>
>
> Why should read the block from disk before starting write ?
> Can't the errors be handled in the write path ?
>
In theory, the code could be rewritten to handle the errors later. This
would require undoing any modifications made prior to the error (e.g.
rename removes the entry from the old directory, adds an entry to the new
directory, and updates the file to point to the new directory; if the last
step fails with EIO, you would have to undo the previous steps).
What benefit would that bring?
>
> We observe this zio can take 60ms to complete on a busy zpool,
> it kills the performance.
>
How would doing the reads later improve performance? (The reads would
still by synchronous, and you'd have more locks held, e.g. the tx would be
assigned.)
--matt
>
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer