So usually this should be functionality handled by the raid/san
controller I guess, > but given that btrfs is playing the role of a
controller here at what point are we drawing the line of not
implementing block-level functionality into the filesystem ?

    Don't worry this is not invading into the block layer. How
    can you even build this functionality in the block layer ?
    Block layer even won't know that disks are mirrored. RAID
    does or BTRFS in our case.

By block layer I guess I meant the storage driver of a particular raid card. Because what is currently happening is re-implementing functionality that will generally sit in the driver. So my question was more generic and high-level - at what point do we draw the line of implementing feature that are generally implemented in hardware devices (be it their drivers or firmware).

   Not all HW configs use RAID capable HBAs. A server connected to a SATA JBOD using a SATA HBA without MD will relay on BTRFS to provide all the features and capabilities that otherwise would have provided by such a presumable HW config.

That does sort of sound like means implementing some portion of the
HBA features/capabilities in the filesystem.

To me it seems this this could be workable at the fs level, provided it
deals just with policies and remains hardware-neutral.

  Thanks. Ok.

However most
of the use cases appear to involve some hardware-dependent knowledge
or assumptions.

What happens when someone sets this on a virtual disk,
or say a (persistent) memory-backed block device?

  Do you have any policy in particular ?

No, this is your proposal ;^)

 Policy added here:
 It is about the devid which is assigned by the btrfs.
 Future policy:
 They aren't hardware dependent though ssd says use ssd
 disk for reading if available. LBA is to divide the read
 IO access based on the sector #. The logic is quite simple
      read-sector < FS-SIZE/2 ? mirror1 : mirror2;

You've said cases #3 thru #6 are illustrative only. However they make
assumptions about the underlying storage, and/or introduce potential for
unexpected behaviors.

 The assumptions I am making is that user will understand their
 storage and tune this parameter accordingly, and there is heuristic
 (which Tim wrote) to do things automatically. Sometimes manual settings
 provide better performance than heuristic.

Plus they could end up replicating functionality
from other layers as Nikolay pointed out. Seems unlikely these would be
practical to implement.

The I/O one would actually be rather nice to have and wouldn't really be duplicating anything (at least, not duplicating anything we consistently run on top of).  The pid-based selector works fine for cases where the only thing on the disks is a single BTRFS filesystem.  When there's more than that, it can very easily result in highly asymmetrical load on the disks because it doesn't account for current I/O load when picking a copy to read.  Last I checked, both MD and DM-RAID have at least the option to use I/O load in determining where to send reads for RAID1 setups, and they do a far better job than BTRFS at balancing load in these cases.

 Yeah.. some enterprise FS and storage communicate performance
 tunability automatically between each other. We will be there too.

Case #2 seems concerning if it exposes internal,
implementation-dependent filesystem data into a de facto user-level
interface. (Do we ensure the devid is unique, and cannot get changed or
re-assigned internally to a different device, etc?)
The devid gets assigned when a device is added to a filesystem, it's a monotonically increasing number that gets incremented for every new device, and never changes for a given device as long as it remains in the filesystem (it will change if you remove the device and then re-add it).  The only exception to this is that the replace command will assign the new device the same devid that the device it is replacing had (which I would argue leads to consistent behavior here).  Given that, I think it's sufficiently safe to use it for something like this.

Thanks, Anand

To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to
More majordomo info at
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to
More majordomo info at

Reply via email to