On 2015-10-05 00:00, Martin Tippmann wrote:
2015-10-03 16:50 GMT+02:00 Jim Dowling <[email protected] <mailto:[email protected]>>:

    As you point out, hdfs does its own checksumming of blocks, which
    is needed as blocks are transferred over the network. So, yes it
    is double checksumming if you will.

    We are keeping the data node as it is. The only change needed will
    be to identify a block device as an "archive" device or a normal
    device. We're interested in archive devices for this work.
    The bigger picture is that Apache HDFS are going towards striping
    blocks over different data nodes, losing data locality. We are
    investigating  btrfs/raid5 for archived data. It's workload would
    be much lower
    than standard.


Hi, thanks for the clarification!

[snip]

    So the idea is to erasure code twice, checksum twice. Overall
    overhead will be about 50%, half of this for raid5, half hdfs
    erasure coding.
    Hypothesis: For cold storage data with normal at most one active
    job per data node, jobs will read/write data faster, improving
    performance, particularly over 10GbE


btrfs RAID5 should do the job - I don't think the checksumming is really a problem as it's CRC32C that modern Intel CPUs provide an instruction for.

If the performance is not as great you could try doing btrfs on top of mdraid RAID5 - mdraid should be more optimized than btrfs at that this point. If you don't need btrfs snapshots and subvolumes you could implement the HDFS snapshotting using the upcoming XFS reflink support - that provides CoW semantics - should be working with HDFS blocks if you cp --reflink them for Snapshots.

From numbers that got posted here a while ago mdraid + XFS is at the moment are quite bit faster than btrfs - XFS provides Metadata checksumming (no duplication through) so you could spare at least the double checksumming of data. However using mdraid has some caveats as it's able to grow or shrink once configured.

HTH
Martin

Thanks for the tips Martin. We have a bit more research to do before we get started.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to