Ryan Blough created HDDS-15403:
----------------------------------
Summary: Build a second EC reconstruction procedure to target
faster time to recovery
Key: HDDS-15403
URL: https://issues.apache.org/jira/browse/HDDS-15403
Project: Apache Ozone
Issue Type: Improvement
Components: EC, Ozone Datanode
Reporter: Ryan Blough
This Jira is to track and implement a different erasure coding reconstruction
procedure with a focus on completing the reconstruction task faster.
The EC reconstruction procedure we currently use is sequential, by chunk, and
has a minimal resource footprint. The core loop of the reconstruction is:
# Fetch a container chunk over the network.
# Load the chunk into off-heap memory.
# Do reconstruction on the chunk.
# Write that chunk over the network to the target nodes.
# On confirmation of write, iterate the loop.
This has the advantage of consuming minimal resources at each step, with memory
footprint being limited to one container chunk size, and consuming a single
thread.
However, it is also has network at either end of a loop that iterates many
times.
The concept of this second reconstruction method is to complete steps 1-4 in a
single stage each. The tradeoff will be faster time to recovery for the
individual container in exchange for a larger resource footprint (namely enough
memory to store the full-size container).
After a first pass to establish end-to-end single-threaded behavior in
comparison with the loop, additional considerations are likely to include
async, multithreading, and revisiting the single-chunk work unit depending on
how conventional Reed-Solomon (the algorithm in libhadoop) scales with work
unit size.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]