Ryan Blough created HDDS-15403:
----------------------------------

             Summary: Build a second EC reconstruction procedure to target 
faster time to recovery
                 Key: HDDS-15403
                 URL: https://issues.apache.org/jira/browse/HDDS-15403
             Project: Apache Ozone
          Issue Type: Improvement
          Components: EC, Ozone Datanode
            Reporter: Ryan Blough


This Jira is to track and implement a different erasure coding reconstruction 
procedure with a focus on completing the reconstruction task faster.

The EC reconstruction procedure we currently use is sequential, by chunk, and 
has a minimal resource footprint. The core loop of the reconstruction is:
 # Fetch a container chunk over the network.
 # Load the chunk into off-heap memory.
 # Do reconstruction on the chunk.
 # Write that chunk over the network to the target nodes.
 # On confirmation of write, iterate the loop.

This has the advantage of consuming minimal resources at each step, with memory 
footprint being limited to one container chunk size, and consuming a single 
thread.

However, it is also has network at either end of a loop that iterates many 
times.

The concept of this second reconstruction method is to complete steps 1-4 in a 
single stage each. The tradeoff will be faster time to recovery for the 
individual container in exchange for a larger resource footprint (namely enough 
memory to store the full-size container).

After a first pass to establish end-to-end single-threaded behavior in 
comparison with the loop, additional considerations are likely to include 
async, multithreading, and revisiting the single-chunk work unit depending on 
how conventional Reed-Solomon (the algorithm in libhadoop) scales with work 
unit size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to