adoroszlai opened a new pull request, #5084:
URL: https://github.com/apache/ozone/pull/5084

   ## What changes were proposed in this pull request?
   
   HDDS-8698 introduced a hard limit on EC pipeline count.  This may be too 
restrictive, since a few bad nodes could spoil all pipelines.
   
   The problem can happen in the time period between datanode failure and SCM 
marking it as dead, closing its related pipelines.  In this case client trying 
to write to a failed datanode detects the failure sooner than SCM does.  It 
requests a new block, excluding the failed datanode.  If all existing pipelines 
are excluded due to such failure, client encounters block allocation failure, 
then gives up.
   
   This PR proposes to turn the limit into a soft one, increasing the limit for 
the final attempt (after searching through all existing pipelines) to the 
number of datanodes.  (This only affects small clusters with few disks per 
node.  The problem may exist in larger clusters, too, and we may need to 
fine-tune this rule later.)
   
   https://issues.apache.org/jira/browse/HDDS-9033
   
   ## How was this patch tested?
   
   Tested manually in `ozone-topology` environment:
   
    * create 5 EC keys (creates new container for each)
    * find a datanode that is part of all 5 pipelines
    * stop the datanode
    * try to create another EC key
   
   Client output shows write failure, but succeeds eventually:
   
   ```
   WARN io.KeyOutputStream: EC stripe write failed: S S F S S
   ```
   
   SCM debug log indicates pipeline limit was hit:
   
   ```
   scm_1         | 2023-07-18 09:49:35,216 [IPC Server handler 2 on default 
port 9863] DEBUG pipeline.WritableECContainerProvider: Pipeline count 5 reached 
limit 5, checking existing ones; requested size=268435456, 
replication=EC{rs-3-2-1024k}, owner=om1, ExcludeList {datanodes = 
[0c764f3f-ac11-4757-8a20-bb379dbabfa8(null/null)], containerIds = [], 
pipelineIds = [PipelineID=fa15da28-8d55-4b6e-aad6-dc5898b4c92e]}
   scm_1         | 2023-07-18 09:49:35,216 [IPC Server handler 2 on default 
port 9863] DEBUG pipeline.WritableECContainerProvider: Checking existing 
pipelines: []
   scm_1         | 2023-07-18 09:49:35,216 [IPC Server handler 2 on default 
port 9863] DEBUG pipeline.WritableECContainerProvider: Increasing pipeline 
limit 5 -> 6 for final attempt
   scm_1         | 2023-07-18 09:49:35,228 [IPC Server handler 2 on default 
port 9863] INFO pipeline.WritableECContainerProvider: Created and opened new 
pipeline Pipeline[ Id: 8052e76e-a184-43f5-afa3-9532c88d3275, Nodes: 
618d8a2e-64f2-421e-9269-2e3cb1437542(ozone-topology_datanode_3_1.ozone-topology_net/10.5.0.6)90b392fa-95bf-4ea3-b2f7-43015451cfc7(ozone-topology_datanode_5_1.ozone-topology_net/10.5.0.8)38fc6c5a-fcd5-4819-97c5-468ec4648e19(ozone-topology_datanode_4_1.ozone-topology_net/10.5.0.7)5059a97c-bd7c-4819-832f-41ac0d5b4c7f(ozone-topology_datanode_1_1.ozone-topology_net/10.5.0.4)5b2b6e9d-d67e-4c39-aa85-57066d69154f(ozone-topology_datanode_2_1.ozone-topology_net/10.5.0.5),
 ReplicationConfig: EC{rs-3-2-1024k}, State:ALLOCATED, leaderId:, 
CreationTimestamp2023-07-18T09:49:35.218009Z[UTC]]
   ```
   
   Added unit test.
   
   CI:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/5586806884


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to