adoroszlai opened a new pull request, #5084: URL: https://github.com/apache/ozone/pull/5084
## What changes were proposed in this pull request? HDDS-8698 introduced a hard limit on EC pipeline count. This may be too restrictive, since a few bad nodes could spoil all pipelines. The problem can happen in the time period between datanode failure and SCM marking it as dead, closing its related pipelines. In this case client trying to write to a failed datanode detects the failure sooner than SCM does. It requests a new block, excluding the failed datanode. If all existing pipelines are excluded due to such failure, client encounters block allocation failure, then gives up. This PR proposes to turn the limit into a soft one, increasing the limit for the final attempt (after searching through all existing pipelines) to the number of datanodes. (This only affects small clusters with few disks per node. The problem may exist in larger clusters, too, and we may need to fine-tune this rule later.) https://issues.apache.org/jira/browse/HDDS-9033 ## How was this patch tested? Tested manually in `ozone-topology` environment: * create 5 EC keys (creates new container for each) * find a datanode that is part of all 5 pipelines * stop the datanode * try to create another EC key Client output shows write failure, but succeeds eventually: ``` WARN io.KeyOutputStream: EC stripe write failed: S S F S S ``` SCM debug log indicates pipeline limit was hit: ``` scm_1 | 2023-07-18 09:49:35,216 [IPC Server handler 2 on default port 9863] DEBUG pipeline.WritableECContainerProvider: Pipeline count 5 reached limit 5, checking existing ones; requested size=268435456, replication=EC{rs-3-2-1024k}, owner=om1, ExcludeList {datanodes = [0c764f3f-ac11-4757-8a20-bb379dbabfa8(null/null)], containerIds = [], pipelineIds = [PipelineID=fa15da28-8d55-4b6e-aad6-dc5898b4c92e]} scm_1 | 2023-07-18 09:49:35,216 [IPC Server handler 2 on default port 9863] DEBUG pipeline.WritableECContainerProvider: Checking existing pipelines: [] scm_1 | 2023-07-18 09:49:35,216 [IPC Server handler 2 on default port 9863] DEBUG pipeline.WritableECContainerProvider: Increasing pipeline limit 5 -> 6 for final attempt scm_1 | 2023-07-18 09:49:35,228 [IPC Server handler 2 on default port 9863] INFO pipeline.WritableECContainerProvider: Created and opened new pipeline Pipeline[ Id: 8052e76e-a184-43f5-afa3-9532c88d3275, Nodes: 618d8a2e-64f2-421e-9269-2e3cb1437542(ozone-topology_datanode_3_1.ozone-topology_net/10.5.0.6)90b392fa-95bf-4ea3-b2f7-43015451cfc7(ozone-topology_datanode_5_1.ozone-topology_net/10.5.0.8)38fc6c5a-fcd5-4819-97c5-468ec4648e19(ozone-topology_datanode_4_1.ozone-topology_net/10.5.0.7)5059a97c-bd7c-4819-832f-41ac0d5b4c7f(ozone-topology_datanode_1_1.ozone-topology_net/10.5.0.4)5b2b6e9d-d67e-4c39-aa85-57066d69154f(ozone-topology_datanode_2_1.ozone-topology_net/10.5.0.5), ReplicationConfig: EC{rs-3-2-1024k}, State:ALLOCATED, leaderId:, CreationTimestamp2023-07-18T09:49:35.218009Z[UTC]] ``` Added unit test. CI: https://github.com/adoroszlai/hadoop-ozone/actions/runs/5586806884 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
