[ https://issues.apache.org/jira/browse/HDFS-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17929264#comment-17929264 ]
ASF GitHub Bot commented on HDFS-17737: --------------------------------------- dannytbecker opened a new pull request, #7427: URL: https://github.com/apache/hadoop/pull/7427 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR Currently EC Reads are less stable than replication reads because if 4 out of 9 datanodes in the block group are busy, then the whole read fails. Erasure Coding reads need to be able to handle ERROR_BUSY signals from DataNodes and retry after a backoff duration to avoid overloading the DataNodes while increasing the stability of the read. Throttling on server side was another proposed solution, but we prefer this client side backoff for a few main reasons (see https://msasg.visualstudio.com/DefaultCollection/Multi%20Tenancy/_git/Hadoop/pullRequest/5272897#1739008224): 1. Throttling on the server would use up thread connections which have a maximum limit. 2. Throttling was originally added only for cohosting scenario to reduce impact on other services 3. Throttling would use up resources on the DataNode which is already busy. #### What The previous implementation followed a 4 phase algorithm to read. 1. Attempt to read chunks from the data blocks 2. Check for missing data chunks. Fail if there are more missing than the number of parity blocks, otherwise read parity blocks and null data blocks 3. Wait for data to be read into the buffers and handle any read errors by reading from more parity blocks 4. Check for missing blocks and either decode or fail. The new implementation now merges phase 1-3 into a single loop: 1. Loop until we have enough blocks for read or decode, or we have too many missing blocks to succeed - Determine the number of chunks we need to fetch. ALLZERO chunks count towards this total. null data chunks also count towards this total unless there are missing data chunks. - Read chunks until we have enough pending or fetched to be able to decode or normal read. faster. - Get results from reads and handle exceptions by preparing more reads for decoding the missing data - Check if we should sleep before retrying any reads. 2. Check for missing blocks and either decode or fail. Add two new states to StripingChunk: - "SLEEPING" to indicate that the node where the chunk is stored has failed and will be retried in the future - "READY" to indicate that the node where the chunk is stored is ready to be attempted ### How was this patch tested? Add unit test to `TestWriteReadStripedFile` - Covers RS(3,2) with 1 chunk on failed nodes, 2 chunks on failed nodes, and 3 chunks on failed nodes. ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Implement Backoff Retry for ErasureCoding reads > ----------------------------------------------- > > Key: HDFS-17737 > URL: https://issues.apache.org/jira/browse/HDFS-17737 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient, ec, erasure-coding > Affects Versions: 3.3.4 > Reporter: Danny Becker > Assignee: Danny Becker > Priority: Major > > #Why > Currently EC Reads are less stable than replication reads because if 4 out of > 9 datanodes in the block group are busy, then the whole read fails. Erasure > Coding reads need to be able to handle ERROR_BUSY signals from DataNodes and > retry after a backoff duration to avoid overloading the DataNodes while > increasing the stability of the read. > Throttling on server side was another proposed solution, but we prefer this > client side backoff for a few main reasons (see > https://msasg.visualstudio.com/DefaultCollection/Multi%20Tenancy/_git/Hadoop/pullRequest/5272897#1739008224): > 1. Throttling on the server would use up thread connections which have a > maximum limit. > 2. Throttling was originally added only for cohosting scenario to reduce > impact on other services > 3. Throttling would use up resources on the DataNode which is already busy. > #What > The previous implementation followed a 4 phase algorithm to read. > 1. Attempt to read chunks from the data blocks > 2. Check for missing data chunks. Fail if there are more missing than the > number of parity blocks, otherwise read parity blocks and null data blocks > 3. Wait for data to be read into the buffers and handle any read errors by > reading from more parity blocks > 4. Check for missing blocks and either decode or fail. > The new implementation now merges phase 1-3 into a single loop: > 1. Loop until we have enough blocks for read or decode, or we have too many > missing blocks to succeed > - Determine the number of chunks we need to fetch. ALLZERO chunks count > towards this total. null data chunks also count towards this total unless > there are missing data chunks. > - Read chunks until we have enough pending or fetched to be able to decode > or normal read. > faster. > - Get results from reads and handle exceptions by preparing more reads for > decoding the missing data > - Check if we should sleep before retrying any reads. > 2. Check for missing blocks and either decode or fail. > #Tests > Add unit test to `TestWriteReadStripedFile` > - Covers RS(3,2) with 1 chunk busy, 2 chunks busy, and 3 chunks busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org