[ https://issues.apache.org/jira/browse/HADOOP-19232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926775#comment-17926775 ]
ASF GitHub Bot commented on HADOOP-19232: ----------------------------------------- anmolanmol1234 opened a new pull request, #7385: URL: https://github.com/apache/hadoop/pull/7385 ## Description of PR : This Pr is in correlation to the series of work done under Parent Jira: [HADOOP-19179] (https://issues.apache.org/jira/browse/HADOOP-19179) Jira for this Patch: https://issues.apache.org/jira/browse/HADOOP-19232 Scope of this task is to refactor the AbfsOutputStream class to handle the ingress for DFS and Blob endpoint effectively. ## Production code changes : The `AbfsOutputStream` class is crucial for handling the data being written to Azure Storage. Its primary responsibilities include: - **Buffering**: Temporarily holding data in memory before it is uploaded. - **Streaming**: Efficiently streaming data to Azure Storage. - **Committing**: Ensuring that buffered data is correctly uploaded and committed to Azure Storage. ### New Additions The new additions introduce a more modular and flexible approach to managing data ingress (data being written to storage), catering to both Azure Data Lake Storage (ADLS) and Azure Blob Storage. #### AzureIngressHandler The `AzureIngressHandler` is a new parent class designed to encapsulate common logic for data ingress operations. It simplifies the process of writing data to Azure Storage by providing a unified interface. This class has two specialized child classes: 1. **AzureDfsIngressHandler**: - Manages data ingress specifically for Azure Data Lake Storage (DFS). - Handles operations like creating, appending, and flushing data blocks for DFS. 2. **AzureBlobIngressHandler**: - Manages data ingress specifically for Azure Blob Storage (BLOB). - Handles operations like creating, appending, and flushing data blocks for Blob Storage, while ensuring that each block has a unique `blockId`. #### AbfsBlock and AbfsBlobBlock Data is managed in discrete blocks to improve efficiency and manageability. 1. **AbfsBlock**: - A basic structure for buffering data. - Used as a common block type for both DFS and Blob Storage. 2. **AbfsBlobBlock**: - A subclass of `AbfsBlock` tailored for Blob Storage. - Requires a unique `blockId` for each block, which is necessary for the Blob Storage API. #### Block Managers To manage these data blocks, new manager classes have been introduced. These classes handle the lifecycle of blocks, including creation, appending, and flushing. 1. **AzureBlockManager**: - A parent class for managing the lifecycle of data blocks. - Provides common functionality for block management. 2. **AzureDFSBlockManager**: - Manages the lifecycle of `AbfsBlock` instances for DFS. - Handles the specifics of appending and flushing blocks in DFS. 3. **AzureBlobBlockManager**: - Manages the lifecycle of `AbfsBlobBlock` instances for Blob Storage. - Ensures each block has a unique `blockId`. - Handles the specifics of appending and flushing blocks in Blob Storage. ### Integration with AbfsOutputStream The `AbfsOutputStream` class has been updated to incorporate the new ingress flow logic, enhancing its ability to handle data writes to both DFS and Blob Storage. Here’s how it integrates: 1. **Configuration Selection**: - The `AbfsOutputStream` reads the configuration parameter `fs.azure.ingress.service.type` to determine whether the user has configured the system to use `BLOB` or `DFS` for data ingress. 2. **Handler Initialization**: - Based on the configuration, `AbfsOutputStream` initializes the appropriate handler (`AzureBlobIngressHandler` or `AzureDfsIngressHandler`). 3. **Buffering Data**: - As data is written to `AbfsOutputStream`, it is buffered into blocks (`AbfsBlock` for DFS or `AbfsBlobBlock` for Blob Storage). 4. **Managing Blocks**: - The corresponding block manager (`AzureDFSBlockManager` or `AzureBlobBlockManager`) manages the lifecycle of these blocks, ensuring that data is correctly created, appended, and flushed. 5. **Block Id Management (Blob Specific)**: - For Blob Storage, `AzureBlobBlockManager` ensures that each block has a unique `blockId`, adhering to the requirements of the Blob Storage API. ### Detailed Flow 1. **Creating Data Blocks**: - When data is written to `AbfsOutputStream`, it is divided into blocks (`AbfsBlock` for DFS or `AbfsBlobBlock` for Blob Storage). 2. **Appending Data**: - These blocks are appended to the Azure storage system via the appropriate handler (`AzureBlobIngressHandler` or `AzureDfsIngressHandler`). 3. **Flushing Data**: - Once all data has been buffered, the blocks are flushed to ensure all buffered data is committed to the storage system. 4. **Lifecycle Management**: - The block managers (`AzureDFSBlockManager` and `AzureBlobBlockManager`) oversee the lifecycle of blocks, handling retries, errors, and ensuring data integrity.  ## Test Code Changes: 1. Existing tests are modified to work with new abstracted design. 2. Test Suite was run on DFS Endpoint to make sure the original driver works seamlessly and undisturbed. 3. Adding some new tests around the new code > ABFS: [FnsOverBlob] Implementing Ingress Support with various Fallback > Handling > ------------------------------------------------------------------------------- > > Key: HADOOP-19232 > URL: https://issues.apache.org/jira/browse/HADOOP-19232 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/azure > Affects Versions: 3.4.0 > Reporter: Anuj Modi > Assignee: Anmol Asrani > Priority: Major > Labels: pull-request-available > > Scope of this task is to refactor the AbfsOutputStream class to handle the > ingress for DFS and Blob endpoint effectively. > More details will be added soon. > Perquisites for this Patch: > 1. [HADOOP-19187] ABFS: [FnsOverBlob]Making AbfsClient Abstract for > supporting both DFS and Blob Endpoint - ASF JIRA (apache.org) > 2. [HADOOP-19226] ABFS: [FnsOverBlob]Implementing Azure Rest APIs on Blob > Endpoint for AbfsBlobClient - ASF JIRA (apache.org) > 3. [HADOOP-19207] ABFS: [FnsOverBlob]Response Handling of Blob Endpoint APIs > and Metadata APIs - ASF JIRA (apache.org) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org