[ https://issues.apache.org/jira/browse/HADOOP-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012602#comment-18012602 ]
ASF GitHub Bot commented on HADOOP-19604: ----------------------------------------- anujmodi2021 commented on code in PR #7853: URL: https://github.com/apache/hadoop/pull/7853#discussion_r2260558410 ########## hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/ITestAbfsOutputStream.java: ########## @@ -481,6 +481,8 @@ public void testResetCalledOnExceptionInRemoteFlush() throws Exception { //expected exception } // Verify that reset was called on the message digest Review Comment: Can we add a similar test where with config disabled we assert that methods to compute md5 hash were not called at all? ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AbfsConfiguration.java: ########## @@ -438,6 +438,10 @@ public class AbfsConfiguration{ FS_AZURE_ABFS_ENABLE_CHECKSUM_VALIDATION, DefaultValue = DEFAULT_ENABLE_ABFS_CHECKSUM_VALIDATION) private boolean isChecksumValidationEnabled; + @BooleanConfigurationValidatorAnnotation(ConfigurationKey = + FS_AZURE_ABFS_ENABLE_FULL_BLOB_CHECKSUM_VALIDATION, DefaultValue = DEFAULT_ENABLE_FULL_BLOB_ABFS_CHECKSUM_VALIDATION) Review Comment: Do we need "ABFS" in Configuration Key variable name? We always have AZURE alone. ########## hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/AbfsClientTestUtil.java: ########## @@ -166,7 +166,12 @@ public static void setMockAbfsRestOperationForFlushOperation( requestHeaders.add(new AbfsHttpHeader(CONTENT_LENGTH, String.valueOf(buffer.length))); requestHeaders.add(new AbfsHttpHeader(CONTENT_TYPE, APPLICATION_XML)); requestHeaders.add(new AbfsHttpHeader(IF_MATCH, eTag)); - requestHeaders.add(new AbfsHttpHeader(X_MS_BLOB_CONTENT_MD5, blobMd5)); + if (spiedClient.isFullBlobChecksumValidationEnabled()) { Review Comment: Can be simplified by having a single statement for adding blobMd5 to request header. blobMd5 value can be computed using a trilean operator. Similar can be used in production code to make sure one of the way is used to compute Md5 and variable is never null. ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java: ########## @@ -1103,8 +1107,21 @@ public AbfsRestOperation flush(byte[] buffer, AbfsRestOperation op1 = getPathStatus(path, true, tracingContext, contextEncryptionAdapter); String metadataMd5 = op1.getResult().getResponseHeader(CONTENT_MD5); - if (blobMd5 != null && !blobMd5.equals(metadataMd5)) { - throw ex; + /* + * Validate the response by comparing the server's MD5 metadata against either: + * 1. The full blob content MD5 (if full blob checksum validation is enabled), or + * 2. The full block ID buffer MD5 (fallback if blob checksum validation is disabled) + */ + if (getAbfsConfiguration().isFullBlobChecksumValidationEnabled() && blobMd5 != null) { Review Comment: We can simply call the new method introduced in client here as well `isFullBlobChecksumValidationEnabled` ########## hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/ITestAbfsOutputStream.java: ########## @@ -481,6 +481,8 @@ public void testResetCalledOnExceptionInRemoteFlush() throws Exception { //expected exception } // Verify that reset was called on the message digest - Mockito.verify(mockMessageDigest, Mockito.times(1)).reset(); + if (spiedClient.isChecksumValidationEnabled()) { + Mockito.verify(mockMessageDigest, Mockito.times(1)).reset(); Review Comment: Nit: assert message. ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java: ########## @@ -1073,11 +1073,15 @@ public AbfsRestOperation flush(byte[] buffer, requestHeaders.add(new AbfsHttpHeader(CONTENT_LENGTH, String.valueOf(buffer.length))); requestHeaders.add(new AbfsHttpHeader(CONTENT_TYPE, APPLICATION_XML)); requestHeaders.add(new AbfsHttpHeader(IF_MATCH, eTag)); + String md5Hash = null; if (leaseId != null) { requestHeaders.add(new AbfsHttpHeader(X_MS_LEASE_ID, leaseId)); } - if (blobMd5 != null) { + if (isFullBlobChecksumValidationEnabled() && blobMd5 != null) { requestHeaders.add(new AbfsHttpHeader(X_MS_BLOB_CONTENT_MD5, blobMd5)); + } else { Review Comment: Is there a config even to avoid this checksum computation? ########## hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestWasbAbfsCompatibility.java: ########## @@ -113,7 +114,10 @@ public void testReadFile() throws Exception { boolean[] createFileWithAbfs = new boolean[]{false, true, false, true}; boolean[] readFileWithAbfs = new boolean[]{false, true, true, false}; - AzureBlobFileSystem abfs = getFileSystem(); + Configuration conf = getRawConfiguration(); + conf.setBoolean(FS_AZURE_ABFS_ENABLE_FULL_BLOB_CHECKSUM_VALIDATION, true); + FileSystem fileSystem = FileSystem.newInstance(conf); Review Comment: When creating new instance, we need to close it. Can you check for all the tests in this class and make sure any new instance sis getting closed within test itself? ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsOutputStream.java: ########## @@ -544,7 +547,12 @@ private void uploadBlockAsync(AbfsBlock blockToUpload, outputStreamStatistics.bytesToUpload(bytesLength); outputStreamStatistics.writeCurrentBuffer(); DataBlocks.BlockUploadData blockUploadData = blockToUpload.startUpload(); - String md5Hash = getMd5(); + String md5Hash; + if (getClient().getAbfsConfiguration().getIsChecksumValidationEnabled()) { Review Comment: `isChecksumValidationEnabled()` can be used here? ########## hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java: ########## @@ -1103,8 +1107,21 @@ public AbfsRestOperation flush(byte[] buffer, AbfsRestOperation op1 = getPathStatus(path, true, tracingContext, contextEncryptionAdapter); String metadataMd5 = op1.getResult().getResponseHeader(CONTENT_MD5); - if (blobMd5 != null && !blobMd5.equals(metadataMd5)) { - throw ex; + /* + * Validate the response by comparing the server's MD5 metadata against either: + * 1. The full blob content MD5 (if full blob checksum validation is enabled), or + * 2. The full block ID buffer MD5 (fallback if blob checksum validation is disabled) + */ + if (getAbfsConfiguration().isFullBlobChecksumValidationEnabled() && blobMd5 != null) { + // Full blob content MD5 mismatch — integrity check failed + if (!blobMd5.equals(metadataMd5)) { Review Comment: `blobMd5` and `md5Hash` are the 2 variables used here for holding the md5 hash of data. For a single flush case there will either be a valid blobMd5 or a valid md5Hash. Can we not merge them into a single variable and use that everywher? Can there be a case where both are null?? If not we can simply use same variable for them and avoid null checks and any chances of mixmathching variables as they 2 sounds very similar. > ABFS: Fix WASB ABFS compatibility issues > ---------------------------------------- > > Key: HADOOP-19604 > URL: https://issues.apache.org/jira/browse/HADOOP-19604 > Project: Hadoop Common > Issue Type: Sub-task > Affects Versions: 3.4.1 > Reporter: Anmol Asrani > Assignee: Anmol Asrani > Priority: Major > Labels: pull-request-available > Fix For: 3.4.1 > > > Fix WASB ABFS compatibility issues. Fix issues such as:- > # BlockId computation to be consistent across clients for PutBlock and > PutBlockList > # Restrict url encoding of certain json metadata during setXAttr calls. > # Maintain the md5 hash of whole block to validate data integrity during > flush. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org