[ 
https://issues.apache.org/jira/browse/HADOOP-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012602#comment-18012602
 ] 

ASF GitHub Bot commented on HADOOP-19604:
-----------------------------------------

anujmodi2021 commented on code in PR #7853:
URL: https://github.com/apache/hadoop/pull/7853#discussion_r2260558410


##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/ITestAbfsOutputStream.java:
##########
@@ -481,6 +481,8 @@ public void testResetCalledOnExceptionInRemoteFlush() 
throws Exception {
       //expected exception
     }
     // Verify that reset was called on the message digest

Review Comment:
   Can we add a similar test where with config disabled we assert that methods 
to compute md5 hash were not called at all?



##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AbfsConfiguration.java:
##########
@@ -438,6 +438,10 @@ public class AbfsConfiguration{
       FS_AZURE_ABFS_ENABLE_CHECKSUM_VALIDATION, DefaultValue = 
DEFAULT_ENABLE_ABFS_CHECKSUM_VALIDATION)
   private boolean isChecksumValidationEnabled;
 
+  @BooleanConfigurationValidatorAnnotation(ConfigurationKey =
+      FS_AZURE_ABFS_ENABLE_FULL_BLOB_CHECKSUM_VALIDATION, DefaultValue = 
DEFAULT_ENABLE_FULL_BLOB_ABFS_CHECKSUM_VALIDATION)

Review Comment:
   Do we need "ABFS" in Configuration Key variable name? We always have AZURE 
alone.



##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/AbfsClientTestUtil.java:
##########
@@ -166,7 +166,12 @@ public static void 
setMockAbfsRestOperationForFlushOperation(
     requestHeaders.add(new AbfsHttpHeader(CONTENT_LENGTH, 
String.valueOf(buffer.length)));
     requestHeaders.add(new AbfsHttpHeader(CONTENT_TYPE, APPLICATION_XML));
     requestHeaders.add(new AbfsHttpHeader(IF_MATCH, eTag));
-    requestHeaders.add(new AbfsHttpHeader(X_MS_BLOB_CONTENT_MD5, blobMd5));
+    if (spiedClient.isFullBlobChecksumValidationEnabled()) {

Review Comment:
   Can be simplified by having a single statement for adding blobMd5 to request 
header.
   blobMd5 value can be computed using a trilean operator. 
   
   Similar can be used in production code to make sure one of the way is used 
to compute Md5 and variable is never null.



##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java:
##########
@@ -1103,8 +1107,21 @@ public AbfsRestOperation flush(byte[] buffer,
         AbfsRestOperation op1 = getPathStatus(path, true, tracingContext,
             contextEncryptionAdapter);
         String metadataMd5 = op1.getResult().getResponseHeader(CONTENT_MD5);
-        if (blobMd5 != null && !blobMd5.equals(metadataMd5)) {
-          throw ex;
+        /*
+         * Validate the response by comparing the server's MD5 metadata 
against either:
+         * 1. The full blob content MD5 (if full blob checksum validation is 
enabled), or
+         * 2. The full block ID buffer MD5 (fallback if blob checksum 
validation is disabled)
+         */
+        if (getAbfsConfiguration().isFullBlobChecksumValidationEnabled() && 
blobMd5 != null) {

Review Comment:
   We can simply call the new method introduced in client here as well 
`isFullBlobChecksumValidationEnabled`



##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/services/ITestAbfsOutputStream.java:
##########
@@ -481,6 +481,8 @@ public void testResetCalledOnExceptionInRemoteFlush() 
throws Exception {
       //expected exception
     }
     // Verify that reset was called on the message digest
-    Mockito.verify(mockMessageDigest, Mockito.times(1)).reset();
+    if (spiedClient.isChecksumValidationEnabled()) {
+      Mockito.verify(mockMessageDigest, Mockito.times(1)).reset();

Review Comment:
   Nit: assert message.



##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java:
##########
@@ -1073,11 +1073,15 @@ public AbfsRestOperation flush(byte[] buffer,
     requestHeaders.add(new AbfsHttpHeader(CONTENT_LENGTH, 
String.valueOf(buffer.length)));
     requestHeaders.add(new AbfsHttpHeader(CONTENT_TYPE, APPLICATION_XML));
     requestHeaders.add(new AbfsHttpHeader(IF_MATCH, eTag));
+    String md5Hash = null;
     if (leaseId != null) {
       requestHeaders.add(new AbfsHttpHeader(X_MS_LEASE_ID, leaseId));
     }
-    if (blobMd5 != null) {
+    if (isFullBlobChecksumValidationEnabled() && blobMd5 != null) {
       requestHeaders.add(new AbfsHttpHeader(X_MS_BLOB_CONTENT_MD5, blobMd5));
+    } else {

Review Comment:
   Is there a config even to avoid this checksum computation?



##########
hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestWasbAbfsCompatibility.java:
##########
@@ -113,7 +114,10 @@ public void testReadFile() throws Exception {
     boolean[] createFileWithAbfs = new boolean[]{false, true, false, true};
     boolean[] readFileWithAbfs = new boolean[]{false, true, true, false};
 
-    AzureBlobFileSystem abfs = getFileSystem();
+    Configuration conf = getRawConfiguration();
+    conf.setBoolean(FS_AZURE_ABFS_ENABLE_FULL_BLOB_CHECKSUM_VALIDATION, true);
+    FileSystem fileSystem = FileSystem.newInstance(conf);

Review Comment:
   When creating new instance, we need to close it. Can you check for all the 
tests in this class and make sure any new instance sis getting closed within 
test itself?



##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsOutputStream.java:
##########
@@ -544,7 +547,12 @@ private void uploadBlockAsync(AbfsBlock blockToUpload,
     outputStreamStatistics.bytesToUpload(bytesLength);
     outputStreamStatistics.writeCurrentBuffer();
     DataBlocks.BlockUploadData blockUploadData = blockToUpload.startUpload();
-    String md5Hash = getMd5();
+    String md5Hash;
+    if (getClient().getAbfsConfiguration().getIsChecksumValidationEnabled()) {

Review Comment:
   `isChecksumValidationEnabled()` can be used here?



##########
hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java:
##########
@@ -1103,8 +1107,21 @@ public AbfsRestOperation flush(byte[] buffer,
         AbfsRestOperation op1 = getPathStatus(path, true, tracingContext,
             contextEncryptionAdapter);
         String metadataMd5 = op1.getResult().getResponseHeader(CONTENT_MD5);
-        if (blobMd5 != null && !blobMd5.equals(metadataMd5)) {
-          throw ex;
+        /*
+         * Validate the response by comparing the server's MD5 metadata 
against either:
+         * 1. The full blob content MD5 (if full blob checksum validation is 
enabled), or
+         * 2. The full block ID buffer MD5 (fallback if blob checksum 
validation is disabled)
+         */
+        if (getAbfsConfiguration().isFullBlobChecksumValidationEnabled() && 
blobMd5 != null) {
+          // Full blob content MD5 mismatch — integrity check failed
+          if (!blobMd5.equals(metadataMd5)) {

Review Comment:
   `blobMd5` and `md5Hash` are the 2 variables used here for holding the md5 
hash of data. For a single flush case there will either be a valid blobMd5 or a 
valid md5Hash. Can we not merge them into a single variable and use that 
everywher? Can there be a case where both are null??
   
   If not we can simply use same variable for them and avoid null checks and 
any chances of mixmathching variables as they 2 sounds very similar.
   





> ABFS: Fix WASB ABFS compatibility issues
> ----------------------------------------
>
>                 Key: HADOOP-19604
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19604
>             Project: Hadoop Common
>          Issue Type: Sub-task
>    Affects Versions: 3.4.1
>            Reporter: Anmol Asrani
>            Assignee: Anmol Asrani
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.1
>
>
> Fix WASB ABFS compatibility issues. Fix issues such as:-
>  # BlockId computation to be consistent across clients for PutBlock and 
> PutBlockList
>  # Restrict url encoding of certain json metadata during setXAttr calls.
>  # Maintain the md5 hash of whole block to validate data integrity during 
> flush.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to