[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755631#comment-17755631
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

tasanuma commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1682643278

   Merged. Thanks for your contribution, @chunyiyang! Thanks for your review 
and providing the unit test, @simbadzina!




> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755630#comment-17755630
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

tasanuma merged PR #5951:
URL: https://github.com/apache/hadoop/pull/5951




> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755375#comment-17755375
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

chunyiyang commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1681667372

   This PR has passed all the checks, but seems like I don't have the 
permission to merge. Could someone help to merge this if possible? Thanks in 
advance!




> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755372#comment-17755372
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

hadoop-yetus commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1681656464

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 51s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 26s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  18m 17s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   1m 11s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 36s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  9s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   2m 34s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 52s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |  17m 19s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 35s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |  16m 35s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m  9s | 
[/results-checkstyle-hadoop-common-project_hadoop-common.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5951/3/artifact/out/results-checkstyle-hadoop-common-project_hadoop-common.txt)
 |  hadoop-common-project/hadoop-common: The patch generated 1 new + 127 
unchanged - 0 fixed = 128 total (was 127)  |
   | +1 :green_heart: |  mvnsite  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  3s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   2m 44s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 53s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  18m 54s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 56s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 236m 34s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5951/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5951 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 3da3be0432d9 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 49cae565e0c1235e0bf279c776b2c38ba6c60a42 |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5951/3/testReport/ |
   | Max. process+thread count | 1263 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common U: 
hadoop-common-project/hadoop-common |
   | Console output | 

[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755318#comment-17755318
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

simbadzina commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1681496761

   > @tasanuma Thank your for the review and approval! @simbadzina Thanks for 
your comments. I cherry-picked your unit test and it worked well in my local 
environment. Thanks!
   > 
   > Below is the test result in my local environment: with the fix:
   > 
   > ```
   > mvn test -Dtest=TestIPC.java
   > [INFO] ---
   > [INFO]  T E S T S
   > [INFO] ---
   > [INFO] Running org.apache.hadoop.ipc.TestIPC
   > [WARNING] Tests run: 46, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
112.083 s - in org.apache.hadoop.ipc.TestIPC
   > [INFO]
   > [INFO] Results:
   > [INFO]
   > [WARNING] Tests run: 46, Failures: 0, Errors: 0, Skipped: 1
   > ```
   > 
   > without the fix: failed as expected
   > 
   > ```
   > mvn test -Dtest=TestIPC.java
   > [INFO] Running org.apache.hadoop.ipc.TestIPC
   > [ERROR] Tests run: 46, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 
113.739 s <<< FAILURE! - in org.apache.hadoop.ipc.TestIPC
   > [ERROR] 
testReceiveStateBeforeCallerNotification(org.apache.hadoop.ipc.TestIPC)  Time 
elapsed: 0.181 s  <<< FAILURE!
   > java.lang.AssertionError
   >at org.junit.Assert.fail(Assert.java:87)
   >at org.junit.Assert.assertTrue(Assert.java:42)
   >at org.junit.Assert.assertTrue(Assert.java:53)
   >at 
org.apache.hadoop.ipc.TestIPC.testReceiveStateBeforeCallerNotification(TestIPC.java:1365)
   >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   >at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   >at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
   >at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   >at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
   >at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   >at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
   >at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
   >at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   >at java.base/java.lang.Thread.run(Thread.java:829)
   > ```
   
   For sure. Thanks for finding and fixing this bug.




> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading 

[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755317#comment-17755317
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

chunyiyang commented on code in PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#discussion_r1296594414


##
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java:
##
@@ -1330,6 +1337,37 @@ public void run() {
   server.stop();
 }
   }
+
+  /**
+   * Verify that stateID is received into call before
+   * call is notified.

Review Comment:
   Fixed. Thanks!





> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755316#comment-17755316
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

chunyiyang commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1681491223

   @tasanuma Thank your for the review and approval!
   @simbadzina Thanks for your comments. I cherry-picked your unit test and it 
worked well in my local environment. Thanks!
   
   Below is the test result in my local environment:
   with the fix:
   ```
   mvn test -Dtest=TestIPC.java
   [INFO] ---
   [INFO]  T E S T S
   [INFO] ---
   [INFO] Running org.apache.hadoop.ipc.TestIPC
   [WARNING] Tests run: 46, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
112.083 s - in org.apache.hadoop.ipc.TestIPC
   [INFO]
   [INFO] Results:
   [INFO]
   [WARNING] Tests run: 46, Failures: 0, Errors: 0, Skipped: 1
   ```
   without the fix: failed as expected
   ```
   mvn test -Dtest=TestIPC.java
   [INFO] Running org.apache.hadoop.ipc.TestIPC
   [ERROR] Tests run: 46, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 
113.739 s <<< FAILURE! - in org.apache.hadoop.ipc.TestIPC
   [ERROR] 
testReceiveStateBeforeCallerNotification(org.apache.hadoop.ipc.TestIPC)  Time 
elapsed: 0.181 s  <<< FAILURE!
   java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at 
org.apache.hadoop.ipc.TestIPC.testReceiveStateBeforeCallerNotification(TestIPC.java:1365)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.lang.Thread.run(Thread.java:829)
   ```
   
   




> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is 

[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755315#comment-17755315
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

simbadzina commented on code in PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#discussion_r1296591248


##
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java:
##
@@ -1330,6 +1337,37 @@ public void run() {
   server.stop();
 }
   }
+
+  /**
+   * Verify that stateID is received into call before
+   * call is notified.

Review Comment:
   Nit: Typo "caller" is notified.





> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755265#comment-17755265
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

simbadzina commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1681305358

   I've created a unit test here: 
https://github.com/simbadzina/hadoop/pull/1/commits/d216bfe8c5f506f08263cf490479a0398b097f05
   
   ```
 /**
  * Verify that stateID is received into call before
  * call is notified.
  * @throws IOException
  */
 @Test(timeout=6)
 public void testReceiveStateBeforeCallerNotification() throws IOException {
   AtomicBoolean stateReceived = new AtomicBoolean(false);
   AlignmentContext alignmentContext = Mockito.mock(AlignmentContext.class);
   Mockito.doAnswer((Answer) invocation -> {
 Thread.sleep(1000);
 stateReceived.set(true);
 return null;
   }).when(alignmentContext)
   
.receiveResponseState(any(RpcHeaderProtos.RpcResponseHeaderProto.class));
   
   final Client client = Mockito.spy(new Client(LongWritable.class, conf));
   final TestServer server = new TestServer(1, false);
   
   try {
 InetSocketAddress addr = NetUtils.getConnectAddress(server);
 server.start();
 call(client, new LongWritable(RANDOM.nextLong()), addr,
 0, conf, alignmentContext);
 Assert.assertTrue(stateReceived.get());
   } finally {
 client.stop();
 server.stop();
   }
 }
   ```




> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754938#comment-17754938
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

hadoop-yetus commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1680234074

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 58s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  50m 34s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 23s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   1m 23s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 55s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 36s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   2m 43s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  36m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 21s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |  16m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |  16m 17s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 17s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   2m 46s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  36m 26s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  19m 18s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m 13s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 232m 25s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5951/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5951 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 84be26a3ecef 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / c95258cbef20f04bb20ded485ff4c38dbcaf1d47 |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5951/1/testReport/ |
   | Max. process+thread count | 3149 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common U: 
hadoop-common-project/hadoop-common |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5951/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
  

[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException

2023-08-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754860#comment-17754860
 ] 

ASF GitHub Bot commented on HDFS-17156:
---

chunyiyang opened a new pull request, #5951:
URL: https://github.com/apache/hadoop/pull/5951

   
   
   ### Description of PR
   By moving `call.setRpcResponse` after 
`call.alignmentContext.receiveResponseState(header)`, it guarantees that router 
sends a response to the client only after receiving the latest stateId from the 
active namenode with fixes the issue.
   
   ### How was this patch tested?
   Before this patch, the mapreduce job failed roughly 1 out of 100 runs. After 
the deployment of this patch, it didn't fail in total  2000 mapreduce jobs.
   
   ### For code changes:
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   
   




> mapreduce job encounters java.io.IOException
> 
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException when dfs.client.rbf.observer.read.enable is true

2023-08-14 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754126#comment-17754126
 ] 

Takanobu Asanuma commented on HDFS-17156:
-

[~chunyiyang] I added you to the contributor role and assigned you to this 
jira. You can assign yourself next time. Thanks!

> mapreduce job encounters java.io.IOException when 
> dfs.client.rbf.observer.read.enable is true
> -
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Assignee: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException when dfs.client.rbf.observer.read.enable is true

2023-08-13 Thread Chunyi Yang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753871#comment-17753871
 ] 

Chunyi Yang commented on HDFS-17156:


[~tasanuma] Thank you! I would like to work on this, but couldn't find the 
'Assign to Me' option. Could you please advise me where to apply for the 
assignee permission? Thanks in advance.

> mapreduce job encounters java.io.IOException when 
> dfs.client.rbf.observer.read.enable is true
> -
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17156) mapreduce job encounters java.io.IOException when dfs.client.rbf.observer.read.enable is true

2023-08-11 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753235#comment-17753235
 ] 

Takanobu Asanuma commented on HDFS-17156:
-

It looked like a bug in RBF SBN, so I moved this jira from HADOOP to HDFS.

> mapreduce job encounters java.io.IOException when 
> dfs.client.rbf.observer.read.enable is true
> -
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chunyi Yang
>Priority: Minor
>  Labels: Observer, RBF
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs:///user//.staging/job_XX/.tez/application_XX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp 
> mentioned in the error message aligns with the OP_CLOSE MTIME of the 
> 'tez-conf.pb' file (which is correct). However, the actual timestamp 
> retrieved from the read operation corresponds to the OP_ADD MTIME of the 
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests 
> that the observer namenode responds to the client before its edits file is 
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org