[
https://issues.apache.org/jira/browse/HDFS-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755316#comment-17755316
]
ASF GitHub Bot commented on HDFS-17156:
---------------------------------------
chunyiyang commented on PR #5951:
URL: https://github.com/apache/hadoop/pull/5951#issuecomment-1681491223
@tasanuma Thank your for the review and approval!
@simbadzina Thanks for your comments. I cherry-picked your unit test and it
worked well in my local environment. Thanks!
Below is the test result in my local environment:
with the fix:
```
mvn test -Dtest=TestIPC.java
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.ipc.TestIPC
[WARNING] Tests run: 46, Failures: 0, Errors: 0, Skipped: 1, Time elapsed:
112.083 s - in org.apache.hadoop.ipc.TestIPC
[INFO]
[INFO] Results:
[INFO]
[WARNING] Tests run: 46, Failures: 0, Errors: 0, Skipped: 1
```
without the fix: failed as expected
```
mvn test -Dtest=TestIPC.java
[INFO] Running org.apache.hadoop.ipc.TestIPC
[ERROR] Tests run: 46, Failures: 1, Errors: 0, Skipped: 1, Time elapsed:
113.739 s <<< FAILURE! - in org.apache.hadoop.ipc.TestIPC
[ERROR]
testReceiveStateBeforeCallerNotification(org.apache.hadoop.ipc.TestIPC) Time
elapsed: 0.181 s <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at
org.apache.hadoop.ipc.TestIPC.testReceiveStateBeforeCallerNotification(TestIPC.java:1365)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
at
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.lang.Thread.run(Thread.java:829)
```
> mapreduce job encounters java.io.IOException
> --------------------------------------------
>
> Key: HDFS-17156
> URL: https://issues.apache.org/jira/browse/HDFS-17156
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: rbf
> Reporter: Chunyi Yang
> Assignee: Chunyi Yang
> Priority: Minor
> Labels: Observer, RBF, pull-request-available
>
> While executing a mapreduce job in an environment utilizing Router-Based
> Federation with Observer read enabled, there is an estimated 1% chance of
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource
> hdfs://XXXX/user/XXXX/.staging/job_XXXXXX/.tez/application_XXXXXX/tez-conf.pb
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was:
> \"2023-07-07T12:41:16.822+0900\", current time:
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when
> nodemanager tries to download a file right after the file has been written to
> the HDFS. The write operation runs on active namenode and read operation runs
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the expected timestamp
> mentioned in the error message aligns with the OP_CLOSE MTIME of the
> 'tez-conf.pb' file (which is correct). However, the actual timestamp
> retrieved from the read operation corresponds to the OP_ADD MTIME of the
> target 'tez-conf.pf' file (which is incorrect). This inconsistency suggests
> that the observer namenode responds to the client before its edits file is
> updated with the latest stateId.
> Further troubleshooting has revealed that during write operations, the router
> responds to the client before receiving the latest stateId from the active
> namenode. Consequently, the outdated stateId is then used in the subsequent
> read operation on the observer namenode, leading to inaccuracies in the
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a
> response to the client only after receiving the latest stateId from the
> active namenode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]