Chunyi Yang created HADOOP-18847:
------------------------------------
Summary: mapreduce job encounters java.io.IOException when
dfs.client.rbf.observer.read.enable is true
Key: HADOOP-18847
URL: https://issues.apache.org/jira/browse/HADOOP-18847
Project: Hadoop Common
Issue Type: Bug
Components: common
Reporter: Chunyi Yang
When running mapreduce job in a Router-Based Federation+Observer read enabled
environment, we see approximately a 1% probability of encountering the
following error.
{code:java}
"java.io.IOException: Resource
hdfs://XXXX/user/XXXX/.staging/job_XXXXXX/.tez/application_XXXXXX/tez-conf.pb
changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was:
\"2023-07-07T12:41:16.822+0900\", current time:
\"2023-07-07T12:41:22.386+0900\"",
{code}
This error happens in function verifyAndCopy inside FSDownload.java when
nodemanager tries to download a file right after the file has been written to
the HDFS. The write operation runs on active namenode and read operation runs
on observer namenode as expected.
The edits file and hdfs-audit files show that the expected time seen in error
message is the OP_CLOSE MTIME of the tez-conf.pb file(which is correct) while
the actual timestamp it gets from the read operation is OP_ADD MTIME of the
target tez-conf.pf file (which is wrong). This mismatch shows that the observer
namenode responses to client before its edits file updates to the lastest
stateid which causes the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]