[ 
https://issues.apache.org/jira/browse/HADOOP-18847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunyi Yang updated HADOOP-18847:
---------------------------------
    Summary: mapreduce job encounters java.io.IOException when observer read is 
enabled  (was: mapreduce job encounters java.io.IOException when 
dfs.client.rbf.observer.read.enable is true)

> mapreduce job encounters java.io.IOException when observer read is enabled
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-18847
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18847
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>            Reporter: Chunyi Yang
>            Priority: Minor
>
> While executing a mapreduce job in an environment utilizing Router-Based 
> Federation with Observer read enabled, there is an estimated 1% chance of 
> encountering the following error.
> {code:java}
> "java.io.IOException: Resource 
> hdfs://XXXX/user/XXXX/.staging/job_XXXXXX/.tez/application_XXXXXX/tez-conf.pb 
> changed on src filesystem - expected: \"2023-07-07T12:41:16.801+0900\", was: 
> \"2023-07-07T12:41:16.822+0900\", current time: 
> \"2023-07-07T12:41:22.386+0900\"",
> {code}
> This error happens in function verifyAndCopy inside FSDownload.java when 
> nodemanager tries to download a file right after the file has been written to 
> the HDFS. The write operation runs on active namenode and read operation runs 
> on observer namenode as expected.
> The edits file and hdfs-audit files indicate that the timestamp mentioned in 
> the error message aligns with the OP_CLOSE MTIME of the 'tez-conf.pb' file 
> (which is accurate). However, the actual timestamp retrieved from the read 
> operation corresponds to the OP_ADD MTIME of the target 'tez-conf.pf' file 
> (which is incorrect). This inconsistency suggests that the observer namenode 
> responds to the client before its edits file is updated with the latest 
> stateId.
> Further troubleshooting has revealed that during write operations, the router 
> responds to the client before receiving the latest stateId from the active 
> namenode. Consequently, the outdated stateId is then used in the subsequent 
> read operation on the observer namenode, leading to inaccuracies in the 
> information provided by the observer namenode.
> To resolve this issue, it is essential to ensure that the router sends a 
> response to the client only after receiving the latest stateId from the 
> active namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to