[ 
https://issues.apache.org/jira/browse/HDFS-17223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794523#comment-17794523
 ] 

ASF GitHub Bot commented on HDFS-17223:
---------------------------------------

xinglin commented on PR #6183:
URL: https://github.com/apache/hadoop/pull/6183#issuecomment-1846575811

   My understanding is a similar issue is happening here as what I tried to fix 
in [HDFS-17030](https://issues.apache.org/jira/browse/HDFS-17030): when a JN is 
not responsive (either it is down or it hangs), the starting NN would try to 
connect to it anyway with retries. Thus, it would wait for 
`ipc.client.connect.timeout` * `ipc.client.connect.max.retries.on.timeouts` 
when NN is not able to establish a socket to the journal node, or 
`ipc.client.rpc-timeout.ms` when a socket is established but the journal node 
fails to send back a response. 
   
   




> Add journalnode maintenance node list
> -------------------------------------
>
>                 Key: HDFS-17223
>                 URL: https://issues.apache.org/jira/browse/HDFS-17223
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: qjm
>    Affects Versions: 3.3.6
>            Reporter: kuper
>            Priority: Major
>              Labels: pull-request-available
>
> * In the case of configuring 3 journal nodes in HDFS, if only 2 journal nodes 
> are available and 1 journal node fails to start due to machine issues, it 
> will result in a long initialization time for the namenode (around 30-40 
> minutes, depending on the IPC timeout and retry policy configuration). 
> * The failed journal node cannot recover immediately, but HDFS can still 
> function in this situation. In our production environment, we encountered 
> this issue and had to reduce the IPC timeout and adjust the retry policy to 
> accelerate the namenode initialization and provide services. 
> * I'm wondering if it would be possible to have a journal node maintenance 
> list to speed up the namenode initialization knowing that one journal node 
> cannot provide services in advance?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to