[ 
https://issues.apache.org/jira/browse/HDFS-16912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686241#comment-17686241
 ] 

Ayush Saxena commented on HDFS-16912:
-------------------------------------

It looks to me like a cluster config issue, from the UnknownHostException for 
the hdfs namespace.

does this happen with some specific datanode and if you are using shared 
namespace for the datanodes, so is it specific to a namespace?

 

I would say check the datanode which is throwing this error, check if it is 
hearbeating to the active Nn, whether regular BlockReport is successful or not.

 

second, you said you have 2 namespaces, ns1 and ns2, from where in the 
exception  ns1hdfs comming? 



Unless this is a bug, I would say close this, and subscribe and reach to out 
hadoop user mailing list

https://hadoop.apache.org/mailing_lists.html

 

> Block is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1)
> ---------------------------------------------------------------
>
>                 Key: HDFS-16912
>                 URL: https://issues.apache.org/jira/browse/HDFS-16912
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: block placement
>    Affects Versions: 3.3.1
>         Environment: hadoop:3.3.1
>            Reporter: jiangchunyang
>            Priority: Major
>
> We use hdfs federation mode: ns1, ns2. The table data is written under 
> dc-hdfs. But we designate a specific library under a specific ns according to 
> the business division.
> Use parquetWriter to write data to the staging temporary file directory of 
> each table under a specific ns, but when the Writer is closed, an exception 
> will be reported, which will trigger our operation to restore the file lease, 
> but when the file is found to be restoring the lease An exception will be 
> reported:
> It looks like dn and nn have temporarily lost communication, and this doesn't 
> happen with every write.
> {code:java}
> java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1hdfs
>     at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
>     at 
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
>     at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
>     at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
>     at 
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:118)
>     at 
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:35)
>     at 
> org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
>     at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
>     at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
>     at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
>     at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:419)
>     at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
>     at java.lang.Thread.run(Thread.java:748)
>     Suppressed: java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: ns1hdfs
>         at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
>         at 
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
>         at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
>         at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
>         at 
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
>         at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:837)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:816)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:733)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
>         ... 3 more
>     Caused by: java.net.UnknownHostException: ns1hdfs
>         ... 16 more
> Caused by: java.net.UnknownHostException: ns1hdfs
>     ... 21 more {code}
> I can't see what the problem is, but when I checked the log of the namenode, 
> I found that the block status of the file could not change from COMMITTED to 
> COMPLETE. The reason is that dn needs to report ibr to namenode when closing 
> the file. And it will close after receiving ack confirmation. However, dn 
> failed to report ibr, which made it impossible to close the file. And it will 
> retry every time the report fails, and the waiting time of the client is 
> doubled in turn: 400ms, 800ms, 1600ms, 3200ms, 6400ms. These retries can be 
> seen in the log of the namenode.
>  
> {code:java}
> 2023-02-09 10:00:10,638 INFO  hdfs.StateChange 
> (FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate 
> blk_1092451654_18751000, replicas=10.146.144.69:1019, 10.146.80.45:1019 for 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,072 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,474 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:12,285 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:13,887 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:17,089 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:23,490 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to