[ 
https://issues.apache.org/jira/browse/HDFS-16912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiangchunyang updated HDFS-16912:
---------------------------------
    Attachment: image-2023-02-09-16-36-42-818.png

> Block is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1)
> ---------------------------------------------------------------
>
>                 Key: HDFS-16912
>                 URL: https://issues.apache.org/jira/browse/HDFS-16912
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: block placement
>    Affects Versions: 3.3.1
>         Environment: hadoop:3.3.1
>            Reporter: jiangchunyang
>            Priority: Major
>         Attachments: image-2023-02-09-16-36-42-818.png
>
>
> We use hdfs federation mode: ns1, ns2. The table data is written under 
> dc-hdfs. But we designate a specific library under a specific ns according to 
> the business division.
> Use parquetWriter to write data to the staging temporary file directory of 
> each table under a specific ns, but when the Writer is closed, an exception 
> will be reported, which will trigger our operation to restore the file lease, 
> but when the file is found to be restoring the lease An exception will be 
> reported:
> It looks like dn and nn have temporarily lost communication, and this doesn't 
> happen with every write.
> {code:java}
> java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1hdfs
>     at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
>     at 
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
>     at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
>     at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
>     at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
>     at 
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:118)
>     at 
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:35)
>     at 
> org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
>     at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
>     at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
>     at 
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
>     at 
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:419)
>     at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
>     at java.lang.Thread.run(Thread.java:748)
>     Suppressed: java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: ns1hdfs
>         at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
>         at 
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
>         at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
>         at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
>         at 
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
>         at 
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
>         at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:837)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:816)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:733)
>         at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
>         ... 3 more
>     Caused by: java.net.UnknownHostException: ns1hdfs
>         ... 16 more
> Caused by: java.net.UnknownHostException: ns1hdfs
>     ... 21 more {code}
> I can't see what the problem is, but when I checked the log of the namenode, 
> I found that the block status of the file could not change from COMMITTED to 
> COMPLETE. The reason is that dn needs to report ibr to namenode when closing 
> the file. And it will close after receiving ack confirmation. However, dn 
> failed to report ibr, which made it impossible to close the file. And it will 
> retry every time the report fails, and the waiting time of the client is 
> doubled in turn: 400ms, 800ms, 1600ms, 3200ms, 6400ms. These retries can be 
> seen in the log of the namenode.
>  
> {code:java}
> 2023-02-09 10:00:10,638 INFO  hdfs.StateChange 
> (FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate 
> blk_1092451654_18751000, replicas=10.146.144.69:1019, 10.146.80.45:1019 for 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,072 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,474 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:12,285 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:13,887 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:17,089 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:23,490 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* 
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum 
> = 1) in file 
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to