[
https://issues.apache.org/jira/browse/HDFS-16912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
jiangchunyang updated HDFS-16912:
---------------------------------
Attachment: image-2023-02-09-16-36-42-818.png
> Block is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1)
> ---------------------------------------------------------------
>
> Key: HDFS-16912
> URL: https://issues.apache.org/jira/browse/HDFS-16912
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: block placement
> Affects Versions: 3.3.1
> Environment: hadoop:3.3.1
> Reporter: jiangchunyang
> Priority: Major
> Attachments: image-2023-02-09-16-36-42-818.png
>
>
> We use hdfs federation mode: ns1, ns2. The table data is written under
> dc-hdfs. But we designate a specific library under a specific ns according to
> the business division.
> Use parquetWriter to write data to the staging temporary file directory of
> each table under a specific ns, but when the Writer is closed, an exception
> will be reported, which will trigger our operation to restore the file lease,
> but when the file is found to be restoring the lease An exception will be
> reported:
> It looks like dn and nn have temporarily lost communication, and this doesn't
> happen with every write.
> {code:java}
> java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1hdfs
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
> at
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
> at
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:118)
> at
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:35)
> at
> org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
> at
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
> at
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
> at
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
> at
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:419)
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
> at java.lang.Thread.run(Thread.java:748)
> Suppressed: java.lang.IllegalArgumentException:
> java.net.UnknownHostException: ns1hdfs
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
> at
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
> at
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
> at
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:837)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:816)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:733)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
> ... 3 more
> Caused by: java.net.UnknownHostException: ns1hdfs
> ... 16 more
> Caused by: java.net.UnknownHostException: ns1hdfs
> ... 21 more {code}
> I can't see what the problem is, but when I checked the log of the namenode,
> I found that the block status of the file could not change from COMMITTED to
> COMPLETE. The reason is that dn needs to report ibr to namenode when closing
> the file. And it will close after receiving ack confirmation. However, dn
> failed to report ibr, which made it impossible to close the file. And it will
> retry every time the report fails, and the waiting time of the client is
> doubled in turn: 400ms, 800ms, 1600ms, 3200ms, 6400ms. These retries can be
> seen in the log of the namenode.
>
> {code:java}
> 2023-02-09 10:00:10,638 INFO hdfs.StateChange
> (FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate
> blk_1092451654_18751000, replicas=10.146.144.69:1019, 10.146.80.45:1019 for
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,072 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,474 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:12,285 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:13,887 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:17,089 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:23,490 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]