jiangchunyang created HDFS-16912:
------------------------------------
Summary: Block is COMMITTED but not COMPLETE(numNodes= 0 <
minimum = 1)
Key: HDFS-16912
URL: https://issues.apache.org/jira/browse/HDFS-16912
Project: Hadoop HDFS
Issue Type: Bug
Components: block placement
Affects Versions: 3.3.1
Environment: hadoop:3.3.1
Reporter: jiangchunyang
We use hdfs federation mode: ns1, ns2. The table data is written under dc-hdfs.
But we designate a specific library under a specific ns according to the
business division.
Use parquetWriter to write data to the staging temporary file directory of each
table under a specific ns, but when the Writer is closed, an exception will be
reported, which will trigger our operation to restore the file lease, but when
the file is found to be restoring the lease An exception will be reported:
It looks like dn and nn have temporarily lost communication, and this doesn't
happen with every write.
{code:java}
java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1hdfs
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
at
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
at
com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
at
com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
at
com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:118)
at
com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:35)
at
org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:419)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.IllegalArgumentException:
java.net.UnknownHostException: ns1hdfs
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
at
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
at
com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
at
com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
at
org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:837)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:816)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:733)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
... 3 more
Caused by: java.net.UnknownHostException: ns1hdfs
... 16 more
Caused by: java.net.UnknownHostException: ns1hdfs
... 21 more {code}
I can't see what the problem is, but when I checked the log of the namenode, I
found that the block status of the file could not change from COMMITTED to
COMPLETE. The reason is that dn needs to report ibr to namenode when closing
the file. And it will close after receiving ack confirmation. However, dn
failed to report ibr, which made it impossible to close the file. And it will
retry every time the report fails, and the waiting time of the client is
doubled in turn: 400ms, 800ms, 1600ms, 3200ms, 6400ms. These retries can be
seen in the log of the namenode.
{code:java}
2023-02-09 10:00:10,638 INFO hdfs.StateChange
(FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate
blk_1092451654_18751000, replicas=10.146.144.69:1019, 10.146.80.45:1019 for
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:11,072 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:11,474 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:12,285 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:13,887 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:17,089 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:23,490 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]