[
https://issues.apache.org/jira/browse/HDFS-16912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686278#comment-17686278
]
jiangchunyang commented on HDFS-16912:
--------------------------------------
[~ayushtkn] Thanks for your comment.And thanks for your suggestion.
The throwing timing of this exception is exactly after the last block
submission, and the time is exactly 2023-02-09 10:00:23. The error reporting
and retrying can be seen in the last code block of the description. That is,
the exception thrown by the program after the last retry failed.
{code:java}
2023-02-09 10:00:10,638 INFO hdfs.StateChange
(FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate
blk_1092451654_18751000, replicas=10.146.144.69:1019, 10.146.80.45:1019 for
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:11,072 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:11,474 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:12,285 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:13,887 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:17,089 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
2023-02-09 10:00:23,490 INFO namenode.FSNamesystem
(FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK* blk_1092451654_18751000
is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file
/user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711{code}
!image-2023-02-09-16-36-42-818.png!
We currently have a total of 10 dn nodes for storage, and we will expand the
capacity later. But this error is not on one node, but basically all nodes will
exist. Because basically every task will have such problems. And the frequency
of occurrence is increasing
And it is not sure whether it is caused by too many small files. There are
indeed a lot of files in the cluster.
The current number of files in ns1: about 880w
The number of files in ns2: about 322w.
This error will report the same error on our two ns. This exception ns1hdfs is
really confusing. See the code is thrown from the method of
SecurityUtil.buildTokenService: throw new IllegalArgumentException(
new UnknownHostException(addr. getHostName())
);
> Block is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1)
> ---------------------------------------------------------------
>
> Key: HDFS-16912
> URL: https://issues.apache.org/jira/browse/HDFS-16912
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: block placement
> Affects Versions: 3.3.1
> Environment: hadoop:3.3.1
> Reporter: jiangchunyang
> Priority: Major
> Attachments: image-2023-02-09-16-36-42-818.png
>
>
> We use hdfs federation mode: ns1, ns2. The table data is written under
> dc-hdfs. But we designate a specific library under a specific ns according to
> the business division.
> Use parquetWriter to write data to the staging temporary file directory of
> each table under a specific ns, but when the Writer is closed, an exception
> will be reported, which will trigger our operation to restore the file lease,
> but when the file is found to be restoring the lease An exception will be
> reported:
> It looks like dn and nn have temporarily lost communication, and this doesn't
> happen with every write.
> {code:java}
> java.lang.IllegalArgumentException: java.net.UnknownHostException: ns1hdfs
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
> at
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
> at
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:118)
> at
> com.onething.dc.flink.parquet.sink.ProcessParquetSink$1.processElement(ProcessParquetSink.java:35)
> at
> org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
> at
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205)
> at
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
> at
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
> at
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:419)
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
> at java.lang.Thread.run(Thread.java:748)
> Suppressed: java.lang.IllegalArgumentException:
> java.net.UnknownHostException: ns1hdfs
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:445)
> at
> org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:140)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:357)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:181)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.recoverLease(ProcessParquetSinkTemplate.java:180)
> at
> com.onething.dc.flink.parquet.sink.template.ProcessParquetSinkTemplate.close(ProcessParquetSinkTemplate.java:208)
> at
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
> at
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:837)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:816)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:733)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
> ... 3 more
> Caused by: java.net.UnknownHostException: ns1hdfs
> ... 16 more
> Caused by: java.net.UnknownHostException: ns1hdfs
> ... 21 more {code}
> I can't see what the problem is, but when I checked the log of the namenode,
> I found that the block status of the file could not change from COMMITTED to
> COMPLETE. The reason is that dn needs to report ibr to namenode when closing
> the file. And it will close after receiving ack confirmation. However, dn
> failed to report ibr, which made it impossible to close the file. And it will
> retry every time the report fails, and the waiting time of the client is
> doubled in turn: 400ms, 800ms, 1600ms, 3200ms, 6400ms. These retries can be
> seen in the log of the namenode.
>
> {code:java}
> 2023-02-09 10:00:10,638 INFO hdfs.StateChange
> (FSDirWriteFileOp.java:logAllocatedBlock(802)) - BLOCK* allocate
> blk_1092451654_18751000, replicas=10.146.144.69:1019, 10.146.80.45:1019 for
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,072 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:11,474 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:12,285 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:13,887 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:17,089 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711
> 2023-02-09 10:00:23,490 INFO namenode.FSNamesystem
> (FSNamesystem.java:checkBlocksComplete(3151)) - BLOCK*
> blk_1092451654_18751000 is COMMITTED but not COMPLETE(numNodes= 0 < minimum
> = 1) in file
> /user/hive/warehouse/dc_orig.db/o_dc_http_topic_metric/_flume_parquet_insert_staging/bdyun-suzhou-tw04a0032.su.baidu.internal_1675908007110_628122711{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]