[
https://issues.apache.org/jira/browse/HDFS-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081430#comment-18081430
]
András Bokor commented on HDFS-17920:
-------------------------------------
The improvement caused decreasing the log file size by 50%:
{code:java}
➜ hadoop-hdfs git:(trunk) ls -lha target/surefire-reports/
total 43624
drwxr-xr-x@ 5 andrasbokor staff 160B May 16 18:56 .
drwxr-xr-x@ 17 andrasbokor staff 544B May 16 18:52 ..
-rw-r--r--@ 1 andrasbokor staff 10M May 16 18:52
TEST-org.apache.hadoop.hdfs.server.datanode.TestDiskError.xml
-rw-r--r--@ 1 andrasbokor staff 10M May 16 18:52
org.apache.hadoop.hdfs.server.datanode.TestDiskError-output.txt
-rw-r--r--@ 1 andrasbokor staff 354B May 16 18:52
org.apache.hadoop.hdfs.server.datanode.TestDiskError.txt
{code}
vs
{code:java}
ls -lah hadoop-hdfs-project/hadoop-hdfs/target/surefire-reports/
total 23344
drwxr-xr-x@ 5 andrasbokor staff 160B May 16 19:10 .
drwxr-xr-x@ 17 andrasbokor staff 544B May 16 19:10 ..
-rw-r--r--@ 1 andrasbokor staff 5.3M May 16 19:10
TEST-org.apache.hadoop.hdfs.server.datanode.TestDiskError.xml
-rw-r--r--@ 1 andrasbokor staff 5.3M May 16 19:10
org.apache.hadoop.hdfs.server.datanode.TestDiskError-output.txt
-rw-r--r--@ 1 andrasbokor staff 354B May 16 19:10
org.apache.hadoop.hdfs.server.datanode.TestDiskError.txt
{code}
Also, the loop runs about 10-20 times instead of 400+
{code:java}
grep "test.txt"
hadoop-hdfs-project/hadoop-hdfs/target/surefire-reports/org.apache.hadoop.hdfs.server.datanode.TestDiskError-output.txt
| tail -n 1
2026-05-16 18:52:39,984 [main] INFO hdfs.DFSTestUtil
(DFSTestUtil.java:waitReplication(836)) - All blocks of file /test.txt430
verified to have replication factor 2{code}
vs
{code:java}
2026-05-16 19:10:27,491 [main] INFO hdfs.DFSTestUtil
(DFSTestUtil.java:waitReplication(836)) - All blocks of file /test.txt12
verified to have replication factor 2{code}
{*}Note{*}: The metrics above represent a positive side effect; the core focus
of this fix remains preventing the infinite loop.
> TestDiskError.testShutdown can run into infinite loop
> -----------------------------------------------------
>
> Key: HDFS-17920
> URL: https://issues.apache.org/jira/browse/HDFS-17920
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: András Bokor
> Priority: Critical
>
> The test case tests if DN shuts down when there is a disk failure (that is
> simulated by the test).
> We found that when this feature does not work for whatever reason
> TestDiskError.testShutdown takes a long time and did not finish, also it
> consumes all the storage space. The log file is somewhere around 11 GB, but
> it can be increased by increasing the container size.
> Since the log file is huge and capable of running indefinitely, it is
> suspicious that there might be an infinite loop somewhere in the test.
> I checked what loops exist [in the test
> file;|https://github.com/apache/hadoop/blob/734dd8a67cd6df56b59ff75aa43de57834a0d248/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDiskError.java#L121]
> there aren't many, and with one exception, they all run only a few
> iterations:
> {code:java}
> DataNode dn = cluster.getDataNodes().get(dnIndex);
> for (int i=0; dn.isDatanodeUp(); i++) {
> Path fileName = new Path("/test.txt"+i);
> DFSTestUtil.createFile(fs, fileName, 1024, (short)2, 1L);
> DFSTestUtil.waitReplication(fs, fileName, (short)2);
> fs.delete(fileName, true);
> } {code}
> Here, we keep creating and deleting new files until the DataNode (DN) dies. I
> don't know how long the replication takes, but based on the file size and the
> replication factor of 2, it should happen quickly. This is a suspicious
> section because if the test doesn't finish quickly (meaning the "bad" DN
> doesn't shut itself down), it’s conceivable that a vast number of file
> operations are generating a massive amount of logs.I ran a grep on the log
> file to see how many iterations are executed, and I found a line like this:
>
> {code:java}
> BLOCK* allocate blk_1073970157_229333, replicas=127.0.0.1:34219,
> 127.0.0.1:39923 for /test.txt114166{code}
>
> This indicates that this single unit test case generates over a hundred
> thousand file operations on its own. Based on the log I examined, which
> covers a half-hour window, the loop is running about 60 times per second; I'm
> not even sure if this makes sense.
> Introducing some kind of interval plus a timeout would likely help, as the
> test currently works in a way where if the feature under test fails, you
> don't get an assertion error—you get an infinite loop.
> *Please note that* {*}we are not addressing the root cause of the possible
> shutdown failure{*}; instead, we are targeting the resulting infinite loop
> and the unnecessarily large log file.
> Also, I have set the priority to Critical (even though a unit test failure
> does not indicate that) because, this issue can block CI process.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]