[jira] [Updated] (HDFS-17920) TestDiskError.testShutdown can run into infinite loop

Jira Sat, 16 May 2026 22:42:08 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


András Bokor updated HDFS-17920:
--------------------------------
    Description: 
The test case tests if DN shuts down when there is a disk failure (that is 
simulated by the test).

We found that when this feature does not work for whatever reason 
TestDiskError.testShutdown takes a long time and did not finish, also it 
consumes all the storage space. The log file is somewhere around 11 GB, but it 
can be increased by increasing the container size.

Since the log file is huge and capable of running indefinitely, it is 
suspicious that there might be an infinite loop somewhere in the test.

I checked what loops exist [in the test 
file;|https://github.com/apache/hadoop/blob/734dd8a67cd6df56b59ff75aa43de57834a0d248/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDiskError.java#L121]
 there aren't many, and with one exception, they all run only a few iterations:
{code:java}
DataNode dn = cluster.getDataNodes().get(dnIndex);
      for (int i=0; dn.isDatanodeUp(); i++) {
        Path fileName = new Path("/test.txt"+i);
        DFSTestUtil.createFile(fs, fileName, 1024, (short)2, 1L);
        DFSTestUtil.waitReplication(fs, fileName, (short)2);
        fs.delete(fileName, true);
      } {code}
Here, we keep creating and deleting new files until the DataNode (DN) dies. I 
don't know how long the replication takes, but based on the file size and the 
replication factor of 2, it should happen quickly. This is a suspicious section 
because if the test doesn't finish quickly (meaning the "bad" DN doesn't shut 
itself down), it’s conceivable that a vast number of file operations are 
generating a massive amount of logs.I ran a grep on the log file to see how 
many iterations are executed, and I found a line like this:

 
{code:java}
BLOCK* allocate blk_1073970157_229333, replicas=127.0.0.1:34219, 
127.0.0.1:39923 for /test.txt114166{code}
 

This indicates that this single unit test case generates over a hundred 
thousand file operations on its own. Based on the log I examined, which covers 
a half-hour window, the loop is running about 60 times per second; I'm not even 
sure if this makes sense.

Introducing some kind of interval plus a timeout would likely help, as the test 
currently works in a way where if the feature under test fails, you don't get 
an assertion error—you get an infinite loop.

*Please note that* {*}we are not addressing the root cause of the possible 
shutdown failure{*}; instead, we are targeting the resulting infinite loop and 
the unnecessarily large log file.

Also, I have set the priority to Critical (even though a unit test issue does 
not indicate that) because, allowing a test to loop indefinitely and exhaust 
host storage poses a severe operational risk. It triggers OS-level instability, 
corrupts local application caches/databases, and acts as a *CI/CD Blocker* by 
causing hard disk failures on shared build agents, ultimately disrupting the 
entire engineering pipeline.

 

  was:
The test case tests if DN shuts down when there is a disk failure (that is 
simulated by the test).

We found that when this feature does not work for whatever reason 
TestDiskError.testShutdown takes a long time and did not finish, also it 
consumes all the storage space. The log file is somewhere around 11 GB, but it 
can be increased by increasing the container size.

Since the log file is huge and capable of running indefinitely, it is 
suspicious that there might be an infinite loop somewhere in the test.

I checked what loops exist [in the test 
file;|https://github.com/apache/hadoop/blob/734dd8a67cd6df56b59ff75aa43de57834a0d248/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDiskError.java#L121]
 there aren't many, and with one exception, they all run only a few iterations:
{code:java}
DataNode dn = cluster.getDataNodes().get(dnIndex);
      for (int i=0; dn.isDatanodeUp(); i++) {
        Path fileName = new Path("/test.txt"+i);
        DFSTestUtil.createFile(fs, fileName, 1024, (short)2, 1L);
        DFSTestUtil.waitReplication(fs, fileName, (short)2);
        fs.delete(fileName, true);
      } {code}
Here, we keep creating and deleting new files until the DataNode (DN) dies. I 
don't know how long the replication takes, but based on the file size and the 
replication factor of 2, it should happen quickly. This is a suspicious section 
because if the test doesn't finish quickly (meaning the "bad" DN doesn't shut 
itself down), it’s conceivable that a vast number of file operations are 
generating a massive amount of logs.I ran a grep on the log file to see how 
many iterations are executed, and I found a line like this:

 
{code:java}
BLOCK* allocate blk_1073970157_229333, replicas=127.0.0.1:34219, 
127.0.0.1:39923 for /test.txt114166{code}
 

This indicates that this single unit test case generates over a hundred 
thousand file operations on its own. Based on the log I examined, which covers 
a half-hour window, the loop is running about 60 times per second; I'm not even 
sure if this makes sense.

Introducing some kind of interval plus a timeout would likely help, as the test 
currently works in a way where if the feature under test fails, you don't get 
an assertion error—you get an infinite loop.

*Please note that* {*}we are not addressing the root cause of the possible 
shutdown failure{*}; instead, we are targeting the resulting infinite loop and 
the unnecessarily large log file.

Also, I have set the priority to Critical (even though a unit test issue does 
not indicate that) because, this issue can block CI process or cause issues on 
local computer.

 


> TestDiskError.testShutdown can run into infinite loop
> -----------------------------------------------------
>
>                 Key: HDFS-17920
>                 URL: https://issues.apache.org/jira/browse/HDFS-17920
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: András Bokor
>            Assignee: András Bokor
>            Priority: Critical
>              Labels: pull-request-available
>
> The test case tests if DN shuts down when there is a disk failure (that is 
> simulated by the test).
> We found that when this feature does not work for whatever reason 
> TestDiskError.testShutdown takes a long time and did not finish, also it 
> consumes all the storage space. The log file is somewhere around 11 GB, but 
> it can be increased by increasing the container size.
> Since the log file is huge and capable of running indefinitely, it is 
> suspicious that there might be an infinite loop somewhere in the test.
> I checked what loops exist [in the test 
> file;|https://github.com/apache/hadoop/blob/734dd8a67cd6df56b59ff75aa43de57834a0d248/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDiskError.java#L121]
>  there aren't many, and with one exception, they all run only a few 
> iterations:
> {code:java}
> DataNode dn = cluster.getDataNodes().get(dnIndex);
>       for (int i=0; dn.isDatanodeUp(); i++) {
>         Path fileName = new Path("/test.txt"+i);
>         DFSTestUtil.createFile(fs, fileName, 1024, (short)2, 1L);
>         DFSTestUtil.waitReplication(fs, fileName, (short)2);
>         fs.delete(fileName, true);
>       } {code}
> Here, we keep creating and deleting new files until the DataNode (DN) dies. I 
> don't know how long the replication takes, but based on the file size and the 
> replication factor of 2, it should happen quickly. This is a suspicious 
> section because if the test doesn't finish quickly (meaning the "bad" DN 
> doesn't shut itself down), it’s conceivable that a vast number of file 
> operations are generating a massive amount of logs.I ran a grep on the log 
> file to see how many iterations are executed, and I found a line like this:
>  
> {code:java}
> BLOCK* allocate blk_1073970157_229333, replicas=127.0.0.1:34219, 
> 127.0.0.1:39923 for /test.txt114166{code}
>  
> This indicates that this single unit test case generates over a hundred 
> thousand file operations on its own. Based on the log I examined, which 
> covers a half-hour window, the loop is running about 60 times per second; I'm 
> not even sure if this makes sense.
> Introducing some kind of interval plus a timeout would likely help, as the 
> test currently works in a way where if the feature under test fails, you 
> don't get an assertion error—you get an infinite loop.
> *Please note that* {*}we are not addressing the root cause of the possible 
> shutdown failure{*}; instead, we are targeting the resulting infinite loop 
> and the unnecessarily large log file.
> Also, I have set the priority to Critical (even though a unit test issue does 
> not indicate that) because, allowing a test to loop indefinitely and exhaust 
> host storage poses a severe operational risk. It triggers OS-level 
> instability, corrupts local application caches/databases, and acts as a 
> *CI/CD Blocker* by causing hard disk failures on shared build agents, 
> ultimately disrupting the entire engineering pipeline.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-17920) TestDiskError.testShutdown can run into infinite loop

Reply via email to