[ 
https://issues.apache.org/jira/browse/HDFS-17920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

András Bokor updated HDFS-17920:
--------------------------------
    Summary: TestDiskError.testShutdown can run into infinite loop  (was: 
TestDiskError.testShutdown can run into infint loop)

> TestDiskError.testShutdown can run into infinite loop
> -----------------------------------------------------
>
>                 Key: HDFS-17920
>                 URL: https://issues.apache.org/jira/browse/HDFS-17920
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: András Bokor
>            Priority: Critical
>
> We found that when running JUnit tests TestDiskError.testShutdown takes a 
> long and did not finish, also it consumes all the storage space. The log file 
> is somewhere around 11 GB, but it can be increased by increasing the 
> container size.
> Since the log file is huge and capable of running indefinitely, it is 
> suspicious that there might be an infinite loop somewhere in the test.
> I checked what loops exist [in the test 
> file;|https://github.com/apache/hadoop/blob/734dd8a67cd6df56b59ff75aa43de57834a0d248/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDiskError.java#L121]
>  there aren't many, and with one exception, they all run only a few 
> iterations:
> {code:java}
> DataNode dn = cluster.getDataNodes().get(dnIndex);
>       for (int i=0; dn.isDatanodeUp(); i++) {
>         Path fileName = new Path("/test.txt"+i);
>         DFSTestUtil.createFile(fs, fileName, 1024, (short)2, 1L);
>         DFSTestUtil.waitReplication(fs, fileName, (short)2);
>         fs.delete(fileName, true);
>       } {code}
> Here, we keep creating and deleting new files until the DataNode (DN) dies. I 
> don't know how long the replication takes, but based on the file size and the 
> replication factor of 2, it should happen quickly. This is a suspicious 
> section because if the test doesn't finish quickly (meaning the "bad" DN 
> doesn't shut itself down), it’s conceivable that a vast number of file 
> operations are generating a massive amount of logs.I ran a grep on the log 
> file to see how many iterations are executed, and I found a line like this:
>  
> {code:java}
> BLOCK* allocate blk_1073970157_229333, replicas=127.0.0.1:34219, 
> 127.0.0.1:39923 for /test.txt114166{code}
>  
> This indicates that this single unit test case generates over a hundred 
> thousand file operations on its own. Based on the log I examined, which 
> covers a half-hour window, the loop is running about 60 times per second; I'm 
> not even sure if this makes sense.
> Introducing some kind of interval plus a timeout would likely help, as the 
> test currently works in a way where if the feature under test fails, you 
> don't get an assertion error—you get an infinite loop.
> *Please note that* in our internal release, this unit test fails because the 
> faulty DataNode does not shut down. In this ticket, {*}we are not addressing 
> the root cause of the shutdown failure{*}; instead, we are targeting the 
> resulting infinite loop and the unnecessarily large log file.
> Also, I have set the priority to Critical (even though a unit test failure 
> does not indicate that) because, this issue can block CI process.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to