[ 
https://issues.apache.org/jira/browse/HDFS-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-9549:
----------------------------
    Attachment: HDFS-9549.01.patch

This one is a bit tricky to me. IIUC:

- {{testExceedsCapacity}} waits for DNs to reach their cache capacity, then 
verify that there's no {{pendingCached}} left.
- Test fails because, although {{CacheReplicationMonitor#addNewPendingCached}} 
checks remaining bytes before adding, it's possible that datanode can have a 
pending to become cached on the {{CachingTask}} thread, which is no longer 
pending but not counted as cacheUsed yet. Then a new block will be added as 
{{pendingCached}}, which eventually will not succeed since the capacity is 
reached. (I'll attach a log I used to analyze this, 1073741826 is the one 
already transitioned to cached, and 1073741841 is the one that's been added 
later and never succeeded. I have a {{waitFor}} when getting this log, so the 
end of the log is just repeating and can be ignored)
- Given the above root cause, fixing it in 
{{CacheReplicationMonitor#addNewPendingCached}} would be difficult without some 
synchronization. Instead, I took another approach, to conditionally remove the 
extra blocks if a DN don't have enough resident to fit it in 
{{CacheReplicationMonitor#rescanCachedBlockMap}}. I understand that would make 
the scan slower, so I combined that with the current iteration  of 
{{pendingCached}}, hoping to minimize the impact.

[~andrew.wang] and [~cmccabe], could you please review? Thanks!

> TestCacheDirectives#testExceedsCapacity is flaky
> ------------------------------------------------
>
>                 Key: HDFS-9549
>                 URL: https://issues.apache.org/jira/browse/HDFS-9549
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>         Environment: Jenkins
>            Reporter: Wei-Chiu Chuang
>            Assignee: Xiao Chen
>         Attachments: HDFS-9549.01.patch
>
>
> I have observed that this test (TestCacheDirectives.testExceedsCapacity) 
> fails quite frequently in Jenkins (trunk, trunk-Java8)  
> Error Message
> Pending cached list of 127.0.0.1:54134 is not empty, [{blockId=1073741841, 
> replication=1, mark=true}]
> Stacktrace
> java.lang.AssertionError: Pending cached list of 127.0.0.1:54134 is not 
> empty, [{blockId=1073741841, replication=1, mark=true}]
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.assertTrue(Assert.java:41)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.checkPendingCachedEmpty(TestCacheDirectives.java:1479)
>       at 
> org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity(TestCacheDirectives.java:1502)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to