[
https://issues.apache.org/jira/browse/HDFS-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Chen updated HDFS-9549:
----------------------------
Attachment: HDFS-9549.01.patch
This one is a bit tricky to me. IIUC:
- {{testExceedsCapacity}} waits for DNs to reach their cache capacity, then
verify that there's no {{pendingCached}} left.
- Test fails because, although {{CacheReplicationMonitor#addNewPendingCached}}
checks remaining bytes before adding, it's possible that datanode can have a
pending to become cached on the {{CachingTask}} thread, which is no longer
pending but not counted as cacheUsed yet. Then a new block will be added as
{{pendingCached}}, which eventually will not succeed since the capacity is
reached. (I'll attach a log I used to analyze this, 1073741826 is the one
already transitioned to cached, and 1073741841 is the one that's been added
later and never succeeded. I have a {{waitFor}} when getting this log, so the
end of the log is just repeating and can be ignored)
- Given the above root cause, fixing it in
{{CacheReplicationMonitor#addNewPendingCached}} would be difficult without some
synchronization. Instead, I took another approach, to conditionally remove the
extra blocks if a DN don't have enough resident to fit it in
{{CacheReplicationMonitor#rescanCachedBlockMap}}. I understand that would make
the scan slower, so I combined that with the current iteration of
{{pendingCached}}, hoping to minimize the impact.
[~andrew.wang] and [~cmccabe], could you please review? Thanks!
> TestCacheDirectives#testExceedsCapacity is flaky
> ------------------------------------------------
>
> Key: HDFS-9549
> URL: https://issues.apache.org/jira/browse/HDFS-9549
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.0.0
> Environment: Jenkins
> Reporter: Wei-Chiu Chuang
> Assignee: Xiao Chen
> Attachments: HDFS-9549.01.patch
>
>
> I have observed that this test (TestCacheDirectives.testExceedsCapacity)
> fails quite frequently in Jenkins (trunk, trunk-Java8)
> Error Message
> Pending cached list of 127.0.0.1:54134 is not empty, [{blockId=1073741841,
> replication=1, mark=true}]
> Stacktrace
> java.lang.AssertionError: Pending cached list of 127.0.0.1:54134 is not
> empty, [{blockId=1073741841, replication=1, mark=true}]
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at
> org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.checkPendingCachedEmpty(TestCacheDirectives.java:1479)
> at
> org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives.testExceedsCapacity(TestCacheDirectives.java:1502)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)