[
https://issues.apache.org/jira/browse/HDDS-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982798#comment-16982798
]
Attila Doroszlai commented on HDDS-2477:
----------------------------------------
Hi [~bharat], there was a unit test failure in {{TestTableCacheImpl}} during
post-commit build, but I see that the PR build was clean. Can it be an
intermittent problem?
{code:title=https://github.com/apache/hadoop-ozone/runs/321747893}
2019-11-26T18:27:13.2498941Z [ERROR] Tests run: 10, Failures: 1, Errors: 0,
Skipped: 0, Time elapsed: 2.813 s <<< FAILURE! - in
org.apache.hadoop.hdds.utils.db.cache.TestTableCacheImpl
2019-11-26T18:27:13.2505113Z [ERROR]
testPartialTableCacheWithOverrideAndDelete[0](org.apache.hadoop.hdds.utils.db.cache.TestTableCacheImpl)
Time elapsed: 0.135 s <<< FAILURE!
2019-11-26T18:27:13.2507359Z java.lang.AssertionError: expected:<2> but was:<6>
2019-11-26T18:27:13.2510182Z at org.junit.Assert.fail(Assert.java:88)
2019-11-26T18:27:13.2513376Z at
org.junit.Assert.failNotEquals(Assert.java:743)
2019-11-26T18:27:13.2515256Z at
org.junit.Assert.assertEquals(Assert.java:118)
2019-11-26T18:27:13.2517279Z at
org.junit.Assert.assertEquals(Assert.java:555)
2019-11-26T18:27:13.2520318Z at
org.junit.Assert.assertEquals(Assert.java:542)
2019-11-26T18:27:13.2544916Z at
org.apache.hadoop.hdds.utils.db.cache.TestTableCacheImpl.testPartialTableCacheWithOverrideAndDelete(TestTableCacheImpl.java:308)
{code}
> TableCache cleanup issue for OM non-HA
> --------------------------------------
>
> Key: HDDS-2477
> URL: https://issues.apache.org/jira/browse/HDDS-2477
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Manager
> Reporter: Bharat Viswanadham
> Assignee: Bharat Viswanadham
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.5.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In OM in non-HA case, the ratisTransactionLogIndex is generated by
> OmProtocolServersideTranslatorPB.java. And in OM non-HA
> validateAndUpdateCache is called from multipleHandler threads. So think of a
> case where one thread which has an index - 10 has added to doubleBuffer. (0-9
> still have not added). DoubleBuffer flush thread flushes and call cleanup.
> (So, now cleanup will go and cleanup all cache entries with less than 10
> epoch) This should not have cleanup those which might have put in to cache
> later and which are in process of flush to DB. This will cause inconsitency
> for few OM requests.
>
>
> Example:
> 4 threads Committing 4 parts.
> 1st thread - part 1 - ratis Index - 3
> 2nd thread - part 2 - ratis index - 2
> 3rd thread - part3 - ratis index - 1
>
> First thread got lock, and put in to doubleBuffer and cache with
> OmMultipartInfo (with part1). And cleanup is called to cleanup all entries in
> cache with less than 3. In the mean time 2nd thread and 1st thread put 2,3
> parts in to OmMultipartInfo in to Cache and doubleBuffer. But first thread
> might cleanup those entries, as it is called with index 3 for cleanup.
>
> Now when the 4th part upload came -> when it is commit Multipart upload when
> it gets multipartinfo it get Only part1 in OmMultipartInfo, as the
> OmMultipartInfo (with 1,2,3 is still in process of committing to DB). So now
> after 4th part upload is complete in DB and Cache we will have 1,4 parts
> only. We will miss part2,3 information.
>
> So for non-HA case cleanup will be called with list of epochs that need to be
> cleanedup.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]