[ 
https://issues.apache.org/jira/browse/HDDS-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16982798#comment-16982798
 ] 

Attila Doroszlai commented on HDDS-2477:
----------------------------------------

Hi [~bharat], there was a unit test failure in {{TestTableCacheImpl}} during 
post-commit build, but I see that the PR build was clean.  Can it be an 
intermittent problem?

{code:title=https://github.com/apache/hadoop-ozone/runs/321747893}
2019-11-26T18:27:13.2498941Z [ERROR] Tests run: 10, Failures: 1, Errors: 0, 
Skipped: 0, Time elapsed: 2.813 s <<< FAILURE! - in 
org.apache.hadoop.hdds.utils.db.cache.TestTableCacheImpl
2019-11-26T18:27:13.2505113Z [ERROR] 
testPartialTableCacheWithOverrideAndDelete[0](org.apache.hadoop.hdds.utils.db.cache.TestTableCacheImpl)
  Time elapsed: 0.135 s  <<< FAILURE!
2019-11-26T18:27:13.2507359Z java.lang.AssertionError: expected:<2> but was:<6>
2019-11-26T18:27:13.2510182Z    at org.junit.Assert.fail(Assert.java:88)
2019-11-26T18:27:13.2513376Z    at 
org.junit.Assert.failNotEquals(Assert.java:743)
2019-11-26T18:27:13.2515256Z    at 
org.junit.Assert.assertEquals(Assert.java:118)
2019-11-26T18:27:13.2517279Z    at 
org.junit.Assert.assertEquals(Assert.java:555)
2019-11-26T18:27:13.2520318Z    at 
org.junit.Assert.assertEquals(Assert.java:542)
2019-11-26T18:27:13.2544916Z    at 
org.apache.hadoop.hdds.utils.db.cache.TestTableCacheImpl.testPartialTableCacheWithOverrideAndDelete(TestTableCacheImpl.java:308)
{code}

> TableCache cleanup issue for OM non-HA
> --------------------------------------
>
>                 Key: HDDS-2477
>                 URL: https://issues.apache.org/jira/browse/HDDS-2477
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Manager
>            Reporter: Bharat Viswanadham
>            Assignee: Bharat Viswanadham
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.5.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In OM in non-HA case, the ratisTransactionLogIndex is generated by 
> OmProtocolServersideTranslatorPB.java. And in OM non-HA 
> validateAndUpdateCache is called from multipleHandler threads. So think of a 
> case where one thread which has an index - 10 has added to doubleBuffer. (0-9 
> still have not added). DoubleBuffer flush thread flushes and call cleanup. 
> (So, now cleanup will go and cleanup all cache entries with less than 10 
> epoch) This should not have cleanup those which might have put in to cache 
> later and which are in process of flush to DB. This will cause inconsitency 
> for few OM requests.
>  
>  
> Example:
> 4 threads Committing 4 parts.
> 1st thread - part 1 - ratis Index - 3
> 2nd thread - part 2 - ratis index - 2
> 3rd thread - part3 - ratis index - 1
>  
> First thread got lock, and put in to doubleBuffer and cache with 
> OmMultipartInfo (with part1). And cleanup is called to cleanup all entries in 
> cache with less than 3. In the mean time 2nd thread and 1st thread put 2,3 
> parts in to OmMultipartInfo in to Cache and doubleBuffer. But first thread 
> might cleanup those entries, as it is called with index 3 for cleanup.
>  
> Now when the 4th part upload came -> when it is commit Multipart upload when 
> it gets multipartinfo it get Only part1 in OmMultipartInfo, as the 
> OmMultipartInfo (with 1,2,3 is still in process of committing to DB). So now 
> after 4th part upload is complete in DB and Cache we will have 1,4 parts 
> only. We will miss part2,3 information.
>  
> So for non-HA case cleanup will be called with list of epochs that need to be 
> cleanedup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to