[ 
https://issues.apache.org/jira/browse/HDFS-12940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297507#comment-16297507
 ] 

Chen Liang edited comment on HDFS-12940 at 12/19/17 10:06 PM:
--------------------------------------------------------------

Thanks [~nandakumar131] for the catch! But when I ran test locally, I still 
occasionally got assertion on line 1140, which implies that a key that should 
have been deleted still exists. I suspect that in addition to the issue you 
fixed, there can be another issue due to other tests: since the cleanup service 
got started at the start of minicluster, as multiple tests went on, the exact 
time of clean up check can be arbitrary. So it's possible that the second 
{{Thread.sleep(2000);}} may not actually include a cleanup check, given that 
the clean up check interval is set to 3 sec.

If this is true, then I think there are two ways to fix:
1. instead of sleeping for 2000, first sleep 2000, then create key 5, then 
sleep another 2000. Then for key1~4, the sleep is 4000, which is guaranteed to 
include a check that cleans up key 1~4. While for key 5, the sleep is 2000, so 
even if there is a check after its creation, it guarantees the check won't 
remove key because key 5 is guaranteed to have lived <2000 by the checking 
time. We may also want to decrease the check interval and cleanup threshold to 
make this test run faster.
2. make testExpiredOpenKey a separate unit test class, then we avoid the impact 
from the other tests.
Either way looks good to me.


was (Author: vagarychen):
Thanks [~nandakumar131] for the catch! But when I ran test locally, I still 
occasionally got assertion on line 1140, which implies that a key that should 
have been deleted still exists. I suspect that in addition to the issue you 
fixed, there can be another issue due to other tests: since the cleanup service 
got started at the start of minicluster, as multiple tests went on, the exact 
time of clean up check can be arbitrary. So it's possible that the second 
{{Thread.sleep(2000);}} may not actually include a cleanup check, given that 
the clean up check interval is set to 3 sec.

If this is true, then I think there are two ways to fix:
1. instead of sleeping for 2000, first sleep 2000, then create key 5, then 
sleep another 2000. Then for key1~4, the sleep is 4000, which is guaranteed to 
have included a check. While for key 5, the sleep is 2000, so even if there is 
a check after its creation, it guarantees the check won't remove key because 
key 5 is guaranteed to have lived <2000 by the checking time. We may also want 
to decrease the check interval and cleanup threshold to make this test run 
faster.
2. make testExpiredOpenKey a separate unit test class, then we avoid the impact 
from the other tests.
Either way looks good to me.

> Ozone: KSM: TestKeySpaceManager#testExpiredOpenKey fails occasionally
> ---------------------------------------------------------------------
>
>                 Key: HDFS-12940
>                 URL: https://issues.apache.org/jira/browse/HDFS-12940
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Nanda kumar
>            Assignee: Nanda kumar
>         Attachments: HDFS-12940-HDFS-7240.000.patch
>
>
> {{TestKeySpaceManager#testExpiredOpenKey}} is flaky.
> In {{testExpiredOpenKey}} we are opening four keys for writing and wait for 
> them to expire (without committing). Verification/Assertion is done by 
> querying {{MiniOzoneCluster}} and matching the count. Since the {{cluster}} 
> instance of {{MiniOzoneCluster}} is shared between test-cases in 
> {{TestKeySpaceManager}}, we should not rely on the count. The verification 
> should only happen by matching the keyNames and not with the count.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to