[
https://issues.apache.org/jira/browse/HADOOP-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiao Chen updated HADOOP-12699:
-------------------------------
Attachment: HADOOP-12699.06.patch
Thanks [~asuresh] for the comment and confirming on the fix direction!
I found out what was missed in my previous mentioned approach: I was just
removing the runnables from executor, but did not handle the already started
runnables right.
bq. in the drain call, do an executors.shutdownnow(), clear all the queues and
restart the executor threads.
My thought over this is that shutting down all tasks may be an overkill - we
only need to cancel the tasks for the key that's being rolled. Also, going this
way we need to handle the {{executorThreadsStarted}} race between
{{submitRefillTask}} and restart. How do you think about my approach below?
Attached patch 6, locally ran the reproduce snippet 10k time and no failure.
- As discussed, focusing on fixing the {{ValueQueue}}, instead of hacking it
for the test.
- On {{ValueQueue#drain}}, remove all {{NamedRunnable}} from the
{{ThreadPoolExecutor}}'s queue for that specific key being drained. Updated
{{keysInProgress}} data structure in {{NamedRunnable}} from set to map to
achieve this.
- Also need to cancel the {{NamedRunnable}} that's already being executed.
-* Since we're not submitting to the executor, we can't get a {{Future}} to
cancel. I added an {{AtomicBoolean}} to flag whether a {{NamedRunnable}} is
canceled, and manually cancel job / clear wrongly-filled queue according to
this flag.
-* We need to handle the race on {{keyQueue}} between drain and get. Added
{{ReadWriteLock}} for this purpose. Each key uses a separate lock, so that
operations on different keys don't impact each other.
-* Since the {{keyQueue}} are loaded by {{LoadingCache}}, I didn't find a
perfect place to initialize the locks for {{keyQueue}}. Lazy initialization is
used to create the locks.
I understand this patch adds locking and may impact performance, but I think
this is the trade off for supporting same client to roll and get (and I feel we
should support it). I've been trying to keep the impact (locking/runnable
canceling) scope under each key, which is the same way roll and get works
currently.
Please review and share your thoughts. Thank you!
> TestKMS#testKMSProvider intermittently fails during 'test rollover draining'
> ----------------------------------------------------------------------------
>
> Key: HADOOP-12699
> URL: https://issues.apache.org/jira/browse/HADOOP-12699
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Xiao Chen
> Assignee: Xiao Chen
> Attachments: HADOOP-12699.01.patch, HADOOP-12699.02.patch,
> HADOOP-12699.03.patch, HADOOP-12699.04.patch, HADOOP-12699.06.patch,
> HADOOP-12699.repro.2, HADOOP-12699.repro.patch
>
>
> I've seen several failures of testKMSProvider, all failed in the following
> snippet:
> {code}
> // test rollover draining
> KeyProviderCryptoExtension kpce = KeyProviderCryptoExtension.
> createKeyProviderCryptoExtension(kp);
> .....
> EncryptedKeyVersion ekv1 = kpce.generateEncryptedKey("k6");
> kpce.rollNewVersion("k6");
> EncryptedKeyVersion ekv2 = kpce.generateEncryptedKey("k6");
> Assert.assertNotEquals(ekv1.getEncryptionKeyVersionName(),
> ekv2.getEncryptionKeyVersionName());
> {code}
> with error message
> {quote}Values should be different. Actual: k6@0{quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)