[
https://issues.apache.org/jira/browse/HDFS-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590030#comment-16590030
]
Zsolt Venczel commented on HDFS-13731:
--------------------------------------
While investigating the above timeouts I found the following concurrency issue:
* while the ReencryptionUpdate.processCheckpoints method is executing and
removing tasks from the task list
* on a different thread a new re-encryption task can be added to the same task
list by calling ReencryptionHandler.submitCurrentBatch that calls
ZoneSubmissionTracker.addTask
My latest patch contains a proposal to prevent this.
I've attached the full log produced for the issue.
The important section where the *processCheckpoints* iterations are still
running but a new ZoneSubmissionTracker task is being added:
{code:java}
2018-08-22 17:16:01,535 INFO FSTreeTraverser - Submitted batch
(start:/zones/zone/0, size:5) of zone 16387 to re-encrypt.
2018-08-22 17:16:01,535 INFO ReencryptionHandler - Processing batched
re-encryption for zone 16387, batch size 5, start:/zones/zone/0
2018-08-22 17:16:01,536 INFO ReencryptionHandler - Completed re-encrypting one
batch of 5 edeks from KMS, time consumed: 922873, start: /zones/zone/0.
2018-08-22 17:16:01,536 INFO ReencryptionUpdater - Processing returned
re-encryption task for zone /zones/zone(16387), batch size 5,
start:/zones/zone/0
2018-08-22 17:16:01,536 DEBUG ReencryptionUpdater - Updating file xattrs for
re-encrypting zone /zones/zone, starting at /zones/zone/0
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16388 for
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16389 for
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16390 for
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16391 for
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16392 for
re-encryption.
2018-08-22 17:16:01,536 INFO ReencryptionUpdater - Updated xattrs on 5(5)
files in zone /zones/zone for re-encryption, starting:/zones/zone/0.
2018-08-22 17:16:01,536 DEBUG ReencryptionUpdater - Updating re-encryption
checkpoint with completed task. last: /zones/zone/4 size:5.
2018-08-22 17:16:01,536 INFO FSTreeTraverser - Submitted batch
(start:/zones/zone/5, size:5) of zone 16387 to re-encrypt.
2018-08-22 17:16:01,536 INFO ReencryptionHandler - Processing batched
re-encryption for zone 16387, batch size 5, start:/zones/zone/5
2018-08-22 17:16:01,537 ERROR ReencryptionUpdater - Re-encryption updater
thread exiting.
java.util.ConcurrentModificationException
at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966)
at java.util.LinkedList$ListItr.remove(LinkedList.java:921)
at
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.processCheckpoints(ReencryptionUpdater.java:411)
at
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.processTask(ReencryptionUpdater.java:488)
at
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.takeAndProcessTasks(ReencryptionUpdater.java:437)
at
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.run(ReencryptionUpdater.java:264)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-08-22 17:16:01,537 INFO ReencryptionHandler - Submission completed of
zone 16387 for re-encryption.
{code}
Which results in cancelling the re-encryption tasks:
{code:java}
2018-08-22 17:16:51,612 INFO ReencryptionUpdater - Cancelling 2 re-encryption
tasks
...
2018-08-22 17:16:51,621 INFO ReencryptionUpdater - Cancelling 2 re-encryption
tasks
{code}
My uploaded patch fixes two other test related issues:
* sometimes in the testRestartAfterReencryptAndCheckpoint fs.saveNamespace()
call was performing slow therefore we should wait for it to finish the operation
* cancelFutureDuringReencryption method introduced a race condition as at
{code:java}
callableRunning.set(true); Thread.sleep(Long.MAX_VALUE);{code}
between setting the callableRunning to true and sleeping the thread a
concurrent modification can happen in rare cases.
> ReencryptionUpdater fails with ConcurrentModificationException during
> processCheckpoints
> ----------------------------------------------------------------------------------------
>
> Key: HDFS-13731
> URL: https://issues.apache.org/jira/browse/HDFS-13731
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: encryption, test
> Affects Versions: 3.0.0
> Reporter: Xiao Chen
> Assignee: Zsolt Venczel
> Priority: Major
> Attachments: HDFS-13731-failure.log
>
>
> HDFS-12837 fixed some flakiness of Reencryption related tests. But as
> [~zvenczel]'s comment, there are a few timeouts still. We should investigate
> that.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]