[jira] [Commented] (HDFS-13731) ReencryptionUpdater fails with ConcurrentModificationException during processCheckpoints

Zsolt Venczel (JIRA) Thu, 23 Aug 2018 03:21:02 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590030#comment-16590030
 ]


Zsolt Venczel commented on HDFS-13731:
--------------------------------------

While investigating the above timeouts I found the following concurrency issue:
 * while the ReencryptionUpdate.processCheckpoints method is executing and 
removing tasks from the task list
 * on a different thread a new re-encryption task can be added to the same task 
list by calling ReencryptionHandler.submitCurrentBatch that calls 
ZoneSubmissionTracker.addTask

My latest patch contains a proposal to prevent this.

I've attached the full log produced for the issue.

The important section where the *processCheckpoints* iterations are still 
running but a new ZoneSubmissionTracker task is being added:
{code:java}
2018-08-22 17:16:01,535 INFO  FSTreeTraverser - Submitted batch 
(start:/zones/zone/0, size:5) of zone 16387 to re-encrypt.
2018-08-22 17:16:01,535 INFO  ReencryptionHandler - Processing batched 
re-encryption for zone 16387, batch size 5, start:/zones/zone/0
2018-08-22 17:16:01,536 INFO  ReencryptionHandler - Completed re-encrypting one 
batch of 5 edeks from KMS, time consumed: 922873, start: /zones/zone/0.
2018-08-22 17:16:01,536 INFO  ReencryptionUpdater - Processing returned 
re-encryption task for zone /zones/zone(16387), batch size 5, 
start:/zones/zone/0
2018-08-22 17:16:01,536 DEBUG ReencryptionUpdater - Updating file xattrs for 
re-encrypting zone /zones/zone, starting at /zones/zone/0
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16388 for 
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16389 for 
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16390 for 
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16391 for 
re-encryption.
2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16392 for 
re-encryption.
2018-08-22 17:16:01,536 INFO  ReencryptionUpdater - Updated xattrs on 5(5) 
files in zone /zones/zone for re-encryption, starting:/zones/zone/0.
2018-08-22 17:16:01,536 DEBUG ReencryptionUpdater - Updating re-encryption 
checkpoint with completed task. last: /zones/zone/4 size:5.
2018-08-22 17:16:01,536 INFO  FSTreeTraverser - Submitted batch 
(start:/zones/zone/5, size:5) of zone 16387 to re-encrypt.
2018-08-22 17:16:01,536 INFO  ReencryptionHandler - Processing batched 
re-encryption for zone 16387, batch size 5, start:/zones/zone/5
2018-08-22 17:16:01,537 ERROR ReencryptionUpdater - Re-encryption updater 
thread exiting.
java.util.ConcurrentModificationException
at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966)
at java.util.LinkedList$ListItr.remove(LinkedList.java:921)
at 
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.processCheckpoints(ReencryptionUpdater.java:411)
at 
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.processTask(ReencryptionUpdater.java:488)
at 
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.takeAndProcessTasks(ReencryptionUpdater.java:437)
at 
org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.run(ReencryptionUpdater.java:264)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-08-22 17:16:01,537 INFO  ReencryptionHandler - Submission completed of 
zone 16387 for re-encryption.
{code}
Which results in cancelling the re-encryption tasks:
{code:java}
2018-08-22 17:16:51,612 INFO  ReencryptionUpdater - Cancelling 2 re-encryption 
tasks
...
2018-08-22 17:16:51,621 INFO  ReencryptionUpdater - Cancelling 2 re-encryption 
tasks
{code}
My uploaded patch fixes two other test related issues:
 * sometimes in the testRestartAfterReencryptAndCheckpoint fs.saveNamespace() 
call was performing slow therefore we should wait for it to finish the operation
 * cancelFutureDuringReencryption method introduced a race condition as at
{code:java}
callableRunning.set(true); Thread.sleep(Long.MAX_VALUE);{code}
between setting the callableRunning to true and sleeping the thread a 
concurrent modification can happen in rare cases.

> ReencryptionUpdater fails with ConcurrentModificationException during 
> processCheckpoints
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-13731
>                 URL: https://issues.apache.org/jira/browse/HDFS-13731
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: encryption, test
>    Affects Versions: 3.0.0
>            Reporter: Xiao Chen
>            Assignee: Zsolt Venczel
>            Priority: Major
>         Attachments: HDFS-13731-failure.log
>
>
> HDFS-12837 fixed some flakiness of Reencryption related tests. But as 
> [~zvenczel]'s comment, there are a few timeouts still. We should investigate 
> that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13731) ReencryptionUpdater fails with ConcurrentModificationException during processCheckpoints

Reply via email to