[
https://issues.apache.org/jira/browse/HDDS-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717587#comment-17717587
]
Attila Doroszlai commented on HDDS-8471:
----------------------------------------
In addition to possibly adding failed tasks to the new queue, I think there is
another issue: the replication processor may also take items from the new
queue. It's probably also harmless, but may result in some container being
processed irregularly (i.e. twice in the same loop, then skipped in next loop).
Both problems are fixed by using a single instance of the queue for each
iteration consistently.
> Ensure replication processors use a single queue for each iteration
> -------------------------------------------------------------------
>
> Key: HDDS-8471
> URL: https://issues.apache.org/jira/browse/HDDS-8471
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Stephen O'Donnell
> Assignee: Attila Doroszlai
> Priority: Major
> Labels: pull-request-available
>
> The under and over replication queues in ReplicationManager are created when
> replicationManager checks the health of all containers in the system. When it
> does that, it forms a new "ReplicationQueue" object wrapping the under and
> over replicated queues.
> The OverReplicatedProcessor and UnderReplicatedProcessor both extend
> UnhealthyReplicationProcessor. Within it, it dequeues messages and processes
> them. If there is an exception, it saves the message in a list, ready to
> enqueue it again later. It saves the message, rather than enqueuing it
> immediately, to avoid the queue entering an infinite loop when a container
> fails repeatedly.
> The issue is that while the Under / Over process is running, it could be
> saving up containers to requeue, but then ReplicationManager could process
> all the containers and replace the queue. Then the bad containers are
> requeued onto the "new" queue, possibly creating duplicates.
> While the duplicates should not cause any problem, it would be better if this
> was handled more gracefully.
> For example, if the queue has been replaced, drop the failed containers - but
> how to check if the queue has been replaced?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]