[
https://issues.apache.org/jira/browse/HDDS-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767569#comment-17767569
]
Sammi Chen commented on HDDS-7728:
----------------------------------
{quote}SCM requires that all replicas are empty before moving the container to
deleted state. One non-empty replica will block deletion of all replicas.
My point is that both cases can be solved by the missing/orphan container
cleanup. If we are already implementing missing container cleanup then there is
no need to add complexity to the RM to additionally handle the orphan block
case as well.{quote}
I guess you mean the case that a container has 4 replicas, 3 are empty, 1 has
blocks. This container can be identified by Recon as a orphan container to
delete. This is one special case of orphan block.
The majority is the following case, that a container has 4 live replicas,
* Replica-1 10 blocks
* Replica-2 10 blocks
* Replica-3 10 blocks
* Replica-4 15 blocks
Replica-4 has 15 blocks, the extra 5 blocks are orphan blocks. And there is no
pending block deletion txs for this container. Which 5 blocks are orphan,
currently no single module knows it. Because DN 's container report doesn't
include info of each block. And single DN also doesn't know which block is
orphan as it only know the info of one replica of container.
Since there are 4 replicas, RM will choose one to delete. IIRC, currently it
picks the first one based on some hash sort result. In the above case, any
replica can be a delete candidate. If it's lucky enough, Replica-4 is chosen
and deleted, then both the over-replicated and orphan blocks are solved for
this container. If other replica is chosen and deleted, then over-replicated
is solved but orphan blocks are still there.
So the proposal is leverage deleteTransactionId in container replica info. For
each container replica, there are two Ids. One is blockCommitSequenceId, bcsid,
which will monotonically increase every time metadata is updated for a OPEN
container. Another is deleteTransactionId, which is also a SCM wise globally
monotonically increase number. Once container is transformed from OPEN to
CLOSED, bcsid will never change again. But we can delete blocks in a CLOSED
container. Every time a new batch of blocks are deleted, the
deleteTransactionId in this container will be updated. So the container replica
which has the smaller deleteTransactionId will be one has orphan blocks than
others. In this way, which replica to delete is deterministic, Replica-4 will
be chosen. Then the orphan block will be resolved naturally when Replica-4 is
deleted.
[~erose], I think we may have some communication gap here. Let me summarize
the cases of orphan container and orphan blocks in below. Forget about the
title of JIRA, more cases than the one stated in the title will be discussed.
1. orphan containers. Containers are not referred anymore from OM metadata.
Those containers, they may or may not has replicas reported fro DN. For this
type container, I agree, that Recon is the best place to do a cleanup if there
are replicas reported, for Recon has the OM data, RM in SCM doesn't. RM cannot
know which container is orphan or not. For those orphan containers, it's
related block deletion transactions in SCM, if any, can skip to execute and be
deleted too. Some missing containers can be both missing and orphan.
2. missing containers. Containers are referred from OM, but don't have any
replicas reported from DN. It cause data loss, a sever problem to Ozone. This
type of containers may have pending block deletion txs too. It's better to keep
this container metadata, block deletion txs, and other container related data
untouched to have a context for further data loss investigation.
3. Over-Replicated Containers neither orphan or missing, tow cases
a. There are no pending block deletion txs for those containers. The
proposal of this case is already explained in the beginning part of this
comments.
b. There are pending block deletion txs for those containers. It looks like
the RM and Block deletion service doesn't have sync on this. RM can send out
the replica deletion command to one DN. In the meanwhile, Block deletion
services can send out block deletion transactions to four DNs. When 3 DN ack
the txs success, SCM will delete the transaction from RocksDB. So the above
sample container could end up as,
* Replica-1 10 blocks -> 6 blocks
* Replica-2 10 blocks. // deleted
* Replica-3 10 blocks -> 6 blocks
* Replica-4 15 blocks -> 11 blocks
The 5 extra orphan blocks are still there. The key point here is Replica-4 is
not the deleted one.
So we can see, whether there is pending deletion txs for the over-replicated
container, the key to resolve the orphan block is to chose the block replica
with small deleteTransactionId to delete.
4. Under-Replicated containers
RM will copy 1 replica to make it 3 replica. Which one replica is a better
source? replica with bigger bcsid and bigger deleteTransactionId.
5. Mis-Replicated containers
Mis-replicated containers is equal to a under-replicated case plus a
over-replicated case. Follow the above item 3 and item 4 solution respectively.
6. Unhealthy containers, all replicas are of unhealthy state.
I not sure about how block deletion service handles this type of containers
currently. Need check it more.
All the time, [~ashishk] and I proposing solution for above item 3, and you
and Stephen are emphasizing item 1 and item 2,
Stephen also mentioned item 4 and 5. I think that where the communication gap
comes from. I agree with Ethan's proposal about orphan container handled by
Recon. For orphan blocks, the the special case can be covered by orphan
container handling, while the majority case is better handled by RM in SCM.
For Recon doesn't have any advantage over RM on this problem. If required, we
can have a sync meeting on this topic. What do you think? [~erose][~sodonnell].
> Block should be safely deleted from the containers if they are instructed
> from OM and containers are in missing state.
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: HDDS-7728
> URL: https://issues.apache.org/jira/browse/HDDS-7728
> Project: Apache Ozone
> Issue Type: Improvement
> Components: SCM
> Affects Versions: 1.3.0
> Reporter: Uma Maheswara Rao G
> Assignee: Ashish Kumar
> Priority: Major
>
> Currently when OM instructs to delete the blocks and if containers are in
> missing state, deletion may not be processed properly. This Jira to track
> this requirement and implement to safe deletion os blocks what ever state
> they are on. Otherwise containers would never get cleaned up even though all
> blocks in that files deleted.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]