[ 
https://issues.apache.org/jira/browse/HDDS-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Rose updated HDDS-6235:
-----------------------------
    Description: 
An empty KeyValueContainer will have an empty chunks directory. 
TarContainerPacker#pack recurses into directories adding files into containers, 
but if the chunks directory is empty, it will not be included in the tar. The 
receiver will unpack the tar successfully, but the resulting container will not 
have a chunks directory. After this, the container will not be able to 
replicated further, as the tar packing step requires all container pieces to be 
present on disk. This issue is more likely to occur due to HDDS-5359, which 
causes many empty containers to be tracked by SCM indefinitely.

Since the issue only affects empty containers, there does not appear to be any 
data loss risk, even though the container scanner may detect it as 
"corruption". The issue may manifest as the container being marked unhealthy by 
the background container scanner (if it is enabled), or a container 
continuously attempting to be replicated and failing. In the later case, logs 
like this may be observed on the receiver of an import:
{code}
2020-06-23 14:11:20,504 [grpc-default-executor-111] INFO 
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Container 
206 is downloaded to /tmp/container-copy/container-206.tar.gz
2020-06-23 14:11:20,505 [ContainerReplicationThread-0] INFO 
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: 
Container 206 is downloaded, starting to import.
2020-06-23 14:11:20,616 [ContainerReplicationThread-0] ERROR 
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: 
Can't import the downloaded container data id=206
java.io.IOException: Container descriptor is missing from the container archive.
{code}
This happens because the sender stopped packing contents into the container 
when it found the chunks dir missing, so it did not add the .container file. 
The send happens anyways, but the receiver tries to unpack the .container file 
first, and aborts when it sees it is not there.

This Jira will fix the issue with the tar packer, and also add a repair step on 
datanode startup to create the chunks directory for containers that do not have 
one. This step should be a quick addition to datanode startup that already 
iterates all the containers, and should not impact startup time.

  was:
An empty KeyValueContainer will have an empty chunks directory. 
TarContainerPacker#pack recurses into directories adding files into containers, 
but if the chunks directory is empty, it will not be included in the tar. The 
receiver will unpack the tar successfully, but the resulting container will not 
have a chunks directory. After this, the container will not be able to 
replicated further, as the tar packing step requires all container pieces to be 
present on disk. This issue is more likely to occur due to HDDS-5359, which 
causes many empty containers to be tracked by SCM indefinitely.

Since the issue only affects empty containers, there does not appear to be any 
data loss risk, even though the container scanner may detect it as 
"corruption". The issue may manifest as a container continuously attempting to 
be replicated and failing, or the container being marked unhealthy by the 
background container scanner (if it is enabled).

This Jira will fix the issue with the tar packer, and also add a repair step on 
datanode startup to create the chunks directory for containers that do not have 
one. This step should be a quick addition to datanode startup that already 
iterates all the containers, and should not impact startup time.


> Empty KeyValueContainers are replicated without chunks directory
> ----------------------------------------------------------------
>
>                 Key: HDDS-6235
>                 URL: https://issues.apache.org/jira/browse/HDDS-6235
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0
>            Reporter: Ethan Rose
>            Assignee: Ethan Rose
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.3.0
>
>
> An empty KeyValueContainer will have an empty chunks directory. 
> TarContainerPacker#pack recurses into directories adding files into 
> containers, but if the chunks directory is empty, it will not be included in 
> the tar. The receiver will unpack the tar successfully, but the resulting 
> container will not have a chunks directory. After this, the container will 
> not be able to replicated further, as the tar packing step requires all 
> container pieces to be present on disk. This issue is more likely to occur 
> due to HDDS-5359, which causes many empty containers to be tracked by SCM 
> indefinitely.
> Since the issue only affects empty containers, there does not appear to be 
> any data loss risk, even though the container scanner may detect it as 
> "corruption". The issue may manifest as the container being marked unhealthy 
> by the background container scanner (if it is enabled), or a container 
> continuously attempting to be replicated and failing. In the later case, logs 
> like this may be observed on the receiver of an import:
> {code}
> 2020-06-23 14:11:20,504 [grpc-default-executor-111] INFO 
> org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: 
> Container 206 is downloaded to /tmp/container-copy/container-206.tar.gz
> 2020-06-23 14:11:20,505 [ContainerReplicationThread-0] INFO 
> org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: 
> Container 206 is downloaded, starting to import.
> 2020-06-23 14:11:20,616 [ContainerReplicationThread-0] ERROR 
> org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator: 
> Can't import the downloaded container data id=206
> java.io.IOException: Container descriptor is missing from the container 
> archive.
> {code}
> This happens because the sender stopped packing contents into the container 
> when it found the chunks dir missing, so it did not add the .container file. 
> The send happens anyways, but the receiver tries to unpack the .container 
> file first, and aborts when it sees it is not there.
> This Jira will fix the issue with the tar packer, and also add a repair step 
> on datanode startup to create the chunks directory for containers that do not 
> have one. This step should be a quick addition to datanode startup that 
> already iterates all the containers, and should not impact startup time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to