[
https://issues.apache.org/jira/browse/HDDS-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Rose updated HDDS-6235:
-----------------------------
Description:
An empty KeyValueContainer will have an empty chunks directory.
TarContainerPacker#pack recurses into directories adding files into containers,
but if the chunks directory is empty, it will not be included in the tar. The
receiver will unpack the tar successfully, but the resulting container will not
have a chunks directory. After this, the container will not be able to
replicated further, as the tar packing step requires all container pieces to be
present on disk. This issue is more likely to occur due to HDDS-5359, which
causes many empty containers to be tracked by SCM indefinitely.
Since the issue only affects empty containers, there does not appear to be any
data loss risk, even though the container scanner may detect it as
"corruption". The issue may manifest as the container being marked unhealthy by
the background container scanner (if it is enabled), or a container
continuously attempting to be replicated and failing. In the later case, logs
like this may be observed on the receiver of an import:
{code}
2020-06-23 14:11:20,504 [grpc-default-executor-111] INFO
org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Container
206 is downloaded to /tmp/container-copy/container-206.tar.gz
2020-06-23 14:11:20,505 [ContainerReplicationThread-0] INFO
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator:
Container 206 is downloaded, starting to import.
2020-06-23 14:11:20,616 [ContainerReplicationThread-0] ERROR
org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator:
Can't import the downloaded container data id=206
java.io.IOException: Container descriptor is missing from the container archive.
{code}
This happens because the sender stopped packing contents into the container
when it found the chunks dir missing, so it did not add the .container file.
The send happens anyways, but the receiver tries to unpack the .container file
first, and aborts when it sees it is not there.
This Jira will fix the issue with the tar packer, and also add a repair step on
datanode startup to create the chunks directory for containers that do not have
one. This step should be a quick addition to datanode startup that already
iterates all the containers, and should not impact startup time.
was:
An empty KeyValueContainer will have an empty chunks directory.
TarContainerPacker#pack recurses into directories adding files into containers,
but if the chunks directory is empty, it will not be included in the tar. The
receiver will unpack the tar successfully, but the resulting container will not
have a chunks directory. After this, the container will not be able to
replicated further, as the tar packing step requires all container pieces to be
present on disk. This issue is more likely to occur due to HDDS-5359, which
causes many empty containers to be tracked by SCM indefinitely.
Since the issue only affects empty containers, there does not appear to be any
data loss risk, even though the container scanner may detect it as
"corruption". The issue may manifest as a container continuously attempting to
be replicated and failing, or the container being marked unhealthy by the
background container scanner (if it is enabled).
This Jira will fix the issue with the tar packer, and also add a repair step on
datanode startup to create the chunks directory for containers that do not have
one. This step should be a quick addition to datanode startup that already
iterates all the containers, and should not impact startup time.
> Empty KeyValueContainers are replicated without chunks directory
> ----------------------------------------------------------------
>
> Key: HDDS-6235
> URL: https://issues.apache.org/jira/browse/HDDS-6235
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 1.0.0, 1.1.0, 1.2.0
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.3.0
>
>
> An empty KeyValueContainer will have an empty chunks directory.
> TarContainerPacker#pack recurses into directories adding files into
> containers, but if the chunks directory is empty, it will not be included in
> the tar. The receiver will unpack the tar successfully, but the resulting
> container will not have a chunks directory. After this, the container will
> not be able to replicated further, as the tar packing step requires all
> container pieces to be present on disk. This issue is more likely to occur
> due to HDDS-5359, which causes many empty containers to be tracked by SCM
> indefinitely.
> Since the issue only affects empty containers, there does not appear to be
> any data loss risk, even though the container scanner may detect it as
> "corruption". The issue may manifest as the container being marked unhealthy
> by the background container scanner (if it is enabled), or a container
> continuously attempting to be replicated and failing. In the later case, logs
> like this may be observed on the receiver of an import:
> {code}
> 2020-06-23 14:11:20,504 [grpc-default-executor-111] INFO
> org.apache.hadoop.ozone.container.replication.GrpcReplicationClient:
> Container 206 is downloaded to /tmp/container-copy/container-206.tar.gz
> 2020-06-23 14:11:20,505 [ContainerReplicationThread-0] INFO
> org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator:
> Container 206 is downloaded, starting to import.
> 2020-06-23 14:11:20,616 [ContainerReplicationThread-0] ERROR
> org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator:
> Can't import the downloaded container data id=206
> java.io.IOException: Container descriptor is missing from the container
> archive.
> {code}
> This happens because the sender stopped packing contents into the container
> when it found the chunks dir missing, so it did not add the .container file.
> The send happens anyways, but the receiver tries to unpack the .container
> file first, and aborts when it sees it is not there.
> This Jira will fix the issue with the tar packer, and also add a repair step
> on datanode startup to create the chunks directory for containers that do not
> have one. This step should be a quick addition to datanode startup that
> already iterates all the containers, and should not impact startup time.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]