Re: [ADVISORY] Possible data loss during HDFS decommissioning

Josh Elser Wed, 23 Sep 2015 10:07:24 -0700

-cc user@ (figure I'm forking this into a more dev-focused question now)

True, we don't have procedures for retroactively changing docs. I guessJIRA essentially acts as this version-affected discovery mechanism forus. People generally seem to understand a search of JIRA to find knownissues too.

My only worry of creating a page on the website is that it's yet-anotherplace people have to search to get the details on some operationalsubject. We've been doing well (since 1.6) to capture details like thisin the user manual, so I figured this would also make sense to mentionthere. Perhaps multiple places is reasonable too?


[email protected] wrote:

Known issue in the release notes on the web page? We would have to
update every version though. Seems like we need a known issues document
that lists issues in dependencies that transcend Accumulo versions.

------------------------------------------------------------------------
*From: *"Josh Elser" <[email protected]>
*To: *[email protected]
*Cc: *[email protected]
*Sent: *Wednesday, September 23, 2015 10:26:50 AM
*Subject: *Re: [ADVISORY] Possible data loss during HDFS decommissioning

What kind of documentation can we put in the user manual about this?
Recommend to only decom one rack at a time until we get the issue sorted
out in Hadoop-land?

[email protected] wrote:
 > BLUF: There exists the possibility of data loss when performing
DataNode decommissioning with Accumulo running. This note applies to
installations of Accumulo 1.5.0+ and Hadoop 2.5.0+.
 >
 > DETAILS: During DataNode decommissioning it is possible for the
NameNode to report stale block locations (HDFS-8208). If Accumulo is
running during this process then it is possible that files currently
being written will not close properly. Accumulo is affected in two ways:
 >
 > 1. During compactions temporary rfiles are created, then closed, and
renamed. If a failure happens during the close, the compaction will fail.
 > 2. Write ahead log files are created, written to, and then closed. If
a failure happens during the close, then the NameNode will have a walog
file with no finalized blocks.
 >
 > If either of these cases happen, decommissioning of the DataNode
could hang (HDFS-3599, HDFS-5579) because the files are left in an open
for write state. If Accumulo needs the write ahead log for recovery it
will be unable to read the file and will not recover.
 >
 > RECOMMENDATION: Assuming that the replication pipeline for the write
ahead log is working properly, then you should not run into this issue
if you only decommission one rack at a time.
 >

Re: [ADVISORY] Possible data loss during HDFS decommissioning

Reply via email to