[
https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281774#comment-15281774
]
Wei-Chiu Chuang commented on HDFS-9908:
---------------------------------------
Need to rebase due to HADOOP-12973.
> Datanode should tolerate disk scan failure during NN handshake
> --------------------------------------------------------------
>
> Key: HDFS-9908
> URL: https://issues.apache.org/jira/browse/HDFS-9908
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.5.0
> Environment: CDH5.3.3
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Attachments: HDFS-9908.001.patch, HDFS-9908.002.patch,
> HDFS-9908.003.patch, HDFS-9908.004.patch, HDFS-9908.005.patch,
> HDFS-9908.006.patch, HDFS-9908.007.patch
>
>
> DN may treat a disk scan failure exception as an NN handshake exception, and
> this can prevent a DN to join a cluster even if most of its disks are healthy.
> During NN handshake, DN initializes block pools. It will create a lock files
> per disk, and then scan the volumes. However, if the scanning throws
> exceptions due to disk failure, DN will think it's an exception because NN is
> inconsistent with the local storage (see {{DataNode#initBlockPool}}. As a
> result, it will attempt to reconnect to NN again.
> However, at this point, DN has not deleted its lock files on the disks. If it
> reconnects to NN again, it will think the same disks are already being used,
> and then it will fail handshake again because all disks can not be used (due
> to locking), and repeatedly. This will happen even if the DN has multiple
> disks, and only one of them fails. The DN will not be able to connect to NN
> despite just one failing disk. Note that it is possible to successfully
> create a lock file on a disk, and then has error scanning the disk.
> We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0,
> and we still see the same bug in 3.0.0 trunk branch). The root cause is that
> DN treats an internal error (single disk failure) as an external one (NN
> handshake failure) and we should fix it.
> {code:title=DataNode.java}
> /**
> * One of the Block Pools has successfully connected to its NN.
> * This initializes the local storage for that block pool,
> * checks consistency of the NN's cluster ID, etc.
> *
> * If this is the first block pool to register, this also initializes
> * the datanode-scoped storage.
> *
> * @param bpos Block pool offer service
> * @throws IOException if the NN is inconsistent with the local storage.
> */
> void initBlockPool(BPOfferService bpos) throws IOException {
> NamespaceInfo nsInfo = bpos.getNamespaceInfo();
> if (nsInfo == null) {
> throw new IOException("NamespaceInfo not found: Block pool " + bpos
> + " should have retrieved namespace info before initBlockPool.");
> }
>
> setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
> // Register the new block pool with the BP manager.
> blockPoolManager.addBlockPool(bpos);
>
> // In the case that this is the first block pool to connect, initialize
> // the dataset, block scanners, etc.
> initStorage(nsInfo);
> // Exclude failed disks before initializing the block pools to avoid
> startup
> // failures.
> checkDiskError();
> data.addBlockPool(nsInfo.getBlockPoolID(), conf); <----- this line
> throws disk error exception
> blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
> initDirectoryScanner(conf);
> }
> {code}
> {{FsVolumeList#addBlockPool}} is the source of exception.
> {code:title=FsVolumeList.java}
> void addBlockPool(final String bpid, final Configuration conf) throws
> IOException {
> long totalStartTime = Time.monotonicNow();
>
> final List<IOException> exceptions = Collections.synchronizedList(
> new ArrayList<IOException>());
> List<Thread> blockPoolAddingThreads = new ArrayList<Thread>();
> for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
> public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
> FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
> " on volume " + v + "...");
> long startTime = Time.monotonicNow();
> v.addBlockPool(bpid, conf);
> long timeTaken = Time.monotonicNow() - startTime;
> FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
> " on " + v + ": " + timeTaken + "ms");
> } catch (ClosedChannelException e) {
> // ignore.
> } catch (IOException ioe) {
> FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
> ". Will throw later.", ioe);
> exceptions.add(ioe);
> }
> }
> };
> blockPoolAddingThreads.add(t);
> t.start();
> }
> for (Thread t : blockPoolAddingThreads) {
> try {
> t.join();
> } catch (InterruptedException ie) {
> throw new IOException(ie);
> }
> }
> if (!exceptions.isEmpty()) {
> throw exceptions.get(0); <----- here's the original of exception
> }
>
> long totalTimeTaken = Time.monotonicNow() - totalStartTime;
> FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
> bpid + ": " + totalTimeTaken + "ms");
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]