[
https://issues.apache.org/jira/browse/HDFS-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264867#comment-13264867
]
amith commented on HDFS-3332:
-----------------------------
Hi Nicholas,
Please correct me if I am wrong :)
I have NN started with HA configuration(nn1=40.95 and nn2=40.96 nn2 not
started).
I have started only 1 NN and made it as active, wrote a file and corrupted it
manually.
Directory scanner is reporting the bad block to all the NN via BPServiceActor.
Here BPServiceActor#reportBadBlocks(ExtendedBlock block) will not check whether
DN is correctly registered to NN.
We are trying to report bad blocks using bpRegistration(which is null) causing
NPE.
{code}
void reportBadBlocks(ExtendedBlock block) {
DatanodeInfo[] dnArr = { new DatanodeInfo(bpRegistration) };
LocatedBlock[] blocks = { new LocatedBlock(block, dnArr) };
{code}
Why bpRegistration is null?
{code}
private void connectToNNAndHandshake() throws IOException {
// get NN proxy
bpNamenode = dn.connectToNN(nnAddr);
// First phase of the handshake with NN - get the namespace
// info.
NamespaceInfo nsInfo = retrieveNamespaceInfo();
// Verify that this matches the other NN in this HA pair.
// This also initializes our block pool in the DN if we are
// the first NN connection for this BP.
bpos.verifyAndSetNamespaceInfo(nsInfo);
// Second phase of the handshake with the NN.
register();
}
{code}
Here in register() call bpRegistration is assigned. Since
retrieveNamespaceInfo() is like a infinite loop trying to get the version
{code}
NamespaceInfo retrieveNamespaceInfo() throws IOException {
NamespaceInfo nsInfo = null;
while (shouldRun()) {
try {
nsInfo = bpNamenode.versionRequest();
LOG.debug(this + " received versionRequest response: " + nsInfo);
break;
} catch(SocketTimeoutException e) { // namenode is busy
LOG.warn("Problem connecting to server: " + nnAddr);
} catch(IOException e ) { // namenode is not available
LOG.warn("Problem connecting to server: " + nnAddr);
}
// try again in a second
sleepAndLogInterrupts(5000, "requesting version info from NN");
}
if (nsInfo != null) {
checkNNVersion(nsInfo);
} else {
throw new IOException("DN shut down before block pool connected");
}
return nsInfo;
}
{code}
so bpRegistration is not assigned.
> NullPointerException in DN when directoryscanner is trying to report bad
> blocks
> -------------------------------------------------------------------------------
>
> Key: HDFS-3332
> URL: https://issues.apache.org/jira/browse/HDFS-3332
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: data-node
> Affects Versions: 3.0.0
> Environment: HDFS
> Reporter: amith
> Assignee: amith
> Fix For: 3.0.0
>
>
> There is 1 NN and 1 DN (NN is started with HA conf)
> I corrupted 1 block and found
> {code}
> 2012-04-27 09:59:01,214 INFO datanode.DataNode
> (BPServiceActor.java:blockReport(401)) - BlockReport of 2 blocks took 0 msec
> to generate and 5 msecs for RPC and NN processing
> 2012-04-27 09:59:01,214 INFO datanode.DataNode
> (BPServiceActor.java:blockReport(420)) - sent block report, processed
> command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@3b756db3
> 2012-04-27 09:59:01,726 INFO datanode.DirectoryScanner
> (DirectoryScanner.java:scan(390)) - BlockPool
> BP-2087868617-10.18.40.95-1335500488012 Total blocks: 2, missing metadata
> files:0, missing block files:0, missing blocks in memory:0, mismatched
> blocks:1
> 2012-04-27 09:59:01,727 WARN impl.FsDatasetImpl
> (FsDatasetImpl.java:checkAndUpdate(1366)) - Updating size of block
> -4466699320171028643 from 1024 to 1034
> 2012-04-27 09:59:01,727 WARN impl.FsDatasetImpl
> (FsDatasetImpl.java:checkAndUpdate(1374)) - Reporting the block
> blk_-4466699320171028643_1004 as corrupt due to length mismatch
> 2012-04-27 09:59:01,728 DEBUG ipc.Client (Client.java:sendParam(807)) - IPC
> Client (1957050620) connection to /10.18.40.95:8020 from root sending #257
> 2012-04-27 09:59:01,730 DEBUG ipc.Client (Client.java:receiveResponse(848)) -
> IPC Client (1957050620) connection to /10.18.40.95:8020 from root got value
> #257
> 2012-04-27 09:59:01,730 DEBUG ipc.ProtobufRpcEngine
> (ProtobufRpcEngine.java:invoke(193)) - Call: reportBadBlocks 2
> 2012-04-27 09:59:01,731 ERROR datanode.DirectoryScanner
> (DirectoryScanner.java:run(288)) - Exception during DirectoryScanner
> execution - will continue next cycle
> java.lang.NullPointerException
> at org.apache.hadoop.hdfs.protocol.DatanodeID.<init>(DatanodeID.java:66)
> at
> org.apache.hadoop.hdfs.protocol.DatanodeInfo.<init>(DatanodeInfo.java:87)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.reportBadBlocks(BPServiceActor.java:238)
> at
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.reportBadBlocks(BPOfferService.java:187)
> at
> org.apache.hadoop.hdfs.server.datanode.DataNode.reportBadBlocks(DataNode.java:559)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkAndUpdate(FsDatasetImpl.java:1377)
> at
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:318)
> at
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:284)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> {code}
> Here when Directory scanner is trying to report badblock we got a NPE.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira