[
https://issues.apache.org/jira/browse/HDFS-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453821#comment-15453821
]
Manoj Govindassamy commented on HDFS-9781:
------------------------------------------
Ref: {noformat}
@Override
public Map<DatanodeStorage, BlockListAsLongs> getBlockReports(String bpid) {
Map<DatanodeStorage, BlockListAsLongs> blockReportsMap =
new HashMap<DatanodeStorage, BlockListAsLongs>();
Map<String, BlockListAsLongs.Builder> builders =
new HashMap<String, BlockListAsLongs.Builder>();
List<FsVolumeImpl> curVolumes = volumes.getVolumes();
for (FsVolumeSpi v : curVolumes) {
builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength));
}
try (AutoCloseableLock lock = datasetLock.acquire()) {
for (ReplicaInfo b : volumeMap.replicas(bpid)) {
switch(b.getState()) {
case FINALIZED:
case RBW:
case RWR:
builders.get(b.getVolume().getStorageID()).add(b); <=== NPE
break;
case RUR:
{noformat}
[~daryn],
{quote}
One way to fix this might be to call volumes.waitVolumeRemoved() before
volumes.removeVolume() in FsDatasetImpl#removeVolumes()
{quote}
{noformat}
LOG.info("Removing " + absRoot + " from FsDataset.");
// Disable the volume from the service.
asyncDiskService.removeVolume(sd.getCurrentDir());
volumes.removeVolume(absRoot, clearFailure);
volumes.waitVolumeRemoved(5000, this);
{noformat}
* I believe, {{volumes.waitVolumeRemoved(5000, this)}} cannot be moved before
{{volumes.removeVolume(absRoot, clearFailure);}} as the waitVolumesRemoved() is
dependent on the volumesBeingRemoved queue which is filled in by the
removeVolume().
* Also, the intention of waitVolumesRemoved() is to wait for the removed
volume's reference to go to zero.
[~virajith],[~daryn],
{quote}
make FsDatasetImpl#getBlockReports() hold the lock while it is referring to
both volumes and volumeMap i.e., move synchronized(this) before
volumes.getVolumes().
{quote}
* I guess your proposal here is is something like below. (in the latest code
base, i am assuming datasetLock as the synchronized block you were referring)
** {noformat}
try (AutoCloseableLock lock = datasetLock.acquire()) {
curVolumes = volumes.getVolumes();
for (FsVolumeSpi v : curVolumes) {
builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength));
} {noformat}
* IMHO, moving volumes.getVolumes() alone inside the lock will not solve the
problem, particularly the NPE we are seeing in TestDataNodeHotSwapVolumes.
Because, the NPE is happening when doing add after builders.get(<storageID>).
Which means, builders map doesn't have the storageID being passed in.
* That is,
** {{curVolumes}} list is not the problem here. This list has the right set of
volumes after the previous volume removal operation.
** Its the {{ReplicaInfo}} from {{volumeMap.replicas(bpid)}} which has the
reference to the removed volume
** But, {{builders}} map doesn't have the removed volume
** So, {{builders.get(b.getVolume().getStorageID()).add(b);}} will turn to
*null.add(b)* and so the NPE.
Please let me know if my understanding is wrong.
> FsDatasetImpl#getBlockReports can occasionally throw NullPointerException
> -------------------------------------------------------------------------
>
> Key: HDFS-9781
> URL: https://issues.apache.org/jira/browse/HDFS-9781
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 3.0.0-alpha1
> Environment: Jenkins
> Reporter: Wei-Chiu Chuang
> Assignee: Manoj Govindassamy
> Attachments: HDFS-9781.01.patch
>
>
> FsDatasetImpl#getBlockReports occasionally throws NPE. The NPE caused
> TestFsDatasetImpl#testRemoveVolumeBeingWritten to time out, because the test
> waits for the call to FsDatasetImpl#getBlockReports to complete without
> exceptions.
> Additionally, the test should be updated to identify an expected exception,
> using {{GenericTestUtils.assertExceptionContains()}}
> {noformat}
> Exception in thread "Thread-20" java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1709)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1BlockReportThread.run(TestFsDatasetImpl.java:587)
> 2016-02-08 15:47:30,379 [Thread-21] WARN impl.TestFsDatasetImpl
> (TestFsDatasetImpl.java:run(606)) - Exception caught. This should not affect
> the test
> java.io.IOException: Failed to move meta file for ReplicaBeingWritten,
> blk_0_0, RBW
> getNumBytes() = 0
> getBytesOnDisk() = 0
> getVisibleLength()= 0
> getVolume() =
> /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current
> getBlockFile() =
> /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0
> bytesAcked=0
> bytesOnDisk=0 from
> /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta
> to
> /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:857)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addFinalizedBlock(BlockPoolSlice.java:295)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addFinalizedBlock(FsVolumeImpl.java:819)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeReplica(FsDatasetImpl.java:1620)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1601)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1ResponderThread.run(TestFsDatasetImpl.java:603)
> Caused by: java.io.IOException:
> renameTo(src=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta,
>
> dst=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta)
> failed.
> at org.apache.hadoop.io.nativeio.NativeIO.renameTo(NativeIO.java:873)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:855)
> ... 5 more
> 2016-02-08 15:47:34,381 [Thread-19] INFO impl.FsDatasetImpl
> (FsVolumeList.java:waitVolumeRemoved(287)) - Volume reference is released.
> 2016-02-08 15:47:34,384 [Thread-19] INFO impl.TestFsDatasetImpl
> (TestFsDatasetImpl.java:testRemoveVolumeBeingWritten(622)) - Volumes removed
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]