[
https://issues.apache.org/jira/browse/HDDS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591549#comment-16591549
]
Nanda kumar commented on HDDS-354:
----------------------------------
When Datanode starts up, we initialize and send {{getVersion}} call to SCM. As
part of Datanode initialization, we also start all the ReportPublishers (this
includes NodeReportPublisher).
When the response of {{getVersion}} call is received by Datanode it performs a
set of checks on all the Volumes in Datanode using
{{HddsVolumeUtil#checkVolume}}.
If there is some discrepancy in volume, we will mark that volume as failed by
calling {{VolumeSet#failVolume}}.
In meantime, based on the configured interval {{NodeReportPublisher}} wakes up
and tries to generate NodeReport using {{VolumeSet#getNodeReport}} call.
VolumeSet is the one which maintains the list of Volumes in the Datanode.
If both the calls, {{failVolume}} and {{getNodeReport}} lands in {{VolumeSet}}
at the same time, we will be accessing {{VolumeSet#volumeMap}} from two
different threads simultaneously.
*Inconsistent synchronization of {{VolumeSet#volumeMap}}*
{{VolumeSet#volumeMap}} is not thread safe, we are locking this using
{{volumeSetLock}} only when we are trying to update the map. But while reading
or iterating through the map, we don't acquire any lock.
In our case, the {{VolumeSet#failVolume}} call acquired the lock and it has
called {{hddsVolume#failVolume}} which shuts-down {{volumeInfo}} thread in
{{HddsVolume}} which in turn shuts-down {{VolumeUsage}} thread and dereferences
{{VolumeInfo#usage}} (marks it as null).
We still haven't removed the {{HddsVolume}} entry from {{VolumeSet#volumeMap}}
in {{VolumeSet}}.
In meantime, the ReportPublisher thread calls {{VolumeSet#getNodeReport}},
which doesn't need any lock, iterates through {{VolumeSet#volumeMap}} and gets
the {{HddsVolume}} which is already shutdown. When we call
{{hddsVolume.getVolumeInfo().getScmUsed()}} on a volume which is shutdown, we
will end up accessing {{VolumeInfo#usage}} which points to {{null.}}
Because of this race condition, we end up in {{NullPointerException}}.
The simple fix would be to remove the {{HddsVolume}} entry from
{{VolumeSet#volumeMap}} in methods {{VolumeSet#failVolume}} and
{{VolumeSet#removeVolume}} before calling {{hddsVolume.failVolume()}} or
{{hddsVolume.shutdown()}}.
A proper fix would be to use {{read write lock}} and properly synchronize
{{VolumeSet#volumeMap}}.
cc. [~arpitagarwal], [~hanishakoneru], [~bharatviswa]
> VolumeInfo.getScmUsed throws NPE
> --------------------------------
>
> Key: HDDS-354
> URL: https://issues.apache.org/jira/browse/HDDS-354
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: Ajay Kumar
> Priority: Major
>
> {code}java.lang.NullPointerException
> at
> org.apache.hadoop.ozone.container.common.volume.VolumeInfo.getScmUsed(VolumeInfo.java:107)
> at
> org.apache.hadoop.ozone.container.common.volume.VolumeSet.getNodeReport(VolumeSet.java:366)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.getNodeReport(OzoneContainer.java:264)
> at
> org.apache.hadoop.ozone.container.common.report.NodeReportPublisher.getReport(NodeReportPublisher.java:64)
> at
> org.apache.hadoop.ozone.container.common.report.NodeReportPublisher.getReport(NodeReportPublisher.java:39)
> at
> org.apache.hadoop.ozone.container.common.report.ReportPublisher.publishReport(ReportPublisher.java:86)
> at
> org.apache.hadoop.ozone.container.common.report.ReportPublisher.run(ReportPublisher.java:73)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
> at java.util.concurrent.FutureTask.run(FutureTask.java)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]