[jira] [Commented] (HDDS-354) VolumeInfo.getScmUsed throws NPE

Nanda kumar (JIRA) Fri, 24 Aug 2018 05:10:18 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591549#comment-16591549
 ]


Nanda kumar commented on HDDS-354:
----------------------------------

When Datanode starts up, we initialize and send {{getVersion}} call to SCM. As 
part of Datanode initialization, we also start all the ReportPublishers (this 
includes NodeReportPublisher).
When the response of {{getVersion}} call is received by Datanode it performs a 
set of checks on all the Volumes in Datanode using 
{{HddsVolumeUtil#checkVolume}}.
If there is some discrepancy in volume, we will mark that volume as failed by 
calling {{VolumeSet#failVolume}}.

In meantime, based on the configured interval {{NodeReportPublisher}} wakes up 
and tries to generate NodeReport using {{VolumeSet#getNodeReport}} call.

VolumeSet is the one which maintains the list of Volumes in the Datanode.

If both the calls, {{failVolume}} and {{getNodeReport}} lands in {{VolumeSet}} 
at the same time, we will be accessing {{VolumeSet#volumeMap}} from two 
different threads simultaneously.

*Inconsistent synchronization of {{VolumeSet#volumeMap}}*
 {{VolumeSet#volumeMap}} is not thread safe, we are locking this using 
{{volumeSetLock}} only when we are trying to update the map. But while reading 
or iterating through the map, we don't acquire any lock.

 

In our case, the {{VolumeSet#failVolume}} call acquired the lock and it has 
called {{hddsVolume#failVolume}} which shuts-down {{volumeInfo}} thread in 
{{HddsVolume}} which in turn shuts-down {{VolumeUsage}} thread and dereferences 
{{VolumeInfo#usage}} (marks it as null).

We still haven't removed the {{HddsVolume}} entry from {{VolumeSet#volumeMap}} 
in {{VolumeSet}}.

In meantime, the ReportPublisher thread calls {{VolumeSet#getNodeReport}}, 
which doesn't need any lock, iterates through {{VolumeSet#volumeMap}} and gets 
the {{HddsVolume}} which is already shutdown. When we call 
{{hddsVolume.getVolumeInfo().getScmUsed()}} on a volume which is shutdown, we 
will end up accessing {{VolumeInfo#usage}} which points to {{null.}}

Because of this race condition, we end up in {{NullPointerException}}.

The simple fix would be to remove the {{HddsVolume}} entry from 
{{VolumeSet#volumeMap}} in methods {{VolumeSet#failVolume}} and 
{{VolumeSet#removeVolume}} before calling {{hddsVolume.failVolume()}} or 
{{hddsVolume.shutdown()}}.

A proper fix would be to use {{read write lock}} and properly synchronize 
{{VolumeSet#volumeMap}}.

cc. [~arpitagarwal], [~hanishakoneru], [~bharatviswa]

> VolumeInfo.getScmUsed throws NPE
> --------------------------------
>
>                 Key: HDDS-354
>                 URL: https://issues.apache.org/jira/browse/HDDS-354
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Ajay Kumar
>            Priority: Major
>
> {code}java.lang.NullPointerException
>       at 
> org.apache.hadoop.ozone.container.common.volume.VolumeInfo.getScmUsed(VolumeInfo.java:107)
>       at 
> org.apache.hadoop.ozone.container.common.volume.VolumeSet.getNodeReport(VolumeSet.java:366)
>       at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.getNodeReport(OzoneContainer.java:264)
>       at 
> org.apache.hadoop.ozone.container.common.report.NodeReportPublisher.getReport(NodeReportPublisher.java:64)
>       at 
> org.apache.hadoop.ozone.container.common.report.NodeReportPublisher.getReport(NodeReportPublisher.java:39)
>       at 
> org.apache.hadoop.ozone.container.common.report.ReportPublisher.publishReport(ReportPublisher.java:86)
>       at 
> org.apache.hadoop.ozone.container.common.report.ReportPublisher.run(ReportPublisher.java:73)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
>       at java.util.concurrent.FutureTask.run(FutureTask.java)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-354) VolumeInfo.getScmUsed throws NPE

Reply via email to