[
https://issues.apache.org/jira/browse/HDFS-10830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458960#comment-15458960
]
Arpit Agarwal commented on HDFS-10830:
--------------------------------------
bq. Wouldn't it be better to go for the following ..wait and signal model
compared to polling
I completely agree, but that may be a more complex change. Let's fix the
immediate problem first and address the signaling improvement later. Sound fair?
I assigned it to myself.
> FsDatasetImpl#removeVolumes() crashes with IllegalMonitorStateException when
> vol being removed is in use
> --------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10830
> URL: https://issues.apache.org/jira/browse/HDFS-10830
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.0.0-alpha1
> Reporter: Manoj Govindassamy
> Assignee: Arpit Agarwal
> Attachments: HDFS-10830.01.patch
>
>
> {{FsDatasetImpl#removeVolumes()}} operation crashes abruptly with
> IllegalMonitorStateException whenever the volume being removed is in use
> concurrently.
> Looks like {{removeVolumes()}} is waiting on a monitor object "this" (that is
> FsDatasetImpl) which it has never locked, leading to
> IllegalMonitorStateException. This monitor wait happens only the volume being
> removed is in use (referencecount > 0). The thread performing this remove
> volume operation thus crashes abruptly and block invalidations for the remove
> volumes are totally skipped.
> {code:title=FsDatasetImpl.java|borderStyle=solid}
> @Override
> public void removeVolumes(Set<File> volumesToRemove, boolean clearFailure) {
> ..
> ..
> try (AutoCloseableLock lock = datasetLock.acquire()) { <== LOCK acquire
> datasetLock
> for (int idx = 0; idx < dataStorage.getNumStorageDirs(); idx++) {
> .. .. ..
> asyncDiskService.removeVolume(sd.getCurrentDir()); <== volume SD1 remove
> volumes.removeVolume(absRoot, clearFailure);
> volumes.waitVolumeRemoved(5000, this); <== WAIT on "this"
> ?? But, we haven't locked it yet.
> This will cause
> IllegalMonitorStateException
> and crash
> getBlockReports()/FBR thread!
> for (String bpid : volumeMap.getBlockPoolList()) {
> List<ReplicaInfo> blocks = new ArrayList<>();
> for (Iterator<ReplicaInfo> it = volumeMap.replicas(bpid).iterator();
> it.hasNext(); ) {
> .. .. ..
> it.remove(); <== volumeMap removal
> }
> blkToInvalidate.put(bpid, blocks);
> }
> .. ..
> } <== LOCK release
> datasetLock
> // Call this outside the lock.
> for (Map.Entry<String, List<ReplicaInfo>> entry :
> blkToInvalidate.entrySet()) {
> ..
> for (ReplicaInfo block : blocks) {
> invalidate(bpid, block); <== Notify NN of
> Block removal
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]