[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

Srinivasu Majeti (Jira) Thu, 26 Oct 2023 09:46:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780003#comment-17780003
 ]


Srinivasu Majeti commented on HDFS-15273:
-----------------------------------------

CCing [~weichiu] to review this and approve.

> CacheReplicationMonitor hold lock for long time and lead to NN out of service
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-15273
>                 URL: https://issues.apache.org/jira/browse/HDFS-15273
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: caching, namenode
>            Reporter: Xiaoqiao He
>            Assignee: Xiaoqiao He
>            Priority: Major
>         Attachments: HDFS-15273.001.patch, HDFS-15273.002.patch, 
> HDFS-15273.003.patch
>
>
> CacheReplicationMonitor scan Cache Directives and Cached BlockMap 
> periodically. If we add more and more cache directives, 
> CacheReplicationMonitor will cost very long time to rescan all of cache 
> directives and cache blocks. Meanwhile, scan operation hold global write 
> lock, during scan period, NameNode could not process other request.
> So I think we should warn this risk to end user who turn on CacheManager 
> feature before improve this implement.
> {code:java}
>   private void rescan() throws InterruptedException {
>     scannedDirectives = 0;
>     scannedBlocks = 0;
>     try {
>       namesystem.writeLock();
>       try {
>         lock.lock();
>         if (shutdown) {
>           throw new InterruptedException("CacheReplicationMonitor was " +
>               "shut down.");
>         }
>         curScanCount = completedScanCount + 1;
>       } finally {
>         lock.unlock();
>       }
>       resetStatistics();
>       rescanCacheDirectives();
>       rescanCachedBlockMap();
>       blockManager.getDatanodeManager().resetLastCachingDirectiveSentTime();
>     } finally {
>       namesystem.writeUnlock();
>     }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15273) CacheReplicationMonitor hold lock for long time and lead to NN out of service

Reply via email to