[jira] [Work logged] (HDFS-16262) Async refresh of cached locations in DFSInputStream

ASF GitHub Bot (Jira) Thu, 07 Oct 2021 16:38:50 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16262?focusedWorklogId=662350&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-662350
 ]


ASF GitHub Bot logged work on HDFS-16262:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Oct/21 23:37
            Start Date: 07/Oct/21 23:37
    Worklog Time Spent: 10m 
      Work Description: bbeaudreault opened a new pull request #3527:
URL: https://github.com/apache/hadoop/pull/3527


   ### Description of PR
   
   Refactor refreshing of cached block locations so that it happens as part of 
an async process, with rate limiting. Add the ability to limit to only refresh 
DFSInputStreams if necessary. This defaults to false to preserve backwards 
compatibility with the old behavior from 
https://issues.apache.org/jira/browse/HDFS-15119
   
   See https://issues.apache.org/jira/browse/HDFS-16262
   
   ### How was this patch tested?
   
   I added a new test class TestLocatedBlocksRefresher. I am in the process of 
deploying this internally on one of our hadoop-3.3 clusters, will report back.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 662350)
    Time Spent: 1h 20m  (was: 1h 10m)

> Async refresh of cached locations in DFSInputStream
> ---------------------------------------------------
>
>                 Key: HDFS-16262
>                 URL: https://issues.apache.org/jira/browse/HDFS-16262
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> HDFS-15119 added the ability to invalidate cached block locations in 
> DFSInputStream. As written, the feature will affect all DFSInputStreams 
> regardless of whether they need it or not. The invalidation also only applies 
> on the next request, so the next request will pay the cost of calling 
> openInfo before reading the data.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). I'd like 
> to utilize the idea started in HDFS-15119 in order to update DFSInputStreams 
> after blocks have been moved to local hosts.
> I was considering using the feature as is, but some of our clusters are quite 
> large and I'm concerned about the impact on the namenode:
>  * We have some clusters with over 350k StoreFiles, so that'd be 350k 
> DFSInputStreams. With such a large number and very active usage, having the 
> refresh be in-line makes it too hard to ensure we don't DDOS the NameNode.
>  * Currently we need to pay the price of openInfo the next time a 
> DFSInputStream is invoked. Moving that async would minimize the latency hit. 
> Also, some StoreFiles might be far less frequently accessed, so they may live 
> on for a long time before ever refreshing. We'd like to be able to know that 
> all DFSInputStreams are refreshed by a given time.
>  * We may have 350k files, but only a small percentage of them are ever 
> non-local at a given time. Refreshing only if necessary will save a lot of 
> work.
> In order to make this as painless to end users as possible, I'd like to:
>  * Update the implementation to utilize an async thread for managing 
> refreshes. This will give more control over rate limiting across all 
> DFSInputStreams in a DFSClient, and also ensure that all DFSInputStreams are 
> refreshed.
>  * Only refresh files which are lacking a local replica or have known 
> deadNodes to be cleaned up
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16262) Async refresh of cached locations in DFSInputStream

Reply via email to