[jira] [Commented] (HDFS-16262) Async refresh of cached locations in DFSInputStream

Bryan Beaudreault (Jira) Wed, 25 Jan 2023 20:19:07 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680883#comment-17680883
 ]


Bryan Beaudreault commented on HDFS-16262:
------------------------------------------

Hey [~mccormickt12]. It's not exactly that it was ignored/forgotten, but I 
think the problems are sort of orthogonal. I'm not sure it would have made 
sense to solve for ignoredNodes in the context of this JIRA.

This Jira solves for a problem due to persistent state. A DFSInputStream caches 
block locations and dead nodes for the lifetime of the stream. While a stream 
is open, the underlying replicas of blocks may have changed. If so, the cached 
block locations may have datanodes in the wrong order to make use of the best 
locality. Also, dead nodes may build up in a way that inevitably leads to a 
BlockMissingException even if there are no real problems:
 * File has 4 blocks
 * Block A has locations 1, 2, 3
 * Block B has locations 3, 4, 5
 * Block C has locations 6, 7, 8
 * Block D has locations 3, 1, 6
 * Opening this file involves fetching these block locations
 * After opening the file, the block locations are all shuffled around in the 
cluster out of band
 * Reading this file involves reading these blocks in sequence:
 ** First we read block A, but it no longer exists at location 1. Add 1 to 
deadNodes, and found at location 2 so success
 ** Same problem for B, doesn't exist at its first location 3. Add 3 to 
deadnodes, and find at loc 4.
 ** Block C doesn't exist at 6, add 6 to deadNodes and find at 7.
 ** Now we get to D, but all 3 replicas (3, 1, 6) are in deadNodes – they were 
never dead, just no longer holding those replicas above. They do hold the 
replicas for D, but the input stream doesn't care – log "Could not obtain 
block" and do an expensive refreshLocations call.
 * The above refreshLocatiosn call increments global failures counter for this 
stream by 1
 * Let's say your file actually has 100s of blocks, or that you're often 
preading in ways that re-requests data from blocks over time.... If enough 
replicas are moving around frequently enough, you very quickly exceed the 
failure counts and trigger BlockMissingException.

For us the openInfo in refetchLocations would cause clear latency spikes, and 
the BlockMissingExceptions were worse. So the goal of this Jira was to prevent 
that from happening by updating the locations when we know they had changed – 
hopefully before they are requested. Once locations are updated, it makes. 
little sense to keep the same dead nodes list.

The difference with ignoredNodes is it's in the context of a single request. 
Every request starts off with no nodes ignored. The idea with ignoredNodes I 
think is that since you're looping and kicking off hedged requests, you want to 
make sure you don't submit a request to the same node twice. So on each loop 
you add the last node to ignoredNodes.

It's hard to believe ignoredNodes is a problem unless there's really a problem 
with those nodes. I imagine there would be other logs leading up to the error 
which might indicate why ignoredNodes had to grow beyond 1. After the first 
hedge, it looks like further hedge requests are only kicked off if that first 
hedge throws a InterruptedException in getFirstToComplete.

I suppose its possible ignoredNodes is interacting poorly with deadnodes. 
Imagine my above example, except block B doesn't fail so deadNodes is 1, 6. A 
request for D comes in and has to hedge the request to 3... So 3 is in 
ignoredNodes and 1/6 are in deadNodes – refetchLocations required.

If that's true, I think this Jira should help the situation since it would 
clear out deadNodes periodically. I don't think this Jira can clear out 
ignoreNodes because it's not a global state that we can gain access to. 
ignoredNodes is allocated in hedgedFetchBlockByteRange itself.

Looking at your jira, the line numbers in the stack trace is off so it's 
unclear if your install has this Jira available. I'm also not 100% sure of your 
approach to clearing ignoredNodes. Based on my understanding, ignoredNodes 
should really only contain nodes that actively have hedged requests pending 
within the current request context. Clearing the list could cause the next loop 
to submit a request to the same node that's already serving one so you end up 
with 2 pending futures to the same node. I could be wrong though, I haven't 
spent a lot of time studying the code recently. I'd be looking for other log 
indicators leading up to the exception you saw to see what precipitated the 
addition into ignoredNodes to the point that all locations were ignored for a 
single request.

 

> Async refresh of cached locations in DFSInputStream
> ---------------------------------------------------
>
>                 Key: HDFS-16262
>                 URL: https://issues.apache.org/jira/browse/HDFS-16262
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.3.5
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> HDFS-15119 added the ability to invalidate cached block locations in 
> DFSInputStream. As written, the feature will affect all DFSInputStreams 
> regardless of whether they need it or not. The invalidation also only applies 
> on the next request, so the next request will pay the cost of calling 
> openInfo before reading the data.
> I'm working on a feature for HBase which enables efficient healing of 
> locality through Balancer-style low level block moves (HBASE-26250). I'd like 
> to utilize the idea started in HDFS-15119 in order to update DFSInputStreams 
> after blocks have been moved to local hosts.
> I was considering using the feature as is, but some of our clusters are quite 
> large and I'm concerned about the impact on the namenode:
>  * We have some clusters with over 350k StoreFiles, so that'd be 350k 
> DFSInputStreams. With such a large number and very active usage, having the 
> refresh be in-line makes it too hard to ensure we don't DDOS the NameNode.
>  * Currently we need to pay the price of openInfo the next time a 
> DFSInputStream is invoked. Moving that async would minimize the latency hit. 
> Also, some StoreFiles might be far less frequently accessed, so they may live 
> on for a long time before ever refreshing. We'd like to be able to know that 
> all DFSInputStreams are refreshed by a given time.
>  * We may have 350k files, but only a small percentage of them are ever 
> non-local at a given time. Refreshing only if necessary will save a lot of 
> work.
> In order to make this as painless to end users as possible, I'd like to:
>  * Update the implementation to utilize an async thread for managing 
> refreshes. This will give more control over rate limiting across all 
> DFSInputStreams in a DFSClient, and also ensure that all DFSInputStreams are 
> refreshed.
>  * Only refresh files which are lacking a local replica or have known 
> deadNodes to be cleaned up
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-16262) Async refresh of cached locations in DFSInputStream

Reply via email to