[jira] [Commented] (HDFS-14283) DFSInputStream to prefer cached replica

Siyao Meng (Jira) Tue, 01 Oct 2019 16:14:19 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942366#comment-16942366
 ]


Siyao Meng commented on HDFS-14283:
-----------------------------------

Thanks for the patch [~leosun08].

1. After a quick review, it seems that the rev 001 patch essentially chooses 
the *last* cached DataNode location. We should at least put a
{code}
break;
{code}
after
{code}
chosenNode = cachedLocs[i];
{code}

2. Unlike *block.getLocations()* which gets a list of DataNodes in priority 
order, can you confirm whether *block.getCachedLocations()* does the same 
thing? As I skim through the usage of *LocatedBlock.cachedLocs*, it isn't clear 
to me that the cached locations is in any priority order. If it isn't (in any 
priority order), the change could cause the cached location(DataNode) to become 
a hotspot as it will ALWAYS choose one DataNode that cached that block, 
possibly filling up the bandwidth to that specific DataNode and slowdown the 
client. CMIIW. [~jojochuang]

In my mind, we should arrange *LocatedBlock.cachedLocs* in some priority order 
just like *LocatedBlock.locs* does. Only then we can (almost) safely use the 
first valid location in the *LocatedBlock.cachedLocs* array.

> DFSInputStream to prefer cached replica
> ---------------------------------------
>
>                 Key: HDFS-14283
>                 URL: https://issues.apache.org/jira/browse/HDFS-14283
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.6.0
>         Environment: HDFS Caching
>            Reporter: Wei-Chiu Chuang
>            Assignee: Lisheng Sun
>            Priority: Major
>         Attachments: HDFS-14283.001.patch
>
>
> HDFS Caching offers performance benefits. However, currently NameNode does 
> not treat cached replica with higher priority, so HDFS caching is only useful 
> when cache replication = 3, that is to say, all replicas are cached in 
> memory, so that a client doesn't randomly pick an uncached replica.
> HDFS-6846 proposed to let NameNode give higher priority to cached replica. 
> Changing a logic in NameNode is always tricky so that didn't get much 
> traction. Here I propose a different approach: let client (DFSInputStream) 
> prefer cached replica.
> A {{LocatedBlock}} object already contains cached replica location so a 
> client has the needed information. I think we can change 
> {{DFSInputStream#getBestNodeDNAddrPair()}} for this purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-14283) DFSInputStream to prefer cached replica

Reply via email to