[
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507939#comment-14507939
]
Andrew Wang commented on HDFS-8182:
-----------------------------------
Thanks for the explanation Gera. This sounds like auto-tiering / hierarchical
storage management to an extent. If you (or whoever takes this JIRA) builds out
something to calculate file temperature, we could also use it for moving things
between storage types or to control HDFS caching.
Even without that though, some interesting questions to think about:
* Policy like fair-share or priorities? Or just use user quota? Quota is my
favorite.
* If we run too close to a user's quota, writes might be temporarily
unavailable. Deleting takes time, since it's rate limited and happens on the
heartbeat. Probably want to limit how much space we use opportunistically.
* We might get some hotspotting, since the first reader on a rack will localize
everything. Probably still random enough though?
* The idea of having the DN save a local copy as the client reads is efficient,
but somewhat complicated. It might be simpler to have the NN manage all the
replication work.
* We don't have the equivalent of a read file descriptor in HDFS, so unless we
start looking at client-provided per-job identifiers, it's hard to interpret
DONTNEED. i.e. if DONTNEED is sent, can we immediately delete, or is some other
job still using it? If we have proper monitoring of data temperature, DONTNEED
is kind of optional.
* Deleting the file is the best kind of signal, which should happen for
intermediate data.
> Implement topology-aware CDN-style caching
> ------------------------------------------
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client, namenode
> Affects Versions: 2.6.0
> Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we
> could read a block across the ToR switches only once. Example scenarios are
> localization of binaries, MR distributed cache files for map-side joins and
> similar. There are multiple layers where this could be implemented (YARN
> service or individual apps such as MR) but I believe it is best done in HDFS
> or even common FileSystem to support as many use cases as possible.
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN
> about the replica in DN1 and the client will read from DN1 using the
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise
> DONTNEED style, it can start telling DN's to discard extraneous replica.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)