[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511936#comment-14511936 ] Gera Shegalov commented on HDFS-8182: - Thanks for comments [~andrew.wang] and [~jlowe] bq. We might get some hotspotting, since the first reader on a rack will localize everything. Probably still random enough though? I agree there is a hotspotting risk, hopefully indeed low. But if turns out to be big deal we can introduce random node picking in the rack. Agree that we need to clarify flag names and their semantics (RE: DONTNEED) And the issue of quota/accounting is something I haven't thought of yet either. Thanks for bringing it up. > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507982#comment-14507982 ] Jason Lowe commented on HDFS-8182: -- Quota seems like it could be problematic for some of the use-cases involved in the distributed cache. Not all files being downloaded by a user belong to that user. Public localized resources are a good example of this. How does quota enter the picture in the case where the user who owns the file is not the user requesting the file? For example, are other users accessing my public files going to be able to cause my quota usage to increase by this feature? > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507939#comment-14507939 ] Andrew Wang commented on HDFS-8182: --- Thanks for the explanation Gera. This sounds like auto-tiering / hierarchical storage management to an extent. If you (or whoever takes this JIRA) builds out something to calculate file temperature, we could also use it for moving things between storage types or to control HDFS caching. Even without that though, some interesting questions to think about: * Policy like fair-share or priorities? Or just use user quota? Quota is my favorite. * If we run too close to a user's quota, writes might be temporarily unavailable. Deleting takes time, since it's rate limited and happens on the heartbeat. Probably want to limit how much space we use opportunistically. * We might get some hotspotting, since the first reader on a rack will localize everything. Probably still random enough though? * The idea of having the DN save a local copy as the client reads is efficient, but somewhat complicated. It might be simpler to have the NN manage all the replication work. * We don't have the equivalent of a read file descriptor in HDFS, so unless we start looking at client-provided per-job identifiers, it's hard to interpret DONTNEED. i.e. if DONTNEED is sent, can we immediately delete, or is some other job still using it? If we have proper monitoring of data temperature, DONTNEED is kind of optional. * Deleting the file is the best kind of signal, which should happen for intermediate data. > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1450#comment-1450 ] Gera Shegalov commented on HDFS-8182: - Hi Andrew, I think the said block placement policy works fine for data whose usage we know a priori such as binaries in YARN-1492 Shared Cache (few terabytes in our case), MR/Spark staging directories, etc. For such cases we/frameworks already set a high replication factor. And the solution with rf=#racks is already good enough. Except for the replication speed vs YARN scheduling race, which would be eliminated with the approach proposed in this JIRA. In some cases we have no a priori knowledge. The most prominent ones are some primary or temporary files are used as the build input of a hash join in an ad-hoc manner. Having a solution that works transparently irrespective of specified replication factor is a win. Another drawback of a block-placement based solution (besides currently being global, not per file) is that it's not elastic, and is oblivious of the data temperature. I think this JIRA would cover both families of cases above well. > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504105#comment-14504105 ] Andrew Wang commented on HDFS-8182: --- Hi Gera, just for my own edification, given the "spread" block placement policy, is it not sufficient to set the distributed cache repl factor to the # of racks in the cluster? Is the space overhead too much? It's hard for me to evaluate as I don't have much familiarity with dist cache, e.g. typical # bytes in dist cache, # racks, dist cache repl factor. If the overhead ends up being tens of GBs per additional rack though, that seems acceptable considering a rack holds a PB+ these days. > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503601#comment-14503601 ] Gera Shegalov commented on HDFS-8182: - [~andrew.wang], [~zhz] thanks for your comments. bq. IIUC most apps also use the distributed cache, so there isn't too much code duplication that would be reduced by pushing this to HDFS. This proposal is specifically motivated with the scalability issues with DistributedCache localization in a large cluster. When we get to per-path block placement policy, this will be more acceptable than the current per-block-manager approach. However, it's still not as flexible as needed to indicate a temporary demand. Thanks for pointing out these JIRAs. > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503587#comment-14503587 ] Zhe Zhang commented on HDFS-8182: - I agree with Andrew that a simpler version of this proposal can be achieved by a _wide spreading_ placement policy. This requires us to support per-file placement policy though. HDFS-8186 (under the EC branch) tries to add {{BlockPlacementPolicies}} in {{BlockManager}} to address this requirement. > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503572#comment-14503572 ] Zhe Zhang commented on HDFS-8182: - bq. ... a block placement policy that tries to spread replicas across as many racks as possible? The EC branch already has this policy which we could reuse. This is actually in trunk (HDFS-7891). > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching
[ https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503565#comment-14503565 ] Andrew Wang commented on HDFS-8182: --- Could this be achieved more simply with a block placement policy that tries to spread replicas across as many racks as possible? The EC branch already has this policy which we could reuse. This type of optimistic replica creation has also typically be done outside of HDFS; we don't really keep much in the way of per-file read stats right now, and this proposal overall sounds like a bigger change than a new block placement policy. IIUC most apps also use the distributed cache, so there isn't too much code duplication that would be reduced by pushing this to HDFS. > Implement topology-aware CDN-style caching > -- > > Key: HDFS-8182 > URL: https://issues.apache.org/jira/browse/HDFS-8182 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov > > To scale reads of hot blocks in large clusters, it would be beneficial if we > could read a block across the ToR switches only once. Example scenarios are > localization of binaries, MR distributed cache files for map-side joins and > similar. There are multiple layers where this could be implemented (YARN > service or individual apps such as MR) but I believe it is best done in HDFS > or even common FileSystem to support as many use cases as possible. > The life cycle could look like this e.g. for the YARN localization scenario: > 1. inputStream = fs.open(path, ..., CACHE_IN_RACK) > 2. instead of reading from a remote DN directly, NN tells the client to read > via the local DN1 and the DN1 creates a replica of each block. > When the next localizer on DN2 in the same rack starts it will learn from NN > about the replica in DN1 and the client will read from DN1 using the > conventional path. > When the application ends the AM or NM's can instruct the NN in a fadvise > DONTNEED style, it can start telling DN's to discard extraneous replica. -- This message was sent by Atlassian JIRA (v6.3.4#6332)