[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-24 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511936#comment-14511936
 ] 

Gera Shegalov commented on HDFS-8182:
-

Thanks for comments [~andrew.wang] and [~jlowe] 

bq. We might get some hotspotting, since the first reader on a rack will 
localize everything. Probably still random enough though?
I agree there is a hotspotting risk, hopefully indeed low. But if turns out to 
be big deal we can introduce random node picking in the rack.

Agree that we need to clarify flag names and their semantics (RE: DONTNEED)

And the issue of quota/accounting is something I haven't thought of yet either. 
Thanks for bringing it up.

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507982#comment-14507982
 ] 

Jason Lowe commented on HDFS-8182:
--

Quota seems like it could be problematic for some of the use-cases involved in 
the distributed cache.  Not all files being downloaded by a user belong to that 
user.  Public localized resources are a good example of this.  How does quota 
enter the picture in the case where the user who owns the file is not the user 
requesting the file?  For example, are other users accessing my public files 
going to be able to cause my quota usage to increase by this feature?

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-22 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507939#comment-14507939
 ] 

Andrew Wang commented on HDFS-8182:
---

Thanks for the explanation Gera. This sounds like auto-tiering / hierarchical 
storage management to an extent. If you (or whoever takes this JIRA) builds out 
something to calculate file temperature, we could also use it for moving things 
between storage types or to control HDFS caching.

Even without that though, some interesting questions to think about:

* Policy like fair-share or priorities? Or just use user quota? Quota is my 
favorite.
* If we run too close to a user's quota, writes might be temporarily 
unavailable. Deleting takes time, since it's rate limited and happens on the 
heartbeat. Probably want to limit how much space we use opportunistically.
* We might get some hotspotting, since the first reader on a rack will localize 
everything. Probably still random enough though?
* The idea of having the DN save a local copy as the client reads is efficient, 
but somewhat complicated. It might be simpler to have the NN manage all the 
replication work.
* We don't have the equivalent of a read file descriptor in HDFS, so unless we 
start looking at client-provided per-job identifiers, it's hard to interpret 
DONTNEED. i.e. if DONTNEED is sent, can we immediately delete, or is some other 
job still using it? If we have proper monitoring of data temperature, DONTNEED 
is kind of optional.
* Deleting the file is the best kind of signal, which should happen for 
intermediate data.

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-20 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1450#comment-1450
 ] 

Gera Shegalov commented on HDFS-8182:
-

Hi Andrew,

I think the said block placement policy works fine for data whose usage we know 
a priori such as binaries in YARN-1492 Shared Cache (few terabytes in our 
case), MR/Spark staging directories, etc. For such cases we/frameworks already 
set a high replication factor. And the solution with rf=#racks is already good 
enough. Except for the replication speed vs YARN scheduling race, which would 
be eliminated with the approach proposed in this JIRA. 

In some cases we have no a priori knowledge. The most prominent ones are some 
primary or temporary files are used as the build input of a hash join in an 
ad-hoc manner. Having a solution that works transparently irrespective of 
specified replication factor is a win.

Another drawback of a block-placement based solution (besides currently being 
global, not per file) is that it's not elastic, and is oblivious of the data 
temperature. I think this JIRA would cover both families of cases above well.

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-20 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504105#comment-14504105
 ] 

Andrew Wang commented on HDFS-8182:
---

Hi Gera, just for my own edification, given the "spread" block placement 
policy, is it not sufficient to set the distributed cache repl factor to the # 
of racks in the cluster? Is the space overhead too much?

It's hard for me to evaluate as I don't have much familiarity with dist cache, 
e.g. typical # bytes in dist cache, # racks, dist cache repl factor. If the 
overhead ends up being tens of GBs per additional rack though, that seems 
acceptable considering a rack holds a PB+ these days.

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-20 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503601#comment-14503601
 ] 

Gera Shegalov commented on HDFS-8182:
-

[~andrew.wang], [~zhz] thanks for your comments. 

bq. IIUC most apps also use the distributed cache, so there isn't too much code 
duplication that would be reduced by pushing this to HDFS.
This proposal is specifically motivated with the scalability issues with 
DistributedCache localization in a large cluster.

When we get to per-path block placement policy, this will be more acceptable 
than the current per-block-manager approach. However, it's still not as 
flexible as needed to indicate a temporary demand. Thanks for pointing out 
these JIRAs.

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-20 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503587#comment-14503587
 ] 

Zhe Zhang commented on HDFS-8182:
-

I agree with Andrew that a simpler version of this proposal can be achieved by 
a _wide spreading_ placement policy. This requires us to support per-file 
placement policy though. HDFS-8186 (under the EC branch) tries to add 
{{BlockPlacementPolicies}} in {{BlockManager}} to address this requirement.

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-20 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503572#comment-14503572
 ] 

Zhe Zhang commented on HDFS-8182:
-

bq. ... a block placement policy that tries to spread replicas across as many 
racks as possible? The EC branch already has this policy which we could reuse.
This is actually in trunk (HDFS-7891). 

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8182) Implement topology-aware CDN-style caching

2015-04-20 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503565#comment-14503565
 ] 

Andrew Wang commented on HDFS-8182:
---

Could this be achieved more simply with a block placement policy that tries to 
spread replicas across as many racks as possible? The EC branch already has 
this policy which we could reuse.

This type of optimistic replica creation has also typically be done outside of 
HDFS; we don't really keep much in the way of per-file read stats right now, 
and this proposal overall sounds like a bigger change than a new block 
placement policy.

IIUC most apps also use the distributed cache, so there isn't too much code 
duplication that would be reduced by pushing this to HDFS.

> Implement topology-aware CDN-style caching
> --
>
> Key: HDFS-8182
> URL: https://issues.apache.org/jira/browse/HDFS-8182
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>
> To scale reads of hot blocks in large clusters, it would be beneficial if we 
> could read a block across the ToR switches only once. Example scenarios are 
> localization of binaries, MR distributed cache files for map-side joins and 
> similar. There are multiple layers where this could be implemented (YARN 
> service or individual apps such as MR) but I believe it is best done in HDFS 
> or even common FileSystem to support as many use cases as possible. 
> The life cycle could look like this e.g. for the YARN localization scenario:
> 1. inputStream = fs.open(path, ..., CACHE_IN_RACK)
> 2. instead of reading from a remote DN directly, NN tells the client to read 
> via the local DN1 and the DN1 creates a replica of each block.
> When the next localizer on DN2 in the same rack starts it will learn from NN 
> about the replica in DN1 and the client will read from DN1 using the 
> conventional path.
> When the application ends the AM or NM's can instruct the NN in a fadvise 
> DONTNEED style, it can start telling DN's to discard extraneous replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)