[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421728#comment-13421728 ]
Todd Lipcon commented on HDFS-3672: ----------------------------------- bq. My concern is, if this is used in MapReduce it might be okay. But once it starts getting used in other downstream projects removing this would be a challenge That's the whole point of the Unstable API annotation, isn't it? We can change the API and downstream projects should accept that. What if we explicitly also mark it as {{throws UnsupportedOperationException}}. So users of the API would be encouraged to catch this exception. Since it's a performance API, it's always going to be used in an "advisory" role anyway -- any use of it could safely fall back to the non-optimized code path. I'd be OK compromising and calling it LimitedPrivate(MapReduce), but I know that at least one of our customers is interested in using an API like this as well. Unfortunately I can't give too many details on their use case due to NDA (lame, I know), but I just wanted to provide a data point that there is demand for this "in the wild". bq. We're still running some experiments locally, but our assumption is that, within short time-scales (~0.5 seconds), the lagging 0.5 second usage is a reasonably good predictor of the next 0.5 seconds, given most Hadoop-style access is of 100MB+ chunks of data I ran a simple experiment yesterday on one of our test clusters. The cluster is doing a mix of workloads - I think at the time it was running a Hive benchmark suite on ~100 nodes. So, it was under load, but not 100% utilization. On all of the nodes, I collected /proc/diskstats once a second for an hour. I then removed all disk samples where there was 0 load on the disk, since that was just periods of inactivity between test runs. Then, I took the disk utilization at each sample, and appended it as a column to the data from the previous second. I loaded the data into R and constructed a few simple models for each second's disk utilization on a given disk based on the previous second's disk statistics. Linear model using only the current utilization to predict the next second's utilization: {code} > m.linear.on.only.util <- lm(next_sec_util ~ ms_doing_io, data=d) {code} (this would correspond to a trivial model like "assume that if a disk is busy now, it will still be busy in the next second") Linear model using all of the current statistics (queue length, read/write mix, etc) to predict next second's util: {code} > m.linear <- lm(next_sec_util ~ ., data=d) {code} Quadratic model using all of the current statistics, and their interaction terms, to predict next second's util: {code} > d.sample.200k <- d[sample(nrow(d), size=200000),] > m.quadratic <- lm(next_sec_util ~ .:., data=d.sample.200k) {code} Random forest (a decision-tree based model, trained using only 1% of the data, since it's slow): {code} > d.sample.10k <- d[sample(nrow(d), size=10000),] > m.rf <- randomForest(next_sec_util~., data=d.sample.10k) {code} The models fared as follows: ||Model||Percent variance explained|| | Linear on only utilization | 58.4% | | Linear | 70.6% | | Quadratic | 73.9% | | Random forest | 76.9% | Certainly the above analysis is just one workload, and one in which the disks are not being particularly slammed. But, it does show that looking at a disk's current status is a reasonable predictor of status over the next second on a typical MR cluster. > Expose disk-location information for blocks to enable better scheduling > ----------------------------------------------------------------------- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.0.0-alpha > Reporter: Andrew Wang > Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira