[ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421728#comment-13421728
 ] 

Todd Lipcon commented on HDFS-3672:
-----------------------------------

bq. My concern is, if this is used in MapReduce it might be okay. But once it 
starts getting used in other downstream projects removing this would be a 
challenge

That's the whole point of the Unstable API annotation, isn't it? We can change 
the API and downstream projects should accept that.

What if we explicitly also mark it as {{throws UnsupportedOperationException}}. 
So users of the API would be encouraged to catch this exception.

Since it's a performance API, it's always going to be used in an "advisory" 
role anyway -- any use of it could safely fall back to the non-optimized code 
path.

I'd be OK compromising and calling it LimitedPrivate(MapReduce), but I know 
that at least one of our customers is interested in using an API like this as 
well. Unfortunately I can't give too many details on their use case due to NDA 
(lame, I know), but I just wanted to provide a data point that there is demand 
for this "in the wild".

bq. We're still running some experiments locally, but our assumption is that, 
within short time-scales (~0.5 seconds), the lagging 0.5 second usage is a 
reasonably good predictor of the next 0.5 seconds, given most Hadoop-style 
access is of 100MB+ chunks of data

I ran a simple experiment yesterday on one of our test clusters. The cluster is 
doing a mix of workloads - I think at the time it was running a Hive benchmark 
suite on ~100 nodes. So, it was under load, but not 100% utilization.

On all of the nodes, I collected /proc/diskstats once a second for an hour. I 
then removed all disk samples where there was 0 load on the disk, since that 
was just periods of inactivity between test runs. Then, I took the disk 
utilization at each sample, and appended it as a column to the data from the 
previous second. I loaded the data into R and constructed a few simple models 
for each second's disk utilization on a given disk based on the previous 
second's disk statistics.

Linear model using only the current utilization to predict the next second's 
utilization:
{code}
> m.linear.on.only.util <- lm(next_sec_util ~ ms_doing_io, data=d)
{code}
(this would correspond to a trivial model like "assume that if a disk is busy 
now, it will still be busy in the next second")

Linear model using all of the current statistics (queue length, read/write mix, 
etc) to predict next second's util:
{code}
> m.linear <- lm(next_sec_util ~ ., data=d)
{code}

Quadratic model using all of the current statistics, and their interaction 
terms, to predict next second's util:
{code}
> d.sample.200k <- d[sample(nrow(d), size=200000),]
> m.quadratic <- lm(next_sec_util ~ .:., data=d.sample.200k)
{code}

Random forest (a decision-tree based model, trained using only 1% of the data, 
since it's slow):
{code}
> d.sample.10k <- d[sample(nrow(d), size=10000),]
> m.rf <- randomForest(next_sec_util~., data=d.sample.10k)
{code}

The models fared as follows:

||Model||Percent variance explained||
| Linear on only utilization | 58.4% |
| Linear | 70.6% |
| Quadratic | 73.9% |
| Random forest | 76.9% |

Certainly the above analysis is just one workload, and one in which the disks 
are not being particularly slammed. But, it does show that looking at a disk's 
current status is a reasonable predictor of status over the next second on a 
typical MR cluster.
                
> Expose disk-location information for blocks to enable better scheduling
> -----------------------------------------------------------------------
>
>                 Key: HDFS-3672
>                 URL: https://issues.apache.org/jira/browse/HDFS-3672
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-3672-1.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to