[ 
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592678#comment-15592678
 ] 

Andrew Wang commented on HDFS-7343:
-----------------------------------

Thanks for the replies Wei Zhou, inline:

{quote}
 To solve this, we make IO statistics from both HDFS level and system level 
(system wide, like data given by system tool ‘iostat’). IO caused by SCR can be 
measured from system level data.
{quote}

Just iostat-level data won't tell you what file or file range is being read 
though. Do you also plan to capturing strace or other information? What's the 
performance impact?

{quote}
For SSM it have to consider the factors like file length, access history and 
memory availability before making a decision to cache a file. It tries to 
minimize the possibility to cache a file that not needed to.
{quote}

{quote}
For performance consideration, SSM makes the first action have higher priority 
than the second one. It depends and not always the case.
{quote}

Unfortunately this doesn't really clarify things for me. The rules engine is 
the most important part of this work, and if it's a blackbox, it's much harder 
for admins to use it effectively. This is especially true in debugging 
scenarios when the rules engine isn't doing what the admin wants.

Have we done any prototyping and simulation of the rules engine? There are 
workload generators like SWIM which could be useful here.

{quote}
For example, Jingcheng Du and I did a study on HSM last year, we found that the 
throughput of cluster with 4 SSDs + 4 HDDs on each DN is 1.36x larger than 
cluster with 8 HDDs on each DN, it’s almost as good as cluster with 8 SSDs on 
each DN.
{quote}

Thanks for the reference. Some questions about this study though:

* What is the comparative cost-per-byte of SSD vs. HDD? I'm pretty sure it's 
greater than 1.36x, meaning we might be better off buying more HDD to get more 
throughput. Alternatively, buying more RAM depending on the dataset size.
* This is an example of application-specific tuning for HSM, which is a best 
case. If the SSM doesn't correctly recognize the workload pattern, we won't 
achieve the full 1.36x improvement.
* I'm also unable to find the "com.yahoo.ycsb.workloads.CareWorkload" 
mentioned, do you have a reference?

{quote}
For example, the amount of memory available has to be checked before caching a 
file, if not enough memory available then the action will be canceled.
{quote}

The NameNode already does this checking, so it seems better for the enforcement 
of quotas to be done in one place for consistency.

A related question, how does this interact with user-generated actions? Maybe a 
user changes some files from ALL_SSD to DISK since they want to free up SSD 
quota for an important job they're going to run later. The SSM then sees there 
is available SSD and uses it. Then the user is out of SSD quota and their 
important job runs slowly.

{quote}
SSM pays more attention on the efficiency of the whole cluster than a 
particular workload, it may not improve the end-to-end execution time of one 
workload but it may improve another workload in the cluster. 
{quote}

Sorry to be this direct, but is improving average cluster throughput useful to 
end users? Broadly speaking, for ad-hoc user-submitted jobs, you care about 
end-to-end latency. For large batch jobs, they aren't that performance 
sensitive, and their working sets are unlikely to fit in memory/SSD anyway. In 
this case, we care very much about improving a particular workload.

I'll end with some overall comments:

If the goal is improving performance, would our time be better spent on the I/O 
paths, HSM and caching? I mentioned sub-block caching and client-side metrics 
as potential improvements for in-memory caching. Integrating it with the 
storage policies API and YARN resource management would also be useful. I'm 
sure there's work to be done in the I/O path too, particularly the write path 
which hasn't seen as much love as reads. This means we'd get more upside from 
fast storage like SSD.

I'm also not convinced that our average I/O utilization is that high to begin 
with. Typical YARN CPU utilization is <50%, and that's with many jobs being CPU 
bound. On the storage side, most clusters are capacity-bound. Optimistic 
scheduling and better resource isolation might lead to big improvements here.

I'm also concerned about scope creep, particularly since the replies to my 
comments indicate a system even bigger than the one described in the design 
document. It involves:

* A policy engine that can understand a wide variety of OS, HDFS, and 
application-level hints and performance metrics, as well as additional 
constraints from user-provided rules, system quotas, and data movement costs.
* Adding a metrics collection system for OS-level metrics which needs to be 
operated, managed, and deployed.
* The SSM itself, which is a stateful service, which again needs to be 
operated, managed, and deployed.
* Potentially a Kafka dependency

I recommend stripping down the system to focus on the most important usecases 
to start and growing it from there. My preference is for archival usecases. In 
my experience, many clusters are capacity-bound, so erasure coding and archival 
storage can mean big improvements in cost.

> HDFS smart storage management
> -----------------------------
>
>                 Key: HDFS-7343
>                 URL: https://issues.apache.org/jira/browse/HDFS-7343
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kai Zheng
>            Assignee: Wei Zhou
>         Attachments: HDFS-Smart-Storage-Management.pdf
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and 
> flexible storage policy engine considering file attributes, metadata, data 
> temperature, storage type, EC codec, available hardware capabilities, 
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution 
> to provide smart storage management service in order for convenient, 
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache 
> facility, HSM offering, and all kinds of tools (balancer, mover, disk 
> balancer and so on) in a large cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to