[ https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576893#comment-15576893 ]
Andrew Wang commented on HDFS-7343: ----------------------------------- Hi everyone, thanks for the great document. This is a very ambitious project, and would be a great addition to HDFS. Some comments: * We'd thought about doing automatic cache management as a follow-on for HDFS-4949, but there were a couple issues. First was that most reads happen via SCR, thus we do not have reliable IO statistics. Second, performance-sensitive apps are often reading structured columnar data like Parquet, and thus only ranges of a file. Whole-file or whole-block caching is thus very coarse, and wastes precious memory or SSD. Do you plan to address these issues as part of this work? * It's difficult to prioritize at the HDFS-level since performance is measured at the app-level. Typically clusters have two broad categories of jobs running: time-sensitive queries issued by users, and low-priority batch work. These two categories can also be accessing the same data, though with different access patterns. If you're looking at purely HDFS-level information, without awareness of users, jobs, and their corresponding priorities, admins will have a hard time mapping rules to their actual SLOs. * Could you talk a little bit more about the rules solver? What happens when a rule cannot be satisfied? This also ties back into app-level performance, since there are "all-or-nothing" properties where caching half a dataset might improve average throughput, but not improve end-to-end execution time (the SLO metric). * Also on the rules solver, how do we quantify the cost of executing an action? It's important to avoid unnecessarily migrating data back and forth. * Could you talk some more about the value of Kafka in this architecture, compared to a naive implementation that just polls the NN and DN for information? Also wondering if with Kafka we still need a periodic snapshot of state, since Kafka is just a log. * HDFS's inotify mechanism might also be interesting here. The doc talks a lot about improving performance, but I think the more important usecase is actually saving cost by migrating data to archival or EC storage. This is because of the above difficulties surrounding actually understanding application-level performance with just FS-level information. FWIW, we've had reasonable success with time-based policies for aging out data to colder storage with HDFS-4949. This is because many workloads have an access distribution that heavily skews toward newer data. So, some simple rules with time-based triggers or looking at file atimes might get us 80% of what users want. > HDFS smart storage management > ----------------------------- > > Key: HDFS-7343 > URL: https://issues.apache.org/jira/browse/HDFS-7343 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Kai Zheng > Assignee: Wei Zhou > Attachments: HDFS-Smart-Storage-Management.pdf > > > As discussed in HDFS-7285, it would be better to have a comprehensive and > flexible storage policy engine considering file attributes, metadata, data > temperature, storage type, EC codec, available hardware capabilities, > user/application preference and etc. > Modified the title for re-purpose. > We'd extend this effort some bit and aim to work on a comprehensive solution > to provide smart storage management service in order for convenient, > intelligent and effective utilizing of erasure coding or replicas, HDFS cache > facility, HSM offering, and all kinds of tools (balancer, mover, disk > balancer and so on) in a large cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org