[
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807603#comment-17807603
]
Han Liu commented on HDFS-17316:
--------------------------------
Thank you for the detailed comments from [[email protected]] and apologize
for the late reply. Recently I reviewed my code so that it fits the Hadoop code
format.
{quote}1. filesystem contract tests are designed to do this from junit; If your
FS implementation doesn't subclass and run these, you need to start there.
{quote}
Contract tests play an essential role in evaluation of storage service
abilities, where a closely examination of core FS functions are performed, such
as create, open, delete, etc. There is an overlap between contract tests and
the benchmark we discussed here. The main mismatch between them is that
contract tests mainly focus on quality of a most important subset of FS APIs,
where a series of cases are designed for each API.
The goal of the benchmark proposed here is to provide a general way to check
basic compatibility of FS public APIs, which treats the interfaces as the same
and cover all of them, including ACL, XAttr, StoragePolicy, Snapshot, Symlink,
etc. It should be ensured that for a new FS implementation the benchmark
examination can be performed quickly as long as the implementation jar file is
supplied. The benchmark would also introduce a conception of 'suite'
corresponding to a subset of APIs, aiming to check compatibility of specific
scenarios such as 'tpcds'.
{quote}2. filesystem API specification is intended to specify the API and
document where problems surface. maintenance there always welcome -and as the
contract tests are derived from it, enhancements in those tests to follow
{quote}
It is significant that API specification should keep maintenance. As the Hadoop
ecosystem develops, the API core functions might evolve and require new
contract cases. MultipartUploaderTest is an example. I am glad to keep an eye
on it, and contribute more cases when needed.
{quote}3. there's also terasort to validate commit protocols
{quote}
I agree that TeraSort can be used as part of the compatibility benchmark. There
can be an individual suite for the validity of MapReduce file output committer.
{quote}4. + distcp contract tests for its semantics
{quote}
The validity of DistCp can also be an individual suite, where the test case is
a DistCp Job from MiniDFSCluster to target storage service.
{quote}5. dfsio does a lot, but needs maintenance -it only targets the
clusterfs, when really you should be able to point at cloud storage from your
own computer. extending that to take a specific target fs would be good.
{quote}
I agree that DFSIO should be extended to general targets. This should be done
in Hadoop as a separate task, so that benchmark tool can use it. Good idea!
{quote}6. output must go into the class ant junit xml format so jenkins can
present it.
{quote}
Good suggestion. The design of the benchmark is a tool quickly evaluating
compatibility score of a FS implementation. It might be inappropriate to be
treated as a unit test system. All cases must be simple, and after a quick run
a report is automatically generated showing an overall score and a list of 'not
compatible APIs'. The framework contains both Java cases and pjdfstest-style
shell scripts. Thus, the benchmark framework is more flexible and do not need a
junit report.
{quote}We can create a new hadoop git repo for this. Do you have existing code
and any detailed specification/docs. this also allows you to add dependencies
on other things, e.g. spark.
{quote}
Yes, I already have some initial codes and will submit a PR for easier
reference later. I'm also preparing a design doc for more details, will share
the link here when ready. The goal of the benchmark we discussed does not need
extra dependencies on spark or hive. On the contrary, the design may limit the
dependency to only Hadoop itself. Thus, a small submodule of Hadoop repo might
be OK, maybe a hadoop-compat-bench module under hadoop-tools I think.
Welcome further discussion and the next code review together!
> Compatibility Benchmark over HCFS Implementations
> -------------------------------------------------
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Han Liu
> Priority: Major
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in
> big data storage ecosystem, providing unified interfaces and generally clear
> semantics, and has become the de-factor standard for industry storage systems
> to follow and conform with. There have been a series of HCFS implementations
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal
> suite to do compatibility assessment of a file system for all such HCFS
> implementations. Thus, whether the functionality is well accomplished and
> meets the core compatible expectations mainly relies on service provider's
> own report. Meanwhile, Hadoop is also developing and new features are
> continuously contributing to HCFS interfaces for existing implementations to
> follow and update, in which case, Hadoop also needs a tool to quickly assess
> if these features are supported or not for a specific HCFS implementation.
> Besides, the known hadoop command line tool or hdfs shell is used to directly
> interact with a HCFS storage system, where most commands correspond to
> specific HCFS interfaces and work well. Still, there are cases that are
> complicated and may not work, like expunge command. To check such commands
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility
> benchmark and provide corresponding tool to do the compatibility assessment
> for an HCFS storage system. The benchmark and tool should consider both HCFS
> interfaces and hdfs shell commands. Different scenarios require different
> kinds of compatibilities. For such consideration, we could define different
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage
> providers and storage users. For end users, it can be used to evalute the
> compatibility level and determine if the storage system in question is
> suitable for the required scenarios. For storage providers, it helps to
> quickly generate an objective and reliable report about core functioins of
> the storage service. As an instance, if the HCFS got a 100% on a suite named
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have
> been well achieved. It is also a guide indicating how storage service
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]