[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

Han Liu (Jira) Tue, 16 Jan 2024 23:19:04 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807603#comment-17807603
 ]


Han Liu commented on HDFS-17316:
--------------------------------

Thank you for the detailed comments from [~ste...@apache.org]  and apologize 
for the late reply. Recently I reviewed my code so that it fits the Hadoop code 
format.
{quote}1. filesystem contract tests are designed to do this from junit; If your 
FS implementation doesn't subclass and run these, you need to start there.
{quote}
Contract tests play an essential role in evaluation of storage service 
abilities, where a closely examination of core FS functions are performed, such 
as create, open, delete, etc. There is an overlap between contract tests and 
the benchmark we discussed here. The main mismatch between them is that 
contract tests mainly focus on quality of a most important subset of FS APIs, 
where a series of cases are designed for each API.

The goal of the benchmark proposed here is to provide a general way to check 
basic compatibility of FS public APIs, which treats the interfaces as the same 
and cover all of them, including ACL, XAttr, StoragePolicy, Snapshot, Symlink, 
etc. It should be ensured that for a new FS implementation the benchmark 
examination can be performed quickly as long as the implementation jar file is 
supplied. The benchmark would also introduce a conception of 'suite' 
corresponding to a subset of APIs, aiming to check compatibility of specific 
scenarios such as 'tpcds'.
{quote}2. filesystem API specification is intended to specify the API and 
document where problems surface. maintenance there always welcome -and as the 
contract tests are derived from it, enhancements in those tests to follow
{quote}
It is significant that API specification should keep maintenance. As the Hadoop 
ecosystem develops, the API core functions might evolve and require new 
contract cases. MultipartUploaderTest is an example. I am glad to keep an eye 
on it, and contribute more cases when needed.
{quote}3. there's also terasort to validate commit protocols
{quote}
I agree that TeraSort can be used as part of the compatibility benchmark. There 
can be an individual suite for the validity of MapReduce file output committer.
{quote}4. + distcp contract tests for its semantics
{quote}
The validity of DistCp can also be an individual suite, where the test case is 
a DistCp Job from MiniDFSCluster to target storage service.
{quote}5. dfsio does a lot, but needs maintenance -it only targets the 
clusterfs, when really you should be able to point at cloud storage from your 
own computer. extending that to take a specific target fs would be good.
{quote}
I agree that DFSIO should be extended to general targets. This should be done 
in Hadoop as a separate task, so that benchmark tool can use it. Good idea!
{quote}6. output must go into the class ant junit xml format so jenkins can 
present it.
{quote}
Good suggestion. The design of the benchmark is a tool quickly evaluating 
compatibility score of a FS implementation. It might be inappropriate to be 
treated as a unit test system. All cases must be simple, and after a quick run 
a report is automatically generated showing an overall score and a list of 'not 
compatible APIs'. The framework contains both Java cases and pjdfstest-style 
shell scripts. Thus, the benchmark framework is more flexible and do not need a 
junit report.
{quote}We can create a new hadoop git repo for this. Do you have existing code 
and any detailed specification/docs. this also allows you to add dependencies 
on other things, e.g. spark.
{quote}
Yes, I already have some initial codes and will submit a PR for easier 
reference later. I'm also preparing a design doc for more details, will share 
the link here when ready. The goal of the benchmark we discussed does not need 
extra dependencies on spark or hive. On the contrary, the design may limit the 
dependency to only Hadoop itself. Thus, a small submodule of Hadoop repo might 
be OK, maybe a hadoop-compat-bench module under hadoop-tools I think.

Welcome further discussion and the next code review together!

> Compatibility Benchmark over HCFS Implementations
> -------------------------------------------------
>
>                 Key: HDFS-17316
>                 URL: https://issues.apache.org/jira/browse/HDFS-17316
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Han Liu
>            Priority: Major
>
> {*}Background：{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems：{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal：{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

Reply via email to