[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

Han Liu (Jira) Sat, 03 Feb 2024 21:46:05 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814042#comment-17814042
 ]


Han Liu commented on HDFS-17316:
--------------------------------

Thanks for comments from [[email protected]] 
{quote}If you look at the hadoop test classes
 * we always need an explicit timeout
 * on abfs/wasb/s3a tests we want to support parallel test execution, where you 
can run multiple JNIT threads in parallel… each path and even temporary 
directories must be uniquely allocated to the given thread, which we do by 
passing a thread ID down.{quote}
Great suggestions. Im preparing the initial code, which has not take them into 
account. I'll add to the doc and make a plan for update.
{quote} * In HADOOP-18508 I'm trying to support having different (local/remote) 
hadoop source trees running test against the same S3 store. Goal: I can run a 
hadoop public branch from one directory well debugging a different branch 
locally. Something like that is needed here, given that it seems intended to 
run against live HDFS clusters. Note that also brings authentication into the 
mix, e.g. the option to use kinit to log in before running the tests. This is 
not exclusive to HDFS either.{quote}
Exactly. All files/directories are created under the uri, so executions would 
not affect each other if only the uri is uniquely generated.

For running tests with different hadoop branches against the same S3 cloud 
service, my understanding is that this is achieved naturally as long as the uri 
is uniquely generated each time.
{quote}It will be good -if not initially supported then at least designed as a 
future option- to allow me to provide a list of stores to run the tests 
against. This is because four S3A testing I now have to qualify with: amazon 
s3, amazon s3 express, google gcs and at least one other implementation of the 
API. All of which have slightly different capabilities -test process is going 
to need to somehow be driven so that for the different implementation it knows 
which features to test/results to expect. The current hadoop-aws/contract test 
design is not up to this.
{quote}
Definitely. I personally run community supported S3A FS implementation. I am 
glad that there would be many more targets so that we could comprehensively 
evaluate the benchmark. However, only one FS instance can be supported per run. 
This is because the benchmark should be simple and fast.
{quote}Microsoft are pushing hard at windows support. For the shell operations 
it might be very good if rather than using bash/sh/zsh that python and pyunit 
was the test runner, which could then invoke Windows commands as well as shall 
scripts. Pyunit test report can be aggregated displayed in Jenkins, which is 
another nice feature of them.
{quote}
Windows support is important while has not been taken into account. It is in 
the future plan.

Some initial code would be finished soon, then I'll update a link.

> Compatibility Benchmark over HCFS Implementations
> -------------------------------------------------
>
>                 Key: HDFS-17316
>                 URL: https://issues.apache.org/jira/browse/HDFS-17316
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Han Liu
>            Priority: Major
>         Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background：{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems：{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal：{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

Reply via email to