I've just created an initial project "fs-api-shim" to provide controlled
access to the hadoop 3.3.3+ filesystem API calls on hadoop 3.2.0+ releases
https://github.com/steveloughran/fs-api-shim

The goal here is to make it possible for core file format libraries
(Parquet, Avro, ORC, Arrow etc) and other apps (HBase, ...) to take
advantage of those APIs which we have updated and optimised for access to
cloud stores. Currently the applications do not and are under performance
on recent releases. I have the ability to change our internal forks but I
would like to let others gain from the changes and avoid having to diverge
i'll internal libraries too much.

Currently too many libraries seen frozen in time

Avro: still rejecting changes which don't compile on hadoop 2
https://github.com/apache/avro/pull/1431

Parquet: still using reflection to access non hadoop 1.x filesystem API
calls
https://github.com/apache/parquet-mr/pull/971

I'm not going to support hadoop 2.10 —but we can at least say "move up to
hadoop 3.2.x and we will let you use later APIs when available"

some calls, like openFile() will work everywhere; on versions with the open
file builder API they will take the final status and fake policy so let
libraries declare whether they are random/sequential is IO and skip those
HEAD requests on the object stores they do to verify that the file exists
and determine its length for the ranged GET call requests which will follow.

https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FileSystemShim.java#L38

On Hadoop 3.2.x, or if openFile() fails for some reason, it will just
downgrade to the classic open() call.

Other API calls we can support dynamic binding to through reflection but
not actually fallback if they are unavailable. This will allow libraries to
use the API calls if present but force them to come up with alternative
solutions if not.

A key part of this is FSDataInputStream, where the ByteBufferReadable API
would be benefit to Parquet

https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FSDataInputStreamShim.java

When we get the vectored IO feature branch in, we can offer similar
reflection-based access. It means applications can compile on hadoop 3.2.x
and 3.3.x but still take advantage of the APIs when they are on a version
without it.

I'm going to stay clear of more complicated APIs which don't offer tangible
performance gains and which are very hard to do (IOStatistics).

Testing is fun; I have a plan there which consists of FS contract tests in
the shim test source tree to verify the 3.2.0 functionality and an adjacent
module which will run those same tests against more recent versions. I need
test will have to beat targetable against objects doors as well as local
and mini HGFS for systems

This is all in github; however it is very much a hadoop extension library.
Is there a way we could release it as an ASF Library but on a different
timetable from normal Hadoop releases? There is always incubator, but this
is such a minor project it is closer to the org.apache.hadoop.thirdparty
library in that it is something all current committers okay should be able
to commit to and release, while releasing on a schedule independent of
hadoop releases themselves. Having it come from this project should give it
more legitimacy.

Steve

Reply via email to