I've just created an initial project "fs-api-shim" to provide controlled access to the hadoop 3.3.3+ filesystem API calls on hadoop 3.2.0+ releases https://github.com/steveloughran/fs-api-shim
The goal here is to make it possible for core file format libraries (Parquet, Avro, ORC, Arrow etc) and other apps (HBase, ...) to take advantage of those APIs which we have updated and optimised for access to cloud stores. Currently the applications do not and are under performance on recent releases. I have the ability to change our internal forks but I would like to let others gain from the changes and avoid having to diverge i'll internal libraries too much. Currently too many libraries seen frozen in time Avro: still rejecting changes which don't compile on hadoop 2 https://github.com/apache/avro/pull/1431 Parquet: still using reflection to access non hadoop 1.x filesystem API calls https://github.com/apache/parquet-mr/pull/971 I'm not going to support hadoop 2.10 —but we can at least say "move up to hadoop 3.2.x and we will let you use later APIs when available" some calls, like openFile() will work everywhere; on versions with the open file builder API they will take the final status and fake policy so let libraries declare whether they are random/sequential is IO and skip those HEAD requests on the object stores they do to verify that the file exists and determine its length for the ranged GET call requests which will follow. https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FileSystemShim.java#L38 On Hadoop 3.2.x, or if openFile() fails for some reason, it will just downgrade to the classic open() call. Other API calls we can support dynamic binding to through reflection but not actually fallback if they are unavailable. This will allow libraries to use the API calls if present but force them to come up with alternative solutions if not. A key part of this is FSDataInputStream, where the ByteBufferReadable API would be benefit to Parquet https://github.com/steveloughran/fs-api-shim/blob/main/fs-api-shim-library/src/main/java/org/apache/hadoop/fs/shim/FSDataInputStreamShim.java When we get the vectored IO feature branch in, we can offer similar reflection-based access. It means applications can compile on hadoop 3.2.x and 3.3.x but still take advantage of the APIs when they are on a version without it. I'm going to stay clear of more complicated APIs which don't offer tangible performance gains and which are very hard to do (IOStatistics). Testing is fun; I have a plan there which consists of FS contract tests in the shim test source tree to verify the 3.2.0 functionality and an adjacent module which will run those same tests against more recent versions. I need test will have to beat targetable against objects doors as well as local and mini HGFS for systems This is all in github; however it is very much a hadoop extension library. Is there a way we could release it as an ASF Library but on a different timetable from normal Hadoop releases? There is always incubator, but this is such a minor project it is closer to the org.apache.hadoop.thirdparty library in that it is something all current committers okay should be able to commit to and release, while releasing on a schedule independent of hadoop releases themselves. Having it come from this project should give it more legitimacy. Steve