https://issues.apache.org/jira/browse/HDFS-13616
I don't want to be territorial here -but as I keep reminding this list whenever it happens, -I do not want any changes to go into the core FileSystem class without * raising a HADOOP- JIRA * involving those of us who work on object stores. We have different problems (latencies, failure modes) and want to move to move async/completable APIs, ideally with builder APIs for future flexibility and per-FS options. * specify semantics formally enough that people implementing and using know what they get. * a specification in the filesystem.md * contract tests to match the spec and which object stores can implement, as well as HDFS The change has ~no javadocs and doesn't even state * whether it's recursive or not. * whether it includes directories or not batchedListStatusIterator is exactly the kind of feature this should apply to -it is where we get a chance to fix those limitations of the previous calls (blocking sync, no expectation of right to cancel listings), ... I'd like to be able to * provide a hint on batch sizes. * get an async response so the fact the LIST can can take time is more visible. * and let us cancel that query if it is taking too long I also like to be able to close an iterator too; that is something we can/should retrofit, or require all implementations to add Completable<RemoteIterator<PartialListing<S extends FileStatus>> listing = batchList(Path) .recursive(true) .opt("fs.option.batchlist.size", 100) .build() RemoteIterator<PartialListing<FileStatus> it = listing.get() FileStatus largeFile = null; try { while(it.hasNext()) { FileStatus st = it.next(); if (st.length()> 1_000_000) { largeFile = st; break; } } finally { if (it instanceof Closeable) { IOUtils.closeQuietly((Closeable)it); } } if (largeFile != null) { processLargeFile(largeFile); } } See: something for slower IO, controllable batch sizes and a way to cancel the scan -so let us recycle the HTTP connection even when breaking out early. This is a recurrent problem and I am getting as bored as a sending these emails out as people probably are at receiving them. Please please at least talk to me. Yes I'm going to add more homework but the goal is to make it something well documented well testable and straightforward to implement by other implementations without us having to reverse engineer HDFS's behaviour and consider that a normative What I do here? 1. Do I overreact and revert the change until my needs are met? Because I know that if I volunteered to do this work myself it's going to get neglected. 2. Is someone going to put their hand up to help this? At the very least, I'm going to tag the APIs as unstable and potentially likely to break so that anyone who uses it in hadoop-3.3.0 isn't going to be upset when it is moved to a builder API. And it will have to for the objects stores. sorry steve