wombatu-kun opened a new pull request, #16641: URL: https://github.com/apache/iceberg/pull/16641
## Problem When the Hadoop FileSystem cache is disabled (for example `fs.abfs.impl.disable.cache=true`), Parquet writes through Iceberg can fail mid-write. The reporter hit this on Spark writing to Azure ADLS Gen2 over `abfs://` with a Hadoop catalog: the job fails with `Could not submit task to executor ... ThreadPoolExecutor [Terminated]`, and the debug log shows `AzureBlobFileSystem.finalize()` running while the file is still being written. ## Root cause `ParquetWriter#ensureWriterInitialized` builds `new ParquetFileWriter(ParquetIO.file(output, conf), ...)` and does not keep the Parquet `OutputFile` it passes in. For a `HadoopOutputFile`, `ParquetIO.file(...)` returns a Parquet-native `org.apache.parquet.hadoop.util.HadoopOutputFile` that resolves its own `FileSystem` through `path.getFileSystem(conf)`; with the cache disabled this is a fresh instance. Because `ParquetFileWriter` retains only the output stream and not the `OutputFile`, nothing keeps that `FileSystem` reachable once the writer has been constructed, so it can be garbage-collected while the write is still in progress. On Azure this is fatal. `AbfsOutputStream` references the store's bounded thread pool (through a `SemaphoredDelegatingExecutor`) and the `AbfsClient`, but never the `AzureBlobFileSystem` wrapper. When the wrapper becomes unreachable, `AzureBlobFileSystem.finalize()` calls `close()`, which shuts down `AzureBlobFileSystemStore`'s `boundedThreadPool`; the still-open stream's next asynchronous flush then submits to that terminated pool and fails, which is exactly the reported log sequence. The read path is unaffected because Parquet's `ParquetFileReader` retains its `InputFile`, and therefore the `FileSystem`, for the reader's lifetime. This change restores the same symmetry on the write side. ## Change Keep the Parquet `OutputFile` on `ParquetWriter` so its `FileSystem` stays reachable for the writer's lifetime, including through `close()` and the footer flush. This keeps alive the exact `FileSystem` instance that backs the write until the write finishes. The fix is limited to this reported path. The no-writer-function path (`ParquetWriteAdapter` over parquet-mr's writer) and the Avro write path share the same root cause but are left as follow-ups. ## Tests Added `TestParquetWriterFileSystemReachability`. It writes through a `ParquetWriter` backed by a `HadoopOutputFile` with the Hadoop FileSystem cache disabled, using a local FileSystem whose output stream does not hold a back-reference to the FileSystem (mirroring `AbfsOutputStream`). It asserts that every FileSystem resolved for the write stays reachable while the writer is open and becomes collectible after the writer is dropped. The test fails without the production change and passes with it. Closes #16640 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
