pantShrey opened a new issue, #21215:
URL: https://github.com/apache/datafusion/issues/21215
### Is your feature request related to a problem or challenge?
DataFusion's spill infrastructure is currently hardcoded to OS-level files
in two distinct areas, leaving no way to plug in a custom file backend
**1. File Creation (`DiskManager`)**
`DiskManagerMode` has no extension point for custom file creation:
```rust
pub enum DiskManagerMode {
OsTmpDirectory,
Directories(Vec<PathBuf>),
Disabled,
// No way to plug in a custom backend
}
```
And `create_tmp_file` returns a concrete struct hardcoded to `NamedTempFile`:
```rust
pub fn create_tmp_file(
self: &Arc,
request_description: &str,
) -> Result<RefCountedTempFile> // hardcoded to NamedTempFile internally
```
**2. File Access (`IPCStreamWriter` / `InProgressSpillFile`)**
Even if `DiskManager` were made extensible,
`InProgressSpillFile::append_batch` opens the file by OS path:
```rust
self.writer = Some(IPCStreamWriter::new(
in_progress_file.path(), // assumes an OS path exists
schema.as_ref(),
self.spill_writer.compression,
)?);
```
This means `IPCStreamWriter` also assumes an OS-level file with a real
filesystem path.
**The concrete use case**
This blocks embedding DataFusion in environments where temp files must go
through the host system's own file management APIs. A concrete example is
Postgres extensions (like [ParadeDB](https://github.com/paradedb/paradedb)),
where spill files need to go through Postgres's `BufFile`/VFD APIs in order to:
- Respect `temp_tablespaces` (controls where temp files are written)
- Enforce `temp_file_limit` (Postgres's GUC for capping temp file usage per
session)
- Participate in Postgres's resource owner cleanup on transaction abort — if
Postgres `longjmp`s through a `FATAL` error, Rust's `Drop` won't run, but
Postgres's resource owner will still clean up `BufFile`-backed files
automatically
Postgres's `BufFile` has no OS-visible filesystem path, it is an internal
abstraction, so it cannot be made to fit the current path-based interface at
either level.
### Describe the solution you'd like
The fix requires abstracting both the creation and the access of these files:
**Make `DiskManager` extensible**
Introduce a `TempFileFactory` trait and a `Custom` variant in
`DiskManagerMode`:
```rust
pub trait TempFileFactory: Send + Sync {
fn create_temp_file(
&self,
request_description: &str,
) -> Result<Box<dyn SpillFile>, DataFusionError>;
}
pub enum DiskManagerMode {
OsTmpDirectory,
Directories(Vec<PathBuf>),
Disabled,
Custom(Arc<dyn TempFileFactory>), // new
}
```
**Abstract over file access in `IPCStreamWriter`**
Instead of accepting a path, `IPCStreamWriter` should accept a `Write` trait
object so it can write through any backend:
```rust
// Instead of:
IPCStreamWriter::new(in_progress_file.path(), schema, compression)
// Something like:
IPCStreamWriter::new(Box, schema, compression)
```
Similarly `SpillFile` (the trait returned by `TempFileFactory`) would need
to expose `Read`/`Write`/`Seek` rather than a path, so `InProgressSpillFile`
and `SpillReaderStream` can work through the abstraction.
All existing behavior defaults unchanged, the current OS-backed path is just
one implementation of the trait.
### Describe alternatives you've considered
Configuring `DiskManager` with `Directories(postgres_temp_path)` to point it
at Postgres's temp tablespace directory, but this bypasses Postgres's resource
accounting entirely. `temp_file_limit` won't be enforced, VFD descriptor limits
won't be respected, and files won't be cleaned up through Postgres's resource
owner on transaction abort.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]