pantShrey opened a new issue, #21215:
URL: https://github.com/apache/datafusion/issues/21215

   ### Is your feature request related to a problem or challenge?
   
   DataFusion's spill infrastructure is currently hardcoded to OS-level files 
in two distinct areas, leaving no way to plug in a custom file backend
   
   **1. File Creation (`DiskManager`)**
   `DiskManagerMode` has no extension point for custom file creation:
   ```rust
   pub enum DiskManagerMode {
       OsTmpDirectory,
       Directories(Vec<PathBuf>),
       Disabled,
       // No way to plug in a custom backend
   }
   ```
   
   And `create_tmp_file` returns a concrete struct hardcoded to `NamedTempFile`:
   ```rust
   pub fn create_tmp_file(
       self: &Arc,
       request_description: &str,
   ) -> Result<RefCountedTempFile>  // hardcoded to NamedTempFile internally
   ```
   
   **2. File Access (`IPCStreamWriter` / `InProgressSpillFile`)**
   Even if `DiskManager` were made extensible, 
`InProgressSpillFile::append_batch` opens the file by OS path:
   ```rust
   self.writer = Some(IPCStreamWriter::new(
       in_progress_file.path(),  // assumes an OS path exists
       schema.as_ref(),
       self.spill_writer.compression,
   )?);
   ```
   
   This means `IPCStreamWriter` also assumes an OS-level file with a real 
filesystem path.
   
   **The concrete use case**
   
   This blocks embedding DataFusion in environments where temp files must go 
through the host system's own file management APIs. A concrete example is 
Postgres extensions (like [ParadeDB](https://github.com/paradedb/paradedb)), 
where spill files need to go through Postgres's `BufFile`/VFD APIs in order to:
   
   - Respect `temp_tablespaces` (controls where temp files are written)
   - Enforce `temp_file_limit` (Postgres's GUC for capping temp file usage per 
session)
   - Participate in Postgres's resource owner cleanup on transaction abort — if 
Postgres `longjmp`s through a `FATAL` error, Rust's `Drop` won't run, but 
Postgres's resource owner will still clean up `BufFile`-backed files 
automatically
   
   Postgres's `BufFile` has no OS-visible filesystem path, it is an internal 
abstraction, so it cannot be made to fit the current path-based interface at 
either level.
   
   
   ### Describe the solution you'd like
   
   The fix requires abstracting both the creation and the access of these files:
   
   **Make `DiskManager` extensible**
   
   Introduce a `TempFileFactory` trait and a `Custom` variant in 
`DiskManagerMode`:
   ```rust
   pub trait TempFileFactory: Send + Sync {
       fn create_temp_file(
           &self,
           request_description: &str,
       ) -> Result<Box<dyn SpillFile>, DataFusionError>;
   }
   
   pub enum DiskManagerMode {
       OsTmpDirectory,
       Directories(Vec<PathBuf>),
       Disabled,
       Custom(Arc<dyn TempFileFactory>), // new
   }
   ```
   
   **Abstract over file access in `IPCStreamWriter`**
   
   Instead of accepting a path, `IPCStreamWriter` should accept a `Write` trait 
object so it can write through any backend:
   ```rust
   // Instead of:
   IPCStreamWriter::new(in_progress_file.path(), schema, compression)
   
   // Something like:
   IPCStreamWriter::new(Box, schema, compression)
   ```
   
   Similarly `SpillFile` (the trait returned by `TempFileFactory`) would need 
to expose `Read`/`Write`/`Seek` rather than a path, so `InProgressSpillFile` 
and `SpillReaderStream` can work through the abstraction.
   
   All existing behavior defaults unchanged, the current OS-backed path is just 
one implementation of the trait.
   
   ### Describe alternatives you've considered
   
   Configuring `DiskManager` with `Directories(postgres_temp_path)` to point it 
at Postgres's temp tablespace directory, but this bypasses Postgres's resource 
accounting entirely. `temp_file_limit` won't be enforced, VFD descriptor limits 
won't be respected, and files won't be cleaned up through Postgres's resource 
owner on transaction abort.
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to