pantShrey commented on issue #21215:
URL: https://github.com/apache/datafusion/issues/21215#issuecomment-4189614153
Hey @alamb, I’m planning to work on this issue and would really appreciate
your guidance on a few questions I have.
Following up on my proposal and looking at the implementation, I noticed
IPCStreamWriter hardcodes `StreamWriter<File>` even though Arrow's StreamWriter
already accepts W: Write generically. Additionally, I saw that the codebase
frequently passes `RefCountedTempFile` around and uses .path() to spawn new
readers on demand (e.g., `File::open(spill_file.path())` in bitwise_stream.rs).
To abstract away the filesystem, I was considering introducing a `SpillFile`
trait with `open_write()` and `open_read()` methods, allowing backends to
generate I/O handles lazily without relying on file paths.
Does this approach seem reasonable, or did you have a different abstraction
in mind? To make this concrete, here is the architecture I'm thinking of before
I open a draft PR:
``` rust
pub trait SpillFile: Send + Sync {
fn size(&self) -> u64;
fn path(&self) -> Option<&Path> { None }//kept for the debugging and
observability part for osfiles
fn open_write(&self) -> Result<Box<dyn std::io::Write + Send>>;
fn open_read(&self) -> Result<Box<dyn std::io::Read + Send>>;
}
pub trait TempFileFactory: Send + Sync {
fn create_temp_file(&self, request_description: &str) ->
Result<std::sync::Arc<dyn SpillFile>>;
}
```
2. Planned changes:
- RefCountedTempFile stays as-is but implements SpillFile.
- Add a `Custom(Arc <dyn TempFileFactory>) `variant to DiskManagerMode.
- Change IPCStreamWriter to accept Box<dyn Write + Send> instead of File.
- Update scattered callers currently doing File::open(spill_file.path()) to
use spill_file.open_read().
I also had a question regarding disk quota tracking:
Currently, `update_disk_usage()` updates DataFusion's global tracker to
enforce `max_temp_directory_size`. If we use a custom factory (like Postgres or
S3), how should we handle this?
- Option A: Add update_disk_usage() to the SpillFile trait so custom
backends are forced into DataFusion's tracker.
- Option B: Bypass DataFusion's tracking for Custom backends. We assume the
custom backend enforces its own limits (e.g., Postgres temp_file_limit), and
DataFusion only tracks OS-level files.
Do you have a preference here, or any guidance on how this should be handled?
Thanks again for your time!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]