pantShrey commented on issue #21215:
URL: https://github.com/apache/datafusion/issues/21215#issuecomment-4189614153

   Hey @alamb, I’m planning to work on this issue and would really appreciate 
your guidance on a few questions I have.
   
   Following up on my proposal and looking at the implementation, I noticed 
IPCStreamWriter hardcodes `StreamWriter<File>` even though Arrow's StreamWriter 
already accepts W: Write generically. Additionally, I saw that the codebase 
frequently passes `RefCountedTempFile` around and uses .path() to spawn new 
readers on demand (e.g., `File::open(spill_file.path())` in bitwise_stream.rs).
   
   To abstract away the filesystem, I was considering introducing a `SpillFile` 
trait with `open_write()` and `open_read()` methods, allowing backends to 
generate I/O handles lazily without relying on file paths.
   
   Does this approach seem reasonable, or did you have a different abstraction 
in mind? To make this concrete, here is the architecture I'm thinking of before 
I open a draft PR:
   ``` rust
   pub trait SpillFile: Send + Sync {
       fn size(&self) -> u64;
       fn path(&self) -> Option<&Path> { None }//kept for the debugging and 
observability part for osfiles
       fn open_write(&self) -> Result<Box<dyn std::io::Write + Send>>;
       fn open_read(&self) -> Result<Box<dyn std::io::Read + Send>>;
   }
   
   pub trait TempFileFactory: Send + Sync {
       fn create_temp_file(&self, request_description: &str) -> 
Result<std::sync::Arc<dyn SpillFile>>;
   }
   ```
   2. Planned changes:
   - RefCountedTempFile stays as-is but implements SpillFile.
   - Add a `Custom(Arc <dyn TempFileFactory>) `variant to DiskManagerMode.
   - Change IPCStreamWriter to accept Box<dyn Write + Send> instead of File.
   - Update scattered callers currently doing File::open(spill_file.path()) to 
use spill_file.open_read().
   
   I also had a question regarding disk quota tracking:
   Currently, `update_disk_usage()` updates DataFusion's global tracker to 
enforce `max_temp_directory_size`. If we use a custom factory (like Postgres or 
S3), how should we handle this?
   - Option A: Add update_disk_usage() to the SpillFile trait so custom 
backends are forced into DataFusion's tracker.
   - Option B: Bypass DataFusion's tracking for Custom backends. We assume the 
custom backend enforces its own limits (e.g., Postgres temp_file_limit), and 
DataFusion only tracks OS-level files.
   
   Do you have a preference here, or any guidance on how this should be handled?
   Thanks again for your time!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to