tustvold opened a new issue, #7994: URL: https://github.com/apache/arrow-datafusion/issues/7994
### Is your feature request related to a problem or challenge? Currently we accommodate streaming workloads within DataFusion by overloading the file IO abstractions. This is not always a very good fit and results in a number of workarounds: * Providing PartitionedFile information to methods that write new files via FileSinkConfig - https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileSinkConfig.html#structfield.file_groups * Passing unbounded_input to parallel write logic - https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79 * Special logic to handle reading files that aren't regular files - https://github.com/apache/arrow-datafusion/pull/7282#discussion_r1294019448 * An ObjectStore::append method that doesn't naturally fit with the design goals of the project to be modelled off object stores and not filesystems * And more... As DataFusion gets more sophisticated about handling catalogs, reading/writing partitioned data, this overloading is getting more and more arcane and hard to reason about, and I think it is overdue we do something to address it. ### Describe the solution you'd like I would like to separate the notions of FileSink and FileScan from a StreamSink and StreamSource, this would allow abstractions that better fit their respective use-cases. In particular * FileSink and FileScan can focus on reading/writing partitioned immutable files following standard big data practices * StreamSink and StreamSource can focus on reading/writing CSV / JSON (/ Avro) data from streaming sources Not only would this simplify the current code, but would also expand the streaming support in DataFusion * Allows for more efficient non-blocking IO, as linux FIFO's support poll(2) (unlike general files) * Potential integrations with data streaming systems ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
