[I] Decouple Streaming Use-Case from File IO Abstractions [arrow-datafusion]

via GitHub Mon, 30 Oct 2023 16:29:24 -0700


tustvold opened a new issue, #7994:
URL: https://github.com/apache/arrow-datafusion/issues/7994


   ### Is your feature request related to a problem or challenge?
   
   Currently we accommodate streaming workloads within DataFusion by 
overloading the file IO abstractions. 
   
   This is not always a very good fit and results in a number of workarounds:
   
   * Providing PartitionedFile information to methods that write new files via 
FileSinkConfig - 
https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileSinkConfig.html#structfield.file_groups
   * Passing unbounded_input to parallel write logic - 
https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/write/orchestration.rs#L79
   * Special logic to handle reading files that aren't regular files - 
https://github.com/apache/arrow-datafusion/pull/7282#discussion_r1294019448
   * An ObjectStore::append method that doesn't naturally fit with the design 
goals of the project to be modelled off object stores and not filesystems
   * And more...
   
   As DataFusion gets more sophisticated about handling catalogs, 
reading/writing partitioned data, this overloading is getting more and more 
arcane and hard to reason about, and I think it is overdue we do something to 
address it.
   
   
   ### Describe the solution you'd like
   
   I would like to separate the notions of FileSink and FileScan from a 
StreamSink and StreamSource, this would allow abstractions that better fit 
their respective use-cases.
   
   In particular 
   
   * FileSink and FileScan can focus on reading/writing partitioned immutable 
files following standard big data practices
   * StreamSink and StreamSource can focus on reading/writing CSV / JSON (/ 
Avro) data from streaming sources
   
   Not only would this simplify the current code, but would also expand the 
streaming support in DataFusion
       * Allows for more efficient non-blocking IO, as linux FIFO's support 
poll(2) (unlike general files)
       * Potential integrations with data streaming systems
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Decouple Streaming Use-Case from File IO Abstractions [arrow-datafusion]

Reply via email to