alamb commented on issue #7994:
URL: 
https://github.com/apache/arrow-datafusion/issues/7994#issuecomment-1806831012

   I have read [the 
document](https://synnada.notion.site/Decoupling-Streaming-Scenarios-from-the-ListingTable-Paradigm-98bd168baa8b497a9396dc0de7d1a29c?pvs=4)
 and I really like it. Thank you @metesynnada.
   
   ## Use Case 
   My understanding is that the goal of this exercise is not primarily about 
the specifics of decoupling append/FIFO from ListingTable, but the more general 
goal of easily using DataFusion to build systems with functionality similar to 
Apache Flink, as described in 
https://github.com/apache/arrow-datafusion/issues/4285
   
   In particular the goal is to make it easy to 
   1. Read data from "streaming" sources
   2. Write data to "streaming" sinks 
   
   Where "streaming" means the data is NOT composed of a fixed set of immutable 
objects (as in object storage), but rather some number of "streams" which can 
be appended / consumed. Some examples are FIFO files, Kafka, RabbitMQ, Kinesis, 
etc.
   
   One of the major challenges in trying to implement such streaming systems 
today is that `ListingTable`, as its name implies, is implemented assuming the 
data is being read or written to an object store, with both the benefits and 
limitation of the object_store  API.
   
   ## Design Feedback
   
   I really like the idea to create a parallel `StreamingTable` to 
`ListingTable` that is backed by a different API than object store and that can 
be used to build such streaming usecases. It much better matches:
   1. The differences in reading from a streaming vs immutable discrete objects 
in object store
   2. The difference in writing data out in a streaming fashion compared to 
immutable discrete objects in object store
   
   ### Requests / ideas
   
   As we expand DataFusion in this streaming direction, I would like to request 
we take this opportunity to define some more crate boundaries so the 
`ListingTable` is not so tightly integrated / intertwined, and likewise 
`StreamingTable` is not so so intertwined -- perhaps we can aim to end up with 
three new crates `datafusion_listing_table` `datafusion_streaming_table` and 
`datafusion_data_format`
   
   
   ## Next steps:
   Would it be possible to create a PoC / proposal with the basic APIs and make 
sure they fit together and into the rest of DataFusion? This is likely to be a 
large change, so I think getting the skeleton in place and then filling out the 
details in subsequent PRs (rather than one massive one) would be my preferred 
process. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to