alamb commented on issue #7994: URL: https://github.com/apache/arrow-datafusion/issues/7994#issuecomment-1806831012
I have read [the document](https://synnada.notion.site/Decoupling-Streaming-Scenarios-from-the-ListingTable-Paradigm-98bd168baa8b497a9396dc0de7d1a29c?pvs=4) and I really like it. Thank you @metesynnada. ## Use Case My understanding is that the goal of this exercise is not primarily about the specifics of decoupling append/FIFO from ListingTable, but the more general goal of easily using DataFusion to build systems with functionality similar to Apache Flink, as described in https://github.com/apache/arrow-datafusion/issues/4285 In particular the goal is to make it easy to 1. Read data from "streaming" sources 2. Write data to "streaming" sinks Where "streaming" means the data is NOT composed of a fixed set of immutable objects (as in object storage), but rather some number of "streams" which can be appended / consumed. Some examples are FIFO files, Kafka, RabbitMQ, Kinesis, etc. One of the major challenges in trying to implement such streaming systems today is that `ListingTable`, as its name implies, is implemented assuming the data is being read or written to an object store, with both the benefits and limitation of the object_store API. ## Design Feedback I really like the idea to create a parallel `StreamingTable` to `ListingTable` that is backed by a different API than object store and that can be used to build such streaming usecases. It much better matches: 1. The differences in reading from a streaming vs immutable discrete objects in object store 2. The difference in writing data out in a streaming fashion compared to immutable discrete objects in object store ### Requests / ideas As we expand DataFusion in this streaming direction, I would like to request we take this opportunity to define some more crate boundaries so the `ListingTable` is not so tightly integrated / intertwined, and likewise `StreamingTable` is not so so intertwined -- perhaps we can aim to end up with three new crates `datafusion_listing_table` `datafusion_streaming_table` and `datafusion_data_format` ## Next steps: Would it be possible to create a PoC / proposal with the basic APIs and make sure they fit together and into the rest of DataFusion? This is likely to be a large change, so I think getting the skeleton in place and then filling out the details in subsequent PRs (rather than one massive one) would be my preferred process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
