hntd187 opened a new issue #1544:
URL: https://github.com/apache/arrow-datafusion/issues/1544


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Consume a streaming source from Kafka. (to start) Datafusion already has a 
batch oriented processing framework that works pretty well. I wanna extend this 
to also be able to consume from streaming sources such as a kafka topic.
   
   **Describe the solution you'd like**
   So I think it makes sense to start with what Spark Streaming did for it's 
Streaming implementation which is the idea of micro-batching.
   
   ![Blank 
diagram](https://user-images.githubusercontent.com/6778339/148859801-d020ea6e-0927-4d7c-a35b-f0ab8f50df25.png)
   Generally the idea pictured above is DF will listen to a topic for some 
period of time (defined at start up) then execute operations on that collected 
batch window of events. In the case of Kafka there normally these are either 
JSON or Avro which already has encoders in DataFusion.
   
   I spent sometime looking at the types in data source and I came to the 
conclusion that it would probably be possible to implement this on top of the 
current API, but frankly, it would suck. The source traits all have a naming 
convention centered around tables and files, which a Kafka topic is technically 
neither. Basically what I am saying is an implementation here would be highly 
confusing to anyone trying to understand why things are named what they are. I 
propose we add a set additional traits specifically for streaming sources. 
These would map to an execution plan like the other data sources, but should 
have ability to manage the stream information such as committing offsets, 
checkpointing, watermarking. These are probably secondary things to come a bit 
after a "get the thing to work" implementation, but I wanted to just put it out 
there that these traits initially would look rather bare and not have much 
difference from the other data sources. They would though quickly diverg
 e from those contracts into ones that support managing these stream 
operations. 
   
   What I am not sure about is while these types should likely live in 
DataFusion the actual implementation probably should not. At least start as a 
contrib module and maybe be promoted into the main repo eventually if it makes 
sense.
   
   Does this make sense? I can start a draft PR soon to get the ball rolling on 
discussion into actual code.
   
   **Describe alternatives you've considered**
   I dunno, I could play more video games or something, but I'd rather do this. 
Joking aside, Flock has streaming capabilities but it's mostly based around 
cloud offerings and not a long running process like a sparking streaming job.
   
   **Additional context**
   @alamb additional ideas, how you wanna get started?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to