[GitHub] [arrow-datafusion] hntd187 opened a new issue #1544: Streaming support for DataFusion

GitBox Mon, 10 Jan 2022 16:33:21 -0800


hntd187 opened a new issue #1544:
URL: https://github.com/apache/arrow-datafusion/issues/1544

**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
Consume a streaming source from Kafka. (to start) Datafusion already has a
batch oriented processing framework that works pretty well. I wanna extend this
to also be able to consume from streaming sources such as a kafka topic.

**Describe the solution you'd like**
So I think it makes sense to start with what Spark Streaming did for it's
Streaming implementation which is the idea of micro-batching.

![Blank
diagram](https://user-images.githubusercontent.com/6778339/148859801-d020ea6e-0927-4d7c-a35b-f0ab8f50df25.png)
Generally the idea pictured above is DF will listen to a topic for some
period of time (defined at start up) then execute operations on that collected
batch window of events. In the case of Kafka there normally these are either
JSON or Avro which already has encoders in DataFusion.

I spent sometime looking at the types in data source and I came to the
conclusion that it would probably be possible to implement this on top of the
current API, but frankly, it would suck. The source traits all have a naming
convention centered around tables and files, which a Kafka topic is technically
neither. Basically what I am saying is an implementation here would be highly
confusing to anyone trying to understand why things are named what they are. I
propose we add a set additional traits specifically for streaming sources.
These would map to an execution plan like the other data sources, but should
have ability to manage the stream information such as committing offsets,
checkpointing, watermarking. These are probably secondary things to come a bit
after a "get the thing to work" implementation, but I wanted to just put it out
there that these traits initially would look rather bare and not have much
difference from the other data sources. They would though quickly diverg
e from those contracts into ones that support managing these stream
operations.

What I am not sure about is while these types should likely live in
DataFusion the actual implementation probably should not. At least start as a
contrib module and maybe be promoted into the main repo eventually if it makes
sense.

Does this make sense? I can start a draft PR soon to get the ball rolling on
discussion into actual code.

**Describe alternatives you've considered**
I dunno, I could play more video games or something, but I'd rather do this.
Joking aside, Flock has streaming capabilities but it's mostly based around
cloud offerings and not a long running process like a sparking streaming job.

**Additional context**
@alamb additional ideas, how you wanna get started?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] hntd187 opened a new issue #1544: Streaming support for DataFusion

Reply via email to