Hi Pedro,
I think the answer is it likely depends.  The main trade-off in using Arrow
in a streaming process is the high metadata overhead if you have very few
rows.  There have been prior discussions on the mailing list about
row-based and streaming that might be useful [1][2] in expanding on the
trade-offs.

For some additional color: Brian Hulette gave a talk [3] a while ago about
potentially using Arrow within Beam (I believe flink has a high overlap
with the Beam API) and some of the challenges.  It also looks like there
was a Flink JIRA (that you might be on?) about using Arrow directly in
Flink and some of the trade-offs [4].

The questions you posed are a little bit vague, if there is more context it
might be able to help make the conversation more productive.

-Micah

[1]
https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E
[2]
https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E
[3] https://www.youtube.com/watch?v=avy1ifTZlhE
[4] https://issues.apache.org/jira/browse/FLINK-10929


On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com> wrote:

> Hello,
>
> This may be a stupid question but is Arrow used for or designed with
> streaming processing use-cases in mind, where data is non-stationary. I.e:
> Flink stream processing jobs?
>
> Particularly, is it possible from a given event source (say Kafka) to
> efficiently generate incremental record batches for stream processing?
>
> Suppose there is a data source that continuously generates messages with
> 100+ fields. You want to compute grouped aggregations (sums, averages,
> count distinct, etc...) over a select few of those fields, say 5 fields at
> most used for all queries.
>
> Is this a valid use-case for Arrow?
> What if time is important and some windowing technique has to be applied?
>
> Thank you very much for your time!
> Have a good day.
>

Reply via email to