Hi Micah,

Thank you for your reply and the links, the threads were quite interesting.
You are right, I opened the flink issue regarding arrow support to
understand whether it was on their roadmap to take a look at.

My use-case is processing a stream of events (or rows if you will) to
compute ~100-150 sliding window aggregations over a subset of the received
fields (say 10 out of a row with 80+ fields).

Something of the sort:
*average (session_time) group by ID over 1 hour.*

For the query above only 3 fields are required, the session_time, ID and
timestamp (implicitly required to define time windows) meaning that we can
discard a significant amount of information from the original event.

Furthermore, these types of queries seem to fit what I would call (for lack
of a better word) "sliding" dataframes. Arrow's aim (as I understand it) is
to standardized the static dataframe data structure memory model, can it
also support a sliding version?

Usually these queries are defined by data scientists and domain experts who
are comfortable using python and not java or c++ which are the languages,
streaming engines are built on.

My goal is to understand if existing solutions streaming engines like flink
can converge into a common model that could in the future help develop
efficient cross-language streaming engines.

I hope I've been able to clarify some points.

Thanks


Em sex., 4 de set. de 2020 às 20:17, Micah Kornfield <emkornfi...@gmail.com>
escreveu:

> Hi Pedro,
> I think the answer is it likely depends.  The main trade-off in using Arrow
> in a streaming process is the high metadata overhead if you have very few
> rows.  There have been prior discussions on the mailing list about
> row-based and streaming that might be useful [1][2] in expanding on the
> trade-offs.
>
> For some additional color: Brian Hulette gave a talk [3] a while ago about
> potentially using Arrow within Beam (I believe flink has a high overlap
> with the Beam API) and some of the challenges.  It also looks like there
> was a Flink JIRA (that you might be on?) about using Arrow directly in
> Flink and some of the trade-offs [4].
>
> The questions you posed are a little bit vague, if there is more context it
> might be able to help make the conversation more productive.
>
> -Micah
>
> [1]
>
> https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E
> [2]
>
> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E
> [3] https://www.youtube.com/watch?v=avy1ifTZlhE
> [4] https://issues.apache.org/jira/browse/FLINK-10929
>
>
> On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com> wrote:
>
> > Hello,
> >
> > This may be a stupid question but is Arrow used for or designed with
> > streaming processing use-cases in mind, where data is non-stationary.
> I.e:
> > Flink stream processing jobs?
> >
> > Particularly, is it possible from a given event source (say Kafka) to
> > efficiently generate incremental record batches for stream processing?
> >
> > Suppose there is a data source that continuously generates messages with
> > 100+ fields. You want to compute grouped aggregations (sums, averages,
> > count distinct, etc...) over a select few of those fields, say 5 fields
> at
> > most used for all queries.
> >
> > Is this a valid use-case for Arrow?
> > What if time is important and some windowing technique has to be applied?
> >
> > Thank you very much for your time!
> > Have a good day.
> >
>

Reply via email to