Re: Arrow as a streaming format

Mark Farnan Wed, 09 Sep 2020 23:39:25 -0700

+1 on this also. 

As per previous questions, this is something I am also looking into.

IIOT realtime streaming, it can be as low as one datapoint per 'message' / 
block / packet etc.    Or at best. one 'row'.   i.e. 1 second streaming sensor 
data, or faster which also has a 1 second latency / update requirement so it 
can't be delayed and batched, at least not very much.

Even if received as such, I really don't want to store them as discrete blocks. 
   Though 'Table' seems to cover this logically,  it still results in 
fragmented memory and minimal Arrow benefit.
So  I've been concentrating on how to take those points/rows coming in by 
alternate transfer means  (not Arrow format),  and building up the Arrow block 
in memory.  Still unclear best way to do this and have it useable at the same 
time as this goes against the entire concept of immutable block !.

Still thinking through the best pattern.   In essence, there is one block that 
is 'growing' in memory.  At some point draw a line under it and start a new 
one. 
Taking that one step further, possible to merge blocks together to create 
superset blocks even.    Unclear if this is needed or not.

It would be ideal if even the streaming format coming in, was based on Arrow 
concepts / datatypes / organization etc. 

More thinking required.

Regards

Mark.

On 9/10/20, 5:25 AM, "Fan Liya" <liya.fa...@gmail.com> wrote:

    +1 for introducing Arrow in streaming processing, as we have made some
    attempts on this.

    IMO, the metadata overhead is not likely to be a problem.
    If the streaming data is having a high arriving rate, we can compensate for
    this with a large batch size without impacting the response time, while if
    the arriving rate is low, the metadata overhead is not likely to be a
    problem, as the system load is low as well.

    Best,
    Liya Fan

    On Sat, Sep 5, 2020 at 3:17 AM Micah Kornfield <emkornfi...@gmail.com>
    wrote:

    > Hi Pedro,
    > I think the answer is it likely depends.  The main trade-off in using 
Arrow
    > in a streaming process is the high metadata overhead if you have very few
    > rows.  There have been prior discussions on the mailing list about
    > row-based and streaming that might be useful [1][2] in expanding on the
    > trade-offs.
    >
    > For some additional color: Brian Hulette gave a talk [3] a while ago about
    > potentially using Arrow within Beam (I believe flink has a high overlap
    > with the Beam API) and some of the challenges.  It also looks like there
    > was a Flink JIRA (that you might be on?) about using Arrow directly in
    > Flink and some of the trade-offs [4].
    >
    > The questions you posed are a little bit vague, if there is more context 
it
    > might be able to help make the conversation more productive.
    >
    > -Micah
    >
    > [1]
    >
    > 
https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E
    > [2]
    >
    > 
https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E
    > [3] https://www.youtube.com/watch?v=avy1ifTZlhE
    > [4] https://issues.apache.org/jira/browse/FLINK-10929
    >
    >
    > On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com> wrote:
    >
    > > Hello,
    > >
    > > This may be a stupid question but is Arrow used for or designed with
    > > streaming processing use-cases in mind, where data is non-stationary.
    > I.e:
    > > Flink stream processing jobs?
    > >
    > > Particularly, is it possible from a given event source (say Kafka) to
    > > efficiently generate incremental record batches for stream processing?
    > >
    > > Suppose there is a data source that continuously generates messages with
    > > 100+ fields. You want to compute grouped aggregations (sums, averages,
    > > count distinct, etc...) over a select few of those fields, say 5 fields
    > at
    > > most used for all queries.
    > >
    > > Is this a valid use-case for Arrow?
    > > What if time is important and some windowing technique has to be 
applied?
    > >
    > > Thank you very much for your time!
    > > Have a good day.
    > >
    >

Re: Arrow as a streaming format

Reply via email to