+1 on this also. As per previous questions, this is something I am also looking into.
IIOT realtime streaming, it can be as low as one datapoint per 'message' / block / packet etc. Or at best. one 'row'. i.e. 1 second streaming sensor data, or faster which also has a 1 second latency / update requirement so it can't be delayed and batched, at least not very much. Even if received as such, I really don't want to store them as discrete blocks. Though 'Table' seems to cover this logically, it still results in fragmented memory and minimal Arrow benefit. So I've been concentrating on how to take those points/rows coming in by alternate transfer means (not Arrow format), and building up the Arrow block in memory. Still unclear best way to do this and have it useable at the same time as this goes against the entire concept of immutable block !. Still thinking through the best pattern. In essence, there is one block that is 'growing' in memory. At some point draw a line under it and start a new one. Taking that one step further, possible to merge blocks together to create superset blocks even. Unclear if this is needed or not. It would be ideal if even the streaming format coming in, was based on Arrow concepts / datatypes / organization etc. More thinking required. Regards Mark. On 9/10/20, 5:25 AM, "Fan Liya" <liya.fa...@gmail.com> wrote: +1 for introducing Arrow in streaming processing, as we have made some attempts on this. IMO, the metadata overhead is not likely to be a problem. If the streaming data is having a high arriving rate, we can compensate for this with a large batch size without impacting the response time, while if the arriving rate is low, the metadata overhead is not likely to be a problem, as the system load is low as well. Best, Liya Fan On Sat, Sep 5, 2020 at 3:17 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Pedro, > I think the answer is it likely depends. The main trade-off in using Arrow > in a streaming process is the high metadata overhead if you have very few > rows. There have been prior discussions on the mailing list about > row-based and streaming that might be useful [1][2] in expanding on the > trade-offs. > > For some additional color: Brian Hulette gave a talk [3] a while ago about > potentially using Arrow within Beam (I believe flink has a high overlap > with the Beam API) and some of the challenges. It also looks like there > was a Flink JIRA (that you might be on?) about using Arrow directly in > Flink and some of the trade-offs [4]. > > The questions you posed are a little bit vague, if there is more context it > might be able to help make the conversation more productive. > > -Micah > > [1] > > https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E > [2] > > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E > [3] https://www.youtube.com/watch?v=avy1ifTZlhE > [4] https://issues.apache.org/jira/browse/FLINK-10929 > > > On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com> wrote: > > > Hello, > > > > This may be a stupid question but is Arrow used for or designed with > > streaming processing use-cases in mind, where data is non-stationary. > I.e: > > Flink stream processing jobs? > > > > Particularly, is it possible from a given event source (say Kafka) to > > efficiently generate incremental record batches for stream processing? > > > > Suppose there is a data source that continuously generates messages with > > 100+ fields. You want to compute grouped aggregations (sums, averages, > > count distinct, etc...) over a select few of those fields, say 5 fields > at > > most used for all queries. > > > > Is this a valid use-case for Arrow? > > What if time is important and some windowing technique has to be applied? > > > > Thank you very much for your time! > > Have a good day. > > >