Hi,

This is a somewhat lengthy email about thoughts around a streaming
computation engine for Arrow dataset that I would like to hear feedback
from Arrow devs.

The main use cases that we are thinking for the streaming engine are time
series data, i.e., data arrives in time order (e.g. daily US stock prices)
and the query often follows the time order of the data (e.g., compute 7 day
rolling mean of daily US stock prices).

The main motivations for a streaming engine is (1) performance: always
keeps small amount of hot data always in memory and cache (2)
memory efficiency: the engine only need to keep small amounts of data in
memory, e.g., for the 7 day rolling mean case, the engine never need to
keep more than 7 day worth of stock price data, even it is computing this
for a stream of 20 year data. (3) Live data application: data arrives in
real time

I have talked to Phillip Cloud and am aware of an effort going on to build
a computation engine for SQL-like queries (mostly query on the entire
dataset) but am unfamiliar with the details. So I am wondering if there is
a way to design an engine that can satisfy both streaming and batch mode of
processing. Or maybe it needs to be seperate engines but we can minimize
the amount of duplication?

Looking forward to any thoughts around this.

Li

Reply via email to