Hi, This is a somewhat lengthy email about thoughts around a streaming computation engine for Arrow dataset that I would like to hear feedback from Arrow devs.
The main use cases that we are thinking for the streaming engine are time series data, i.e., data arrives in time order (e.g. daily US stock prices) and the query often follows the time order of the data (e.g., compute 7 day rolling mean of daily US stock prices). The main motivations for a streaming engine is (1) performance: always keeps small amount of hot data always in memory and cache (2) memory efficiency: the engine only need to keep small amounts of data in memory, e.g., for the 7 day rolling mean case, the engine never need to keep more than 7 day worth of stock price data, even it is computing this for a stream of 20 year data. (3) Live data application: data arrives in real time I have talked to Phillip Cloud and am aware of an effort going on to build a computation engine for SQL-like queries (mostly query on the entire dataset) but am unfamiliar with the details. So I am wondering if there is a way to design an engine that can satisfy both streaming and batch mode of processing. Or maybe it needs to be seperate engines but we can minimize the amount of duplication? Looking forward to any thoughts around this. Li