Dear Members of Calcite Community,

I'm working on Apache Beam SQL and we use Calcite for query optimization.
We represent both tables and streams as a subclass of
AbstractQueryableTable. In calcite implementation of cost model and
statistics, one of the key elements is row count. Also all the nodes can
present a rowcount estimation based on their inputs. For instance, in the
case of joins, the rowcount is estimated by:
left.rowCount*right.rowCount*selectivity_estimate.

My first question is, what is the correct way of representing streams in
Calcite's Optimizer? Does calcite still uses row_count for streams? If so,
what does row count represent in case of streams?

In [1] they are suggesting to use both window (size of window in terms of
tuples) and rate to represent output of all nodes in stream processing
systems, and for every node these two values are estimated. For instance,
they suggest to estimate window and rate of the joins using:
join_rate = (left_rate*right_window + right_rate*left_window)*selectivitiy
join_window = (left_window*right_window)*selectivitiy

We were thinking about using this approach for Beam SQL; however, I am
wondering where would be the point of extension? I was thinking to
implement computeSelfCost() using a different cost model (rate,
window_size) for our physical Rel Nodes, in which we don't call
estimate_row_count and instead we use inputs' non cumulative cost to
estimate the node's cost. However, I am not sure if this is a good approach
and whether this can potentially cause problems in the optimization
(because there will still be logical nodes that are implemented in calcite
and may use row count estimation). Does calcite uses cost estimation for
logical nodes such as logical join? or it only calculates the cost when the
nodes are physical?

I will appreciate if someone can help me. I will also appreciate if someone
has other suggestions for streams query optimization.

Best,
Alireza Samadian

[1] Ayad, Ahmed M., and Jeffrey F. Naughton. "Static optimization of
conjunctive queries with sliding windows over infinite streams." *Proceedings
of the 2004 ACM SIGMOD international conference on Management of data*.
ACM, 2004.

Reply via email to