Currently I can use following code to getCost,
def getCost(rootSchema: SchemaPlus, sql: String): RelOptCost = {
val configBuilder = Frameworks.newConfigBuilder()
configBuilder.defaultSchema(rootSchema)
val frameworkConfig = configBuilder.build()
val planner = Frameworks.getPlanner(frameworkConfig)
val sqlNode = planner.parse(sql)
val validate = planner.validate(sqlNode)
val rel = planner.rel(validate).project()
val mq = rel.getCluster().getMetadataQuery
mq.getNonCumulativeCost(rel)
}
but seems not very correct, for example,
between two mysql
PLAN: EnumerableCalc(expr#0..3=[{inputs}], proj#0..2=[{exprs}])
EnumerableHashJoin(condition=[=($0, $3)], joinType=[inner])
JdbcToEnumerableConverter
JdbcTableScan(table=[[perm, user_table]])
JdbcToEnumerableConverter
JdbcProject(id=[$0])
JdbcTableScan(table=[[insight_user, user_tab]])
{1500.0 rows, 4500.0 cpu, 0.0 io}
between es and mysql
PLAN: EnumerableLimit(fetch=[10])
EnumerableHashJoin(condition=[=($2, $8)], joinType=[inner])
JdbcToEnumerableConverter
JdbcTableScan(table=[[insight_user, user_tab]])
ElasticsearchToEnumerableConverter
ElasticsearchProject(status=[ITEM($0, 'status')], user=[ITEM($0, 'user')])
ElasticsearchTableScan(table=[[es, es_table_1]])
{10.0 rows, 828.9306334778566 cpu, 0.0 io}
in practice the later one is much slower than the first one since the es table
is very large.
Seems the planner cost is a logical estimate, may I know the correct usage of
it?
Thanks
On 2020/02/20 03:00:47, Yang Liu <[email protected]> wrote:
> Hi all,
>
> Dive into docs and I find the getJoinRowCount
> <https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/RelMdUtil.html#getJoinRowCount(org.apache.calcite.rel.metadata.RelMetadataQuery,org.apache.calcite.rel.core.Join,org.apache.calcite.rex.RexNode)>
> seems the very thing I am looking for, I can reject the join when the rows
> are too many. Any ideas?
>
> Yang Liu <[email protected]> 于2020年2月17日周一 下午5:13写道:
>
> > Thanks Muhammad, can help give more detailed description?
> > Currently I have searched a RefOptCost, is this the one you are referring
> > to?
> > And I need to clarify, I do not mind the cost happened in datasources
> > since that will not affect my application, may just take a longer time to
> > get the result if the cost is high.
> > I care about the cost in our application and afraid of OOM.
> >
> > Thanks
> >
> > Muhammad Gelbana <[email protected]> 于2020年2月17日周一 上午4:26写道:
> >
> >> If your only concern is about memory utilization, I would try estimating
> >> this using the plan's cost. But I guess you'll have run some tests to
> >> estimate the ranges you can accept.
> >>
> >>
> >> On Sun, Feb 16, 2020 at 5:50 PM Yang Liu <[email protected]> wrote:
> >>
> >> > Is it possible to have some limitations on the SQLs to make sure our
> >> > application which depends on Calcite is "safe"? For example, when do
> >> merge
> >> > joining between 2 large datasources, our application maybe OOM since the
> >> > joining process is in memory. If we have the "limitation mechanism", we
> >> can
> >> > refuse to execute the joining to avoid OOM.
> >> >
> >> > Or we can only do the check outside Calcite?
> >> >
> >> > Thanks
> >> >
> >>
> >
>