Re: limitations on the SQLs executed

Yang Liu Thu, 20 Feb 2020 01:05:36 -0800

Currently I can use following code to getCost, 

def getCost(rootSchema: SchemaPlus, sql: String): RelOptCost = {
    val configBuilder = Frameworks.newConfigBuilder()
    configBuilder.defaultSchema(rootSchema)
    val frameworkConfig = configBuilder.build()
    val planner = Frameworks.getPlanner(frameworkConfig)


    val sqlNode = planner.parse(sql)
    val validate = planner.validate(sqlNode)
    val rel = planner.rel(validate).project()
    val mq = rel.getCluster().getMetadataQuery
    mq.getNonCumulativeCost(rel)
}

but seems not very correct, for example, 

between two mysql
PLAN: EnumerableCalc(expr#0..3=[{inputs}], proj#0..2=[{exprs}])
  EnumerableHashJoin(condition=[=($0, $3)], joinType=[inner])
    JdbcToEnumerableConverter
      JdbcTableScan(table=[[perm, user_table]])
    JdbcToEnumerableConverter
      JdbcProject(id=[$0])
        JdbcTableScan(table=[[insight_user, user_tab]])
{1500.0 rows, 4500.0 cpu, 0.0 io}


between es and mysql
PLAN: EnumerableLimit(fetch=[10])
  EnumerableHashJoin(condition=[=($2, $8)], joinType=[inner])
    JdbcToEnumerableConverter
      JdbcTableScan(table=[[insight_user, user_tab]])
    ElasticsearchToEnumerableConverter
      ElasticsearchProject(status=[ITEM($0, 'status')], user=[ITEM($0, 'user')])
        ElasticsearchTableScan(table=[[es, es_table_1]])
{10.0 rows, 828.9306334778566 cpu, 0.0 io}

in practice the later one is much slower than the first one since the es table 
is very large.

Seems the planner cost is a logical estimate, may I know the correct usage of 
it?

Thanks

On 2020/02/20 03:00:47, Yang Liu <[email protected]> wrote: 
> Hi all,
> 
> Dive into docs and I find the getJoinRowCount
> <https://calcite.apache.org/apidocs/org/apache/calcite/rel/metadata/RelMdUtil.html#getJoinRowCount(org.apache.calcite.rel.metadata.RelMetadataQuery,org.apache.calcite.rel.core.Join,org.apache.calcite.rex.RexNode)>
> seems the very thing I am looking for, I can reject the join when the rows
> are too many. Any ideas?
> 
> Yang Liu <[email protected]> 于2020年2月17日周一 下午5:13写道：
> 
> > Thanks Muhammad, can help give more detailed description?
> > Currently I have searched a RefOptCost, is this the one you are referring
> > to?
> > And I need to clarify, I do not mind the cost happened in datasources
> > since that will not affect my application, may just take a longer time to
> > get the result if the cost is high.
> > I care about the cost in our application and afraid of OOM.
> >
> > Thanks
> >
> > Muhammad Gelbana <[email protected]> 于2020年2月17日周一 上午4:26写道：
> >
> >> If your only concern is about memory utilization, I would try estimating
> >> this using the plan's cost. But I guess you'll have run some tests to
> >> estimate the ranges you can accept.
> >>
> >>
> >> On Sun, Feb 16, 2020 at 5:50 PM Yang Liu <[email protected]> wrote:
> >>
> >> > Is it possible to have some limitations on the SQLs to make sure our
> >> > application which depends on Calcite is "safe"? For example, when do
> >> merge
> >> > joining between 2 large datasources, our application maybe OOM since the
> >> > joining process is in memory. If we have the "limitation mechanism", we
> >> can
> >> > refuse to execute the joining to avoid OOM.
> >> >
> >> > Or we can only do the check outside Calcite?
> >> >
> >> > Thanks
> >> >
> >>
> >
>

Re: limitations on the SQLs executed

Reply via email to