how to write a better performance rule?

weijie tong Mon, 09 Jan 2017 00:46:41 -0800

hi all:
  I have written an Drill rule to push count(distinct ) query down to Druid
, but it took more than 2~7 seconds to  go through the Volcano Physical
Planning.I want to know is there any tips to notice to write a better
performance rule ?
  I have implemented two rules:
 1st:


agg->agg->scan

agg->agg->project->scan


2st:

agg->project->scan

agg->scan



 The query sql is :

select count(distinct position_title) from testDruid.employee where
birth_date='1961-09-24' group by employee_id

the Initial phase output is:

LogicalProject(EXPR$0=[$1]): rowcount = 1.5, cumulative cost = {133.1875
rows, 232.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 9
  LogicalAggregate(group=[{0}], EXPR$0=[COUNT(DISTINCT $1)]): rowcount =
1.5, cumulative cost = {131.6875 rows, 231.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 8
    LogicalProject(employee_id=[$4], position_title=[$14]): rowcount =
15.0, cumulative cost = {130.0 rows, 231.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 7
      LogicalFilter(condition=[=($0, '1961-09-24')]): rowcount = 15.0,
cumulative cost = {115.0 rows, 201.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 6
        EnumerableTableScan(table=[[testDruid, employee]]): rowcount =
100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 5

the physical volcano output is:

15:59:58.201 [278cbd08-f2c6-306d-e78e-b479d088cd5d:foreman] DEBUG
o.a.d.e.p.s.h.DefaultSqlHandler - VOLCANO:Physical Planning (7315ms):
ScreenPrel: rowcount = 2500.0, cumulative cost = {77750.0 rows, 970250.0
cpu, 0.0 io, 2.1504E8 network, 440000.00000000006 memory}, id = 115180
  UnionExchangePrel: rowcount = 2500.0, cumulative cost = {77500.0 rows,
970000.0 cpu, 0.0 io, 2.1504E8 network, 440000.00000000006 memory}, id =
115179
    ProjectPrel(EXPR$0=[$1]): rowcount = 2500.0, cumulative cost = {75000.0
rows, 950000.0 cpu, 0.0 io, 2.048E8 network, 440000.00000000006 memory}, id
= 115178
      HashAggPrel(group=[{0}], EXPR$0=[$SUM0($1)]): rowcount = 2500.0,
cumulative cost = {75000.0 rows, 950000.0 cpu, 0.0 io, 2.048E8 network,
440000.00000000006 memory}, id = 115177
        HashToRandomExchangePrel(dist0=[[$0]]): rowcount = 25000.0,
cumulative cost = {50000.0 rows, 450000.0 cpu, 0.0 io, 2.048E8 network, 0.0
memory}, id = 115176
          ProjectPrel(employee_id=[$0], EXPR$0=[$1]): rowcount = 25000.0,
cumulative cost = {25000.0 rows, 50000.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 115175

ScanPrel(groupscan=[DruidGroupScan{groupScanSpec=DruidGroupScanSpec{dataSource='employee',
druidQueryConvertState=MODIFIED, filter=null, allFiltersConverted=true,
hasMultiValueFilters=false, groupbyColumns=[employee_id, position_title],
query=StreamingGroupByQuery{groupByQuery=GroupByQuery{dataSource='employee',
querySegmentSpec=LegacySegmentSpec{intervals=[1000-01-01T00:00:00.000+08:05:43/3000-01-01T00:00:00.000+08:00]},
limitSpec=NoopLimitSpec, dimFilter=null, granularity=AllGranularity,
dimensions=[DefaultDimensionSpec{dimension='employee_id',
outputName='employee_id'}],
aggregatorSpecs=[DistinctCountAggregatorFactory{name='EXPR$0',
fieldName='position_title'}], postAggregatorSpecs=[], havingSpec=null},
batchSize=50000}}, columns=[`employee_id`, `EXPR$0`],
segmentIntervals=[1961-09-24T00:00:00.000+08:00/1961-09-25T00:00:00.000+08:00]}]):
rowcount = 25000.0, cumulative cost = {25000.0 rows, 50000.0 cpu, 0.0 io,
0.0 network, 0.0 memory}, id = 115081

how to write a better performance rule?

Reply via email to