An update.

I’ve run the algorithm on TPC-DS Q17. Here’s the output.

EnumerableProjectRel(I_ITEM_ID=[$0], I_ITEM_DESC=[$1], S_STATE=[$2], 
STORE_SALES_QUANTITYCOUNT=[$3], STORE_SALES_QUANTITYAVE=[$4], 
STORE_SALES_QUANTITYSTDEV=[$5], STORE_SALES_QUANTITYCOV=[/($5, $4)], 
AS_STORE_RETURNS_QUANTITYCOUNT=[$6], AS_STORE_RETURNS_QUANTITYAVE=[$7], 
AS_STORE_RETURNS_QUANTITYSTDEV=[$8], STORE_RETURNS_QUANTITYCOV=[/($8, $7)], 
CATALOG_SALES_QUANTITYCOUNT=[$9], CATALOG_SALES_QUANTITYAVE=[$10], 
CATALOG_SALES_QUANTITYSTDEV=[/($11, $10)], CATALOG_SALES_QUANTITYCOV=[/($11, 
$10)]): rowcount = 5.434029018852197E26, cumulative cost = 
{1.618185849567114E30 rows, 6.412154242245593E28 cpu, 0.0 io}
  EnumerableSortRel(sort0=[$0], sort1=[$1], sort2=[$2], dir0=[ASC], dir1=[ASC], 
dir2=[ASC]): rowcount = 5.434029018852197E26, cumulative cost = 
{1.6176424466652288E30 rows, 5.597049889417763E28 cpu, 0.0 io}
    EnumerableProjectRel(I_ITEM_ID=[$0], I_ITEM_DESC=[$1], S_STATE=[$2], 
STORE_SALES_QUANTITYCOUNT=[$3], STORE_SALES_QUANTITYAVE=[CAST(/($4, 
$5)):JavaType(class java.lang.Integer)], 
STORE_SALES_QUANTITYSTDEV=[CAST(POWER(/(-($6, /(*($4, $4), $5)), CASE(=($5, 1), 
null, -($5, 1))), 0.5)):JavaType(class java.lang.Integer)], 
AS_STORE_RETURNS_QUANTITYCOUNT=[$7], AS_STORE_RETURNS_QUANTITYAVE=[CAST(/($8, 
$7)):JavaType(class java.lang.Integer)], 
AS_STORE_RETURNS_QUANTITYSTDEV=[CAST(POWER(/(-($9, /(*($8, $8), $7)), 
CASE(=($7, 1), null, -($7, 1))), 0.5)):JavaType(class java.lang.Integer)], 
CATALOG_SALES_QUANTITYCOUNT=[$10], CATALOG_SALES_QUANTITYAVE=[CAST(/($11, 
$10)):JavaType(class java.lang.Integer)], $f11=[CAST(POWER(/(-($12, /(*($11, 
$11), $10)), CASE(=($10, 1), null, -($10, 1))), 0.5)):JavaType(class 
java.lang.Integer)]): rowcount = 5.434029018852197E26, cumulative cost = 
{1.1954863841615548E28 rows, 5.5427095992292415E28 cpu, 0.0 io}
      EnumerableAggregateRel(group=[{0, 1, 2}], 
STORE_SALES_QUANTITYCOUNT=[COUNT()], agg#1=[SUM($3)], agg#2=[COUNT($3)], 
agg#3=[SUM($6)], AS_STORE_RETURNS_QUANTITYCOUNT=[COUNT($4)], agg#5=[SUM($4)], 
agg#6=[SUM($7)], CATALOG_SALES_QUANTITYCOUNT=[COUNT($5)], agg#8=[SUM($5)], 
agg#9=[SUM($8)]): rowcount = 5.434029018852197E26, cumulative cost = 
{1.1411460939730328E28 rows, 4.890626116966977E28 cpu, 0.0 io}
        EnumerableProjectRel(I_ITEM_ID=[$58], I_ITEM_DESC=[$61], S_STATE=[$24], 
SS_QUANTITY=[$89], SR_RETURN_QUANTITY=[$140], CS_QUANTITY=[$196], $f6=[*($89, 
$89)], $f7=[*($140, $140)], $f8=[*($196, $196)]): rowcount = 
5.434029018852197E27, cumulative cost = {1.0868058037845108E28 rows, 
4.890626116966977E28 cpu, 0.0 io}
          EnumerableJoinRel(condition=[AND(=($82, $133), =($81, $132), =($88, 
$139))], joinType=[inner]): rowcount = 5.434029018852197E27, cumulative cost = 
{5.434029018992911E27 rows, 5065780.0 cpu, 0.0 io}
            EnumerableJoinRel(condition=[=($0, $86)], joinType=[inner]): 
rowcount = 2.3008402586892598E13, cumulative cost = {4.8588854672852766E13 
rows, 3044518.0 cpu, 0.0 io}
              EnumerableTableAccessRel(table=[[TPCDS, STORE]]): rowcount = 
12.0, cumulative cost = {12.0 rows, 13.0 cpu, 0.0 io}
              EnumerableJoinRel(condition=[=($0, $50)], joinType=[inner]): 
rowcount = 1.2782445881607E13, cumulative cost = {1.279800620431134E13 rows, 
3044505.0 cpu, 0.0 io}
                EnumerableFilterRel(condition=[=(CAST($15):VARCHAR(6) CHARACTER 
SET \"ISO-8859-1\" COLLATE \"ISO-8859-1$en_US$primary\", '1998Q1')]): rowcount 
= 10957.35, cumulative cost = {84006.35 rows, 146099.0 cpu, 0.0 io}
                  EnumerableTableAccessRel(table=[[TPCDS, DATE_DIM]]): rowcount 
= 73049.0, cumulative cost = {73049.0 rows, 73050.0 cpu, 0.0 io}
                EnumerableJoinRel(condition=[=($0, $24)], joinType=[inner]): 
rowcount = 7.7770908E9, cumulative cost = {7.783045975286664E9 rows, 2898406.0 
cpu, 0.0 io}
                  EnumerableTableAccessRel(table=[[TPCDS, ITEM]]): rowcount = 
18000.0, cumulative cost = {18000.0 rows, 18001.0 cpu, 0.0 io}
                  EnumerableTableAccessRel(table=[[TPCDS, STORE_SALES]]): 
rowcount = 2880404.0, cumulative cost = {2880404.0 rows, 2880405.0 cpu, 0.0 io}
            EnumerableJoinRel(condition=[AND(=($31, $79), =($30, $91))], 
joinType=[inner]): rowcount = 6.9978029381741304E16, cumulative cost = 
{6.9978054204658736E16 rows, 2021262.0 cpu, 0.0 io}
              EnumerableJoinRel(condition=[=($28, $0)], joinType=[inner]): 
rowcount = 7.87597881975E8, cumulative cost = {7.884434222216867E8 rows, 
433614.0 cpu, 0.0 io}
                EnumerableFilterRel(condition=[OR(=($15, '1998Q1'), =($15, 
'1998Q2'), =($15, '1998Q3'))]): rowcount = 18262.25, cumulative cost = 
{91311.25 rows, 146099.0 cpu, 0.0 io}
                  EnumerableTableAccessRel(table=[[TPCDS, DATE_DIM]]): rowcount 
= 73049.0, cumulative cost = {73049.0 rows, 73050.0 cpu, 0.0 io}
                EnumerableTableAccessRel(table=[[TPCDS, STORE_RETURNS]]): 
rowcount = 287514.0, cumulative cost = {287514.0 rows, 287515.0 cpu, 0.0 io}
              EnumerableJoinRel(condition=[=($28, $0)], joinType=[inner]): 
rowcount = 3.94888649445E9, cumulative cost = {3.9520401026966867E9 rows, 
1587648.0 cpu, 0.0 io}
                EnumerableFilterRel(condition=[OR(=($15, '1998Q1'), =($15, 
'1998Q2'), =($15, '1998Q3'))]): rowcount = 18262.25, cumulative cost = 
{91311.25 rows, 146099.0 cpu, 0.0 io}
                  EnumerableTableAccessRel(table=[[TPCDS, DATE_DIM]]): rowcount 
= 73049.0, cumulative cost = {73049.0 rows, 73050.0 cpu, 0.0 io}
                EnumerableTableAccessRel(table=[[TPCDS, CATALOG_SALES]]): 
rowcount = 1441548.0, cumulative cost = {1441548.0 rows, 1441549.0 cpu, 0.0 io}

The join order is as follows:

(store x (date_dim x (item x store_sales)))
x ((date_dim x store_returns)
   x (date_dim x catalog_sales))

It has some parts right, some parts wrong. Note that it chose "date_dim x (item 
x store_sales)” when it should have chosen "item x (date_dim x store_sales)”. 
The estimated the row-count for date_dim x store_sales was probably way too 
high, due to date_dim having a lot of values that don’t correspond to any 
sales. (The designers of TPC-DS, in their wisdom, populated the date dimension 
with 200 yrs of data.)

Note that item x store_sales is estimated at 7.8e9 rows, but should be 2.8e7 
(the cardinality of store_sales, because it’s a many-to-one join).

All things considered, it’s a good start. I’ve checked in the algorithm [1], 
disabled by default.

I’ve logged https://issues.apache.org/jira/browse/OPTIQ-357 with some ideas to 
make it better.

Julian

[1] http://git-wip-us.apache.org/repos/asf/incubator-optiq/commit/b0aff0df

Reply via email to