Re: [ optimizer ] Some questions on optimization cost estimates?

Army Mon, 03 Apr 2006 17:20:18 -0700

Army wrote:

So my question is this: if two cost estimates have the sameestimatedCost but different rowCounts (or even singleScanRowCounts),it is reasonable to suppose that the estimate with the lower rowCountis actually better?


Mike Matrigali wrote:

I would think it would be better to figure out why the costs are thesame for these vastly different plans. Seems like a problem withcosting. It seems cleaner to me compare based on one field. On the
face of this there is some bad assumption in building the cost - maybe
it is the 0 cost per hash look up you also mentioned? What were theactual 2 plans?

Thanks for the input, Mike. In looking more at the query plans to answer yourquestions, I think I ended up stumbling into an explanation of why I'm seeingthis behavior.

I looked some more at the plans and the only difference is join order. Forcontext, here are the queries again:


select * from V1, V2 where V1.j = V2.b;
select * from V2, V1 where V1.j = V2.b;

where V1 and V2 are both unions of two tables, with V2 having 100,000 rows andV1 having 7 (seven) rows.

In both cases the optimizer chooses to do a Hash join between V1 and V2. In thefirst case V1 is inner table; in the second, V2 is the inner table. That said,the cost calculation for a JoinNode, as found in JoinNode.optimizeIt(), is:


        /*
        ** We add the costs for the inner and outer table, but the number
        ** of rows is that for the inner table only.
        */
        costEstimate.setCost(
                leftResultSet.getCostEstimate().getEstimatedCost() +
                rightResultSet.getCostEstimate().getEstimatedCost(),
                rightResultSet.getCostEstimate().rowCount(),
                rightResultSet.getCostEstimate().singleScanRowCount());

Thus even though the cost will be the same regardless of which "table" comesfirst, the rowCount will always be that of the inner table. In this case, theoptimizer guesses "20" for the row count of V1 and "100,016" for the row countof V2.


So that explains why we have the same cost but different row counts.

As to why the performance is so much worse when V2 is the inner table, I thinkit's because we try to materialize all 100,000 rows into a Hash table, which (Iwould guess?) spills over to disk and causes us to use a backing store hashtable. But use of the backing store hash table isn't yet factored intooptimizer cost estimates. Instead, the optimizer does a check to see if thememory required for the hash table is "excessive" (which translates to the factthat it will (probably) be too large for memory and require a backing storehash) and, if so, it skips the plan (assuming there is another feasible planthat does not require so much memory).

That said, when we specify "V1, V2" in the query, the check for excessive memoryoccurs as it should. But when we specify "V2, V1" the check for excessivememory returns incorrect results--i.e it doesn't realize that we're going to beusing so much memory, and so we do not skip the join order like we should. Asit turns out, this incorrect behavior is caused by a bug in the 10.1 code thathas been fixed in 10.2 as Phase 1 of DERBY-805. So in short, I was using 10.1to try to analyze the behavior *without* my DERBY-805 changes, and it turns outthat the incorrect behavior I'm seeing is actually fixed as part of theDERBY-805 Phase 1 patch. Oops. And just to make sure, I did verify this withthe 10.2 codeline (after disabling the predicate pushdown for unions) and theoptimizer does indeed choose the same join order regardless of whether I give"V1, V2" or "V2, V1".


> It seems cleaner to me compare based on one field.

Okay, I'm fine with leaving the compare as a one-field check for now. I dohowever think this means that we can still end up with the optimizer choosingdifferent plans depending on the order in which by the user specifies the FROMlist elements. In cases where one join order requires excessive memory we will(as of Phase 1 for DERBY-805) correctly skip that join order regardless of howthe user specified the FROM list. But in cases where memory is not an issue, Ithink the order of the FROM list in the user's query can still affect the planchosen by the optimizer. My gut instinct is that that seems odd, but I guess Ineed to look at it some more...

Thanks to Mike for prodding me in the right direction so that I could understandthis more.

And if anyone has input for the other 2 questions I asked on this thread, I'mall ears... :)


Army

Re: [ optimizer ] Some questions on optimization cost estimates?

Reply via email to