Re: [Drizzle-discuss] Index Prioritization in Storage Engine API

Toru Maesaka Tue, 23 Mar 2010 18:25:02 -0700

Hi Paul,

> Not sure if this helps, but by increasing the value returned by scan_time(),
> increases the cost of a table scan, so the optimizer is more likely to use
> an index scan.


This certainly helps. Thank you.

I've been lurking around PBXT's source code recently for some
inspiration on this matter and noticed both InnoDB and PBXT has a
comment that points out this function. I also noticed that PBXT's
records_in_range() is very concise and from a glimpse, it looks fast.

It would be awesome if we can talk about this next month :)

Cheers,
Toru

On Tue, Mar 23, 2010 at 7:06 PM, Paul McCullagh
<[email protected]> wrote:
> Hi Toru,
>
> Not sure if this helps, but by increasing the value returned by scan_time(),
> increases the cost of a table scan, so the optimizer is more likely to use
> an index scan.
>
> On Mar 23, 2010, at 4:25 AM, Toru Maesaka wrote:
>
>> G'day Jay,
>>
>>> Hi Toru! By "prioritize" I'm assuming you mean "always prefer an index
>>> scan over a table scan"?
>>
>> Yep! I should have made this more clear :( I also realized that I
>> spoke too soon after sending the previous email. Returning a constant
>> cost factor in records_in_range() does seem to work. I really need to
>> thank Domas for pointing me out to this approach.
>>
>> Another solution that could be neat is an API that would return a
>> boolean of whether the index should be used for certain query types.
>>
>>> Actually, this is precisely what InnoDB does. :(  Not only does this
>>> result in a call to malloc() for a copy of the MySQL key value
>>> coverted into InnoDB's internal format, but it also results in a new
>>> InnoDB mem_heap allocation and a call to
>>> btr_estimate_n_rows_in_range().  For *every* call to
>>> Cursor::records_in_range().
>>
>> Ouch. Although I wonder what the performance difference would be if we
>> could optimize this away...
>>
>>> Not exactly efficient...I agree that the interface and implementation
>>> could be much simpler and cleaner.
>>
>> Yeah... What really caught my attention was that an engine could end
>> up doing the same search twice in a simple statement that involves
>> ranges. That is, we provide the optimizer a count of rows in the given
>> range by traversing through the tree. IF the optimizer decides to use
>> the index, then the engine would most likely go through the same
>> search, which I wanted to avoid.
>>
>> Cheers,
>> Toru
>>
>>
>> On Tue, Mar 23, 2010 at 4:31 AM, Jay Pipes <[email protected]> wrote:
>>>
>>> On Mon, Mar 22, 2010 at 1:05 PM, Toru Maesaka <[email protected]> wrote:
>>>>
>>>> Hi!
>>>>
>>>> Would it make sense to have an engine-level option to tell the
>>>> optimizer to prioritize index use at all times? Something along the
>>>> line of HA_PRIORITIZE_INDEX.
>>>
>>> Hi Toru! By "prioritize" I'm assuming you mean "always prefer an index
>>> scan over a table scan"?
>>>
>>> Or do you mean something else?
>>>
>>>> I'm saying this because the relationship between
>>>> Cursor::records_in_range() and HA_READ_RANGE seems important to
>>>> effectively perform range scans at the moment. I tried purposely
>>>> returning a constant small cost factor in records_in_range() but it
>>>> seems that the range search will fail if the number of estimated rows
>>>> (in my case a total lie) is significantly smaller than the number of
>>>> actual rows in the table. If I don't set, HA_READ_RANGE then it seems
>>>> Drizzle will resort to a table scan (despite knowing that
>>>> 'possible_keys' exist in EXPLAIN).
>>>>
>>>> Preferably, I don't want to count through the number of rows in a
>>>> range for the optimizer on every simple query. Although... a
>>>> completely dishonest implementation of records_in_range() would dirty
>>>> the EXPLAIN statement result. So there is definitely a downside to
>>>> what I was experimenting with.
>>>
>>> Actually, this is precisely what InnoDB does. :(  Not only does this
>>> result in a call to malloc() for a copy of the MySQL key value
>>> coverted into InnoDB's internal format, but it also results in a new
>>> InnoDB mem_heap allocation and a call to
>>> btr_estimate_n_rows_in_range().  For *every* call to
>>> Cursor::records_in_range().
>>>
>>> Not exactly efficient...I agree that the interface and implementation
>>> could be much simpler and cleaner.
>>>
>>> -jay
>>>
>>>> It would be great to hear what you all think :)
>>>>
>>>> - Toru
>>>>
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~drizzle-discuss
>>>> Post to     : [email protected]
>>>> Unsubscribe : https://launchpad.net/~drizzle-discuss
>>>> More help   : https://help.launchpad.net/ListHelp
>>>>
>>>
>>
>>
>>
>> --
>> Toru Maesaka <[email protected]>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~drizzle-discuss
>> Post to     : [email protected]
>> Unsubscribe : https://launchpad.net/~drizzle-discuss
>> More help   : https://help.launchpad.net/ListHelp
>
>
>
> --
> Paul McCullagh
> PrimeBase Technologies
> www.primebase.org
> www.blobstreaming.org
> pbxt.blogspot.com
>
>
>
>



-- 
Toru Maesaka <[email protected]>

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Index Prioritization in Storage Engine API

Reply via email to