Many thanks Justin & Jeff for your replies.

Presumbly the conditions are partially redundant, so loc_id => user_id


Yes you're right. I had overlooked this.

I've done some further testing and this confirms what you say: if the WHERE
columns are independent, then the Planner makes a reasonable estimate of
the number of rows. irrespective of whether it uses a single index or a
"Bitmap And" of two indexes.

I've also tested "create statistics" on Postgres 12:

   - gives good estimate with WHERE user_id = 'USER123' and loc_id =
   'LOC12345678'
   - but Plan Rows = 5 with WHERE user_id = 'USER123'  and loc_id = ANY('{
   LOC12345678 }'::text[])
   - Note: if I omit the user_id condition then it gives a good estimate,
   i.e. with WHERE loc_id = ANY('{ LOC12345678 }'::text[])

So statistics objects don't seem to be able to handle the combination of
dependencies and arrays (at least in 12.2).

Steve

On Wed, 6 May 2020 at 19:25, Jeff Janes <jeff.ja...@gmail.com> wrote:

> On Wed, May 6, 2020 at 12:20 PM Steve Pritchard <steve.pritch...@bto.org>
> wrote:
>
>> Version: Postgres 9.6.3 production system (but also tested on Postgres 12)
>>
>> For my query the Planner is sometimes choosing an execution plan that
>> uses "Bitmap And" (depending on the parameters):
>>
>> ->  Bitmap Heap Scan on observation  (cost=484.92..488.93 rows=1
>> width=203) (actual time=233.129..330.886 rows=15636 loops=1)
>>   Recheck Cond: (((user_id)::text = 'USER123'::text) AND ((loc_id)::text
>> = ANY ('{LOC12345678}'::text[])))
>>
>
> If you change " = ANY(array_of_one)" to " = scalar", does that change
> anything?  You might be able to fix this (in v12) using CREATE STATISTICS,
> but I don't know if that mechanism can see through the ANY(array_of_one)
> wrapper.
>
>
>> Note that in cases where the Planner selects a single Index Scan for this
>> query (with different parameters), the Planner makes an accurate estimate
>> of the number of rows and then makes sensible selections of joins (i.e.
>> quick).
>> i.e. the issue seems to be with the "Bitmap And".
>>
>
>
> I don't know if this nitpick matters, but I don't think that that is how
> the planner works.  The row estimates work from the top down, not the
> bottom up.  The row estimate of 1 is based on what conditions the bitmap
> heap scan implements, it is not arrived at by combining the estimates from
> the index scans below it.  If it were to change to a different type of node
> but implemented the same conditions, I think it would have the same row
> estimate.
>
>
>>
>> I don't have an index with both user_id & loc_id, as this is one of
>> several different combinations that can arise (it would require quite a few
>> indexes to cover all the possible combinations).
>>
>
> Are you actually experiencing problems with those other combinations as
> well?  If not, I wouldn't worry about solving hypothetical problems.  If
> those other combinations are actually problems and you go with CREATE
> STATISTICS, then you would have to be creating a lot of different
> statistics.  That would still be ugly, but at least the overhead for
> statistics is lower than for indexes.
>
>
>> However if I did have such an index, the planner would presumably be able
>> to use the statistics for user_id and loc_id to estimate the number of rows.
>>
>
> Indexes on physical columns do not have statistics, so making that index
> would not help with the estimation.  (Expressional indexes do have
> statistics, but I don't see that helping you here).   So while this node
> would execute faster with that index, it would still be kicking the unshown
> nested loop left join 15,636 times when it thinks it will be doing it
> once, and so would still be slow.  The most robust solution might be to
> make the outer part of that nested loop left join faster, so that your
> system would be more tolerant of statistics problems.
>
>
>>
>> So why can't it make an accurate estimate of the rows with a "Bitmap And"
>> & " Bitmap Heap Scan"? (as above)
>>
>
> In the absence of custom statistics, it assumes the selectivities of user_id
> = 'USER123', of loc_id = ANY ('{LOC12345678}'::text[]), and of taxa =
> 'Birds' are all independent of each other and can be multiplied to arrive
> at the overall selectivity.  But clearly that is not the case.  Bird
> watchers mostly watch near where they live, not in random other places.
>
> Cheers,
>
> Jeff
>


-- 
Steve Pritchard
Database Developer

British Trust for Ornithology, The Nunnery, Thetford, Norfolk IP24 2PU, UK
Tel: +44 (0)1842 750050, fax: +44 (0)1842 750030
Registered Charity No 216652 (England & Wales) No SC039193 (Scotland)
Company Limited by Guarantee No 357284 (England & Wales)

Reply via email to