Hi, David,

Yes, switch sorting order would loose an interesting order so if user
dictates order by t, i; planner need to resort to its cost model.
Estimating cardinality of groupby is a much bigger topic than this thread.

I feel sorting string as if it is bytea is particularly interesting.  I am
aware Peter G's patch and I think it is great, but for this sort agg case,
first, I believe it is still slower than sorting bytea, and second, Peter
G's patch depends on data.   A common long prefix will make the patch less
effective, which is probably not so uncommon (for example, URL with a
domain prefix).  I don't see any downside of sort bytea, other than lost
the interest ordering.

Thanks,
Feng



On Fri, Oct 17, 2014 at 4:36 PM, David Rowley <dgrowle...@gmail.com> wrote:

> On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian <feng...@gmail.com> wrote:
>
>> Hi,
>>
>> Consider the following queries.
>>
>> create table t(i int, j int, k int, t text);
>> insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from
>> generate_series(1, 1000000) i;
>>
>> ftian=# explain select count(distinct j) from t group by t, i;
>>                                QUERY PLAN
>> ------------------------------------------------------------------------
>>  GroupAggregate  (cost=158029.84..178029.84 rows=1000000 width=22)
>>    ->  Sort  (cost=158029.84..160529.84 rows=1000000 width=22)
>>          Sort Key: t, i
>>          ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22)
>> (4 rows)
>>
>>
>> The query,
>> select count(distinct j) from t group by t, i;
>>
>> runs for 35 seconds.  However, if I change the query to,
>> select count(distinct j) from t group by i, t;  -- note switching the
>> ordering
>> select count(distinct j) from t group by decode(t, 'escape'), i; --
>> convert t to bytea
>>
>> Run times are just about 5 and 6.5 seconds.  The reason is clear, compare
>> a string with collation is slow, which is well understood by pg hackers.
>> However, here, the sorting order is forced by the planner, not user.
>> Planner can do the following optimizations,
>>
>> 1. for the sort we generated for sort agg, planner can switch column
>> ordering, put int before string,
>> 2. for the sort we generated for sort agg, use bytea compare instead of
>> string compare.
>>
>> They will bring big improvement to this common query.   Is this something
>> reasonable?
>>
>>
> It seems like it might be worth looking into, but I think it's likely more
> complex than sorting the group by order into narrowest column first.
>
> If the query was:
>
> select count(distinct j) from t group by t, i order by t, i;
>
> Then if that was  rewritten to make the group by i,t then the planner
> would then need to perform an additional sort on t,i to get the correct
> order by for the final result, which may or may not be faster, it would
> depend on how many groups there were to sort. Though I guess you could
> possibly just skip the optimisation if the next node up didn't need any
> specific ordering.
>
> I also wonder if taking into account the column's width is not quite
> enough. For example if the 'i' column only had 1 distinct value, then
> performing the group by on 'i' first wouldn't help at all. Keep in mind
> that the columns could be much closer in width than in your example, e.g
> int and bigint. Perhaps you'd need to include some sort of heuristics to
> look at the number of distinct values and the width, and form some sort of
> weighted estimates about which column to put first.
>
> Regards
>
> David Rowley
>

Reply via email to