Re: [PERFORM] Querying distinct values from a large table

Luke Lonergan Tue, 30 Jan 2007 23:00:54 -0800

Tom,

On 1/30/07 9:55 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:


> Alvaro Herrera <[EMAIL PROTECTED]> writes:
>> Gregory Stark wrote:
>>> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
>>> your data distribution. It's not hard to come up with distributions where
>>> it's
>>> 1000x as fast and others where there's no speed difference.)
> 
>> So the figure is really "1-1000x"?  I bet this one is more impressive in
>> PHB terms.
> 
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in.  I wish he'd stop doing that...

In this case I was referring to results obtained using grouping and distinct
optimizations within sort where the benefit is from the use of a single pass
instead of the multiple merge passes for external sort followed by a UNIQUE
operator.  In this case, the benefit ranges from 2-5x in many examples as I
mentioned: "from 2-5 times faster than a typical external sort".  This is
also the same range of benefits we see for this optimization with a popular
commercial database. With the limit/sort optimization we have seen more
dramatic results, but I think those are less typical cases.

Here are some results for a 1GB table and a simple COUNT(DISTINCT) on a
column with 7 unique values from my dual CPU laptop running Greenplum DB (PG
8.2.1 compatible) on both CPUs. Note that my laptop has 2GB of RAM so I have
the 1GB table loaded into OS I/O cache.  The unmodified external sort spills
the sorted attribute to disk, but that takes little time.  Note that the
COUNT(DISTINCT) plan embeds a sort as the transition function in the
aggregation node.

=================================================================
================= No Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# select count(distinct l_shipmode) from lineitem;
 count 
-------
     7
(1 row)

Time: 37832.308 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from
lineitem;          
QUERY PLAN         
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----
 Aggregate  (cost=159175.30..159175.31 rows=1 width=8)
   Total 1 rows with 40899 ms to end, start offset by 3.189 ms.
   ->  Gather Motion 2:1  (slice2)  (cost=159175.25..159175.28 rows=1
width=8)
         recv:  Total 2 rows with 39387 ms to first row, 40899 ms to end,
start offset by 3.191 ms.
         ->  Aggregate  (cost=159175.25..159175.26 rows=1 width=8)
               Avg 1.00 rows x 2 workers.  Max 1 rows (seg0) with 39367 ms
to end, start offset by 22 ms.
               ->  Redistribute Motion 2:2  (slice1)  (cost=0.00..151672.00
rows=3001300 width=8)
                     recv:  Avg 3000607.50 rows x 2 workers.  Max 3429492
rows (seg1) with 0.362 ms to first row, 8643 ms to end, start offset by 23
ms.
                     Hash Key: lineitem.l_shipmode
                     ->  Seq Scan on lineitem  (cost=0.00..91646.00
rows=3001300 width=8)
                           Avg 3000607.50 rows x 2 workers.  Max 3001300
rows (seg0) with 0.049 ms to first row, 2813 ms to end, start offset by
12.998 ms.
 Total runtime: 40903.321 ms
(12 rows)

Time: 40904.013 ms

=================================================================
================= With Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# set mpp_sort_flags=1;
SET
Time: 1.425 ms
lukelonergan=# select count(distinct l_shipmode) from lineitem;
 count 
-------
     7
(1 row)

Time: 12846.466 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from
lineitem;
                   
QUERY PLAN         
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----
 Aggregate  (cost=159175.30..159175.31 rows=1 width=8)
   Total 1 rows with 13754 ms to end, start offset by 2.998 ms.
   ->  Gather Motion 2:1  (slice2)  (cost=159175.25..159175.28 rows=1
width=8)
         recv:  Total 2 rows with 13754 ms to end, start offset by 3.000 ms.
         ->  Aggregate  (cost=159175.25..159175.26 rows=1 width=8)
               Avg 1.00 rows x 2 workers.  Max 1 rows (seg0) with 13734 ms
to end, start offset by 23 ms.
               ->  Redistribute Motion 2:2  (slice1)  (cost=0.00..151672.00
rows=3001300 width=8)
                     recv:  Avg 3000607.50 rows x 2 workers.  Max 3429492
rows (seg1) with 0.352 ms to first row, 10145 ms to end, start offset by 26
ms.
                     Hash Key: lineitem.l_shipmode
                     ->  Seq Scan on lineitem  (cost=0.00..91646.00
rows=3001300 width=8)
                           Avg 3000607.50 rows x 2 workers.  Max 3001300
rows (seg0) with 0.032 ms to first row, 4048 ms to end, start offset by
13.037 ms.
 Total runtime: 13757.524 ms
(12 rows)

Time: 13758.182 ms

================= Background Information ==============
lukelonergan=# select count(*) from lineitem;
  count  
---------
 6001215
(1 row)

Time: 1661.337 ms
lukelonergan=# \d lineitem
               Table "public.lineitem"
     Column      |         Type          | Modifiers
-----------------+-----------------------+-----------
 l_orderkey      | integer               | not null
 l_partkey       | integer               | not null
 l_suppkey       | integer               | not null
 l_linenumber    | integer               | not null
 l_quantity      | double precision      | not null
 l_extendedprice | double precision      | not null
 l_discount      | double precision      | not null
 l_tax           | double precision      | not null
 l_returnflag    | text                  | not null
 l_linestatus    | text                  | not null
 l_shipdate      | date                  | not null
 l_commitdate    | date                  | not null
 l_receiptdate   | date                  | not null
 l_shipinstruct  | text                  | not null
 l_shipmode      | text                  | not null
 l_comment       | character varying(44) | not null
Distributed by: (l_orderkey)

lukelonergan=# select pg_relation_size(oid)/(1000.*1000.) as MB, relname as
Table from pg_class order by MB desc limit 10;
           mb           |             table
------------------------+--------------------------------
  1009.4755840000000000 | lineitem
   230.0559360000000000 | orders
   146.7678720000000000 | partsupp
    35.8072320000000000 | part
    32.8908800000000000 | customer
     1.9333120000000000 | supplier
     1.1304960000000000 | pg_proc
 0.88473600000000000000 | pg_proc_proname_args_nsp_index
 0.81100800000000000000 | pg_attribute
 0.81100800000000000000 | pg_depend


- Luke

Re: [PERFORM] Querying distinct values from a large table

Reply via email to