Re: [PERFORM] This query is still running after 10 hours...

Steven Rosenstein Wed, 29 Sep 2004 06:35:09 -0700

Hi Robert,

"There is no significant disk activity (read 0), one CPU is pegged, and
that process is consuming 218M Resident memory, 168M Shared (10% available
memory total). All reasonable, except for the fact it doesn't come back..."

Just to let you know, I've observed the identical phenomenon on my RHEL3-WS
server running PostgreSQL V7.3.4: One of the CPU's pegged at 100% (2-way
SMP with hyperthreading, so 4 apparent CPU's), virtually zero disk I/O
activity, high memory usage, etc.  I thought it might be due to a casting
problem in a JOIN's ON clause, but that did not turn out to be the case.  I
*have* recently observed that if I run a vacuum analyze on the entire
database, the amount of time spent in this looping state decreases greatly,
but it has *not* disappeared in all cases.

Next week I hope to be able to run some directed test with stats collection
turned on, to try to see if I can find out what's causing this to occur.
I'll post the results if I find anything significant.

--- Steve
___________________________________________________________________________________

Steven Rosenstein
Senior IT Architect/Specialist | IBM Virtual Server Administration
Voice/FAX: 845-689-2064 | Cell: 646-345-6978 | Tieline: 930-6001
Text Messaging: 6463456978 @ mobile.mycingular.com
Email: srosenst @ us.ibm.com

"Learn from the mistakes of others because you can't live long enough to
make them all yourself." -- Eleanor Roosevelt

             Robert Creager                                                
             <[EMAIL PROTECTED]                                             
             ogicalChaos.org>                                           To 
             Sent by:                  Tom Lane <[EMAIL PROTECTED]>        
             pgsql-performance                                          cc 
             [EMAIL PROTECTED]         PGPerformance                       
             .org                      <[EMAIL PROTECTED]>  
                                                                   Subject 
                                       Re: [PERFORM] This query is still   
             09/28/2004 11:44          running after 10 hours...           
             PM                                                            

When grilled further on (Tue, 28 Sep 2004 11:04:23 -0400),
Tom Lane <[EMAIL PROTECTED]> confessed:

> Robert Creager <[EMAIL PROTECTED]> writes:
> > Normally, this query takes from 5 minutes to 2 hours to run.  On this
> > update, it's been running for more than 10 hours.
>
> > ...
> >         ->  Nested Loop  (cost=250.69..129094.19 rows=77211 width=59)
> >               ->  Hash Join  (cost=250.69..307.34 rows=67 width=12)
> >                     Hash Cond: ("outer".pair_id = "inner".pair_id)
> > ...
>
> It chose a nested loop here because it was only expecting 67 rows out of
> the next-lower join, and so it thought it would only need 67 repetitions
> of the index probe into obs_v_file_id_index.  I'm suspicious that that
> estimate was way low and so the nestloop is taking forever.  You might
> try "SET enable_nestloop = off" as a crude way of avoiding that trap.

I tried your suggestion.  Did generate a different plan (below), but the
estimation is blown as it still used a nested loop. The query is currently
running(42 minutes so far). For the query in question, there are 151
different
pair_id's in the pairs table, which equates to 302 entries in the files
table
(part of the query), which moves on to 533592 entries in the obs_v table
and
533699 entries in the obs_i table.

The groups table has 76 total entries, files 9028, zero_pair 2532,
color_groups
147. Only the obs_v and obs_i tables have data of any significant
quantities
with 10M rows apiece.  The trigger hitting the catalog table (875499
entries) is
searching for single entries to match (one fire per obs_v/obs_i update) on
an
index (took 54ms on the first query of a random id just now).

There is no significant disk activity (read 0), one CPU is pegged, and that
process is consuming 218M Resident memory, 168M Shared (10% available
memory
total). All reasonable, except for the fact it doesn't come back...

Hash Join  (cost=100267870.17..100751247.13 rows=1578889 width=63)
  Hash Cond: ("outer".star_id = "inner".star_id)
  ->  Seq Scan on obs_i i  (cost=0.00..213658.19 rows=10391319 width=8)
  ->  Hash  (cost=100266886.39..100266886.39 rows=77113 width=59)
        ->  Hash Join  (cost=100000307.51..100266886.39 rows=77113
width=59)
              Hash Cond: ("outer".file_id = "inner".file_id)
              ->  Seq Scan on obs_v  (cost=0.00..213854.50 rows=10390650
width=5
1)              ->  Hash  (cost=100000307.34..100000307.34 rows=67
width=12)
                    ->  Hash Join  (cost=100000250.69..100000307.34 rows=67
width=12)                          Hash Cond: ("outer".pair_id =
"inner".pair_id)                          ->  Seq Scan on zero_pair zp
(cost=0.00..43.32 rows=2532 width=8)                          ->  Hash
(cost=100000250.40..100000250.40 rows=118 width=12)

  ->  Hash Join  (cost=100000004.80..100000250.40 rows=118 width=12)

                         Hash Cond: ("outer".night_id = "inner".night_id)

                              ->  Seq Scan on files f  (cost=0.00..199.28
rows=9028 width=12)                                      ->  Hash
(cost=100000004.80..100000004.80rows=1 width=8)

          ->  Nested Loop  (cost=100000000.00..100000004.80 rows=1 width=8)

                                            ->  Seq Scan on color_groups cg

(cost=0.00..2.84 rows=1 width=8)

       Filter: (175 = group_id)

->  Seq Scan on groups g  (cost=0.00..1.95 rows=1 width=8)

                                 Filter: (group_id = 175)

--
 20:48:23 up 5 days, 23:14,  2 users,  load average: 2.56, 2.91, 2.78
Linux 2.6.5-02 #8 SMP Mon Jul 12 21:34:44 MDT 2004
(See attached file: color.explain)(See attached file: attlakjy.dat)

color.explain
Description: Binary data

attlakjy.dat
Description: Binary data

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] This query is still running after 10 hours...

Reply via email to