"Peter J. Holzer" <hjp-pg...@hjp.at> writes:
> On 2025-04-30 17:16:47 +0200, Markus Demleitner wrote:
>> dfbsspec.raw_spectra is about 23 Megarows,
>> ->  Parallel Seq Scan on raw_spectra  (cost=0.00..2626995.66 rows=5803266 
>> width=756) (actual time=0.137..6841.379 rows=4642657 loops=5)

> It estimates that it has to read 5803266 of those 23000000 rows.

No, you're misreading that (I admit it's confusing).  The rows report
is the average per parallel worker, and the loops count indicates we
had 5 workers.  So actually this parallel seqscan emitted 4642657*5
= 23213285 rows, or the whole table, which is what should be expected
given it has no filter condition and no LIMIT.

(I am wondering why the estimate is only 5803266 rows, because I don't
think that number is scaled for the number of workers...)

>> ->  Nested Loop  (cost=0.56..4871.60 rows=561 width=0) (actual 
>> time=2.478..2.479 rows=0 loops=1)
>>       ->  Seq Scan on main  (cost=0.00..52.61 rows=561 width=48) (actual 
>> time=0.011..0.317 rows=561 loops=1)
>>       ->  Index Scan using raw_spectra_pub_did on raw_spectra  
>> (cost=0.56..8.58 rows=1 width=66) (actual time=0.003..0.003 rows=0 loops=561)
>>             Index Cond: (pub_did = main.obs_publisher_did)

> Here you select only 561 rows. That's just a tiny fraction of the whole
> table, so the optimizer estimates that doing a few hundred index lookups
> is faster than reading the whole table.

The point here is that the chosen plan shape allows pushing the join
qual "raw_spectra.pub_did = main.obs_publisher_did" down to be an
index condition, which is exactly what we have to do if we want to
avoid reading all of raw_spectra.  What Markus is complaining about
is that that fails to happen if there's a UNION ALL in the way.
Postgres is capable of doing that in other cases, so it's a fair
question.

                        regards, tom lane


Reply via email to