On Wed, 2004-05-19 at 11:56, Tom Lane wrote:
> "Jeffrey W. Baker" <[EMAIL PROTECTED]> writes:
> > Are you saying that index scan results are sorted by something other
> > than the index key? Because in my scheme they would still be sorted by
> > index key.
>
> Not unless you add yet another sort step after the fetch step, that is
> the idea becomes
> 1. read index to get candidate TIDs
> 2. sort TIDs into physical order
> 3. read heap in physical order, check row visibility
> 4. sort selected rows into index order
>
> That would start to look like an unpleasant amount of overhead, since
> the sort would have to be over the entire output; you couldn't divide
> the scan into reasonable-size batches, whereas steps 1-3 can easily
> be run in batched style.
>
> Moreover, this'd not be any win compared to just dropping the assumption
> of sorted output; the planner could stick in the same sort for itself if
> it decided the cost was worth it.
>
> What you probably want to do is support both our traditional indexscan
> style and an "unsorted indexscan" plan type that delivers unsorted
> output (implementing steps 1-3 above). With appropriate cost estimation
> the planner could make an intelligent choice between the two.
> I'm too lazy to look in the archives right now, but my private notes
> about this say:
Thanks for the notes. Introducing a new query execution step sounds
like a better/easier idea than was I was going to do, which was to
rework some of the access methods to operate on vectors instead of
scalars. That idea looks increasingly difficult to implement.
> We could improve indexscan performance if we were to read a bunch of TIDs from
> the index, sort in page number order, and access the heap tuples in page
> number order. But probably need a batch size in the thousands of TIDs to get
> any really useful effect.
Depends on the query, but I got an improvement by half from batches of
400.
> This would not work very well given the present
> assumption that we hold a pin on the index page until we have fetched the
> tuple (cf lazy VACUUM). Pinning lots of index pages is no good for
> concurrency or buffer usage. QUESTION: is that assumption really necessary?
> Can't we just make the code ignore TIDs that lead to deleted tuples? Worst
> case is that we fetch a newly-inserted tuple that does not actually have the
> expected index value, but won't we ignore it because of MVCC tqual rules?
> (This does NOT work for dirty read or SnapshotNow cases, but we could probably
> set things up to only try this optimization for standard snapshot reads.)
> An even nicer property is that we could loop inside the index AM to scoop up
> tuples, saving locking overhead. This might be worth doing even when we want
> to keep returning the tuples in index order.
I think this is *definitely* worth doing. The current implementation in
the vicinity of index_getnext wastes valuable opportunities to optimize
the io and/or to let the kernel do a better job. Descending into
index_getnext -> heap_getnext for every tuple looks expensive in my
application.
Thanks again for your notes I'll revisit the source code and see if I
can come up with a plan.
-jwb
PS Regarding the archive, I've never seen it work. I use MARC instead.
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faqs/FAQ.html