On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 1/19/24 22:43, Melanie Plageman wrote: > > > We fill a queue with blocks from TIDs that we fetched from the index. > > The queue is saved in a scan descriptor that is made available to the > > streaming read callback. Once the queue is full, we invoke the table > > AM specific index_fetch_tuple() function which calls > > pg_streaming_read_buffer_get_next(). When the streaming read API > > invokes the callback we registered, it simply dequeues a block number > > for prefetching. > > So in a way there are two queues in IndexFetchTableData. One (blk_queue) > is being filled from IndexNext, and then the queue in StreamingRead.
I've changed the name from blk_queue to tid_queue to fix the issue you mention in your later remarks. I suppose there are two queues. The tid_queue is just to pass the block requests to the streaming read API. The prefetch distance will be the smaller of the two sizes. > > The only change to the streaming read API is that now, even if the > > callback returns InvalidBlockNumber, we may not be finished, so make > > it resumable. > > Hmm, not sure when can the callback return InvalidBlockNumber before > reaching the end. Perhaps for the first index_fetch_heap call? Any > reason not to fill the blk_queue before calling index_fetch_heap? The callback will return InvalidBlockNumber whenever the queue is empty. Let's say your queue size is 5 and your effective prefetch distance is 10 (some combination of the PgStreamingReadRange sizes and PgStreamingRead->max_ios). The first time you call index_fetch_heap(), the callback returns InvalidBlockNumber. Then the tid_queue is filled with 5 tids. Then index_fetch_heap() is called. pg_streaming_read_look_ahead() will prefetch all 5 of these TID's blocks, emptying the queue. Once all 5 have been dequeued, the callback will return InvalidBlockNumber. pg_streaming_read_buffer_get_next() will return one of the 5 blocks in a buffer and save the associated TID in the per_buffer_data. Before index_fetch_heap() is called again, we will see that the queue is not full and fill it up again with 5 TIDs. So, the callback will return InvalidBlockNumber 3 times in this scenario. > > Structurally, this changes the timing of when the heap blocks are > > prefetched. Your code would get a tid from the index and then prefetch > > the heap block -- doing this until it filled a queue that had the > > actual tids saved in it. With my approach and the streaming read API, > > you fetch tids from the index until you've filled up a queue of block > > numbers. Then the streaming read API will prefetch those heap blocks. > > And is that a good/desirable change? I'm not saying it's not, but maybe > we should not be filling either queue in one go - we don't want to > overload the prefetching. We can focus on the prefetch distance algorithm maintained in the streaming read API and then make sure that the tid_queue is larger than the desired prefetch distance maintained by the streaming read API. > > I didn't actually implement the block queue -- I just saved a single > > block number and pretended it was a block queue. I was imagining we > > replace this with something like your IndexPrefetch->blockItems -- > > which has light deduplication. We'd probably have to flesh it out more > > than that. > > I don't understand how this passes the TID to the index_fetch_heap. > Isn't it working only by accident, due to blk_queue only having a single > entry? Shouldn't the first queue (blk_queue) store TIDs instead? Oh dear! Fixed in the attached v2. I've replaced the single BlockNumber with a single ItemPointerData. I will work on implementing an actual queue next week. > > There are also table AM layering violations in my sketch which would > > have to be worked out (not to mention some resource leakage I didn't > > bother investigating [which causes it to fail tests]). > > > > 0001 is all of Thomas' streaming read API code that isn't yet in > > master and 0002 is my rough sketch of index prefetching using the > > streaming read API > > > > There are also numerous optimizations that your index prefetching > > patch set does that would need to be added in some way. I haven't > > thought much about it yet. I wanted to see what you thought of this > > approach first. Basically, is it workable? > > It seems workable, yes. I'm not sure it's much simpler than my patch > (considering a lot of the code is in the optimizations, which are > missing from this patch). > > I think the question is where should the optimizations happen. I suppose > some of them might/should happen in the StreamingRead API itself - like > the detection of sequential patterns, recently prefetched blocks, ... So, the streaming read API does detection of sequential patterns and not prefetching things that are in shared buffers. It doesn't handle avoiding prefetching recently prefetched blocks yet AFAIK. But I daresay this would be relevant for other streaming read users and could certainly be implemented there. > But I'm not sure what to do about optimizations that are more specific > to the access path. Consider for example the index-only scans. We don't > want to prefetch all the pages, we need to inspect the VM and prefetch > just the not-all-visible ones. And then pass the info to the index scan, > so that it does not need to check the VM again. It's not clear to me how > to do this with this approach. Yea, this is an issue I'll need to think about. To really spell out the problem: the callback dequeues a TID from the tid_queue and looks up its block in the VM. It's all visible. So, it shouldn't return that block to the streaming read API to fetch from the heap because it doesn't need to be read. But, where does the callback put the TID so that the caller can get it? I'm going to think more about this. As for passing around the all visible status so as to not reread the VM block -- that feels solvable but I haven't looked into it. - Melanie
v2-0002-use-streaming-reads-in-index-scan.nocfbot
Description: Binary data
v2-0001-Streaming-Read-API.nocfbot
Description: Binary data