On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <to...@vondra.me> wrote: > Thanks for the link. It seems I came up with an almost the same patch, > with three minor differences: > > 1) There's another place that sets "distance = 0" in > read_stream_next_buffer, so maybe this should preserve the distance too? > > 2) I suspect we need to preserve the distance at the beginning of > read_stream_reset, like > > stream->reset_distance = Max(stream->reset_distance, > stream->distance); > > because what if you call _reset before reaching the end of the stream? > > 3) Shouldn't it reset the reset_distance to 0 after restoring it?
Probably. Hmm... an earlier version of this code didn't use distance == 0 to indicate end-of-stream, but instead had a separate internal end_of_stream flag. If we brought that back and didn't clobber distance, we wouldn't need this save-and-restore dance. It seemed shorter and sweeter without it back then, before _reset() existed in its present form, but I wonder if end_of_stream would be nicer than having to add this kind of stuff, without measurable downsides. > > There was also some discussion at the time about whether "reset so I > > can rescan", and "reset so I can continue after a temporary stop" > > should be different operations requiring different APIs. It now seems > > like one operation is sufficient, but it should preserve the distance > > as you showed and then let the algorithm learn about already-cached > > data in the rescan case (if it is even true then, which is also > > debatable since it depends on the size of the scan). So, I think we > > should just go ahead and commit a patch like that. > > Not sure. To me it seems more like two distinct cases, but I'm not sure > if it requires two distinct "operations" with distinct API. Perhaps a > simple flag for the _reset() would be enough? It'd need to track the > distance anyway, just in case. > > Consider for example a nested loop, which does a rescan every time the > outer row changes. Is there a reason to believe the outer rows will need > the same number of inner rows? Aren't those "distinct streams"? Maybe > I'm thinking about this wrong, of course. Good question. Yeah, your flag idea seems like a good way to avoid baking opinion into this level. I wonder if it should be a bitmask rather than a boolean, in case we think of more things that need to be included or not when resetting. > The thing that however concerns me is that what I observed was not the > distance getting reset to 1, and then ramping up. Which should happen > pretty quickly, thanks to the doubling. In my experiments it *never* > ramped up again, it stayed at 1. I still don't quite understand why. Huh. Will look into that on Monday.