Attached test case illustrates how a bug in nbtree's handling of primitive index scan scheduling can lead to wrong answers when a scrollable cursor is used. This happens when the scan direction changes for a scan with array keys, at exactly the wrong time, when certain precise conditions are met (more on that below, and in the attached test case). The bug in question is from my commit 5bf748b8. Both 17 and master are affected.
Attached is the fix. It's fairly simple. I propose to address the issue by making sure to unset the "needs primitive index scan" flag when we detect that it is invalid for the current (also recently changed) scan direction. This happens at the point that we "change our mind and step to the page to the left, and not to the right as initially expected" (since the test case happens to work by starting the scan in the forward direction and then switching it to scan backwards). In other words, with the patch applied we'll *reliably* detect and recover from violations of _bt_readpage's soft assumption that the scan direction won't change before the prim scan it schedules is allowed to get started -- _bt_steppage is the right choke point for making sure the soft assumption cannot ever really be violated. FWIW the bug is far more subtle than the fix itself may suggest. Our general approach to dealing with scrollable cursors that change direction mostly works already. The test case reveals one particular way in which it might not quite work as intended. To recap, we generally won't get confused about primitive scan direction in this way (in spite of the so->needPrimScan flag being set) because we'll usually *also* find the currPos "moreLeft" flag set to 'true' by the time _bt_readnextpage is reached (since we set both moreLeft and moreRight to 'true' within _bt_readfirstpage iff _bt_readfirstpage finds so->needPrimScan set from before). As a result of all this, the so->needPrimScan flag won't usually be effective at all in most cases where the scan direction subsequently changes before we leave the page in question. Everything generally works out because the scan's array keys will simply get reset (in a way that considers the new scan direction) when we next call _bt_readpage/_bt_checkkeys/_bt_advance_array_keys (this time in the opposite scan direction to last time). And so _bt_first won't ever get the chance to get called, so everything still works out. My test case is designed in such a way as to make recovering from the problem via _bt_readfirstpage's "set both moreLeft and moreRight to 'true' within _bt_readfirstpage iff _bt_readfirstpage finds so->needPrimScan set from before" behavior not save us -- we don't get that behavior for the very first _bt_first, which is how things work within the test case (the test case's initial call to _bt_readfirstpage will set moreLeft to 'false', since it'll be dealing with the first page for the entire top-level scan, not just one individual primitive index scan). Another very specific adversarial aspect of my test case seems worth drawing attention to, for context: the test case fetches tuples from the cursor in a way that is precisely aligned to a boundary between a pair of sibling leaf pages. IOW, for things to visibly break as they do in the test case, the scan needs to reach the precise point of scheduling the next primitive index scan, *without* fetching even a single additional index tuple in the same direction (same direction as the one _bt_readpage used when it scheduled that same primitive index scan) before backing up/changing the scan direction. And so if we were to (say) fetch 11 rows (instead of 10) in the first fetch statement, then we'll actually end up performing the scheduled primitive scan as expected, in order to be able to fetch that 11th index tuple -- a factor that'll deny _bt_first the opportunity to get called with the wrong array keys for the then-current scan direction. The repro is complicated, but fortunately the fix itself is much simpler. -- Peter Geoghegan
v1-0001-Avoid-nbtree-array-scrollable-cursor-confusion.patch
Description: Binary data
btfirst_array_confusion_repro.sql
Description: Binary data