I had misunderstood the content of the comment. I think that I can
understand the problem now.
I might be able to present the solution of the restarting an index scan.
I try to correct the patch.
Simon Riggs wrote:
> On Tue, 2006-01-31 at 15:18 +0900, Junji TERAMOTO wrote:
>> Tom Lane wrote:
>>> I think this is quite likely to break things :-(. What sort of
>>> conditions have you tested it under? (If this were safe, we'd
>>> not have invented the LP_DELETE flag to begin with, but just have
>>> deleted known-dead items immediately.)
>> I apologize for my insufficient description(and bad english).
>> I explain the operation of this patch in detail again.
>> In _bt_insertonpg(), we insert the tupple by the following methods.
>> (1) Finds the right place(page) to insert the tuple.
>> (2) If free space is insufficient, we operate as follows.
>> + (2.1) Confirm whether the lock to the found page is considerable
>> super-exclusive lock in BufferSuperLockHeldByMe()[new]. We
>> check bufHdr->refcount(=pin), and if pin == 1, we know this
>> lock is super-exclusive lock.
>> If our lock is not super-exclusive lock, goto (2.4).
>> + (2.2) If our lock is super-exclusive lock, and the found page is
>> leaf, we look for tupples marked as "LP_DELETE" from the found
>> page, and remove found tuples in _bt_vacuum_page()[new].
>> + (2.3) If free space is sufficient after removal, goto (4).
>> (2.4) Step right to next non-dead page.
>> (2.5) (2) is repeated until an enough free space is found or it
>> reaches a right edge at last.
>> (3) When an enough free space is not found by processing (2), splits
>> the target page (making sure that the split is equitable as far as
>> post-insert free space goes).
>> (4) Inserts the tuple.
>> The numbers from 2.1) to 2.3) are new processing.
>> The pointer's shifting by removing "LP_DELETE" tuples as shown in the
>> above-mentioned becomes a problem, when we resume a stopped indexscan.
>> Therefore, I am having it confirm the state of the lock by (2.1), and
>> process only at super-exclucive lock, same as btbulkdelete().
>> However, because own indexscan might be lost because of this removal,
>> I also modify _bt_restscan() as follows.
>> (1) Check for index tuple we stopped on.
>> (2) If index tuple is moved, first of all, we search in this page
>> +(3) If not found, we try to look for from the head of this page
>> because index tuple may moved left due to removal.
>> (4) If still not found, we look for index tuple right.
>> The number (3) is new processing.
>> We do not delete the empty page. Therefore, there is no necessity for
>> tracing a left page.
>> I think that no problem occurs by this logic.
>> Please point it out if there is a wrong point. Thanks.
> Please read comments in nbtree.c p.557-566 which describe the required
> locking protocol for btree pages.
> Just because you hold the lock does *not* mean you are allowed to delete
> tuples marked as deleted. This patch doesn't work in all cases, which is
> what is required, since the failure cases don't raise an ERROR - they
> just miss data - which is unacceptable. So your patch works 99.9% of the
> time, but sometimes gives wrong answers to queries without you knowing.
> So, patch rejected, but good thinking all the same.
> You will need to argue for a change in the locking protocol before it
> could be accepted. That would speed up VACUUM also, so if it were
> possible it would be gratefully accepted.
> The only help I can give is this: Why does an index scan ever want to
> stop on a deleted tuple? Currently, the scan only remembers one tuple
> its seen. If the tuple was dead, but wasn't yet marked as dead the index
> scan will have read it and stopped there. Some while later, the index
> scan will return and begin scanning right for the tuple. If you delete
> it, the index scan *may* not be able to work out where to restart --
> ERROR. So, if you can work out a way of not restarting an index scan on
> a deleted tuple (and yet still marking them as deleted when it sees
> them) then you may be able to solve the problem.
> I'm looking at the same problem domain also (table bloat), but focusing
> on a different approach that avoids this issue.
> Best Regards, Simon Riggs
Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?