On Thu, Dec 8, 2022 at 12:37 PM houzj.f...@fujitsu.com <houzj.f...@fujitsu.com> wrote: >
Review comments ============== 1. Currently, we don't release the stream lock in LA (leade apply worker) for "rollback to savepoint" and the reason is mentioned in comments of apply_handle_stream_abort() in the patch. But, today, while testing, I found that can lead to deadlock which otherwise, won't happen on the publisher. The key point is rollback to savepoint releases the locks acquired by the particular subtransaction, so parallel apply worker should also do the same. Consider the following example where the transaction in session-1 is being performed by the parallel apply worker and the transaction in session-2 is being performed by the leader apply worker. I have simulated it by using GUC force_stream_mode. Publisher ========== Session-1 postgres=# begin; BEGIN postgres=*# savepoint s1; SAVEPOINT postgres=*# truncate t1; TRUNCATE TABLE Session-2 postgres=# begin; BEGIN postgres=*# insert into t1 values(4); Session-1 postgres=*# rollback to savepoint s1; ROLLBACK Session-2 Commit; With or without commit of Session-2, this scenario will lead to deadlock on the subscriber because PA (parallel apply worker) is waiting for LA to send the next command, and LA is blocked by Exclusive of PA. There is no deadlock on the publisher because rollback to savepoint will release the lock acquired by truncate. To solve this, How about if we do three things before sending abort of sub-transaction (a) unlock the stream lock, (b) increment pending_stream_count, (c) take the stream lock again? Now, if the PA is not already waiting on the stop, it will not wait at stream_stop but will wait after applying abort of sub-transaction and if it is already waiting at stream_stop, the wait will be released. If this works then probably we should try to do (b) before (a) to match the steps with stream_start. 2. There seems to be another general problem in the way the patch waits for stream_stop in PA (parallel apply worker). Currently, PA checks, if there are no more pending streams then it tries to wait for the next stream by waiting on a stream lock. However, it is possible after PA checks there is no pending stream and before it actually starts waiting on a lock, the LA sends another stream for which even stream_stop is sent, in this case, PA will start waiting for the next stream whereas there is actually a pending stream available. In this case, it won't lead to any problem apart from delay in applying the changes in such cases but for the case mentioned in the previous point (Pont 1), it can lead to deadlock even after we implement the solution proposed to solve it. 3. The other point to consider is that for stream_commit/prepare/abort, in LA, we release the stream lock after sending the message whereas for stream_start we release it before sending the message. I think for the earlier cases (stream_commit/prepare/abort), the patch has done like this because pa_send_data() may need to require the lock again when it times out and start serializing, so there will be no sense in first releasing it, then re-acquiring it, and then again releasing it. Can't we also release the lock for stream_start after pa_send_data() only if it is not switched to serialize mode? -- With Regards, Amit Kapila.