We've just written a patch to implement this auto-retry logic in all affected branches: 6.0, 6.1, 7.0 (and trunk). It should be fairly safe and without much side-effect: it will try to replay RPC calls that result in a transaction rollback, caused by one of these 3 PostgreSQL error codes[1]:
- SERIALIZATION_FAILURE (40001 - "cannot serialize transactions due to concurrent update") - DEADLOCK_DETECTED (40P01) - LOCK_NOT_AVAILABLE (50P03 - "could not obtain lock on row in relation ...") Each of these errors is transient and caused by the presence of another concurrent transaction working on the same database entries. The likelihood of seeing that other transaction committed increases with every passing millisecond, so in most cases it should be sufficient to retry once after a little while. After testing this patch with several clients hammering the server at the same time, we noticed that having 3-4 retries with several hundred milliseconds randomized delay seems to be allow them all to pass, whereas if we retry only once we still get a few failures when there are more than 2 concurrent transactions doing the same thing. Concerning the side-effects, the failed transactions have just been rolled back, so replaying them is correct on a semantic level. In rare cases the rolled back transaction might have had a side effect on the rest of the world (e.g. sent an email or written a file), so replaying it might cause the side-effect to occur a second time. However this would be true even with manual replay instead of automatic replay - the user could simply press the same button again to retry. Basically we're just assuming the user did mean the transaction to happen so we're pressing the button again for her. We've though of making the retry delay and/or count configurable, but the defaults should be fine for most cases. And if the default values are not good enough a proper analysis of the concurrency issue would probably be better than bumping up the settings without understanding them. With the default settings the auto-retry could delay the transaction for up to several dozen seconds, which already seems like a very large limit. Most auto-retried transactions will not be delayed for more than a few hundred milliseconds though. Any feedback/tests for these sensitive patches would be appreciated. We're planning to merge them soon unless a problem is detected. Thanks! [1] see http://www.postgresql.org/docs/current/static/errcodes- appendix.html#ERRCODES-TABLE and http://initd.org/psycopg/docs/errorcodes.html ** Changed in: openobject-server Status: Confirmed => Fix Committed -- You received this bug notification because you are a member of OpenERP Indian Team, which is subscribed to OpenERP Server. https://bugs.launchpad.net/bugs/992525 Title: TransactionRollbackError due to concurrent update could be better handled Status in OpenERP Server: Fix Committed Bug description: While using openerp, psycopg2 raises TransactionRollbackError quite often even on small database. This does not seem to be easily reproduceable as it seems to be a conflict between two thread accessing the same table. Nevertheless, I provided a quick video reproducing this while installing "base_crypt" on my computer. This occurs mostly at module installation. And can completely mess up the module installation by giving empty wizard windows of instance. I guess it could also occurs in other situations (in multi-user context), where the bug would be quite difficult to reproduce and with unforeseeable consequences ;) I've spotted an other bug that is due to this it seems: https://bugs.launchpad.net/bugs/956715 In my case (single user), it seem to hit more often on fast computers. To make a probable better guess, it seems to hurt more often whenever using a local connection between the browser and the server. It could be about the web module trying to update the res_users session info and may collide with normal operation. On my computer, from a new database, installing the 'base_crypt' will trigger the exception. When using a distant connection, the bug won't show up. Please check the video I've posted with the bug report if you want to have more detail on the procedure I used. Sorry for the bad sound recording. Note that the video will show you the bug occuring on my computer and NOT occuring on a distant computer. I'm providing a merge proposal along with this patch which solves the issue for me, but need a patient review. To manage notifications about this bug go to: https://bugs.launchpad.net/openobject-server/+bug/992525/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~openerp-india Post to : [email protected] Unsubscribe : https://launchpad.net/~openerp-india More help : https://help.launchpad.net/ListHelp

