Hi, Using the database as the locking mechanism may have performance issues. We will need to lock the whole table for picking a new task to begin.
We will need a way to synchronize the jobs' status 1. Use a master-worker model. The master communicate with the database and dispatch jobs to the workers. This model is simple to understand and implement. But it may face performance issues in the master node. There are a lot of communication between the master and workers. And when the number of tasks is large, the syncrhonization between the workers and master will block and affect the whole system's performance. 2. Use a third-parties for distributed locking, like etcd or redis. As we use a sql database for storing data already, redis may works better for us. Or we can just use etcd to replace the sql database. 3. Implement a distributed lock. Seems an overkill. On Sun, Jan 28, 2018 at 10:52 AM, Eric Lee <[email protected]> wrote: > I guess we can add a status column in the timeout table. It has three > types: NEW, PENDING, DONE. When event starts, the status turns NEW. When > the EventScanner detects the timeout event, it sets the status to PENDING. > When another EventScanner scans the same timeout event, if it can not > update its status to PENDING, it will skip this event. > > 2018-01-28 9:51 GMT+08:00 Willem Jiang <[email protected]>: > > > Yeah, this could help us resolve the issue that different alpha server > > check the same TxEvent. > > But what if some of the alpha is offline, the timeout message cannot be > > handled any more. > > Maybe we can create a leader from the alpha server to do job assign work, > > and supervise the timeout event. > > > > > > Willem Jiang > > > > Blog: http://willemjiang.blogspot.com (English) > > http://jnn.iteye.com (Chinese) > > Twitter: willemjiang > > Weibo: 姜宁willem > > > > On Sat, Jan 27, 2018 at 11:53 PM, Zheng Feng <[email protected]> wrote: > > > > > The different alpha server I assumes that it has the different name, So > > we > > > can insert the name of the alpha server in the TxEvent record. > > > When the alpha server is scanning the TxEvent records for the time out > > > handling, it could only select these ones match the alpha name. > > > It looks like that we don't need the lock here and it has to make sure > > the > > > alpha server name is unique. > > > > > > 2018-01-27 14:36 GMT+08:00 郑扬勇 <[email protected]>: > > > > > > > It seems all solution need import "lock"; > > > > If one event can only handle by one alpha at the same time,we may > need > > > > election mechanism ? > > > > > > > > ------------------ 原始邮件 ------------------ > > > > 发件人: "Eric Lee";<[email protected]>; > > > > 发送时间: 2018年1月26日(星期五) 上午10:58 > > > > 收件人: "dev"<[email protected]>; > > > > > > > > 主题: [Discussion] How to make sure events are handled only once > > > > amongdifferent stateless Saga pack alphas > > > > > > > > > > > > > > > > Background > > > > Currently, the transaction timeout is controlled by omega which makes > > > omega > > > > stateful. Being stateful makes omega recovery relies greatly on the > > > > previous states. Hence, we need to move the timeout management from > > omega > > > > to alpha to simplify implementation of omega. After that, omega will > > be a > > > > stateless agent. > > > > > > > > Difficulty > > > > How to make sure each timeout record are handled only once globally > by > > > > multiple alpha servers? Each alpha server is also stateless. All > states > > > are > > > > stored in database. Alpha will scan the timeout events and handles > them > > > one > > > > by one periodically. Different alpha may process the same event at > the > > > same > > > > time which should be avoided because each event should be handled > only > > > > once. > > > > > > > > Possible Solutions: > > > > 1. Add a expireTime column in TxEvent entity. Then lock the access to > > the > > > > timeout event to avoid concurrent access to the same event. Since > > TxEvent > > > > may involves many operations, adding the lock may introduce latency > in > > > > other transaction. > > > > 2. Create a new entity like the Command entity. Then lock the access > to > > > > this entity and update the status asynchronously when it is done. > > > > 3. Register timeout settings to alpha whenever omega starts. Then > query > > > > TxEvent and ServiceConfig table to find out timeout events. This way > > > still > > > > can not make sure each event is handled once as the range of the lock > > is > > > > too wide to target at a specific event. > > > > > > > > However, the above solutions still not perfect for the problem > because > > > the > > > > lock will become invalid as soon as the query is done and another > alpha > > > may > > > > query from database and process the same event before the timeout > event > > > > being processed by the previous alpha. > > > > > > > > Current implementation details can move forward to > > > > https://github.com/apache/incubator-servicecomb-saga/pull/122 . > > > > > > > > Any suggestion is welcome. > > > > > > > > > > > > Best Regards! > > > > Eric Lee > > > > > > > > > > -- Yang, Best Regards
