[ https://issues.apache.org/jira/browse/KYLIN-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoxiang Yu resolved KYLIN-5339. --------------------------------- Resolution: Fixed > Renew Epoch Retry did not interrupt the old thread in time, and the new > thread failed to write data, resulting in kylin losing epoch > ------------------------------------------------------------------------------------------------------------------------------------ > > Key: KYLIN-5339 > URL: https://issues.apache.org/jira/browse/KYLIN-5339 > Project: Kylin > Issue Type: Bug > Affects Versions: 5.0-alpha > Reporter: sibing.zhang > Priority: Major > Fix For: 5.0-alpha > > Attachments: 31c439f4-0a2b-4616-949d-415f4b417f2e.png, > 602360ee-fa81-4c8c-a7d2-4fdd73d284ff.png > > > epoch renew时有两次retry,每次有超时60s的机制。renew时使用线程池来执行。这个线程池容量由开关 > kylin.server.renew-epoch-pool-size=3决定。这里存在的问题是:renew线程超时60s后没有终止该线程,又拉起了另一个renew线程,对同样的数据进行了更新。此时第一个线程由于没有终止,最后renew成功了,并把数据的MVCC+1。后面renew的线程renew时,会判断MVCC: > !31c439f4-0a2b-4616-949d-415f4b417f2e.png|width=583,height=64! > 此时,发现没有满足条件的数据,导致return的update affectedRows = 0。最终,造成了当前节点丢失了所有项目的控制权。流程可见下图: > > *!602360ee-fa81-4c8c-a7d2-4fdd73d284ff.png|width=560,height=574!* > *fix design* > Epoch > Renew有超时失败的重试机制({{{}kylin.server.leader-race.heart-beat-timeout=60s{}}})。重试时,原有的事务没有停止,新开事务进行了数据库更新。由于Epoch > > 更新时,会校验mvcc的值,所以这里导致第二个事务被第一个事务冲突了。鉴于此,增加事务Timeout机制,Timeout={{{}kylin.server.leader-race.heart-beat-timeout=60s{}}}-1s。事务超时自动回滚,避免了Renew重试时事务冲突的问题。 -- This message was sent by Atlassian Jira (v8.20.10#820010)