[ 
https://issues.apache.org/jira/browse/KYLIN-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038941#comment-17038941
 ] 

ASF GitHub Bot commented on KYLIN-4307:
---------------------------------------

nichunen commented on pull request #1107: KYLIN-4307 WritrConflictException 
when build cube by layer
URL: https://github.com/apache/kylin/pull/1107
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> ExponentialBackoffRetry implemented a incorrect retry policy
> ------------------------------------------------------------
>
>                 Key: KYLIN-4307
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4307
>             Project: Kylin
>          Issue Type: Bug
>          Components: Storage - HBase
>    Affects Versions: v2.6.4
>         Environment: HBase 1.1.5
>            Reporter: Liang Geng
>            Assignee: Yaqian Zhang
>            Priority: Major
>         Attachments: Screen Shot 2019-12-19 at 12.16.05.png, 
> image-2020-02-13-15-58-23-114.png
>
>
> Today we had encountered a build error. The step "Build N-Dimension Cuboid : 
> level 1" failed caused by WriteConflictException.
>  
>  
> {code:java}
> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696795196, to newTs: 1576696799048, operation result: 
> true2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696795196, to newTs: 1576696799048, operation result: 
> true2019-12-19 03:20:02,775 INFO  [BadQueryDetector] 
> service.BadQueryDetector:147 : Detect bad query.2019-12-19 03:20:03,254 TRACE 
> [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 
> 1576696793142, to newTs: 1576696803253, operation result: true2019-12-19 
> 03:20:03,256 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696803253, to newTs: 1576696803255, operation result: 
> true2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143] 
> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual 
> running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0 
> others2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696799048, to newTs: 1576696809122, operation result: 
> true2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696809122, to newTs: 1576696809124, operation result: 
> true2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696803255, to newTs: 1576696813262, operation result: 
> true2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 
> seconds.2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696809124, to newTs: 1576696819131, operation result: 
> false2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696813262, to newTs: 1576696825245, operation result: 
> true2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : 
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, 
> name=Build N-Dimension Cuboid : level 1, 
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
> expect old TS 1576696809124, but it is 1576696819131 at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>  
> At 03:20:24,133, a thread "Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the 
> path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the 
> exception was raised for some reasons. We don't know what kind of exception 
> happened because the exception was eaten by doWithRetry.
> ExponentialBackoffRetry will retry if the following exceptions happened: 
> SocketTimeoutException, or ConnectException, or RetriesExhaustedException or 
> any other exceptions configured in 
> "kylin.resourcestore.connection-exceptions".
> By default the retry policy is disabled, but we have enabled it by setting 
> "kylin.resourcestore.reconnect-enabled".
>  
> {code:java}
> // A successful written, 1576696809122 -> 1576696809124// A successful 
> written, 1576696809122 -> 15766968091242019-12-19 03:20:09,125 TRACE 
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
> 1576696809122, to newTs: 1576696809124, operation result: true
> // A written with exception but actually persistence to HBase, 1576696809124 
> -> 1576696819131// ExponentialBackoffRetry.doWithRetry catched the exception, 
> and then back-off2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 
> seconds.
> // Retry as back-off time reached. But this time, the invoking of checkAndPut 
> failed because the content is already written.2019-12-19 03:20:25,228 TRACE 
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
> 1576696809124, to newTs: 1576696819131, operation result: false
> // checkAndPutResourceImpl throws WriteConflictException2019-12-19 
> 03:20:25,299 ERROR [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : 
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, 
> name=Build N-Dimension Cuboid : level 1, 
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
> expect old TS 1576696809124, but it is 1576696819131 at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>  
> As shown in the log, the thread failed to call checkAndPutResourceImpl and 
> sleep 1 second to back-off at "2019-12-19 03:20:25,228".
> At 03:20:25,299, the WriteConflictException was raised. The message is 
> "expect old TS 1576696809124, but it is 1576696819131". Obviously, the thread 
> is trying to update with the same timestamp. That proves the first written is 
> successful but throws exceptions. And then, that exception was caught by 
> doWithRetry.
> I'm not a HBase expert, the picture above shows my rough guess:  !Screen Shot 
> 2019-12-19 at 12.16.05.png!
> Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy. 
> Determining whether to retry only by the type of exception is not reliable. 
> At least, we should pull the last modified time to verify the written is 
> successful or not. Right?
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to