[
https://issues.apache.org/jira/browse/KYLIN-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999869#comment-16999869
]
Liang Geng commented on KYLIN-4307:
-----------------------------------
Dear all,
I have come up with two solution to resolve this issue and minimize the
modifications. The first one is raise the WriteConflictException conditional,
like this:
{code:java}
protected long checkAndPutResourceImpl(String resPath, byte[] content, long
oldTS, long newTS)
throws IOException, IllegalStateException {
try {
.......
boolean ok = table.checkAndPut(row, B_FAMILY, B_COLUMN_TS, bOldTS, put);
logger.trace("Update row {} from oldTs: {}, to newTs: {}, operation
result: {}", resPath, oldTS, newTS, ok);
if (!ok) {
long real = getResourceTimestampImpl(resPath);
if(newTS != real)
throw new WriteConflictException(
"Overwriting conflict " + resPath + ", expect old TS "
+ oldTS + ", but it is " + real);
else
logger.warn("Potential Overwriting conflict " + resPath + ",
expect old TS " + oldTS + ", but it is " + real);
}
} ......
}{code}
But this solution has a chance to prune the conflict when multiple writers are
trying to write with same new timestamp.
Another one looks more reasonable: throw the exception conditional.
{code:java}
protected long checkAndPutResourceImpl(String resPath, byte[] content, long
oldTS, long newTS)
throws IOException, IllegalStateException {
try {
......
boolean ok = table.checkAndPut(row, B_FAMILY, B_COLUMN_TS, bOldTS, put);
logger.trace("Update row {} from oldTs: {}, to newTs: {}, operation
result: {}", resPath, oldTS, newTS, ok);
if (!ok) {
long real = getResourceTimestampImpl(resPath);
throw new WriteConflictException(
"Overwriting conflict " + resPath + ", expect old TS " +
oldTS + ", but it is " + real);
}
} catch (Exception ex) {
if (pushdown != null)
pushdown.rollback();
// don't throw the exception if the write operation performed
if (checkShouldThrow(resPath, newTS))
throw ex;
else
logger.warn("Exception occurred but the checkAndPut seems
successful", ex);
} finally {
if (pushdown != null)
pushdown.close();
IOUtils.closeQuietly(table);
}
return newTS;
}
private boolean checkShouldThrow(String resPath, long newTS) {
boolean shouldThrow = true;
try {
long real = getResourceTimestampImpl(resPath);
if (real == newTS) {
shouldThrow = false;
}
} catch (Exception e) {
logger.error("Failed to get last modified timestamp", e);
}
return shouldThrow;
}
{code}
This solution may also invalid if any of the two scenarios below happened.
# if getResourceTimestampImpl raises any exception or
# If some writer modified with the same timestamp as newTS before calling
getResourceTimestampImpl
> ExponentialBackoffRetry implemented a incorrect retry policy
> ------------------------------------------------------------
>
> Key: KYLIN-4307
> URL: https://issues.apache.org/jira/browse/KYLIN-4307
> Project: Kylin
> Issue Type: Bug
> Components: Storage - HBase
> Affects Versions: v2.6.4
> Environment: HBase 1.1.5
> Reporter: Liang Geng
> Priority: Major
> Attachments: Screen Shot 2019-12-19 at 12.16.05.png
>
>
> Today we had encountered a build error. The step "Build N-Dimension Cuboid :
> level 1" failed caused by WriteConflictException.
>
>
> {code:java}
> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696795196, to newTs: 1576696799048, operation result:
> true2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696795196, to newTs: 1576696799048, operation result:
> true2019-12-19 03:20:02,775 INFO [BadQueryDetector]
> service.BadQueryDetector:147 : Detect bad query.2019-12-19 03:20:03,254 TRACE
> [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880]
> hbase.HBaseResourceStore:335 : Update row
> /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs:
> 1576696793142, to newTs: 1576696803253, operation result: true2019-12-19
> 03:20:03,256 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696803253, to newTs: 1576696803255, operation result:
> true2019-12-19 03:20:04,187 INFO [FetcherRunner 2062551616-143]
> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual
> running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0
> others2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696799048, to newTs: 1576696809122, operation result:
> true2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696809122, to newTs: 1576696809124, operation result:
> true2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696803255, to newTs: 1576696813262, operation result:
> true2019-12-19 03:20:24,133 INFO [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879]
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
> seconds.2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
> oldTs: 1576696809124, to newTs: 1576696819131, operation result:
> false2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
> oldTs: 1576696813262, to newTs: 1576696825245, operation result:
> true2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08,
> name=Build N-Dimension Cuboid : level 1,
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException:
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08,
> expect old TS 1576696809124, but it is 1576696819131 at
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
> at
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
> at
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
> at
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
> at
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
> at
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
> at
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
> at
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
> at
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>
> At 03:20:24,133, a thread "Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the
> path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the
> exception was raised for some reasons. We don't know what kind of exception
> happened because the exception was eaten by doWithRetry.
> ExponentialBackoffRetry will retry if the following exceptions happened:
> SocketTimeoutException, or ConnectException, or RetriesExhaustedException or
> any other exceptions configured in
> "kylin.resourcestore.connection-exceptions".
> By default the retry policy is disabled, but we have enabled it by setting
> "kylin.resourcestore.reconnect-enabled".
>
> {code:java}
> // A successful written, 1576696809122 -> 1576696809124// A successful
> written, 1576696809122 -> 15766968091242019-12-19 03:20:09,125 TRACE
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879]
> hbase.HBaseResourceStore:335 : Update row
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs:
> 1576696809122, to newTs: 1576696809124, operation result: true
> // A written with exception but actually persistence to HBase, 1576696809124
> -> 1576696819131// ExponentialBackoffRetry.doWithRetry catched the exception,
> and then back-off2019-12-19 03:20:24,133 INFO [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879]
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
> seconds.
> // Retry as back-off time reached. But this time, the invoking of checkAndPut
> failed because the content is already written.2019-12-19 03:20:25,228 TRACE
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879]
> hbase.HBaseResourceStore:335 : Update row
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs:
> 1576696809124, to newTs: 1576696819131, operation result: false
> // checkAndPutResourceImpl throws WriteConflictException2019-12-19
> 03:20:25,299 ERROR [Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08,
> name=Build N-Dimension Cuboid : level 1,
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException:
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08,
> expect old TS 1576696809124, but it is 1576696819131 at
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
> at
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
> at
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
> at
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
> at
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
> at
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
> at
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
> at
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
> at
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
> at
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
> at
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>
> As shown in the log, the thread failed to call checkAndPutResourceImpl and
> sleep 1 second to back-off at "2019-12-19 03:20:25,228".
> At 03:20:25,299, the WriteConflictException was raised. The message is
> "expect old TS 1576696809124, but it is 1576696819131". Obviously, the thread
> is trying to update with the same timestamp. That proves the first written is
> successful but throws exceptions. And then, that exception was caught by
> doWithRetry.
> I'm not a HBase expert, the picture above shows my rough guess: !Screen Shot
> 2019-12-19 at 12.16.05.png!
> Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy.
> Determining whether to retry only by the type of exception is not reliable.
> At least, we should pull the last modified time to verify the written is
> successful or not. Right?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)